[Serialization] How to restore ref-counted strings

Bill Lear

9 Mar 2007 9 Mar '07

2:22 p.m.

We have a massive amount of data to serialize, on the order of several gigabytes. Lots of strings involved, maybe hundreds of millions. We discovered that the data structure in memory would bloat enormously when read back in from disk (say, from 2 gig to 3.1 gig). We think we have tracked this down to (gcc implementation) string reference counts not being "restored". I think a solution for us is to do something like the following: static map<string, bool> string_map; template <class Archive> void read_string(Archive ar, string& a_string) { string s; ar >> s; // read from disk map<string, bool>::iterator i = string_map.find(s); if (i == string_map.end()) { i = string_map.insert(make_pair(s, true)); } a_string = i->first; } void destroy_map() { string_map.clear(); } Then, when the data structures have all been read, invoke the destroy_map() method to clear the string_map object, thus decremented all refcounts of strings by one. Has anyone else encountered this and found a solution? Also, if anyone has bright ideas on a better data structure than std::map to use for storing hundreds of millions of strings at once for the above purpose, that also might be nice. Thanks. Bill

Show replies by date

Robert Ramey

9 Mar 9 Mar

6:53 p.m.

I would first consider something else. It turns out that for reasons I don't want to go into here that std::string is designated as "primitive". That is there is no reference counting (by serialization load), version, etc with std::string type objects. This is probably a good choice except that it makes std::string "special" in comparison to other types. In general, the only other types consider "primitive" are C++ data types that are truely primitive. So, in contrast to the default behavior for std::collections, if the SAME string is saved twice - an actual copy is saved an restored. This might be what most people expect from a datatype like std::string, but it might be an issue in some applicatons. Note that by "SAME string" I'm referring to the same datum - not two different strings with the same contents. So, the serialization could be the source of your issue if you're saving the SAME string many times. A closer reading of your post, suggests that the above isn't what you're referring to. I left in the above just to clarify my thinking on the issue. It sounds to me that you're telling me that the gcc standard library keeps a counted string - "copy on write" implementation so that if a and b are strings the operation a = b doesn't result in a duplication of the content. Which would surprise me. But assuming that's the case, serialization would "lose" the reference counts when the strings are recreated without using the "=" operator. If this is the case, here are a couple of ideas to consider. Define your own serializaton for std::string and use it instead of the one in the serialization library. This is probably a bad idea as it would attribute your special behavior to a standard object and would make your archives and programs non portable and harder to support if you want to ask us for help. Define you're own string class derived from std::string. This string class could be serialized using your own special sauce without losing portablity. The could be formlated as a "serialization wrapper" as described in the manual so that you're code would only have to use this "special string" in the process of serialization and not through out your program. Look in the recent document and the "is_wrapper" typetrait for more information. So now the problem boils down to how your going to capture and restore the fact that these strings share underlying data. At first one would think that just letting your wrapper class use the default tracking behavior eliminate duplicates would solve your problem. But I don't think so. As I said above, I don't think that you're serializing the SAME (see above) string one million times. I think you're serializing a million different strings which happen to contain the same data. It seems to me that you'll have to delve into the implementation of the string class you're using and gain access to the internals of the implementation and figure out how to capture the reference to the shared contents and serialize that. Robert Ramey Bill Lear wrote:

...

We have a massive amount of data to serialize, on the order of several gigabytes. Lots of strings involved, maybe hundreds of millions.

We discovered that the data structure in memory would bloat enormously when read back in from disk (say, from 2 gig to 3.1 gig). We think we have tracked this down to (gcc implementation) string reference counts not being "restored". I think a solution for us is to do something like the following:

static map<string, bool> string_map;

template <class Archive> void read_string(Archive ar, string& a_string) { string s; ar >> s; // read from disk map<string, bool>::iterator i = string_map.find(s);

if (i == string_map.end()) { i = string_map.insert(make_pair(s, true)); }

a_string = i->first; }

void destroy_map() { string_map.clear(); }

Then, when the data structures have all been read, invoke the destroy_map() method to clear the string_map object, thus decremented all refcounts of strings by one.

Has anyone else encountered this and found a solution?

Also, if anyone has bright ideas on a better data structure than std::map to use for storing hundreds of millions of strings at once for the above purpose, that also might be nice.

Thanks.

Bill

Robert Ramey

7:29 p.m.

Hmmm - now I want to ammend my answer. I think your approach IS on the right track. Your usage of maps assures that content of the strings is compared rather than whether the strings are the SAME string. So I would suggest something likethe following

Bill Lear

7:45 p.m.

On Friday, March 9, 2007 at 10:53:20 (-0800) Robert Ramey writes:

...

... Define your own serializaton for std::string and use it instead of the one in the serialization library. This is probably a bad idea as it would attribute your special behavior to a standard object and would make your archives and programs non portable and harder to support if you want to ask us for help.

Definite downsides, true, but I'm not sure that it would be non-portable, except perhaps that I have a different idea of "define your own serialization for std::string". I have done what I considered to be this, and posted it below.

...

Define you're own string class derived from std::string. This string class could be serialized using your own special sauce without losing portablity. The could be formlated as a "serialization wrapper" as described in the manual so that you're code would only have to use this "special string" in the process of serialization and not through out your program. Look in the recent document and the "is_wrapper" typetrait for more information.

Ok, I'll have a look at that --- sounds like a reasonable alternative to what I've done.

...

So now the problem boils down to how your going to capture and restore the fact that these strings share underlying data. At first one would think that just letting your wrapper class use the default tracking behavior eliminate duplicates would solve your problem. But I don't think so. As I said above, I don't think that you're serializing the SAME (see above) string one million times. I think you're serializing a million different strings which happen to contain the same data.

It seems to me that you'll have to delve into the implementation of the string class you're using and gain access to the internals of the implementation and figure out how to capture the reference to the shared contents and serialize that.

The strings share data on assign, so: string a = "foo"; string b = a; means they share the underlying memory "foo", with a logical refcount of 2 (the physical refcount, for implementation reasons, is actually 1). Once you muck with a or b, they get their own copy of the memory, decremented ref count, etc. If I serialize a and b, and deserialize, the load will "break" this ref count --- I get two "unshared" strings, each with a block of memory "foo". Not the fault of the serialization library, of course ... So, here is how I've coded this to test it out. The test I've just completed shows that the memory bloat is completely removed --- this is a major relief, as the bloat was literally expanding by 3-4 gigabytes a process that was already near our VM limit. Think of this as just a proof-of-concept, if you like (boost/archive/impl/text_iarchive_impl.ipp): #ifdef LL_STRING_DESERIALIZATION_CACHE typedef std::map<std::string, bool> ll_cache; static std::map<std::string, bool> ll_string_cache; void nuke_ll_string_cache() { ll_string_cache.clear(); } #endif template<class Archive> BOOST_ARCHIVE_DECL(void) text_iarchive_impl<Archive>::load(std::string &s) { #ifndef LL_STRING_DESERIALIZATION_CACHE std::size_t size; * this->This() >> size; // skip separating space is.get(); // borland de-allocator fixup #if BOOST_WORKAROUND(_RWSTD_VER, BOOST_TESTED_AT(20101)) if(NULL != s.data()) #endif s.resize(size); is.read(const_cast<char *>(s.data()), size); #else std::size_t size; * this->This() >> size; // skip separating space is.get(); std::string input_string; input_string.resize(size); is.read(const_cast<char *>(input_string.data()), size); ll_cache::iterator i = ll_string_cache.find(input_string); if (i == ll_string_cache.end()) { std::pair<ll_cache::iterator, bool> x = ll_string_cache.insert(make_pair(input_string, true)); i = x.first; } s = i->first; #endif } If you have thoughts on how to make this cleaner, without actually hacking into the actual boost implementation details, that would be great (if this is what you already suggested about a wrapper, just say so). Thanks for the help. Bill

Robert Ramey

7:47 p.m.

Here is your proposal

...

static map<string, bool> string_map;

template <class Archive> void read_string(Archive ar, string& a_string) { string s; ar >> s; // read from disk map<string, bool>::iterator i = string_map.find(s);

if (i == string_map.end()) { i = string_map.insert(make_pair(s, true)); }

a_string = i->first; }

void destroy_map() { string_map.clear(); }

First I would use a set rather than map as I think it would be sufficient. Second I would make this a "helper class" . Your special "Lear Archive" would be derived from an existing archive and the helper class using multiple inheritance. Look at the implemenation of shared_ptr checked into the HEADof CVS to see how to do this. class LearHelper { std::set<std::string> m_strings; }; class LearArchive : binary_oarchive_impl, LearHelper { }; class lear_string : std::string { typedef mpl:_true? boost::serialization:is_wrapper; template<class Archive> void save(Archive &ar, const unsigned int version) const { std::string::iterator it = m_strings.find(*this); // if the string isn't in the set if(it != std::string::end()) m_strings.insert(*this); // serialize the one in the set ar << *it; } void load(Archive &ar, const unsigned int version) { // load the string to a temporary location std::string tstr; ar >> tsr; // look it up in the set; std::string::iterator it = m_strings.find(*this); // if the string isn't in the set if(it != std::string::end()) it = m_strings.insert(tsr); // and copy from the one in the set *this = tsr; reset_object_address(... // in case someone loads a pointer to a string } } finally, whenever you serialize a string, you use the lear_string wrapper some class ... std::sting my_string; .... save(... ar << lear_string(my_string); ... load(... ar >> lear_string(my_string); }; Food for thought. This should keep you busy for a while. Good Luck. Robert Ramey

Bill Lear

8:29 p.m.

On Friday, March 9, 2007 at 11:47:21 (-0800) Robert Ramey writes:

...

... First I would use a set rather than map as I think it would be sufficient.

Second I would make this a "helper class" . Your special "Lear Archive" would be derived from an existing archive and the helper class using multiple inheritance. Look at the implemenation of shared_ptr checked into the HEADof CVS to see how to do this. ... Food for thought. This should keep you busy for a while. Good Luck.

Excellent suggestions, thanks very much for taking the time. Bill

Bill Lear

10 Mar 10 Mar

2:24 a.m.

On Friday, March 9, 2007 at 11:47:21 (-0800) Robert Ramey writes:

...

Second I would make this a "helper class" . Your special "Lear Archive" would be derived from an existing archive and the helper class using multiple inheritance. Look at the implemenation of shared_ptr checked into the HEADof CVS to see how to do this. ...

I checked out the head of CVS, but did not see anything relating to a helper class in shared_ptr.hpp. Not sure if I'm looking in the right place. This is how I checked it out: % cvs -z3 -d:pserver:anonymous@boost.cvs.sourceforge.net:/cvsroot/boost checkout boost I'm not sure I need to use a cache of strings when saving, in fact, I think not, but I went ahead and more-or-less followed your instructions. I can't seem to get this to compile, but it's getting closer. The complete code is below my sig. The compiler (gcc) complains: error: 'class boost::archive::text_iarchive' has no member named 'cache' error: 'class boost::archive::text_oarchive' has no member named 'cache' So, it seems text_iarchive is being used as the type in the load method, instead of ll_text_iarchive, which I guess makes sense, but I'm not really sure how to cast to the proper archive, or if that's even a good idea. Also, for some reason, I can't use this idiom: ar << ll_string(s); In fact, this won't work, either: ar << string("foo"); for some reason. Finally, I really have no idea how to use the reset_object_address() method. I looked through the boost code, but couldn't decipher it. Any help appreciated. Bill #include <iostream> #include <string> #include <fstream> #include <set> #include <boost/archive/text_oarchive.hpp> #include <boost/archive/text_iarchive.hpp> using namespace std; using namespace boost; using namespace boost::archive; typedef set<string> cache_t; class ll_cache { public: cache_t cache; }; class ll_text_iarchive: public text_iarchive_impl<text_iarchive>, public ll_cache { public: ll_text_iarchive(std::istream& is, unsigned int flags = 0) : text_iarchive_impl<text_iarchive>(is, flags) {} }; class ll_text_oarchive: public text_oarchive_impl<text_oarchive>, public ll_cache { public: ll_text_oarchive(std::ostream& os, unsigned int flags = 0) : text_oarchive_impl<text_oarchive>(os, flags) {} }; class ll_string: public string { public: ll_string(const std::string& s) : string(s) {} ll_string() {} typedef mpl::true_ boost::serialization::is_wrapper; template <class Archive> void save(Archive& ar, const unsigned int version) const { cache_t::iterator i = ar.cache.find(*this); if (i == ar.cache.end()) { pair<cache_t::iterator, bool> p = ar.cache.insert(make_pair(*this, true)); i = p.first; } ar << *i; } template <class Archive> void load(Archive &ar, const unsigned int version) { string t; ar >> t; cache_t::iterator i = ar.cache.find(t); if (i == ar.cache.end()) { pair<cache_t::iterator, bool> p = ar.cache.insert(make_pair(t, true)); i = p.first; } static_cast<string>(*this) = t; } BOOST_SERIALIZATION_SPLIT_MEMBER() }; BOOST_CLASS_TRACKING(ll_string, boost::serialization::track_never) int main(int ac, char* av[]) { { string a("test");; string b = a; ofstream ofs("foo", ios::binary); ll_text_oarchive oa(ofs); ll_string x(a); oa << x; ll_string y(b); oa << y; } ll_string a; ll_string b; ifstream ifs("foo", ios::binary); ll_text_iarchive ia(ifs); ia >> a; ia >> b; }

Robert Ramey

4:29 a.m.

Bill Lear wrote:

...

On Friday, March 9, 2007 at 11:47:21 (-0800) Robert Ramey writes:

...
Second I would make this a "helper class" . Your special "Lear Archive" would be derived from an existing archive and the helper class using multiple inheritance. Look at the implemenation of shared_ptr checked into the HEADof CVS to see how to do this. ...

I checked out the head of CVS, but did not see anything relating to a helper class in shared_ptr.hpp. Not sure if I'm looking in the right place. This is how I checked it out:

% cvs -z3 -d:pserver:anonymous@boost.cvs.sourceforge.net:/cvsroot/boost checkout boost

Its in boost/archive/binary_iarchive.hpp You don't have to check out, you can always use the web browser interface to browser though the library. Its much better in that it gives a "3 dimensional" view so you can see changes that have been made and differences between versions. The relevant section looks like: 65 class binary_iarchive : 66 public binary_iarchive_impl< 67 boost::archive::binary_iarchive, 68 std::istream::char_type, 69 std::istream::traits_type 70 >, 71 public detail::shared_ptr_helper 72 { 73 public: ... The shared_ptr_helper creates and destroy an extra tracking object that shared_ptr requires. This is similar to your case.

...

I'm not sure I need to use a cache of strings when saving, in fact, I think not, but I went ahead and more-or-less followed your instructions.

lol - well good luck with that!!!. Actually, now that I think about it. I guess we only need the cache of strings on input.

...

I can't seem to get this to compile, but it's getting closer.

The complete code is below my sig.

The compiler (gcc) complains:

error: 'class boost::archive::text_iarchive' has no member named 'cache' error: 'class boost::archive::text_oarchive' has no member named 'cache'

try changing

...

class ll_text_iarchive: public text_iarchive_impl<text_iarchive>, public ll_cache { public: ll_text_iarchive(std::istream& is, unsigned int flags = 0) : text_iarchive_impl<text_iarchive>(is, flags) {} };

...

class ll_text_iarchive: public text_iarchive_impl<ll_text_iarchive>, public ll_cache { public: ll_text_iarchive(std::istream& is, unsigned int flags = 0) : text_iarchive_impl<ll_text_iarchive>(is, flags) {} };

Note that the argumetn to text_iarchive_impl should be the MOST DERIVED class

...

Also, for some reason, I can't use this idiom:

ar << ll_string(s);

In fact, this won't work, either:

ar << string("foo");

My guess is that is the is the "const" trap which using is_wrapper will get around. But this only works in 1.35. So for now do one of the following: use & instead of << or use a const_cast which will be somewhat complex. The "const" trap is also explained in the documentation under the section "rationale"

...

Finally, I really have no idea how to use the reset_object_address() method. I looked through the boost code, but couldn't decipher it.

Did you find anything about it in the documentation?

...

Any help appreciated.

You're welcome. Robert Ramey

Robert Ramey

7:54 a.m.

Robert Ramey wrote:

...

Bill Lear wrote:

...

...
I'm not sure I need to use a cache of strings when saving, in fact, I think not, but I went ahead and more-or-less followed your instructions.

I think I was thinking along the lines of the following template<Archive> void save(Archive &ar, ll_string & l, const unsigned int version) const { std::set<std::string>::iterator it = m_cache.find(l); if(m_cache.end() == it) m_cache.insert(l); // serialize a pointer to the cached value ar << & (* it); // or ar << it - serializing an iterator!!! that's a first } void load(Archive &ar, ll_string & l, const unsigned int version){ std::string * t; ar >> t; // warning or emitted by serialization library - try & or cast. // copy de-serialized string to "real" destination l = *t; } Thus, all the strings what are equal to each other would have all have the same pointer. Serialization tracking ensures that only one copy of the same object is in the archive. So when you load, all the instances have thier contents shared. This could make the file much smaller and faster to load. For example suppose were going to serialize the bible. (King James version). and we have it stored as a long list of words. In this system all words would be stored only once. + one small integer each time the word is stored again. Given a guess of 10,000 different words used in this 250,000 ? word work - the archive would be a lot smaller. Also it might be considerably faster to load. I'm not sure if this would really work - its just an idea you might want to take this idea with a grain of salt. Robert Ramey

6699

Age (days ago)

6700

Last active (days ago)

List overview

Download

0 comments

1 participants

participants (1)

Bill Lear
Robert Ramey