On Friday, March 9, 2007 at 10:53:20 (-0800) Robert Ramey writes:
... Define your own serializaton for std::string and use it instead of the one in the serialization library. This is probably a bad idea as it would attribute your special behavior to a standard object and would make your archives and programs non portable and harder to support if you want to ask us for help.
Definite downsides, true, but I'm not sure that it would be non-portable, except perhaps that I have a different idea of "define your own serialization for std::string". I have done what I considered to be this, and posted it below.
Define you're own string class derived from std::string. This string class could be serialized using your own special sauce without losing portablity. The could be formlated as a "serialization wrapper" as described in the manual so that you're code would only have to use this "special string" in the process of serialization and not through out your program. Look in the recent document and the "is_wrapper" typetrait for more information.
Ok, I'll have a look at that --- sounds like a reasonable alternative to what I've done.
So now the problem boils down to how your going to capture and restore the fact that these strings share underlying data. At first one would think that just letting your wrapper class use the default tracking behavior eliminate duplicates would solve your problem. But I don't think so. As I said above, I don't think that you're serializing the SAME (see above) string one million times. I think you're serializing a million different strings which happen to contain the same data.
It seems to me that you'll have to delve into the implementation of the string class you're using and gain access to the internals of the implementation and figure out how to capture the reference to the shared contents and serialize that.
The strings share data on assign, so:
string a = "foo";
string b = a;
means they share the underlying memory "foo", with a logical refcount
of 2 (the physical refcount, for implementation reasons, is actually
1). Once you muck with a or b, they get their own copy of the memory,
decremented ref count, etc. If I serialize a and b, and deserialize,
the load will "break" this ref count --- I get two "unshared" strings,
each with a block of memory "foo". Not the fault of the serialization
library, of course ...
So, here is how I've coded this to test it out. The test I've just
completed shows that the memory bloat is completely removed --- this
is a major relief, as the bloat was literally expanding by 3-4
gigabytes a process that was already near our VM limit. Think of this
as just a proof-of-concept, if you like
(boost/archive/impl/text_iarchive_impl.ipp):
#ifdef LL_STRING_DESERIALIZATION_CACHE
typedef std::map