[serialization] speed & size optimized binary archive
I'm trying to create a binary archive which manages serialization of derived type through base and archive versioning ( not on a type by type basis ). Looking at the binary archive which ships with the distribution it serializes a whole lot of things which is unnecessary for our use case. Going by the documentation and looking on how other archives are implemented I've created a basic implementation, which while compiling doesn't handle serialization of exported types. Even though the documentation is mostly great, this is a small area which seems to be left out, namely the requirements of an archive to be able to serialize / deserialize exported types properly. Looking at the provided archive types it's not clearly distinguishable exactly what's needed for this functionality. The documentation mentions BOOST_SERIALIZATION_REGISTER_ARCHIVE, which is used for the archive. Also, it would be great to know more in depth what actually gets serialized, even if it's subject of change for newer versions. For example, I'm interested in knowing how serialization of exported types will impact archive size. I'm thinking just an extra id per instance? Reading the documentation I got a hunch that extra meta data on a per type basis gets serialized to the archive as well. Under "Types used by the serialization library" several types are mentioned, which of these can be ignored for our use case? This is a reduced form of the output archive which together with an analogous input archive doesn't manage to serialize derived types through their base even if exported, checked to work with the supplied binary archive type. class BinaryOArchive : public boost::archive::detail::common_oarchive< BinaryOArchive > { typedef std::ostream Sink; friend class boost::archive::save_access; template< typename Type > void save( Type& a_Type ) { m_Out.write( reinterpret_cast< const char* >( &a_Type ), sizeof( Type ) ); } public: BinaryOArchive( Sink& a_Out ) : m_Out( a_Out ) {} void save_binary( void *a_Address, std::size_t a_Count ) { m_Out.write( static_cast< const char* >( a_Address ), a_Count ); } private: Sink& m_Out; }; #include "boost/archive/impl/archive_serializer_map.ipp" BOOST_SERIALIZATION_REGISTER_ARCHIVE(Altus::BinaryOArchive) BOOST_SERIALIZATION_USE_ARRAY_OPTIMIZATION(Altus::BinaryOArchive)
Turns out I needed to handle serialization of strings explicitly, which fixed the error. Looking at how class_name_type is handled throughout the source it seems its type is set in stone, furthermore it seems I pay at least an additional 128 bytes per exported class type, I guess this is convenient when serializing in text format for easy viewing, but for use cases where memory is at a premium and a hash would suffice it would be nice if this type was configurable throughout the system. I'm guessing this string is used in the extended_type_info system mentioned in the manual, leading me to believe that the lookup table for exported types uses a costly string compare to find the correct derived type. Neither performance, memory use or archive size is mentioned as a specific consideration in the manual amongst the 11 goals, leading me to believe perhaps the library simply isn't suited for domains where these are of importance? This I would find rather sad, especially since this is areas where the language is often used.
Sebastian Karlsson wrote:
Turns out I needed to handle serialization of strings explicitly, which fixed the error.
Looking at how class_name_type is handled throughout the source it seems its type is set in stone, furthermore it seems I pay at least an additional 128 bytes per exported class type, I guess this is convenient when serializing in text format for easy viewing, but for use cases where memory is at a premium and a hash would suffice it would be nice if this type was configurable throughout the system. I'm guessing this string is used in the extended_type_info system mentioned in the manual, leading me to believe that the lookup table for exported types uses a costly string compare to find the correct derived type.
The class name is used only for exported types which are serialized through base class pointers. each class name is looked up only one time during the loading of an archive so I would expect that the impact on the execution would be undetectable. The class name is used only once per archive. If your application is so fast and your archives are so small that these factors make an impact you could chose short text names perhaps generated by your own method which would make them very short - maybe 2 characters long. Of course if you can just avoid all of the above by avoiding the serialization of pointers through a base class which is the only place where the look up class names is used. In other words - don't use BOOST_CLASS_EXPORT. If you application is so time critical thay the time for storing /loading the class name is detectable, I would guess you're not using this facility as the redirection through virtual base class is in fact a measurable time impact that occurs on every read/write - not just once per archive.
Neither performance, memory use or archive size is mentioned as a specific consideration in the manual amongst the 11 goals, leading me to believe perhaps the library simply isn't suited for domains where these are of importance?
This serialization library separates the saving/loading of data types in an "archive" implementation. Any conforming archive implementation can be used once the serialization for all data types is defined. The library includes a number of archive implementations for different purposes. Some ar as fast as they can be - binary archives while others address other requirements - e.g. xml compatibility. These can be used as examples along with the documentation, demos and tests to create a new archive implementation should a user find that the included implementation fails to meet one or more of his requirements. The test suite is parameterized by archive implementation so that if a user were to make his own archive (as some have - see MPI library), the current tests can be used to verify that the new archve can handle all the serialization facilities.
This I would find rather sad, especially since this is areas where the language is often used.
If you're really interested in this subject, I think you would find that spending some more time reading the documentation and demos would be productive. Robert Ramey
Thanks for clearing that up, it sounds reasonable.
Am Thursday 19 November 2009 14:15:46 schrieb Sebastian Karlsson:
Turns out I needed to handle serialization of strings explicitly, which fixed the error.
Looking at how class_name_type is handled throughout the source it seems its type is set in stone, furthermore it seems I pay at least an additional 128 bytes per exported class type, I guess this is convenient when serializing in text format for easy viewing, but for use cases where memory is at a premium and a hash would suffice it would be nice if this type was configurable throughout the system. I'm guessing this string is used in the extended_type_info system mentioned in the manual, leading me to believe that the lookup table for exported types uses a costly string compare to find the correct derived type.
Neither performance, memory use or archive size is mentioned as a specific consideration in the manual amongst the 11 goals, leading me to believe perhaps the library simply isn't suited for domains where these are of importance? This I would find rather sad, especially since this is areas where the language is often used.
there has been a discussion about that in the past. the type registration and object tracking functionality of boost.serialization is currently not configurable or replaceable by the user. the thread in which it took place was "Using boost::serializationinreal-timewithoutallocating memory" on boost-users in november 09. I think robert would welcome patches to change that.
Sebastian Karlsson wrote:
I'm trying to create a binary archive which manages serialization of derived type through base and archive versioning ( not on a type by type basis ). Looking at the binary archive which ships with the distribution it serializes a whole lot of things which is unnecessary for our use case.
like what? archives don't contain anything that doesn't have to be there in order to restore the data.
Going by the documentation and looking on how other archives are implemented I've created a basic implementation, which while compiling doesn't handle serialization of exported types. Even though the documentation is mostly great, this is a small area which seems to be left out, namely the requirements of an archive to be able to serialize / deserialize exported types properly.
all you have to do is to include boost/serialization/export.hpp
Looking at the provided archive types it's not clearly distinguishable exactly what's needed for this functionality. The documentation mentions BOOST_SERIALIZATION_REGISTER_ARCHIVE, which is used for the archive.
Also, it would be great to know more in depth what actually gets serialized, even if it's subject of change for newer versions. For example, I'm interested in knowing how serialization of exported types will impact archive size. I'm thinking just an extra id per instance? Reading the documentation I got a hunch that extra meta data on a per type basis gets serialized to the archive as well. Under "Types used by the serialization library" several types are mentioned, which of these can be ignored for our use case?
The easiest way to find this stuff out is to serialize your data to an xml_oarchive. This labels all the data with descriptive tags and the structure is very apparent.
This is a reduced form of the output archive which together with an analogous input archive doesn't manage to serialize derived types through their base even if exported, checked to work with the supplied binary archive type.
I don't see any advantage at all to this effort. The beauty of template meta programming is that only the code actually used is generated. Let the compiler do the work. The headers in the library are almost all optional. For example, If you're not going to serialize derived types - don't include boost/serialization/base_objec.hpp. etc. This way you absolutely know you're not generating code that corresponds to facilities benig used. Robert Ramey
participants (3)
-
Robert Ramey
-
Sebastian Karlsson
-
Stefan Strasser