After fiddling around on linux with clang 3.8 and gcc optimizer options i got down to this. With gcc and -O3.
Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------- to_wire_xml 11174 11178 381818 to_wire_text 5148 5149 820313 to_wire_binary 3327 3330 1141304 to_wire_cstyle 63 63 65217391 from_wire_xml 27170 27183 155096 from_wire_text 5371 5370 783582 from_wire_binary 3226 3228 1296296 from_wire_cstyle 45 45 93750000
This results look very nice. <6µs for serilize/deserialize a structure to a portable text archive seems very nice :)
Now is the difference again pretty big compared to windows .....
For what it's worth, in tests I've done in the past, binary serialization using boost.serialization and other similar systems was not this big of a difference compared to memcpy. I was seeing maybe a 5x to 10x difference compared to memcpy (yours is 50x). Of course, this depends on a lot of factors, like how much data this is because it would determine if you are memory bound or not, but I am wondering if your cstyle tests are actually being completely optimized away. Have you examined the disassembly? If you find the code is being optimized away, Google Benchmark has a handy "benchmark::DoNotOptimize" function to help keep the optimizer from throwing away the side effects of a particular address. -- chris