[iostreams] Devices and WOULD_BLOCK
Hi all, Boost newcomer here ... I have an application where I need to encode/decode a serialized data stream. I have used Boost.Iostreams filtering_stream filter to implement encoding and decoding function of each layer. The device interface is via a USB device, eg., think USB-to-Serial device, but without the classic Virtual COM Port or ttyUSB0 layer available. For example, the FTDI FT232H USB-to-MPSSE interface is a suitable cable http://www.ftdichip.com/Products/Cables/USBMPSSE.htm The cable can be used to implement either JTAG or SPI mode access (the mode selection introduces different filters into the filtering_stream stack). I have the filters working, and am working on getting a device working. Rather than deal with the details of the final hardware, I figured I'd simplify the design by creating a client/server design via sockets; the client code matches what I would use with real hardware, while the server code emulates the hardware. I started with this example for a socket device http://stackoverflow.com/questions/12023166/is-there-a-boostiostreams-bidire... The "problem" with the protocol I need to decode (which I have no choice to change) is that the data stream possibly contains escaped characters, so there is no way to know the read data length at the socket layer - its up to the filtering_stream layers to request data from the device layer until a complete packet is decoded. This all sounds good in theory, but in practice the filter layers attempt to read in blocks, and the read size is often larger than the read data that the device layer will supply. This lead to higher-level layers blocking in read(). I figured (perhaps incorrectly) that I could deal with this using non-blocking socket reads. After reading Boost.Iostreams non-blocking support in Section 3.6 http://www.boost.org/doc/libs/1_57_0/libs/iostreams/doc/index.html http://www.boost.org/doc/libs/1_57_0/libs/iostreams/doc/guide/asynchronous.h... I modified the socket device example in the link above to; * put the socket in non-blocking mode before constructing the device * change the device 'read' procedure to return WOULD_BLOCK rather than throw an exception This does not work, and its not due to the filter layers. There is an issue with the device layer. Here's the issue ... given a design with a filtering_stream created only with a socket_device and no filters, if I trace the code in a debugger (Boost 1.57.0 source, Visual Studio 2012 under Win7), the socket_device read method call return sequence is boost/iostreams/read.hpp - read_device_impl read template at line 187 - read at line 52 boost/iostreams/detail/adapter/concept_adapter.hpp - device_wrapper_impl read at line 169 - read at line 77 boost/iostreams/detail/streambuf/indirect_streambuf.hpp - line 258 i.e., this source file https://github.com/boostorg/iostreams/blob/master/include/boost/iostreams/de... and this particular block of code // Read from source. std::streamsize chars = obj().read(buf.data() + pback_size_, buf.size() - pback_size_, next_); if (chars == -1) { this->set_true_eof(true); chars = 0; } setg(eback(), gptr(), buf.data() + pback_size_ + chars); The code tests for EOF (-1), but not WOULD_BLOCK (-2), so after this point, since chars is -2, things go bad. So I guess my question now is; Have I just bumped into the as-yet-unsupported part of Boost.Iostreams support for asynchronous I/O? Cheers, Dave PS. I can post example code if anyone wants to trace the code for themselves, I just figured I'd post the question to start with.
On 20/01/2015 11:14, David Hawkins wrote:
The "problem" with the protocol I need to decode (which I have no choice to change) is that the data stream possibly contains escaped characters, so there is no way to know the read data length at the socket layer - its up to the filtering_stream layers to request data from the device layer until a complete packet is decoded.
This all sounds good in theory, but in practice the filter layers attempt to read in blocks, and the read size is often larger than the read data that the device layer will supply. This lead to higher-level layers blocking in read(). I figured (perhaps incorrectly) that I could deal with this using non-blocking socket reads.
I can't really answer your specific questions about the Boost implementations, but in general sockets (both blocking and non-blocking) and code dealing with them are expecting that read() will only block (or return WOULD_BLOCK) if no data can be read -- if at least one byte of data is available then that is what it will return, regardless of the amount actually requested. (The read size acts only as a maximum.) Serial ports in particular sometimes operate this way and sometimes don't. Under Windows, the SetCommTimeouts API function selects (via ReadIntervalTimeout and ReadTotalTimeoutConstant) whether normal serial ports will return "early" as above or whether it will wait longer to see if more data is received, and whether there's an overall timeout or if it will block forever. There may be a similar API call you need to make to the FTDI library.
Hi Gavin,
The "problem" with the protocol I need to decode (which I have no choice to change) is that the data stream possibly contains escaped characters, so there is no way to know the read data length at the socket layer - its up to the filtering_stream layers to request data from the device layer until a complete packet is decoded.
This all sounds good in theory, but in practice the filter layers attempt to read in blocks, and the read size is often larger than the read data that the device layer will supply. This lead to higher-level layers blocking in read(). I figured (perhaps incorrectly) that I could deal with this using non-blocking socket reads.
I can't really answer your specific questions about the Boost implementations
No problem, I appreciate you taking time to respond.
but in general sockets (both blocking and non-blocking) and code dealing with them are expecting that read() will only block (or return WOULD_BLOCK) if no data can be read -- if at least one byte of data is available then that is what it will return, regardless of the amount actually requested. (The read size acts only as a maximum.)
Yep, understood. In theory the filtering_stream 'filter' and 'device' layers should pass characters, EOF, and WOULD_BLOCK. Unfortunately the code snippet I posted suppresses the propagation of the WOULD_BLOCK return value from the 'device' layer to the 'filter' layer.
Serial ports in particular sometimes operate this way and sometimes don't. Under Windows, the SetCommTimeouts API function selects (via ReadIntervalTimeout and ReadTotalTimeoutConstant) whether normal serial ports will return "early" as above or whether it will wait longer to see if more data is received, and whether there's an overall timeout or if it will block forever. There may be a similar API call you need to make to the FTDI library.
Ok, this is good to know, thanks. I consider solving this particular "problem" a good way to learn Boost. Without a "problem" to solve, reading code gets old quickly :) The Boost chat client/server has similar features to my problem, i.e., it involves encoding/decoding messages. The encoding/decoding is different than my case, since the chat protocol adds a header with the message length. This simplifies the socket read code, since you can read the fixed-length header, then read the now known) message length, i.e., each call to read has a fixed length parameter. In my case, the message length is unknown, and depends on the content of the message (since data can be escaped). The data stream is generated by a field-programmable gate array (FPGA), and adding a buffer to determine the message length before sending the response would use too many resources, so the encoding protocol cannot easily be changed. I'm in the process of modifying the chat example to encode the messages with a start-of-packet [, end-of-packet ], and escape \ code (so that a [ or ] or \ in the message is encoded as \[ or \] or \\), i.e., Hello! ->(encodes to)-> [Hello!] [Hello!] ->(encodes to)-> [\[Hello!\]] The modified code will use async_read_until to parse the encoded data streams ... I'll probably have to use a regex or match procedure to deal with the fact that \] is not the end-of-packet. Once I get the packet parsing working, I'll see if I can still use the filtering_stream components I came up with for my application, but instead of having them operate on a socket or serial_port device, I'll just have them operate directly on the contents of the streambuf that I filled using async_read_until. Its possible that the async_read_until match condition could use the filtering_stream, i.e., successful decoding of a message results in a match. Cheers, Dave
On 22/01/2015 12:56, David Hawkins wrote:
In theory the filtering_stream 'filter' and 'device' layers should pass characters, EOF, and WOULD_BLOCK. Unfortunately the code snippet I posted suppresses the propagation of the WOULD_BLOCK return value from the 'device' layer to the 'filter' layer.
Generally I would recommend using only blocking or async code. Non-blocking is a holdover from when async didn't really exist yet, and with ASIO you shouldn't have that excuse.
The Boost chat client/server has similar features to my problem, i.e., it involves encoding/decoding messages. The encoding/decoding is different than my case, since the chat protocol adds a header with the message length. This simplifies the socket read code, since you can read the fixed-length header, then read the now known) message length, i.e., each call to read has a fixed length parameter. In my case, the message length is unknown, and depends on the content of the message (since data can be escaped). The data stream is generated by a field-programmable gate array (FPGA), and adding a buffer to determine the message length before sending the response would use too many resources, so the encoding protocol cannot easily be changed.
My point is that theoretically this shouldn't be a problem, as long as the lowest level (actual device reading) behaves as described earlier. When you read(buffer, 512) it would block until the device provides some data; perhaps it sends a 28 byte packet, of which the driver eagerly grabs the first 8 bytes and so read() returns 8; you look at that data, decide it's not a complete message yet, so you stash it into a separate buffer and call read(buffer, 512) again. (There are other approaches if you want zero-copy.) This time the driver already had the remaining 20 bytes queued up and so read() immediately returns 20. You tack those bytes onto the end of your prior stash, parse the whole thing, and now you've got a message you can return to a higher layer, and then go back to reading. Async works similarly, you just break the code up a little more so that it can return while it's waiting. (Just remember that it might also return a few bytes from the start of the *next* message, so you need to strip exactly one message out of the stash and keep accumulating bytes for the next message.)
Hi Gavin,
In theory the filtering_stream 'filter' and 'device' layers should pass characters, EOF, and WOULD_BLOCK. Unfortunately the code snippet I posted suppresses the propagation of the WOULD_BLOCK return value from the 'device' layer to the 'filter' layer.
Generally I would recommend using only blocking or async code. Non-blocking is a holdover from when async didn't really exist yet, and with ASIO you shouldn't have that excuse.
Right, I'd be happy using async code, but this particular feature is new with filtering_streams - per the documentation http://www.boost.org/doc/libs/1_57_0/libs/iostreams/doc/guide/asynchronous.h... Since async support is new, it might not be fully supported (or at least fully debugged). Now that I am getting more familiar with the boost code, I need to dig around a little more in the Boost.Iostream example and test folders.
The Boost chat client/server has similar features to my problem, i.e., it involves encoding/decoding messages. The encoding/decoding is different than my case, since the chat protocol adds a header with the message length. This simplifies the socket read code, since you can read the fixed-length header, then read the now known) message length, i.e., each call to read has a fixed length parameter. In my case, the message length is unknown, and depends on the content of the message (since data can be escaped). The data stream is generated by a field-programmable gate array (FPGA), and adding a buffer to determine the message length before sending the response would use too many resources, so the encoding protocol cannot easily be changed.
My point is that theoretically this shouldn't be a problem, as long as the lowest level (actual device reading) behaves as described earlier.
When you read(buffer, 512) it would block until the device provides some data; perhaps it sends a 28 byte packet, of which the driver eagerly grabs the first 8 bytes and so read() returns 8; you look at that data, decide it's not a complete message yet, so you stash it into a separate buffer and call read(buffer, 512) again. (There are other approaches if you want zero-copy.) This time the driver already had the remaining 20 bytes queued up and so read() immediately returns 20. You tack those bytes onto the end of your prior stash, parse the whole thing, and now you've got a message you can return to a higher layer, and then go back to reading. Async works similarly, you just break the code up a little more so that it can return while it's waiting.
Yes, this is what I'm in the process of getting working with my modified version of the chat client/server. The async_read_until callbacks do nicely sequence through my simplified packet protocol.
(Just remember that it might also return a few bytes from the start of the *next* message, so you need to strip exactly one message out of the stash and keep accumulating bytes for the next message.)
Yep, I ran a few tests where I ensured that two complete messages were received into a boost::asio::streambuf and confirmed that read_until '[' returned the SOP first index, read_until ']' returned the first EOP index, I then consume()'d those characters and repeated the call to read_until and confirmed it returned immediately based on the streambuf contents. So this all works as described by the documentation. --- What is the policy of this list with regards to posting code inline in messages? Once I finish my variation on the chat client/server, I'd be happy to post the code. At a minimum it would provide code for people on the list to review/comment on, and any final version of the code would benefit anyone interested in reading streams containing a different style of packet than that used in the boost example chat client/server. I'll then go back to looking at what I did wrong with filtering_streams 'filter' and 'devices', or see if I can get the async stuff working. Thanks again for the helpful discussion. Cheers, Dave
On 22/01/2015 18:40, David Hawkins wrote:
Yep, I ran a few tests where I ensured that two complete messages were received into a boost::asio::streambuf and confirmed that read_until '[' returned the SOP first index, read_until ']' returned the first EOP index, I then consume()'d those characters and repeated the call to read_until and confirmed it returned immediately based on the streambuf contents. So this all works as described by the documentation.
Yep. Just be careful if you're mixing calls to read_until with calls to read -- read_until may fetch a larger amount of data into the streambuf (beyond the delimiter), while read will always wait for new data, even if the streambuf isn't empty. So you need to explicitly check the streambuf before calling read(). Of course if you're only dealing with purely delimited data then this shouldn't be an issue, as you'll only be using [async_]read_until.
Once I finish my variation on the chat client/server, I'd be happy to post the code. At a minimum it would provide code for people on the list to review/comment on, and any final version of the code would benefit anyone interested in reading streams containing a different style of packet than that used in the boost example chat client/server.
Asio has a dedicated mailing list (https://lists.sourceforge.net/lists/listinfo/asio-users), which I believe the library maintainer pays closer attention to than this list; it may be worthwhile asking there. Maybe you could even get it included in the official docs. :) (I have some working streambuf code but it's not really in consumable-example form.)
Hi Gavin,
Yep, I ran a few tests where I ensured that two complete messages were received into a boost::asio::streambuf and confirmed that read_until '[' returned the SOP first index, read_until ']' returned the first EOP index, I then consume()'d those characters and repeated the call to read_until and confirmed it returned immediately based on the streambuf contents. So this all works as described by the documentation.
Yep. Just be careful if you're mixing calls to read_until with calls to read -- read_until may fetch a larger amount of data into the streambuf (beyond the delimiter), while read will always wait for new data, even if the streambuf isn't empty. So you need to explicitly check the streambuf before calling read().
Of course if you're only dealing with purely delimited data then this shouldn't be an issue, as you'll only be using [async_]read_until.
That is an interesting warning. I hadn't really thought about whether I could mix the blocking and non-blocking read commands (but would have assumed I should not!). The modified chat application uses using only the async versions, i.e., async_read_until, and async_write.
Once I finish my variation on the chat client/server, I'd be happy to post the code. At a minimum it would provide code for people on the list to review/comment on, and any final version of the code would benefit anyone interested in reading streams containing a different style of packet than that used in the boost example chat client/server.
Asio has a dedicated mailing list (https://lists.sourceforge.net/lists/listinfo/asio-users), which I believe the library maintainer pays closer attention to than this list; it may be worthwhile asking there.
Oh, good point. My code started in Boost.Iostreams, but now its in the Boost.Asio camp :)
Maybe you could even get it included in the official docs. :)
Yes, that was my hope. You can never have too many examples!
(I have some working streambuf code but it's not really in consumable-example form.)
Ok, so here's a streambuf question for you. In my attempt to
modify the chat server as much as possible, I initially modified
the buffering to use a streambuf member variable ... but that
fails, since a streambuf is non-copyable. My solution was to use
a shared_ptr<streambuf>> - see the chat_message.hpp code below.
Given that the shared_ptr is reference counted, when the server
'delivers' a read message to multiple clients, the shared pointer
reference count on a new message read from a client will increment
as its copied to each of the clients connected to a session, and
then decrement as each server client handler write pops the message
off its write queue.
At least I assume that is what is going on (I just finished
modifying the server and its working ok), I'll use the debugger
and trace the server code tomorrow to check.
Do you see anything wrong with using a shared_ptr<streambuf>
member variable? I wanted to use a streambuf so that I could
pass it directly to the async_read_until and async_write functions
(saving the buffer() conversions used in the original code).
Cheers,
Dave
//
// chat_message.hpp
// ~~~~~~~~~~~~~~~~
//
// Copyright (c) 2003-2014 Christopher M. Kohlhoff (chris at kohlhoff
dot com)
//
// Distributed under the Boost Software License, Version 1.0. (See
accompanying
// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
//
// Chat client/server message coder/decoder.
//
// This is based on the Boost chat client/server example modified
// to use a different encoding/decoding method.
//
// The encoded message to send or receive is stored in a
// std::shared_ptrboost::asio::streambuf buffer_
// a shared pointer is used, since streambufs are non-copyable.
//
#ifndef CHAT_MESSAGE_HPP
#define CHAT_MESSAGE_HPP
#include
On 22/01/2015 20:52, David Hawkins wrote:
Yep. Just be careful if you're mixing calls to read_until with calls to read -- read_until may fetch a larger amount of data into the streambuf (beyond the delimiter), while read will always wait for new data, even if the streambuf isn't empty. So you need to explicitly check the streambuf before calling read().
Of course if you're only dealing with purely delimited data then this shouldn't be an issue, as you'll only be using [async_]read_until.
That is an interesting warning. I hadn't really thought about whether I could mix the blocking and non-blocking read commands (but would have assumed I should not!). The modified chat application uses using only the async versions, i.e., async_read_until, and async_write.
I wasn't talking about mixing async with non-async (although you can do that as long as you're careful to not have concurrent operations, it's rarely-to-never useful to do so in practice). I was talking about mixing the two types of read calls ("read", which reads up to a specified number of bytes, and "read_until", which reads up to a specified termination value/sequence).
Ok, so here's a streambuf question for you. In my attempt to modify the chat server as much as possible, I initially modified the buffering to use a streambuf member variable ... but that fails, since a streambuf is non-copyable. My solution was to use a shared_ptr<streambuf>> - see the chat_message.hpp code below.
Given that the shared_ptr is reference counted, when the server 'delivers' a read message to multiple clients, the shared pointer reference count on a new message read from a client will increment as its copied to each of the clients connected to a session, and then decrement as each server client handler write pops the message off its write queue.
At least I assume that is what is going on (I just finished modifying the server and its working ok), I'll use the debugger and trace the server code tomorrow to check.
Do you see anything wrong with using a shared_ptr<streambuf> member variable? I wanted to use a streambuf so that I could pass it directly to the async_read_until and async_write functions (saving the buffer() conversions used in the original code).
There are negative performance consequences to copying a shared_ptr (ie. incrementing or decrementing its refcount). *Most* applications don't need to care about this (it's very small) but sometimes it's worthy of note, and there's no harm in avoiding copies in silly places (which is why I thwack people that pass a shared_ptr as a value parameter). Personally, though, I use a streambuf only in the connection-management class, which isn't copyable, so I've never had that problem. Messages themselves are copied out of the streambuf into a std::string or some dedicated message data type, and then copies of these can be made reasonably freely. Of course, copying data is also a negative performance consequence, so it's a matter of picking the right tradeoff for your particular application and workload. :)
Hi Gavin,
I wasn't talking about mixing async with non-async (although you can do that as long as you're careful to not have concurrent operations, it's rarely-to-never useful to do so in practice). I was talking about mixing the two types of read calls ("read", which reads up to a specified number of bytes, and "read_until", which reads up to a specified termination value/sequence).
Thanks for correcting me :)
Do you see anything wrong with using a shared_ptr<streambuf> member variable? I wanted to use a streambuf so that I could pass it directly to the async_read_until and async_write functions (saving the buffer() conversions used in the original code).
There are negative performance consequences to copying a shared_ptr (ie. incrementing or decrementing its refcount). *Most* applications don't need to care about this (it's very small) but sometimes it's worthy of note, and there's no harm in avoiding copies in silly places (which is why I thwack people that pass a shared_ptr as a value parameter).
In the case of the original chat example, the chat message class contained a char array. I added a std::cout message in the constructor copy constructor and destructor, and could see that the message was copied numerous times inside the bind command sequence, so figured a shared_ptr<> was a lighter-weight way to do. I then made sure to pass a reference to a shared_ptr<> where ever I could to avoid a head "thwack" :)
Personally, though, I use a streambuf only in the connection-management class, which isn't copyable, so I've never had that problem. Messages themselves are copied out of the streambuf into a std::string or some dedicated message data type, and then copies of these can be made reasonably freely.
Of course, copying data is also a negative performance consequence, so it's a matter of picking the right tradeoff for your particular application and workload. :)
It turns out that my use of a streambuf was not suitable for use within the message class, since the message can be sent to multiple connected clients - sending to the first client "consumes" the streambuf within the shared message, so the next client sees an empty streambuf - oops! The solution was as you point out, to pass around a message object (containing a char array or std::string), and copy the message into or out of a streambuf as needed. My modified chat client/server now works with the modified protocol. I've modified the client code to work with a serial port, so that I can connect two clients using two USB-to-Serial ports and a cross-over cable. For some strange reason the async_read_until calls work fine under Cygwin but not Linux, so I'm in the process of tracking that down :) Thanks again. Cheers, Dave
On 27 Jan 2015 at 10:58, Gavin Lambert wrote:
There are negative performance consequences to copying a shared_ptr (ie. incrementing or decrementing its refcount). *Most* applications don't need to care about this (it's very small) but sometimes it's worthy of note, and there's no harm in avoiding copies in silly places (which is why I thwack people that pass a shared_ptr as a value parameter).
As food for thought, AFIO which uses shared_ptr very heavily indeed to avoid any locking at all passes them around all by value. It was bugging me whether this was costing me performance, so I tried replaced the lot with reference semantics. Total effect on performance: ~0.1%. The key is that AFIO very, very rarely has more than one thread touch a shared_ptr at once. That, on Intel at least, makes their atomic reference counting almost as cheap as non-atomic reference counting. Combine that with the compiler judiciously folding out copies for you where it can, and the overhead for the benefits to debugging and maintenance is irrelevant. Of course, I'm currently seeing a 300k CPU cycle per op average. shared_ptr is tiny compared to that. With a 10k CPU cycle per op average I might care a bit more. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 27/01/2015 15:08, Niall Douglas wrote:
On 27 Jan 2015 at 10:58, Gavin Lambert wrote:
There are negative performance consequences to copying a shared_ptr (ie. incrementing or decrementing its refcount). *Most* applications don't need to care about this (it's very small) but sometimes it's worthy of note, and there's no harm in avoiding copies in silly places (which is why I thwack people that pass a shared_ptr as a value parameter).
As food for thought, AFIO which uses shared_ptr very heavily indeed to avoid any locking at all passes them around all by value. It was bugging me whether this was costing me performance, so I tried replaced the lot with reference semantics.
Total effect on performance: ~0.1%.
As I said, it's not a big difference (atomic ops are typically ~1us, and that was on the previous CPU generation), but it's still one of my pet peeves, as while there are many places where shared_ptrs do need to get copied for correctness, parameter passing is not one of those places. (And performance gets worse if you end up passing the object through many layers as part of keeping methods short or similar "tidiness" or abstraction guidelines; and it wastes more stack too.) You're going to have to make lots of copies anyway in an asynchronous library like AFIO, because binding an asynchronous callback is one of those places that you *do* need to copy a shared_ptr, so if you have a high percentage of async code (which is what I would expect with that sort of library) then it's not going to make much difference either way.
The key is that AFIO very, very rarely has more than one thread touch a shared_ptr at once. That, on Intel at least, makes their atomic reference counting almost as cheap as non-atomic reference counting. Combine that with the compiler judiciously folding out copies for you where it can, and the overhead for the benefits to debugging and maintenance is irrelevant.
Writing a single shared_ptr instance from multiple threads requires even more overhead from the extra spinlock (via the atomic_*(&sp...) family of functions). Though an uncontended spinlock basically only costs 2 atomic-ops, so it's usually not too bad. (But those functions do mildly irritate me in that they're also passing by value, but at least in that case they're inlined template methods so the compiler will almost certainly elide the parameter copy. Another case where generic library code may "win" over application code.) Multi-writers is one case where it may be better to create separate per-thread copies from some "safe" context up front, if you can (assuming you're ok with operating on stale data until some sync point). But again, to a certain extent async code patterns may already be doing these copies "for you". And if you're limiting yourself to WORM access only, you can skip the spinlock if you're careful.
Of course, I'm currently seeing a 300k CPU cycle per op average. shared_ptr is tiny compared to that. With a 10k CPU cycle per op average I might care a bit more.
I'm probably biased the other way, because about half of the code I work on has sub-millisecond budgets. :)
On Mon, Jan 26, 2015 at 10:51 PM, Gavin Lambert
On 27/01/2015 15:08, Niall Douglas wrote:
As I said, it's not a big difference (atomic ops are typically ~1us, and that was on the previous CPU generation), but it's still one of my pet peeves, as while there are many places where shared_ptrs do need to get copied for correctness, parameter passing is not one of those places. (And performance gets worse if you end up passing the object through many layers as part of keeping methods short or similar "tidiness" or abstraction guidelines; and it wastes more stack too.)
I got distracted by the ~1us estimate you gave here. I just wrote a quick benchmark for an uncontended fetch_add + compare and repeat, and came up with about 22 cycles total per iteration, which is about 7 ns per iteration. If I use a volatile int instead of an atomic, it is just over 2 ns per iteration. It's more expensive, but it seems to be less than an order of magnitude, rather than the 3 orders of magnitude mentioned above. Here's the code for posterity. #include <atomic> int main(int argc, char **args) { #if 1 std::atomic<int> count(1000000000); while (count.fetch_add(-1, std::memory_order_relaxed)); #else volatile int count = 1000000000; while (count--); #endif return 0; } Sorry to derail the discussion. Carry on.
On 27 Jan 2015 at 8:16, Brian Budge wrote:
I got distracted by the ~1us estimate you gave here. I just wrote a quick benchmark for an uncontended fetch_add + compare and repeat, and came up with about 22 cycles total per iteration, which is about 7 ns per iteration. If I use a volatile int instead of an atomic, it is just over 2 ns per iteration. It's more expensive, but it seems to be less than an order of magnitude, rather than the 3 orders of magnitude mentioned above. Here's the code for posterity.
You may find the results at https://ci.nedprod.com/view/Boost%20Thread-Expected-Permit/job/Boost.S pinlock%20Test%20Linux%20GCC%204.8/228/console of interest. Some figures for Haswell uncontended: === Binary spinlock performance === 1. Achieved 102531608.982708 transactions per second 2. Achieved 102767464.092175 transactions per second 3. Achieved 102837332.390959 transactions per second === Tristate spinlock performance === 1. Achieved 97338235.338779 transactions per second 2. Achieved 99064689.648506 transactions per second 3. Achieved 99594486.110628 transactions per second === Pointer spinlock performance === 1. Achieved 85935625.193489 transactions per second 2. Achieved 85551904.532972 transactions per second 3. Achieved 85074926.977000 transactions per second Haswell contended: === Binary spinlock performance === 1. Achieved 100056328.085670 transactions per second 2. Achieved 99038604.412362 transactions per second 3. Achieved 93814414.369464 transactions per second === Tristate spinlock performance === 1. Achieved 73303113.800913 transactions per second 2. Achieved 87909718.117258 transactions per second 3. Achieved 66449661.784678 transactions per second === Pointer spinlock performance === 1. Achieved 75884031.753741 transactions per second 2. Achieved 80199058.554479 transactions per second 3. Achieved 78455657.805638 transactions per second One can draw from this that atomics are fast even when contended if and only if the cache line invalidation coherency traffic is kept below the CPU's coherency bus bandwidth. An uncontended unordered_map: === Large unordered_map spinlock write performance === 1. Achieved 18456343.370007 transactions per second 2. Achieved 18493407.792700 transactions per second 3. Achieved 18589912.112064 transactions per second Versus a contended one: === Large unordered_map spinlock write performance === 1. Achieved 17174649.408718 transactions per second 2. Achieved 17177112.056269 transactions per second 3. Achieved 17468407.872320 transactions per second The performance of atomics if you're keeping cache line invalidations low isn't the problem - modern CPUs are very good on that. The big performance problem *introduced* by the use of atomics is that it equals you telling the compiler that global state is changing in a way the compiler does not understand. This means that the compiler cannot eliminate any code with a dependency touching any atomic. In my idea testing for non-allocating future designs a few months ago, even a non-atomic boolean which when flipped true "turned on" the use of atomics over non-atomics made the difference between thousands of opcodes being generated and less than five opcodes. The atomics are so fundamentally anti-optimiser that their use ought to be avoided in header only code as much as possible, they are incredibly penalising. In non header only code I wouldn't worry, even with link time optimisation present compilers do a poor job of optimising past ABI boundaries. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 27 Jan 2015 at 19:51, Gavin Lambert wrote:
As I said, it's not a big difference (atomic ops are typically ~1us, and that was on the previous CPU generation), but it's still one of my pet peeves, as while there are many places where shared_ptrs do need to get copied for correctness, parameter passing is not one of those places. (And performance gets worse if you end up passing the object through many layers as part of keeping methods short or similar "tidiness" or abstraction guidelines; and it wastes more stack too.)
A good compiler optimiser will collapse those if the code is header only. In AFIO's case it does exactly that - five nested function calls each taking by value shared_ptr turn into a single shared_ptr copy only.
You're going to have to make lots of copies anyway in an asynchronous library like AFIO, because binding an asynchronous callback is one of those places that you *do* need to copy a shared_ptr, so if you have a high percentage of async code (which is what I would expect with that sort of library) then it's not going to make much difference either way.
You're right in general, and in AFIO until v1.3 of the engine. In v1.4 I'm going even more intrusive, and I expect to elide all but necessary copying completely in the main engine loop. I will do this via the batch detachable and reattachable node_ptr support in my concurrent_unordered_map, basically you can detach and recycle op state rather than ever allocating or deallocating. This should let me stop pinning shared_ptrs to their callbacks as the new custom future implementation will tag a shared_ptr exactly once for you.
Of course, I'm currently seeing a 300k CPU cycle per op average. shared_ptr is tiny compared to that. With a 10k CPU cycle per op average I might care a bit more.
I'm probably biased the other way, because about half of the code I work on has sub-millisecond budgets. :)
300k CPU cycles is still 0.1ms. But no, it's the stochastic variance that upsets me. I want a worst case latency of 0.1ms, then I would be pleased I think. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 21 Jan 2015 at 21:40, David Hawkins wrote:
Right, I'd be happy using async code, but this particular feature is new with filtering_streams - per the documentation
http://www.boost.org/doc/libs/1_57_0/libs/iostreams/doc/guide/asynchronous.h...
Since async support is new, it might not be fully supported (or at least fully debugged). Now that I am getting more familiar with the boost code, I need to dig around a little more in the Boost.Iostream example and test folders.
FYI I don't believe Iostreams has been substantially changed since Boost 1.44. It is currently considered without a named maintainer. Also, non blocking i/o is not async i/o, despite what Iostreams claims in its docs and seems very confused about. Async i/o on files lets you issue structured queue depth easily to an optimal level for performance. Non-blocking i/o on files requires a lot more work to achieve the same (and besides, non-blocking i/o on files isn't supported on any major operating system, so Iostreams has to emulate it using inefficient fragment i/o). ASIO can do async file i/o, though support is not well tested, and strong ordering of writes has to be done by hand. A library in the Boost review queue AFIO which extends ASIO does do async file i/o and makes life somewhat easier, though until coroutine support is added you still end up writing fragments of completion handler like with ASIO without coroutines. AFIO also may add async NFS and Samba safe file byte range locking soon. This is a real devil to implement sanely, but I believe I have found a viable if evil algorithm. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
Hi Niall,
Right, I'd be happy using async code, but this particular feature is new with filtering_streams - per the documentation
http://www.boost.org/doc/libs/1_57_0/libs/iostreams/doc/guide/asynchronous.h...
Since async support is new, it might not be fully supported (or at least fully debugged). Now that I am getting more familiar with the boost code, I need to dig around a little more in the Boost.Iostream example and test folders.
FYI I don't believe Iostreams has been substantially changed since Boost 1.44. It is currently considered without a named maintainer.
Also, non blocking i/o is not async i/o, despite what Iostreams claims in its docs and seems very confused about. Async i/o on files lets you issue structured queue depth easily to an optimal level for performance. Non-blocking i/o on files requires a lot more work to achieve the same (and besides, non-blocking i/o on files isn't supported on any major operating system, so Iostreams has to emulate it using inefficient fragment i/o).
ASIO can do async file i/o, though support is not well tested, and strong ordering of writes has to be done by hand. A library in the Boost review queue AFIO which extends ASIO does do async file i/o and makes life somewhat easier, though until coroutine support is added you still end up writing fragments of completion handler like with ASIO without coroutines.
AFIO also may add async NFS and Samba safe file byte range locking soon. This is a real devil to implement sanely, but I believe I have found a viable if evil algorithm.
Thanks for the insight. I appreciate the feedback I'm getting from the experienced Boost community members. Cheers, Dave
participants (4)
-
Brian Budge
-
David Hawkins
-
Gavin Lambert
-
Niall Douglas