[asio] io_service threadpool performance
Dear all, I have been testing asio's io_service in a threadpool setup for job dispatching. However, it seems as if adding threads doesn't improve performance; perhaps even the opposite with 1 thread having the best performance. See below for some results of a simple test I'm doing, posting 10 M jobs to the io_service, and starting N threads at io_service::run after that. Timings below are measured on an 8-core machine, I would expect the execution performance to improve (not to get worse) for execution by more threads. Posting to the io_service is done single-threaded, so these timings should remain approx. the same. Please find attached the test program. Is there something I've missed and/or should tweak to get the desired improvement per added thread? Many thanks, Kind regards, Rutger ter Borg Concurrency = 1 Finished posting after: 3.15 Finished execution after: 5.44 Execs / sec: 1e+07/2.29=4.36681e+06 Concurrency = 2 Finished posting after: 2.85 Finished execution after: 5.47 Execs / sec: 1e+07/2.62=3.81679e+06 Concurrency = 3 Finished posting after: 3.15 Finished execution after: 11.65 Execs / sec: 1e+07/8.5=1.17647e+06 Concurrency = 4 Finished posting after: 3.15 Finished execution after: 9.8 Execs / sec: 1e+07/6.65=1.50376e+06 Concurrency = 5 Finished posting after: 3.28 Finished execution after: 12.45 Execs / sec: 1e+07/9.17=1.09051e+06 Concurrency = 6 Finished posting after: 3.29 Finished execution after: 8.84 Execs / sec: 1e+07/5.55=1.8018e+06 Concurrency = 7 Finished posting after: 3.51 Finished execution after: 10.09 Execs / sec: 1e+07/6.58=1.51976e+06 Concurrency = 8 Finished posting after: 3.38 Finished execution after: 12.54 Execs / sec: 1e+07/9.16=1.0917e+06
Dear all,
Hi!
posting 10 M jobs to the io_service, and starting N threads at io_service::run after that. Timings below are measured on an 8-core machine, I would expect
I suppose you are the one instantiating the connection. I had the same exact symptoms a week ago... not sure if this is the same problem. However instead of using threadpool, I was using thread_group(). The pseudo code looked something like this: for (int i = 0; i < n; i++) { tg.create(bind(io_service::run, io_service_)); } Performance was pretty poor... so to fix the issue, all I had to do was add usleep(1000) before each call. Just for reference, it tool 13sec to transfer a 1GB file, instead of 50. I haven't had time to investigate what was causing the issue. HTH. :) vjeko
the execution performance to improve (not to get worse) for execution by more threads. Posting to the io_service is done single-threaded, so these timings should remain approx. the same. Please find attached the test program. Is there something I've missed and/or should tweak to get the desired improvement per added thread?
Many thanks, Kind regards,
Rutger ter Borg
Concurrency = 1 Finished posting after: 3.15 Finished execution after: 5.44 Execs / sec: 1e+07/2.29=4.36681e+06 Concurrency = 2 Finished posting after: 2.85 Finished execution after: 5.47 Execs / sec: 1e+07/2.62=3.81679e+06 Concurrency = 3 Finished posting after: 3.15 Finished execution after: 11.65 Execs / sec: 1e+07/8.5=1.17647e+06 Concurrency = 4 Finished posting after: 3.15 Finished execution after: 9.8 Execs / sec: 1e+07/6.65=1.50376e+06 Concurrency = 5 Finished posting after: 3.28 Finished execution after: 12.45 Execs / sec: 1e+07/9.17=1.09051e+06 Concurrency = 6 Finished posting after: 3.29 Finished execution after: 8.84 Execs / sec: 1e+07/5.55=1.8018e+06 Concurrency = 7 Finished posting after: 3.51 Finished execution after: 10.09 Execs / sec: 1e+07/6.58=1.51976e+06 Concurrency = 8 Finished posting after: 3.38 Finished execution after: 12.54 Execs / sec: 1e+07/9.16=1.0917e+06
Vjekoslav Brajkovic wrote:
posting 10 M jobs to the io_service, and starting N threads at io_service::run after that. Timings below are measured on an 8-core machine, I would expect
I suppose you are the one instantiating the connection.
This is apart from any connection, it's measuring io_service as a pure job dispatcher.
I had the same exact symptoms a week ago... not sure if this is the same problem. However instead of using threadpool, I was using thread_group(). The pseudo code looked something like this:
Interesting -- what would be the expected performance difference between a thread group and a pool of threads?
I haven't had time to investigate what was causing the issue.
Do you know if all of io_service's jobs are dispatched through a platform-specific dispatcher, e.g., epoll? Thanks, Rutger
On Fri, Mar 13, 2009 at 2:32 PM, Rutger ter Borg
Dear all,
I have been testing asio's io_service in a threadpool setup for job dispatching. However, it seems as if adding threads doesn't improve performance; perhaps even the opposite with 1 thread having the best performance.
Someone more familiar with the implementation could comment, but just poking around through the implementation it appears there is one queue of handlers that will get shared by all threads; right there I think there'll be a lot of lock contention between threads on the single queue. I tried translating this example to Intel's TBB library, and start to see concurrency effects as I move up beyond 8 threads on my quad-core box (using parallel_for with a blocked_range that results in a single call to f per task). Increasing the amount of work done on a given task (by increasing the size of the blocked_range to parallel_for) speeds the run-time greatly, presumably because of the reduced number of tasks and reduced context switching. I'm guessing that the asio io_service isn't really geared towards effective use of multi-core CPUs where you're trying to schedule a large number of small computational tasks; I'll go out on a limb and say that this *wasn't* the intent of the library (as the name somewhat implies). Not sure if that was helpful, but it let me play around with TBB, which seems very nice. Cheers Oliver
Someone more familiar with the implementation could comment, but just poking around through the implementation it appears there is one queue of handlers that will get shared by all threads; right there I think there'll be a lot of lock contention between threads on the single queue.
What if you use io_service-per-cpu approach? How does it affect the performance?
Oliver Seiler wrote:
I'm guessing that the asio io_service isn't really geared towards effective use of multi-core CPUs where you're trying to schedule a large number of small computational tasks; I'll go out on a limb and say that this *wasn't* the intent of the library (as the name somewhat implies).
Not sure if that was helpful, but it let me play around with TBB, which seems very nice.
You're saying that it is taken for granted that ASIO is bad at handling a great deal of (small) tasks? Taken that for true, then asio also must be bad at handling a large number of small network messages. I.e., I shouldn't try handling all data of a couple of NICs with ASIO, at least not using a threadpool setup? I was under the impression that ASIO is a high-performance asynchronous event and IO library, and as such, is good at everything it does... Perhaps a lock-free task-queue would change things for the better. Thanks for pointing out TBB, I'll take a look -- however I'm primarily interested in taking message handling/event handling to the max. Cheers, Rutger
On Tue, Mar 17, 2009 at 2:44 PM, Rutger ter Borg
[...] You're saying that it is taken for granted that ASIO is bad at handling a great deal of (small) tasks? Taken that for true, then asio also must be bad at handling a large number of small network messages. I.e., I shouldn't try handling all data of a couple of NICs with ASIO, at least not using a threadpool setup?
As I said, perhaps someone more familiar with it could comment; that assessment was from trying out the sample code, playing around with it, and playing around with TBB. There appears the be a 2-lock queue implementation for the handler_queue that is used by the io_service for dispatching handlers; I can't really tell if it is being used or not, or if it has to be enabled manually. This might help performance in the sample.
I was under the impression that ASIO is a high-performance asynchronous event and IO library, and as such, is good at everything it does... Perhaps a lock-free task-queue would change things for the better.
You should maybe develop a more realistic test. The sample code was testing the parallelism of the dispatching code in a sort of worst-case (trivial CPU-bound operation that probably doesn't even need to access any memory), and seemingly it doesn't scale well to multi-core hardware. But unless you're doing something just as trivial in the handlers as you are in the sample, so what? Having written similar things (i.e., asynchronous message-passing to/from network sockets), I've never really found much use, performance-wise, for any more than one thread for dealing with select/poll/epoll/etc on a pool of sockets, compared to the overhead of computation/IO associated with actually doing something with what gets read off the wire. Anyway, I'm not trying to dissuade you from using ASIO, nor trying to imply that it isn't high-performance (high-performance compared to what, for example). I have no idea what you're intended use is. I do notice that the ASIO documentation doesn't really focus on use beyond device IO, and it seems perfectly suitable to that. I would tend to prefer using something like TBB or a simple thread-pool implementation for dispatching the computational work, rather than doing it in ASIO, based solely on that documentation and having now looked a bit at the implementation.
Thanks for pointing out TBB, I'll take a look -- however I'm primarily interested in taking message handling/event handling to the max.
Depends on what happens in the event handling; if you're doing something to disk or a database, I'd suggest you not worry about this aspect of it. If you're doing some sort of computation, bundle up more work per message.
participants (4)
-
Igor R
-
Oliver Seiler
-
Rutger ter Borg
-
Vjekoslav Brajkovic