
Oliver,
Do you have some performance data for your fiber implementation? What is the (amortized) overhead introduced for one fiber (i.e. the average time required to create, schedule, execute, and delete one fiber which runs an empty function, when executing a larger number of those, perhaps 500.000 fibers)? It would be interesting to see this number when giving 1..N cores to the scheduler.
unfortunately I've no performance tests yet - maybe I'll write one after some optimizations (like replacing the stl containers by a single linked list of intrusive_ptr).
I'd write the test before starting to do optimizations.
I'm not sure what a fiber should execute within such a test. should the fiber-function have an empty body (e.g. execute nothing)? or should it at least yield one time?
Well, that are two separate performance tests already :-P However, having it yielding just adds two more context switches and a scheduling cycle, thus I'd expect not too much additional insight from this. While you're at it, I'd suggest to also write a test measuring the overhead of using futures. For an idea how such tests could look like, you might want to glance here: https://github.com/STEllAR-GROUP/hpx/tree/master/tests/performance.
if the code executed by the fiber does nothing then the execution time will be determined by the algorithm for memory allocation of the clib. the context switches for resuming ans suspending the fiber and the time required to insert and remove the fiber from the ready-queue inside the the fiber-scheduler.
That's assumptions you're having which are by no means conclusive. From our experience with HPX (https://github.com/STEllAR-GROUP/hpx) the overheads for a fiber (which is a hpx::thread in our case) are determined by many more factors than just the memory allocator. Things like contention caused by the work stealing or by NUMA effects such when you start stealing across NUMA domains usually overshadow the memory allocation costs. Additionally, the quality of the scheduler implementation affects things gravely.
this queue is currently a stl container and will be replaced by a single- linked list of intrusive-ptrs.
If you had a performance test you'd immediately see whether this improves your performance. Doing optimizations based on gut feelings are most of the time not very effective, you need measurements to support your work.
a context switch (suspending/resuming a coroutine) needs ca. 80 CPU cycles on Intel Core2 Q6700 (64bit Linux).
Sure, but this does not tell you how much time is consumed by executing those. The actual execution time will be determined by many factors, such a caching effects, TLB misses, memory bandwidth limitations and other contention effects. IMHO, for this library to be accepted, it has to prove to be of high quality which implies best possible performance. You might want to compare the performance of your library with other existing solutions (for instance TBB, qthreads, openmp, HPX). The link I provided above will give you a set of trivial tests for those. Moreover, we'd be happy to add an equivalent test for your library to our repository. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu