Hartmut Kaiser
There's also the issue of the multi-in (N:1 & N:M) transformations, i.e. that take input queues. They would either have to take a whole thread, provide some kind of yield, or something else. I'd probably ignore this for now (just consume a thread) and focus on the simple 1:1 and 1:N cases as they naturally return after each transformation.
All of that functionality is available from HPX. It gives you work-queue based scheduling (with work stealing) of suspend-able/resume-able threads (i.e. supporting yield) with very little overhead. It also manages the 1:N or N:M threading for you.
You could try building all of your functionality on top of HPX first. This could allow to figure out the actual underlying mechanisms your library would rely on. Later on you can move it to Boost after all of the required functionality has been accepted there.
There is also Christophe Henry's Boost.Asynchronous. Just like the HPX guys he spent a lot of time thinking about this kind of problem and from what I understand, his solutions is reasonable. The interface is still a little rough around the edges but that can be changed. Advantage over HPX: it is a library already targeted at becoming part of Boost one day. As for the original question, I think option one (dedicate a thread to each segment) is fine for now. Maybe it can be implemented with a pseudo task abstraction that can be exchanged with a real task abstraction (or whatever HPX or Boost.Asynchronous call it) once it is available. I would consider a fixed-size threadpool with not enough threads to run all segments a runtime-error (or compile-time if the threadpool allows for this) - for now. [0] https://github.com/henry-ch/asynchronous