hybrid parallelism

older
asio UDP async receive question -...

Hicham Mouline

3 Nov 2010 3 Nov '10

10:43 p.m.

Hello, Following a previous thread that asked about how to parallelize a large number of calculations, I took this summary: I am thinking of choosing a simple model whereby: There are M computers (possibly heterogeneous). I will stick to 1 process per computer. Each process will have N threads. Giving a total M*N "execution units". I will take a simple solution in that the M and N are fixed though determined at runtime. So there is this pool of M*N exec units and I give them 100 000 tasks to do. Those tasks get scheduled in the same process/computer, they share their memory. Each of the computers gets a duplicate of the memory used as inputs to the tasks. For thread pool, TBB, boost::asio were suggested. I believe there is also threadpool.sourceforge.net For the cross-computer communication, boost::mpi. Is this something boost::mpi + mpi can help with? Regards,

Attachments:

attachment.html (text/html — 3.0 KB)

Show replies by date

Brian Budge

3 Nov 3 Nov

11:16 p.m.

Hi Hicham - Yes, you can use MPI (possibly through boost::mpi) to distribute tasks to multiple machines, and then use threads on those machines to work on finer grained portions of those tasks. From another thread on this list, there are constructs in boost::asio that handle task queuing for the thread tasks. Brian On Wed, Nov 3, 2010 at 3:43 PM, Hicham Mouline <hicham@mouline.org> wrote:

...

Hello,

Following a previous thread that asked about how to parallelize a large number of calculations, I took this summary:

I am thinking of choosing a simple model whereby:

There are M computers (possibly heterogeneous). I will stick to 1 process per computer.

Each process will have N threads. Giving a total M*N "execution units".

I will take a simple solution in that the M and N are fixed though determined at runtime.

So there is this pool of M*N exec units and I give them 100 000 tasks to do.

Those tasks get scheduled in the same process/computer, they share their memory.

Each of the computers gets a duplicate of the memory used as inputs to the tasks.

For thread pool, TBB, boost::asio were suggested. I believe there is also threadpool.sourceforge.net

For the cross-computer communication, boost::mpi.

Is this something boost::mpi + mpi can help with?

Regards,

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Dave Abrahams

11:53 p.m.

On Thu, Nov 4, 2010 at 8:16 AM, Brian Budge <brian.budge@gmail.com> wrote:

...

Hi Hicham -

Yes, you can use MPI (possibly through boost::mpi) to distribute tasks to multiple machines, and then use threads on those machines to work on finer grained portions of those tasks. From another thread on this list, there are constructs in boost::asio that handle task queuing for the thread tasks.

If I were you I would start by trying to do this with N processes per machine, rather than N threads, since you need the MPI communication anyway. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Hicham Mouline

4 Nov 4 Nov

12:19 a.m.

...

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Dave Abrahams Sent: 03 November 2010 23:54 To: boost-users@lists.boost.org Subject: Re: [Boost-users] hybrid parallelism

...
Hi Hicham -

Yes, you can use MPI (possibly through boost::mpi) to distribute tasks to multiple machines, and then use threads on those machines to work on finer grained portions of those tasks. From another thread on

On Thu, Nov 4, 2010 at 8:16 AM, Brian Budge <brian.budge@gmail.com> wrote: this

...
list, there are constructs in boost::asio that handle task queuing for the thread tasks.

If I were you I would start by trying to do this with N processes per machine, rather than N threads, since you need the MPI communication anyway.

-- Dave Abrahams BoostPro Computing http://www.boostpro.com _______________________________________________

Just temporarily? You would still after that add a layer of multithreading to each process, and have only 1 process per machine, after that, no? A 1 process N threads in 1 machine is probably better total wall time than just N mono threaded processes because of the no need to duplicate the input memory to the tasks. The question I really wanted to ask about is that I expect to have M*N outstanding threads (M computers, N threads in each process) just sitting there waiting for jobs. Then from the user interface, I click and that starts 100000 tasks, then it is spread all over the M machines and N threads in each process. Then result comes back, displayed... Then user clicks again and same thing happens. You're saying this is doable with Boost.MPI + MPI impl? I wasn't expecting to divide the tasks into finer grained ones. All the tasks are atomic and have about the same exec time. It's rather pass 100000/M tasks to each machine, then divide this number by N for each thread in that process. This last bit is up to me to code. Ideally, the task is just a functor with operator() member and M machines and N threads are treated similarly. I guess it's up to me to write some abstraction layer to view the whole M*N in a flat way. Other questions, more architectural in nature, I'm not sure they are best asked here? Regards,

David Abrahams

2:09 a.m.

At Thu, 4 Nov 2010 00:19:58 -0000, Hicham Mouline wrote:

...

...
From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Dave Abrahams

If I were you I would start by trying to do this with N processes per machine, rather than N threads, since you need the MPI communication anyway.

Just temporarily? You would still after that add a layer of multithreading to each process, and have only 1 process per machine, after that, no?

Not necessarily. It would depend if the performance was good enough or not.

...

A 1 process N threads in 1 machine is probably better total wall time than just N mono threaded processes because of the no need to duplicate the input memory to the tasks.

You can use shared memory for that, if you want. Whether there is a tangible advantage probably depends on the size of your input.

...

The question I really wanted to ask about is that I expect to have M*N outstanding threads (M computers, N threads in each process) just sitting there waiting for jobs. Then from the user interface, I click and that starts 100000 tasks, then it is spread all over the M machines and N threads in each process. Then result comes back, displayed... Then user clicks again and same thing happens.

You're saying this is doable with Boost.MPI + MPI impl?

If you are determined to have threads, you'll need more than just MPI, but yes, you can use Boost.MPI to handle the communication and synchronization across machines.

...

I wasn't expecting to divide the tasks into finer grained ones. All the tasks are atomic and have about the same exec time. It's rather pass 100000/M tasks to each machine, then divide this number by N for each thread in that process. This last bit is up to me to code. Ideally, the task is just a functor with operator() member and M machines and N threads are treated similarly. I guess it's up to me to write some abstraction layer to view the whole M*N in a flat way.

That's why I'd start with processes and MPI; it already has the abstraction layers that make two processes running on the same machine look identical to the same two processes running on different machines. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Brian Budge

2:14 a.m.

Do these tasks share a lot of data? If they are really lightwieght memory-wise, heavy computationally, and don't require fine-grained communication with each other, I'd go with David's suggestion, as it will be easier to write, and the performance won't be much different. If you use a lot of memory, need fine-grained chatter between tasks, or the tasks are pretty cheap, threads may be (much) better. Brian On Wed, Nov 3, 2010 at 5:19 PM, Hicham Mouline <hicham@mouline.org> wrote:

...

...
-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Dave Abrahams Sent: 03 November 2010 23:54 To: boost-users@lists.boost.org Subject: Re: [Boost-users] hybrid parallelism

...
Hi Hicham -

Yes, you can use MPI (possibly through boost::mpi) to distribute tasks to multiple machines, and then use threads on those machines to work on finer grained portions of those tasks. From another thread on

On Thu, Nov 4, 2010 at 8:16 AM, Brian Budge <brian.budge@gmail.com> wrote: this

...
list, there are constructs in boost::asio that handle task queuing for the thread tasks.

If I were you I would start by trying to do this with N processes per machine, rather than N threads, since you need the MPI communication anyway.

-- Dave Abrahams BoostPro Computing http://www.boostpro.com _______________________________________________

Just temporarily? You would still after that add a layer of multithreading to each process, and have only 1 process per machine, after that, no?

A 1 process N threads in 1 machine is probably better total wall time than just N mono threaded processes because of the no need to duplicate the input memory to the tasks.

The question I really wanted to ask about is that I expect to have M*N outstanding threads (M computers, N threads in each process) just sitting there waiting for jobs. Then from the user interface, I click and that starts 100000 tasks, then it is spread all over the M machines and N threads in each process. Then result comes back, displayed... Then user clicks again and same thing happens.

You're saying this is doable with Boost.MPI + MPI impl?

I wasn't expecting to divide the tasks into finer grained ones. All the tasks are atomic and have about the same exec time. It's rather pass 100000/M tasks to each machine, then divide this number by N for each thread in that process. This last bit is up to me to code. Ideally, the task is just a functor with operator() member and M machines and N threads are treated similarly. I guess it's up to me to write some abstraction layer to view the whole M*N in a flat way.

Other questions, more architectural in nature, I'm not sure they are best asked here?

Regards,

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Matthias Troyer

3:34 p.m.

On Nov 4, 2010, at 3:14, Brian Budge <brian.budge@gmail.com> wrote:

...

Do these tasks share a lot of data? If they are really lightwieght memory-wise, heavy computationally, and don't require fine-grained communication with each other, I'd go with David's suggestion, as it will be easier to write, and the performance won't be much different.

If you use a lot of memory, need fine-grained chatter between tasks, or the tasks are pretty cheap, threads may be (much) better.

Brian

I second this opinion for several reasons First, mixing MPI with multithreading can be hard since many MPI implementations are not thread safe. Be sure to only let the master thread use MPI. Secondly, it adds another level of complexity. Starting M*N MPI processes is much easier, unless you waste too much memory that way. Third, we have just recently benchmarked several multithreaded LAPACK routines in the Intel, AMD and other lapack libraries and compared them to the MPI based routines in SCALAPACK. Surprisingly the MPI implementations outperformed the multithreaded ones by a large margin. For me that shows that using MPI it is - at least in these applications - easier to write efficiently parallel code than using multithreading and the advantage easily overshadows any loss in efficiency due to the distributed memory nature. Keep in mind that between processes on the same computer, MPI uses a shared memory mechanism to send data and does not use the network. Matthias

David Abrahams

5 Nov 5 Nov

12:46 a.m.

At Thu, 4 Nov 2010 16:34:19 +0100, Matthias Troyer wrote:

...

On Nov 4, 2010, at 3:14, Brian Budge <brian.budge@gmail.com> wrote:

...
Do these tasks share a lot of data? If they are really lightwieght memory-wise, heavy computationally, and don't require fine-grained communication with each other, I'd go with David's suggestion, as it will be easier to write, and the performance won't be much different.

If you use a lot of memory, need fine-grained chatter between tasks, or the tasks are pretty cheap, threads may be (much) better.

Brian

I second this opinion for several reasons

First, mixing MPI with multithreading can be hard since many MPI implementations are not thread safe. Be sure to only let the master thread use MPI.

Good point; I hadn't thought of that.

...

Secondly, it adds another level of complexity. Starting M*N MPI processes is much easier, unless you waste too much memory that way.

That was mostly what I had in mind.

...

Third, we have just recently benchmarked several multithreaded LAPACK routines in the Intel, AMD and other lapack libraries and compared them to the MPI based routines in SCALAPACK. Surprisingly the MPI implementations outperformed the multithreaded ones by a large margin.

Was that done running the MPI processes together on one machine? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Matthias Troyer

9:29 p.m.

On 5 Nov 2010, at 01:46, David Abrahams wrote:

...

At Thu, 4 Nov 2010 16:34:19 +0100, Matthias Troyer wrote:

...
On Nov 4, 2010, at 3:14, Brian Budge <brian.budge@gmail.com> wrote:

...
Do these tasks share a lot of data? If they are really lightwieght memory-wise, heavy computationally, and don't require fine-grained communication with each other, I'd go with David's suggestion, as it will be easier to write, and the performance won't be much different.

If you use a lot of memory, need fine-grained chatter between tasks, or the tasks are pretty cheap, threads may be (much) better.

Brian

I second this opinion for several reasons

First, mixing MPI with multithreading can be hard since many MPI implementations are not thread safe. Be sure to only let the master thread use MPI.

Good point; I hadn't thought of that.

...
Secondly, it adds another level of complexity. Starting M*N MPI processes is much easier, unless you waste too much memory that way.

That was mostly what I had in mind.

...
Third, we have just recently benchmarked several multithreaded LAPACK routines in the Intel, AMD and other lapack libraries and compared them to the MPI based routines in SCALAPACK. Surprisingly the MPI implementations outperformed the multithreaded ones by a large margin.

Was that done running the MPI processes together on one machine?

Yes, on a 16-core SUN Blade and also on other shared memory workstations. It seems to show - at least this is my first conclusion - that the more explicit memory management and explicit barriers in the MPI based program can give faster code than the easier to write multi-threaded method. Matthias

5351

Age (days ago)

5353

Last active (days ago)

List overview

Download

0 comments

1 participants

participants (1)

Brian Budge
Dave Abrahams
David Abrahams
Hicham Mouline
Matthias Troyer