Re: [Boost-users] Hybrid parallelism, no more + mpi+serialization, many questions

Hicham Mouline

16 Nov 2010 16 Nov '10

5:22 p.m.

hello,Subsequent to a previous thread asking whether to merge MPI and Openmp to parallelize a large problem, I've been advised to go through MPI only as it would be simpler and that MPI implementations on the same box use shared memory which doesn't have a huge cost (still some compared to a uniprocess multithread where objects are actually shared naturally.... writing this, actually a question comes up:1. in the "shared memory" of many mpi processes on the same box, is an object (say a list of numbers) actually shared between the 2 processes address spaces? I guess not unless one explicitly make it so with the "shared memory API" (unix specific?)So, I currently have a serial application with a GUI that runs some calculations.My next step is to use OpenMPI with the help of the Boost.MPI wrapper library in C++ to parallelize those calculations.There is a set of static data objects created once at startup or loaded from files.2. what are the pros/cons of loading the static data objects individually from each separate mpi process vs broadcasting the static data via MPI itself after only the master reads/sets up the static data?3. Is it possible to choose the binary archive instead of the text archive when serializing my user-defined types?Where do I deal with the endianness issue given that I may have Intel/Sparc/PowerPC CPUs?regards,

Attachments:

attachment.html (text/html — 1.6 KB)

Show replies by date

James C. Sutherland

16 Nov 16 Nov

7:41 p.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

On Tue, Nov 16, 2010 at 10:22 AM, Hicham Mouline <hicham@mouline.org> wrote:

...

hello,

Subsequent to a previous thread asking whether to merge MPI and Openmp to parallelize a large problem, I've been advised to go through MPI only as it would be simpler and that MPI implementations on the same box use shared memory which doesn't have a huge cost (still some compared to a uniprocess multithread where objects are actually shared naturally.... writing this, actually a question comes up: 1. in the "shared memory" of many mpi processes on the same box, is an object (say a list of numbers) actually shared between the 2 processes address spaces? I guess not unless one explicitly make it so with the "shared memory API" (unix specific?)

In MPI, each process has access only to the memory that it directly controls, and data must be explicitly transferred between processes, even if that memory is physically shared. If you break that model, you are playing with fire.

...

So, I currently have a serial application with a GUI that runs some calculations. My next step is to use OpenMPI with the help of the Boost.MPI wrapper library in C++ to parallelize those calculations. There is a set of static data objects created once at startup or loaded from files.

2. what are the pros/cons of loading the static data objects individually from each separate mpi process vs broadcasting the static data via MPI itself after only the master reads/sets up the static data?

It is easier to load them from disk on each process (you don't have to deal with serialization and passing the structure). Typically you will not see a performance problem if this is only a one-time startup cost and if you are not loading massive data files from a file system with weak IO capabilities onto very many MPI processes.

...

3. Is it possible to choose the binary archive instead of the text archive when serializing my user-defined types? Where do I deal with the endianness issue given that I may have Intel/Sparc/PowerPC CPUs?

Not sure how boost::serialization handles that one... There are probably compiler flags that you can set to change endian-ness if needed though.

...

regards,

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

David Abrahams

8:19 p.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

At Tue, 16 Nov 2010 12:41:08 -0700, James C. Sutherland wrote:

...

...
Where do I deal with the endianness issue given that I may have Intel/Sparc/PowerPC CPUs?

Not sure how boost::serialization handles that one... There are probably compiler flags that you can set to change endian-ness if needed though.

IIUC MPI, and thus Boost.MPI, handles it for you transparently. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Hicham Mouline

8:52 p.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

...

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of David Abrahams Sent: 16 November 2010 20:19 To: boost-users@lists.boost.org Subject: Re: [Boost-users] Hybrid parallelism, no more + mpi+serialization, many questions

At Tue, 16 Nov 2010 12:41:08 -0700, James C. Sutherland wrote:

...
...
Where do I deal with the endianness issue given that I may have Intel/Sparc/PowerPC CPUs?

Not sure how boost::serialization handles that one... There are

...
compiler flags that you can set to change endian-ness if needed

probably though.

IIUC MPI, and thus Boost.MPI, handles it for you transparently.

I'm a bit unclear. MPI uses serialization to serialize user-defined types (you write the serialize template function). I don't know if MPI lets you choose if you want a binary archive or a text/xml archive. If you can choose the binary archive, wouldn't the issue then be with serialization and not MPI? What about primitive types like a double? This will be a quick test. regards,

Riccardo Murri

9:07 p.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

On Tue, Nov 16, 2010 at 9:52 PM, Hicham Mouline <hicham@mouline.org> wrote:

...

MPI uses serialization to serialize user-defined types (you write the serialize template function). I don't know if MPI lets you choose if you want a binary archive or a text/xml archive.

Boost.MPI chooses the archive: its own specialized archive version. In general, with Boost.Serialization, the serialization functions you write are independent of the archive: they work equally well with a text/xml archive than with a binary archive. Boost.MPI exploits this and defines its own archive types to translate your classes into something that (C level) MPI can handle. You can influence the process to gain some more speed with types that directly map to MPI types, see the Boost.MPI manual: http://www.boost.org/doc/libs/1_44_0/doc/html/mpi/tutorial.html#mpi.performa... Best regards, Riccardo -- Riccardo Murri Grid Computing Competence Centre, http://www.gc3.uzh.ch/ Organisch-Chemisches Institut, University of Zurich Winterthurerstrasse 190, CH-8057 Zürich (Switzerland) Tel: +41 44 635 4222 Fax: +41 44 635 6888

Hicham Mouline

17 Nov 17 Nov

12:04 a.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

...

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Riccardo Murri Sent: 16 November 2010 21:08 To: boost-users@lists.boost.org Subject: Re: [Boost-users] Hybrid parallelism, no more + mpi+serialization, many questions

On Tue, Nov 16, 2010 at 9:52 PM, Hicham Mouline <hicham@mouline.org> wrote:

...
MPI uses serialization to serialize user-defined types (you write the serialize template function). I don't know if MPI lets you choose if you want a binary archive or a text/xml archive.

Boost.MPI chooses the archive: its own specialized archive version.

In general, with Boost.Serialization, the serialization functions you write are independent of the archive: they work equally well with a text/xml archive than with a binary archive. Boost.MPI exploits this and defines its own archive types to translate your classes into something that (C level) MPI can handle.

You can influence the process to gain some more speed with types that directly map to MPI types, see the Boost.MPI manual:

http://www.boost.org/doc/libs/1_44_0/doc/html/mpi/tutorial.html#mpi.per formance_optimizations

Best regards, Riccardo -- Therefore endianness and bitness is not an issue even for primitive types inside complex user-defined types,

For e.g. struct my_type { double d1; int d2; }; then I write the templated serialize() function for my_type. MPI should be able to send(after serialization) an instance of my_type from Intel to a Sparc box where the instance is deserialized and the object will be properly constructed (d1 will be correct) cool,

Matthias Troyer

5:28 p.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

On 16 Nov 2010, at 20:41, James C. Sutherland wrote:

...

On Tue, Nov 16, 2010 at 10:22 AM, Hicham Mouline <hicham@mouline.org> wrote:

...
2. what are the pros/cons of loading the static data objects individually from each separate mpi process vs broadcasting the static data via MPI itself after only the master reads/sets up the static data?

It is easier to load them from disk on each process (you don't have to deal with serialization and passing the structure). Typically you will not see a performance problem if this is only a one-time startup cost and if you are not loading massive data files from a file system with weak IO capabilities onto very many MPI processes.

I would go with reading once and broadcasting, especially if, as was mentioned before, one aims at going to thousands of processes. No I/O system can scale, and implementing the broadcast is trivial: a single function call. Matthias

Hicham Mouline

18 Nov 18 Nov

11 a.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

...

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Matthias Troyer I would go with reading once and broadcasting, especially if, as was mentioned before, one aims at going to thousands of processes. No I/O system can scale, and implementing the broadcast is trivial: a single function call.

Matthias

_______________________________________________

The large calculation that I currently do serially and that I intend to parallelize is the maximum of the return values of a large number of evaluations of a given "function" in the mathematical sense. The number of arguments of the function is only known at runtime. Let's say it is determined at runtime that the number of arguments is 10, ie we have 10 arguments x0, ..., x9 Each argument can take a different number of values, for e.g. x0 can be x0_0, x0_1 .... x0_n0 x1 can be x1_0, x1_1, ...x1_n1 and so on...n0 and n1 are typically known at runtime and different so serially, I run f(x0_0, x1_0, ..., x9_0) f(x0_0, x1_0, ..., x9_1) ... f(x0_0, x1_0, ..., x9_n9) then with all the x8 then all the x7 ... then all the x0. There is n0*n1*...*n9 runs Then I get the maximum of the return values. Imagining I have N mpi processes, ideally each process would run n0*n1*...*n9/N function evaluations. How do I split? In terms of current implementation, each of the x is a boost::variant over 4 types: a double, a <min,max> pair, a <min, max, increment> triplet or a vector<double> A visitor is applied recursively to the variants in order to traverse the whole parameter space. apply_visitor on x0 => say if x0 is a triplet, then for (x0 from min to max with increment) apply_visitor on x1 and so on until x9, then we actually call the f function with all the arguments collected so far. How can one parallelize such a beast? rds,

Brian Budge

3:14 p.m.

New subject: Hybrid parallelism, no more + mpi+serialization, many questions

This partly depends on how many processors/machines you have available. You need to find a way of partitioning your state space into tasks, and then doling those tasks out to processes/threads. How expensive is f()? How much memory is used? Brian On Thu, Nov 18, 2010 at 3:00 AM, Hicham Mouline <hicham@mouline.org> wrote:

...

...
-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of Matthias Troyer I would go with reading once and broadcasting, especially if, as was mentioned before, one aims at going to thousands of processes. No I/O system can scale, and implementing the broadcast is trivial: a single function call.

Matthias

_______________________________________________

The large calculation that I currently do serially and that I intend to parallelize is the maximum of the return values of a large number of evaluations of a given "function" in the mathematical sense. The number of arguments of the function is only known at runtime. Let's say it is determined at runtime that the number of arguments is 10, ie we have 10 arguments x0, ..., x9 Each argument can take a different number of values, for e.g. x0 can be x0_0, x0_1 .... x0_n0 x1 can be x1_0, x1_1, ...x1_n1 and so on...n0 and n1 are typically known at runtime and different

so serially, I run f(x0_0, x1_0, ..., x9_0) f(x0_0, x1_0, ..., x9_1) ... f(x0_0, x1_0, ..., x9_n9) then with all the x8 then all the x7 ... then all the x0. There is n0*n1*...*n9 runs Then I get the maximum of the return values.

Imagining I have N mpi processes, ideally each process would run n0*n1*...*n9/N function evaluations. How do I split?

In terms of current implementation, each of the x is a boost::variant over 4 types: a double, a <min,max> pair, a <min, max, increment> triplet or a vector<double> A visitor is applied recursively to the variants in order to traverse the whole parameter space. apply_visitor on x0 => say if x0 is a triplet, then for (x0 from min to max with increment) apply_visitor on x1 and so on until x9, then we actually call the f function with all the arguments collected so far.

How can one parallelize such a beast?

rds,

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

5350

Age (days ago)

5352

Last active (days ago)

List overview

Download

2 comments

1 participants

participants (1)

Brian Budge
David Abrahams
Hicham Mouline
James C. Sutherland
Matthias Troyer
Riccardo Murri