Re: [Boost-users] Hybrid parallelism, no more + mpi+serialization, many questions
-----Original Message-----From: "Brian Budge" [brian.budge@gmail.com]Date: 18/11/2010 03:20 PMTo: boost-users@lists.boost.orgSubject: Re: [Boost-users] Hybrid parallelism, no more + mpi+serialization,
many questions>> The large calculation that I currently do serially and that I intend to>> parallelize is the maximum of the return values of a large number of>> evaluations of a given "function" in the mathematical sense.>> The number of arguments of the function is only known at runtime.>> Let's say it is determined at runtime that the number of arguments is 10, ie>> we have 10 arguments x0, ..., x9>> Each argument can take a different number of values, for e.g. x0 can be>> x0_0, x0_1 .... x0_n0>> x1 can be x1_0, x1_1, ...x1_n1 and so on...n0 and n1 are typically known at>> runtime and different>>>> so serially, I run>> f(x0_0, x1_0, ..., x9_0)>> f(x0_0, x1_0, ..., x9_1)>> ...>> f(x0_0, x1_0, ..., x9_n9)>> then with all the x8 then all the x7 ... then all the x0.>> There is n0*n1*...*n9 runs>> Then I get the maximum of the return values.>>>>>> Imagining I have N mpi processes, ideally each process would run>> n0*n1*...*n9/N function evaluations.>> How do I split?>>>> In terms of current implementation, each of the x is a boost::variant over 4>> types:>> a double, a
Here's how I'd probably do this using a pull model:
I'd have N nodes which might have M_i threads. Node 0 is the master,
and it will traverse your "tree". You'll have N-1 other nodes, and
the will ask the master node for work to do. When the master gets a
request for work, it will send a message with some number of tasks
(maybe 1000 at a time) by traversing the tree and putting the
parameter sets for those tasks in the message. When the slaves
receive a message full of tasks, a pull model can again be employed
with the M_i threads on each machine. Each thread on each machine can
keep track of its current maximum along with the parameters used to
compute that maximum. When the master runs out of work, the response
message will have zero tasks, and the slaves can decide the local
maximum from their local threads, and send that parameter set +
maximum to the master. Then the master runs through the max from all
the slaves and chooses the parameter set that gave the maximum value.
You could simplify this slightly by having nodes run per-cpu
processes instead of per-node, and then you don't have to manage your
threads. This will have the advantage of being easier to write, but
may not perform quite as well. Make sure you send enough tasks at a
time that you don't have a lot of waiting around for tasks (also,
larger messages tend to be higher throughput).
As a second round, if you want to get fancier, you can send for more
work while work is still being computed, thus never starving your work
threads. Additionally, you could have the master actively queuing up
parameter sets between requests in order to handle the requests
quicker.
Brian
On Thu, Nov 18, 2010 at 7:54 AM, Hicham Mouline
-----Original Message----- From: "Brian Budge" [brian.budge@gmail.com] Date: 18/11/2010 03:20 PM To: boost-users@lists.boost.org Subject: Re: [Boost-users] Hybrid parallelism, no more + mpi+serialization, many questions
The large calculation that I currently do serially and that I intend to parallelize is the maximum of the return values of a large number of evaluations of a given "function" in the mathematical sense. The number of arguments of the function is only known at runtime. Let's say it is determined at runtime that the number of arguments is 10, ie we have 10 arguments x0, ..., x9 Each argument can take a different number of values, for e.g. x0 can be x0_0, x0_1 .... x0_n0 x1 can be x1_0, x1_1, ...x1_n1 and so on...n0 and n1 are typically known at runtime and different
so serially, I run f(x0_0, x1_0, ..., x9_0) f(x0_0, x1_0, ..., x9_1) ... f(x0_0, x1_0, ..., x9_n9) then with all the x8 then all the x7 ... then all the x0. There is n0*n1*...*n9 runs Then I get the maximum of the return values.
Imagining I have N mpi processes, ideally each process would run n0*n1*...*n9/N function evaluations. How do I split?
In terms of current implementation, each of the x is a boost::variant over 4 types: a double, a
pair, a triplet or a vector<double> A visitor is applied recursively to the variants in order to traverse the whole parameter space. apply_visitor on x0 => say if x0 is a triplet, then for (x0 from min to max with increment) apply_visitor on x1 and so on until x9, then we actually call the f function with all the arguments collected so far. How can one parallelize such a beast?
rds,
_______________________________________________ This partly depends on how many processors/machines you have available. You need to find a way of partitioning your state space into tasks, and then doling those tasks out to processes/threads. How expensive is f()? How much memory is used? Brian f() takes about 1ms. I have a couple of billion n-tuples in the parameter space for now but thinking of reaching easily a couple of trillions. In the serial mode, only about 20Mbytes of RAM is used. The parameter space is not generated ahead. It looks like a tree. The tree is generated, when I reach a leaf, I evaluate f(), and compare the return to the current maximum. So I'm not storing much. rds,
Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
Brian Budge
-
Hicham Mouline