Flyweight: wrapping shared_ptr
Hi all, I'd like to add hash-consing on top of my hiearchy-based AST implementation, which uses shared_ptr (I'm in C++11) to guarantee value semantics (the pointee is const). Below is a tiny version that would help me see if using Boost.Flyweight is adequate. It works, but it is not transparent, because I have to explicitly call '.get()' on my flyweights to deference to the pointers. So for instance my code reads: size_t hash() const { size_t res = 0; hash_combine(res, boost::hash_value(op)); hash_combine(res, l.get()->hash()); hash_combine(res, r.get()->hash()); return res; } where I expect to have l->hash(), not l.get()->hash(). I have tried deriving from flyweight to add an operator->, but it fails horribly. I realize that it is suboptimal to use flyweight on top of shared_ptr, as I get two reference counters. Yet this change is less invasive than migrating my AST hierarchy to using variants as is done in the example of the documentation. What would be the right approach? Thanks a lot.
El 07/10/2014 15:06, Akim Demaille escribió:
Hi all,
I'd like to add hash-consing on top of my hiearchy-based AST
implementation, which uses shared_ptr (I'm in C++11) to guarantee
value semantics (the pointee is const).
Below is a tiny version that would help me see if using Boost.Flyweight
is adequate.
It works, but it is not transparent, because I have to explicitly
call '.get()' on my flyweights to deference to the pointers. So for
instance my code reads:
size_t hash() const
{
size_t res = 0;
hash_combine(res, boost::hash_value(op));
hash_combine(res, l.get()->hash());
hash_combine(res, r.get()->hash());
return res;
}
where I expect to have l->hash(), not l.get()->hash().
I have tried deriving from flyweight to add an operator->, but it
fails horribly.
This is how the derivation would be done (I've tried with your code and it works,
effectively eliminating the need to do .get()-> rather than -> ):
struct Exp:boost::flyweight
Le 7 oct. 2014 à 18:18, Joaquín Mª López Muñoz
El 07/10/2014 15:06, Akim Demaille escribió:
I have tried deriving from flyweight to add an operator->, but it fails horribly.
This is how the derivation would be done (I've tried with your code and it works, effectively eliminating the need to do .get()-> rather than -> ):
struct Exp:boost::flyweight
{ using super=boost::flyweight ; using super::super; using element_type=super::value_type::element_type; const element_type& operator*() const{return *(this->get());} const element_type* operator->()const{return &*(this->get());} };
That's about what I did, except that I felt free to use decltype.
bool operator==(const Exp& l, const Exp& r) { return *l == *r; }
bool operator!=(const Exp& l, const Exp& r) { return !(l == r); }
Actually I also had to solve this issue for == and !=, which seemed fishy, and I was afraid I would have to address a larger set of functions in the Real World Case.
I think this approach is OK: elements must be stored through some type of pointer because they're polymorphic, and, variants banned, std::shared_ptr<Base> seems a sensible choice.
Agreed.
An alternative would be to use some sort of cloning smart pointer to spare the reference count, but it's not clear to me this would be faster than shared_ptr:
* element exists: shared_ptr creation is slower than clone_ptr creation * element does not exist: shared_ptr creation and copy is probably faster than clone_ptr creation and cloning.
Yes, indeed.
So, if in doubt profile and determine what's best for you.
Yes, I'll bench the result. Is there anyway Flyweight could have supported the operator-> natively? It seems that with some SFINAE under the hood, it would work, don't you think?
Hi Joaquín,
You are a tough one, I'm surprised (and very pleased) you still answer
to this thread :)
Le 10 oct. 2014 à 14:00, Joaquin M Lopez Munoz
a. a single factory based on Flyweight, and use derived_poly_flyweight
b. one factory made by hand on top of map or unordered_map
c. likewise, but with Flyweight and key_value
As for c, you can have something more or less along that line by using your latest attempt (http://coliru.stacked-crooked.com/a/2b768fa26574adea ), declaring the flyweights with the no_tracking policy and having them store a weak_ptr (which boild down to your b hand-made scenario.)
Well, I couldn't get this work. In fact b neither. There are several issues. One is that the key-value approach, while appealing (you don't build at all if the object of this type was already built) keeps value alive because they are keys. Consider Bin('+', Num(42), Num(51)). To build it, I have used the key <'+', Exp(Num(42)), Exp(Num(51))>, to map it to a weak pointer. But when I kill the pointer, the key remains, and therefore 42 and 51 are still alive and are not reclaimed. So I tried a set of weak_pointer, but couldn't have it work. In fact, my problem is that I would like to get a call-back from the shared_ptr at use_count == 1, not == 0 (to erase the value from the factory). But I could not find a means to have that. I have also tried to store real objects in the factory, and use shared_from_this to create the shared_ptr, but this is illegal: shared_ptr can only be called when there is already a shared_ptr somewhere.
Well, this is not so clear to me, as the expressions I store have often a strong regularity, with many elements of the same kind in a row.
Of course, benchmarking is the best way to go.
I agree.
I'd be interested in knowing your outcomes.
Well, you won: the single store is roughly 20% faster in my experiment. I am surprised, I really expected the in-place allocation in the per-type factories to be much cheaper that the pointer and malloc agitation in the case of the flyweight of single-pointer type. Maybe my attempt to have a polymorphic wrapper around flyweight is slow, I don't know. But after all, it is true that most of the time these values are manipulated via the base type, which is cheap (well, gratis) in the single-factory case, while it requires quite some allocations in the per-type-factory case. So after all, it might make some sense anyway. The bench was as follows: for (unsigned i = 0; i < 500U * 1000U; ++i) { Num n = num(1); Bin e = bin('*', bin('+', num(1), num(1)), bin('+', num(1), num(1))); // Duplicates large parts of e. Exp f = bin('/', bin('*', bin('+', num(1), num(1)), bin('+', num(1), num(1))), bin('*', bin('+', num(1), num(1)), bin('+', num(1), num(1)))); assert(f->eval() == 1); } rm 7 9 && make 7 9 && time ./7 && time ./9 clang++-mp-3.5 -std=c++11 -isystem /opt/local/include -ggdb -O3 7.cc -o 7 clang++-mp-3.5 -std=c++11 -isystem /opt/local/include -ggdb -O3 9.cc -o 9 real 0m4.550s This is single factory user 0m4.540s sys 0m0.003s real 0m5.261s This is per-type factory user 0m5.256s sys 0m0.003s The corresponding files are here: single: http://coliru.stacked-crooked.com/a/d7028e28fcc86217 per-type: http://coliru.stacked-crooked.com/a/4566f6e2e7dd6d81 Again, thanks for your feedback.
Akim Demaille
Hi Joaquín,
You are a tough one, I'm surprised (and very pleased) you still answer to this thread :)
I love C++ in general, and these low-level design problems in particular.
Well, I couldn't get this work. In fact b neither. [...]
All the problems you describe are precisely the ones Boost.Flyweight tries to solve :-) It is very hard to implement a flyweight-like design (refcount semantics plus unique values) on top of shared_ptr. In fact, Boost.Flyweight does not use shared_ptr's internally. Also, Boost.Fyweight is not designed for easy and efficient casting of flyweight<Derived> to flyweight<Base>: I'm afraid that, to do scenario b right, you'll basically have to write everything from scratch --factory, tracking, locking, etc.
I'd be interested in knowing your outcomes.
Well, you won: the single store is roughly 20% faster in my experiment. I am surprised, I really expected the in-place allocation in the per-type factories to be much cheaper that the pointer and malloc agitation in the case of the flyweight of single-pointer type.
Maybe my attempt to have a polymorphic wrapper around flyweight is slow, I don't know. But after all, it is true that most of the time these values are manipulated via the base type, which is cheap (well, gratis) in the single-factory case, while it requires quite some allocations in the per-type-factory case. So after all, it might make some sense anyway.
The bench was as follows [...]
I think the main penalty in your per-type test comes from the fact that you're maintaining two separate refcounting systems (flyweight and shared_ptr<Holder>) and two virtual hierarchies (Exp_impl and Holder) in parallel, which effectively doubles the bookkeeping load of the program. On the other hand, I think the benchmark is probably too lightweight to really discern whether per-type hashing is beneficial or detrimental: the number of elements involved is too low, so everything will fit in the working memory very comfortably, and hash collision, either in the single or in the per-type case, is almost certain not to occur. May I suggest you extend the benchmarking to deal with tens of thousands of unique values *at a given time*: this would yield, I guess, fairer results. Joaquín M López Muñoz Telefónica
Hi Joaquín,
Le 10 oct. 2014 à 22:02, Joaquin M Lopez Munoz
Well, I couldn't get this work. In fact b neither. [...]
All the problems you describe are precisely the ones Boost.Flyweight tries to solve :-)
Yes. The more I was writing code, failing, and working around issues, the more I realized everything Flyweight needs to do.
The bench was as follows [...]
I think the main penalty in your per-type test comes from the fact that you're maintaining two separate refcounting systems (flyweight and shared_ptr<Holder>) and two virtual hierarchies (Exp_impl and Holder) in parallel, which effectively doubles the bookkeeping load of the program.
Indeed.
On the other hand, I think the benchmark is probably too lightweight to really discern whether per-type hashing is beneficial or detrimental: the number of elements involved is too low, so everything will fit in the working memory very comfortably, and hash collision, either in the single or in the per-type case, is almost certain not to occur. May I suggest you extend the benchmarking to deal with tens of thousands of unique values *at a given time*: this would yield, I guess, fairer results.
You are right, and I am amazed by the figures below. Actually, I have checked several times if I did something wrong, but I don't think so. The new bench is as follows: { size_t n = 10 * 1000; Exp exp = num(0); for (unsigned j = 1; j <= n; ++j) { Exp e = bin('*', bin('+', num(j), num(j)), bin('+', num(j), num(j))); Exp f = bin('/', e, e); // <======== a fancy "1" exp = bin('+', exp, f); // <======== so this is ++exp } assert(exp->eval() == n); } which I ran on several implementations: 1.cc: basic implementation using inheritance http://coliru.stacked-crooked.com/a/840ddd39842323e8 7.cc: inheritance, single Flyweight factory for the base class http://coliru.stacked-crooked.com/a/0d9a4eb64cbbe820 9.cc: inheritance, per-concrete types factories and custom wrapper http://coliru.stacked-crooked.com/a/e0c2e130fc668e30 10.cc: variant based implementation, no flyweight at all http://coliru.stacked-crooked.com/a/d51374d1272915d2 11.cc: variant based implementation, single flyweight for the resulting variant type (i.e., same as your example in the doc). http://coliru.stacked-crooked.com/a/44d2c491dcfcaf6c The results are: ./1 0,02s user 0,00s system 93% cpu 0,031 total ./7 0,06s user 0,00s system 97% cpu 0,070 total ./9 46,19s user 0,09s system 99% cpu 46,424 total ./10 171,29s user 9,71s system 99% cpu 3:01,09 total ./11 17,25s user 0,04s system 99% cpu 17,296 total The variants are amazingly costly compared to "native" polymorphism. It seems to consume so much space that the system part is now visible. As to whether it is better to have per-derived or per-base factory, it's not even a close call. Of course it is even better not to share at all, but that's expected by the bench itself. In my case, I expect to get ROI by getting faster hash/operator== in my library. I originally planed to implement a variant of flyweights per-type, but it does not seem like an interesting alternative any longer.
Akim Demaille
Hi Joaquín,
The new bench is as follows:
{ size_t n = 10 * 1000; Exp exp = num(0); for (unsigned j = 1; j <= n; ++j) { Exp e = bin('*', bin('+', num(j), num(j)), bin('+', num(j), num(j))); Exp f = bin('/', e, e); // <======== a fancy "1" exp = bin('+', exp, f); // <======== so this is ++exp } assert(exp->eval() == n); }
I think you can make the test even more realistic: as it stands now, the level of value redundancy is very low --for instance, each num value is repeated only four times--, which is not the ideal setting for Boost.Flyweight. If you change the creation of e to something like Exp e = bin('*', bin('+', num(1+(j%m)), num(1+(j%m))), bin('+', num(2+(j%m)), num(2+(j%m)))); for some constant m (in the range of 1000, for instance) then you can modulate the level of redundancy (the lower m, the more duplicate values): tuning m can tell you how Boost.Flyweight improves wrt the shared_ptr case (which should not be affected by m at all). I bet by having m sufficient low and n sufficiently high #7 can get to beat #1.
which I ran on several implementations:
[...]
The results are:
./1 0,02s user 0,00s system 93% cpu 0,031 total ./7 0,06s user 0,00s system 97% cpu 0,070 total ./9 46,19s user 0,09s system 99% cpu 46,424 total ./10 171,29s user 9,71s system 99% cpu 3:01,09 total ./11 17,25s user 0,04s system 99% cpu 17,296 total
Well, seems like there's a clear winner here (ruling out #1, which I understand from what you say it's not an acceptable alternative, presumably because of memory limitations). Joaquín M López Muñoz Telefónica
Hi!
Le 12 oct. 2014 à 20:46, Joaquin M Lopez Munoz
I think you can make the test even more realistic: as it stands now, the level of value redundancy is very low --for instance, each num value is repeated only four times--, which is not the ideal setting for Boost.Flyweight. If you change the creation of e to something like
Exp e = bin('*', bin('+', num(1+(j%m)), num(1+(j%m))), bin('+', num(2+(j%m)), num(2+(j%m))));
for some constant m (in the range of 1000, for instance) then you can modulate the level of redundancy (the lower m, the more duplicate values): tuning m can tell you how Boost.Flyweight improves wrt the shared_ptr case (which should not be affected by m at all). I bet by having m sufficient low and n sufficiently high #7 can get to beat #1.
I couldn't find such n and m. Actually, too large a n gives not so nice errors: $ time ./7 Assertion failed: (pthread_mutex_lock(&m_)==0), function scoped_lock, file /opt/local/include/boost/flyweight/detail/recursive_lw_mutex.hpp, line 72. but I'm not surprised: the tree must be too deep, and I'm not even sure my stack is in good shape. I have not tried to disable locking, as I'd be happy to go multithread in the future, but currently it wouldn't be a problem. For n = 10.000 and m = 100, I get (1 is pure shared_ptr, and 7 is Flyweight). $ time ./1 ./1 0,02s user 0,01s system 93% cpu 0,030 total $ time ./7 ./7 0,05s user 0,00s system 96% cpu 0,051 total Note that my test case does some sharing even with shared_ptr: Bin f = bin('/', e, e); it's not a tree. But that's coherent with my code base, where I already have _some_ sharing, but not maximal. With no sharing at all, I get: $ time ./1 ./1 0,03s user 0,01s system 96% cpu 0,042 total $ time ./7 ./7 0,07s user 0,00s system 97% cpu 0,070 total
./1 0,02s user 0,00s system 93% cpu 0,031 total ./7 0,06s user 0,00s system 97% cpu 0,070 total ./9 46,19s user 0,09s system 99% cpu 46,424 total ./10 171,29s user 9,71s system 99% cpu 3:01,09 total ./11 17,25s user 0,04s system 99% cpu 17,296 total
Well, seems like there's a clear winner here (ruling out #1, which I understand from what you say it's not an acceptable alternative, presumably because of memory limitations).
No, my hope is that I can get faster hashing and operator== on my expressions in all the algorithms that are on top of these expressions. I am not yet worried about memory consumption. But maybe I'm hoping for something I won't get...
Akim Demaille
Hi!
Le 12 oct. 2014 à 20:46, Joaquin M Lopez Munoz
a >
écrit :
If you change the creation of e to something like
Exp e = bin('*', bin('+', num(1+(j%m)), num(1+(j%m))), bin('+', num(2+(j%m)), num(2+(j%m))));
for some constant m (in the range of 1000, for instance) then you can modulate the level of redundancy (the lower m, the more duplicate values): tuning m can tell you how Boost.Flyweight improves wrt the shared_ptr case (which should not be affected by m at all). I bet by having m sufficient low and n sufficiently high #7 can get to beat #1.
I couldn't find such n and m. Actually, too large a n gives not so nice errors:
$ time ./7 Assertion failed: (pthread_mutex_lock(&m_)==0), function scoped_lock, file /opt/local/include/boost/flyweight/detail/recursive_lw_mutex.hpp, line 72.
but I'm not surprised: the tree must be too deep, and I'm not even sure my stack is in good shape. I have not tried to disable locking, as I'd be happy to go multithread in the future, but currently it wouldn't be a problem.
Yep, this definitely looks like the reentrant lock recursion limit has been hit (Pthreads ERECURSE error). Expression depth is limiting how far you can go with n in your becnhmark. If you changed the code so that the complexity of e remains bounded as n grows (10,000 is not that large), the performance of Boost.Flyweight would be better and better with respect to #1. An alternative could be to maintain a vector of expressions so that redundancy is higher even for moderately high n's. Anyway, this is just trying to prove a point that at the end of the day might not be too interesting for your purposes.
./1 0,02s user 0,00s system 93% cpu 0,031 total ./7 0,06s user 0,00s system 97% cpu 0,070 total ./9 46,19s user 0,09s system 99% cpu 46,424 total ./10 171,29s user 9,71s system 99% cpu 3:01,09 total ./11 17,25s user 0,04s system 99% cpu 17,296 total
Well, seems like there's a clear winner here (ruling out #1, which I understand from what you say it's not an acceptable alternative, presumably because of memory limitations).
No, my hope is that I can get faster hashing and operator== on my expressions in all the algorithms that are on top of these expressions. I am not yet worried about memory consumption. But maybe I'm hoping for something I won't get...
Hashing and equality comparison of flyweight'd expressions is going to be blazingly fast as compared with recursive non-flyweight'd versions. Also, have you considered memoizing eval(), either by storing eval result in the expressions themselves or using something like the following? http://www.boost.org/libs/flyweight/doc/examples.html#example5 Joaquín M López Muñoz Telefónica
Hi Joaquin!
Le 14 oct. 2014 à 11:34, Joaquin M Lopez Munoz
./1 0,02s user 0,00s system 93% cpu 0,031 total ./7 0,06s user 0,00s system 97% cpu 0,070 total ./9 46,19s user 0,09s system 99% cpu 46,424 total ./10 171,29s user 9,71s system 99% cpu 3:01,09 total ./11 17,25s user 0,04s system 99% cpu 17,296 total
Well, seems like there's a clear winner here (ruling out #1, which I understand from what you say it's not an acceptable alternative, presumably because of memory limitations).
No, my hope is that I can get faster hashing and operator== on my expressions in all the algorithms that are on top of these expressions. I am not yet worried about memory consumption. But maybe I'm hoping for something I won't get...
Hashing and equality comparison of flyweight'd expressions is going to be blazingly fast as compared with recursive non-flyweight'd versions.
Yes, that's where I expect to get most of my speed up. However, for deep expressions, ordered containers were quite fast, more than hashes, since often the comparison does not go deep inside. So I'll have to bench whether to stick to ordered containers, or move to unordered ones.
Also, have you considered memoizing eval(), either by storing eval result in the expressions themselves or using something like the following?
On this regard, I don't like Boost.Flyweight too much, and prefer to cache myself. Using an unordered_map for instance, in a functor, make memory-management cheap and easy. AFAICT, there is nothing comparable with Flyweight: either I pay a useless ref-counting, or I leak, right? I prefer a map that storing the result in the expression, because in my case (rational expressions, sort of generalized regular expressions if you wish) there are many (deep) algorithms for which I'd like to get such a cache. However a friend of mine, for a similar set-up, implemented his own flyweight, and uses it to store a (unique) number in the formulas. So it's even better than hashing, as he can use plain vectors for his caches. I've not deployed this in my project yet, so I can't provide figures. I expect I'll be able to be back on this topic in a couple of weeks or so. Cheers!
Akim Demaille
Hi Joaquin!
Le 14 oct. 2014 à 11:34, Joaquin M Lopez Munoz
a écrit : Hashing and equality comparison of flyweight'd expressions is going to be blazingly fast as compared with recursive non-flyweight'd versions.
Yes, that's where I expect to get most of my speed up. However, for deep expressions, ordered containers were quite fast, more than hashes, since often the comparison does not go deep inside. So I'll have to bench whether to stick to ordered containers, or move to unordered ones.
I'm not sure I'm following you here... Comparison and hashing of flyweights is blazingly fast *and* does not depend on the expression depth of the flyweight'd values --it does not depend on the values at all, in fact. If you mean that flyweight *creation*, which requires hashing the expression (but only one level deep) with the default hash-based Boost.Flyweight factory, is slower than using a std::set-based factory, well, profiling can tell.
Also, have you considered memoizing eval(), either by storing eval result in the expressions themselves or using something like the following?
On this regard, I don't like Boost.Flyweight too much, and prefer to cache myself. Using an unordered_map for instance, in a functor, make memory-management cheap and easy. AFAICT, there is nothing comparable with Flyweight: either I pay a useless ref-counting, or I leak, right?
Yes, you're right. There's no such a feature as a run-time flyweight factory, so I think your home-made approach is going to be more convenient. Joaquín M López Muñoz Telefónica
Le 22 oct. 2014 à 16:24, Joaquin M Lopez Munoz
Akim Demaille
writes: Hi Joaquin!
Le 14 oct. 2014 à 11:34, Joaquin M Lopez Munoz
a écrit : Hashing and equality comparison of flyweight'd expressions is going to be blazingly fast as compared with recursive non-flyweight'd versions.
Yes, that's where I expect to get most of my speed up. However, for deep expressions, ordered containers were quite fast, more than hashes, since often the comparison does not go deep inside. So I'll have to bench whether to stick to ordered containers, or move to unordered ones.
I'm not sure I'm following you here... Comparison and hashing of flyweights is blazingly fast *and* does not depend on the expression depth of the flyweight'd values --it does not depend on the values at all, in fact.
Yes, of course. I was referring (with the "were", I agree it's subtle :) to the pre-Flyweight implementation. I was commenting that I was surprised that in general ordered structures were faster that unordered ones, and I suspect it's mostly because hashing is always deep, and comparison is not. However, I don't like Flyweight address comparison. Sure, it's fast. It also not well defined, not repeatable, not portable etc. At least when developing, I like having something more stable, so I will probably stick with a deep comparison. So I was saying that moving to flyweights also means, in my case, looking for places where unordered is ok, and move to unordered in these places.
participants (3)
-
Akim Demaille
-
Joaquin M Lopez Munoz
-
Joaquín Mª López Muñoz