Hi Erik,
Erik Erlandson
I would like to request a formal review for inclusion of the edit_distance() function.
Congratulations on your progress. I have tried your code and done a quick benchmark comparison with my own version of Myers' algorithm. I'm pleased to see that you have excellent performance. In particular, the way that you have chosen to store the state needed for generating the script seems more performant than mine, which uses std containers and copying. (If I were doing it again today I might try to use Boost.Intrusive.) Some comments: - I think it should be possible to relax the requirement for random access iterators. Either continue to store integers and then use std::advance, or store iterators. I'm considering, for example, comparing UTF-8 with an iterator adaptor than changes a random-access iterator over the underlying bytes into a bidirectional iterator over the characters. - Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ? - It may be useful to group contiguous operations of the same kind, i.e. a "run" of N equal characters would result in one call to the output script object, rather than N calls. This can be achieved by merging in the script object, but that would be less efficient than passing the entire run as an iterator pair: PSEUDO-CODE!!! struct output { std::string pending; enum ... last_action; void insertion(char e, int) { if (last_action != insert) { flush(); last_action = insert; } pending.append(e); } // etc. void flush() { do_something_with(pending); pending.clear(); } }; vs. struct output { template <typename ITER_T> void insertion(ITER_T begin, ITER_T end) { do_something_with(begin,end); } }; - Some might prefer a version that doesn't use Boost.Parameter. - In the same vein, you might offer a version that takes an iterator pair rather than a range. - My experience has been that early termination is best expressed in terms of RAM, i.e. the caller could indicate a maximum amount of RAM, in bytes, rather than a maximum for the edit cost. You can probably convert between them via sizeof. - I'm not convinced that the /sequence/ in the include path is useful. Why not just put it in boost/algorithm/ ? - It would be great to see a benchmark against GNU diff! I hope Marshall is reading. What exactly is the process going to be for getting this into Boost.Algorithm? Regards, Phil.