Re: [boost] [algorithm] Review Request: edit_distance

28 Jan 2014

      Hi Erik,

Erik Erlandson <eje@redhat.com> wrote:
...
I would like to request a formal review for inclusion 
of the edit_distance() function.
Congratulations on your progress.

I have tried your code and done a quick benchmark comparison
with my own version of Myers' algorithm.  I'm pleased to see
that you have excellent performance.  In particular, the way
that you have chosen to store the state needed for generating
the script seems more performant than mine, which uses std
containers and copying.  (If I were doing it again today I
might try to use Boost.Intrusive.)

Some comments:

- I think it should be possible to relax the requirement for
random access iterators.  Either continue to store integers
and then use std::advance, or store iterators.  I'm considering,
for example, comparing UTF-8 with an iterator adaptor than
changes a random-access iterator over the underlying bytes
into a bidirectional iterator over the characters.

- Have you looked at the supposedly better algorithm "An O(NP)
Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene
Myers ?

- It may be useful to group contiguous operations of the same
kind, i.e. a "run" of N equal characters would result in one
call to the output script object, rather than N calls.  This
can be achieved by merging in the script object, but that
would be less efficient than passing the entire run as an
iterator pair:

PSEUDO-CODE!!!

struct output {

   std::string pending;
   enum ... last_action;
   void insertion(char e, int) {
     if (last_action != insert) { flush(); last_action = insert; }
     pending.append(e);
   }
   // etc.
   void flush() {
     do_something_with(pending);
     pending.clear();
   }
};

vs.

struct output {

   template <typename ITER_T>
   void insertion(ITER_T begin, ITER_T end) {
     do_something_with(begin,end);
   }

};

- Some might prefer a version that doesn't use Boost.Parameter.

- In the same vein, you might offer a version that takes an
iterator pair rather than a range.

- My experience has been that early termination is best expressed
in terms of RAM, i.e. the caller could indicate a maximum amount
of RAM, in bytes, rather than a maximum for the edit cost.  You
can probably convert between them via sizeof.

- I'm not convinced that the /sequence/ in the include path is
useful.  Why not just put it in boost/algorithm/ ?

- It would be great to see a benchmark against GNU diff!

I hope Marshall is reading.  What exactly is the process going to
be for getting this into Boost.Algorithm?

Regards,  Phil.

Re: [boost] [algorithm] Review Request: edit_distance

Phil Endecott