RFC: edit_distance / edit_alignment library
Boost Community,
I cooked up a library that provides a pair of functions: edit_distance() and edit_alignment(). The library currently lives here:
https://github.com/erikerlandson/edit_distance
Some short example programs are here:
https://github.com/erikerlandson/edit_distance/tree/master/examples
A few example expressions:
// the edit distance (aka levenshtein distance)
unsigned dist = edit_distance("abc", "bcd");
// using a custom cost function for insertion, deletion, substitution
unsigned dist = edit_distance("abc", "bcd", custom_cost_function());
// accepts arbitrary sequences, ranges, range adaptors
unsigned dist = edit_distance(as_vector("abc"), as_list("bcd") | boost::adaptors::reversed);
// obtain the edit distance and the list of the edit operation codes (in output iterator)
unsigned dist = edit_alignment("abc", "bcd", std::ostream_iterator
Hi Erik, Erik Erlandson wrote:
Boost Community,
I cooked up a library that provides a pair of functions: edit_distance() and edit_alignment(). The library currently lives here: https://github.com/erikerlandson/edit_distance
I have had a quick look. It seems that you've implemented the Needleman-Wunsch algorithm, which takes quadratic time and space. There are less complex algorithms, at least for similar input sequences, e.g. Myers - "An O(ND) Difference Algorithm and Its Variations"; I have an implementation of that here: http://svn.chezphil.org/anyterm/trunk/src/diff.cc & .hh. What is your motivation for choosing Needleman-Wunsch? Regards, Phil.
I have had a quick look. It seems that you've implemented the Needleman-Wunsch algorithm, which takes quadratic time and space. There are less complex algorithms, at least for similar input sequences, e.g. Myers - "An O(ND) Difference Algorithm and Its Variations"; I have an implementation of that here: http://svn.chezphil.org/anyterm/trunk/src/diff.cc & .hh.
What is your motivation for choosing Needleman-Wunsch?
It was the simplest algorithm to get going. I was interested in developing a nice boost-ish programmatic interface, and developing unit tests, so for the implementation I stuck with the easy algorithm that I was familiar with. Now I (or anybody else) can drop in better algorithms with some unit test assurance. Plus, Needleman-Wunsch is pretty effective for applications involving relatively small strings or sequences. Not so much for applications like large file diffs, or DNA sequences, etc. At any rate, I view the particular algorithm as somewhat fungible, provided that the interface design is appealing to people. If there is disagreement on that point, I'm interested to know that too.
On Sunday, June 30, 2013, Erik Erlandson wrote:
I have had a quick look. It seems that you've implemented the Needleman-Wunsch algorithm, which takes quadratic time and space. There are less complex algorithms, at least for similar input sequences, e.g. Myers - "An O(ND) Difference Algorithm and Its Variations"; I have an implementation of that here: http://svn.chezphil.org/anyterm/trunk/src/diff.cc & .hh.
What is your motivation for choosing Needleman-Wunsch?
It was the simplest algorithm to get going. I was interested in developing a nice boost-ish programmatic interface, and developing unit tests, so for the implementation I stuck with the easy algorithm that I was familiar with. Now I (or anybody else) can drop in better algorithms with some unit test assurance.
Plus, Needleman-Wunsch is pretty effective for applications involving relatively small strings or sequences. Not so much for applications like large file diffs, or DNA sequences, etc.
At any rate, I view the particular algorithm as somewhat fungible, provided that the interface design is appealing to people. If there is disagreement on that point, I'm interested to know that too.
In the current interface I don't see how one could implement a replacement algorithm. -- -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - grafikrobot/yahoo
----- Original Message -----
Hi Erik,
Erik Erlandson wrote:
Boost Community,
I cooked up a library that provides a pair of functions: edit_distance() and edit_alignment(). The library currently lives here: https://github.com/erikerlandson/edit_distance
I have had a quick look. It seems that you've implemented the Needleman-Wunsch algorithm, which takes quadratic time and space. There are less complex algorithms, at least for similar input sequences, e.g. Myers - "An O(ND) Difference Algorithm and Its Variations"; I have an implementation of that here: http://svn.chezphil.org/anyterm/trunk/src/diff.cc & .hh.
Than you for the Myers paper! I've been collecting some thoughts on this topic. I would certainly want Myers' algorithm (or something like it) in a Boost sequence alignment library for its performance on large sequence data (previously I had been thinking in terms of Hirschberg, which is less efficient). I do also wish to provide an algorithm that supports substitution as an edit operation, and configurable cost functions whose output may vary with the sequence elements. Replacing Needleman-Wunsch with a Dijkstra/SSSP implementation under the hood seems like a good idea. I am grappling with how best to represent the returned "edit script". Considering the Myers paper, some kind of run-compression seems useful. The information required depends on what operation. To represent a run of 'equal elements', you need begin/length for sequence 1 and sequence 2. Substitution is similar. For insertion and deletion, you only need begin/length for one sequence. One might use boost::any to address the differing information, although it seems to involve overhead. Not sure if there is a clear boost-style answer that stands out. In my draft, I also experimented with allowing selection of varying edit-script information via function adaptors. The useful options there differ between Myers, where edit costs are constant, and my desired variant that allows non-constant edit costs. The minimal edit script output in any case is a pair (op-code, run-length). If one wants additional location information, then the op-dependent variation appears, adding begin-index for either one or two sequences. In the case of Myers, edit cost information is not relevant. I see no real problem with supporting different options for different algorithms, as long as the interface is as consistent as possible. I still like this as a feature, although with run compression one might argue that the benefits decrease.
Erik Erlandson wrote:
I am grappling with how best to represent the returned "edit script".
"Best" obviously depends on what the caller is going to do with it.
A good design for the interface should emerge once some experience
has been gained with using the algorithm in actual applications.
Attempting to design a general-purpose interface too soon can be
a mistake.
Having said that, the most generic way to return the edit script
would be to template the algorithm on an output-handling class:
template
----- Original Message -----
I am grappling with how best to represent the returned "edit script".
"Best" obviously depends on what the caller is going to do with it. A good design for the interface should emerge once some experience has been gained with using the algorithm in actual applications. Attempting to design a general-purpose interface too soon can be a mistake.
Having said that, the most generic way to return the edit script would be to template the algorithm on an output-handling class:
template
void diff(ITER1 begin1, ITER1 end1, ITER2 begin2, ITER2 end2, OUTPUT& output)
I had a design using this approach, which has obvious advantages. The one thing I wasn't crazy about is that to do this, the implementation layer is committed to always computing and/or storing start+length information and edit-op cost information, so that the output handler has its space of choices about which information it cares about. My motivation for the draft I presented is that it leverages some template specializations so that it computes exactly the information requested via the adaptors. For example, if you don't ask for edit cost information, it invokes version of the implementation that doesn't spend the effort to unpack that information from the working matrix. On the other hand, quite possibly the Dijkstra and Meyers algorithms don't require that kind of extra work, as they don't operate on that kind of working matrix. I agree strongly with your point about leveraging application experience to make good general-application choices. In fact, the last time I worked with edit distances or dynamic programming was nearly 20 years ago, and it was not on very long sequence data such as base-pairs or large-file diffs. Still, the universe of outputs from these algorithms isn't big: They return a sequence of edit op-codes. Optionally, they can include location information (start/end) and operation costs. It seems possible to deduce a small set of output facilities that provide good coverage. Not that I don't prefer induction to deduction :-)
Erik Erlandson wrote:
I am grappling with how best to represent the returned "edit script".
the most generic way to return the edit script would be to template the algorithm on an output-handling class:
template
void diff(ITER1 begin1, ITER1 end1, ITER2 begin2, ITER2 end2, OUTPUT& output) I had a design using this approach, which has obvious advantages. The one thing I wasn't crazy about is that to do this, the implementation layer is committed to always computing and/or storing start+length information and edit-op cost information, so that the output handler has its space of choices about which information it cares about.
Not necessarily; you can allow a variety of methods in the output handler, with different names or signatures, and use "enable-if" SFINAE techniques to enable appropriate parts of the implementation algorithm. (I don't think I've ever seen anything like this done in a std:: or Boost library. Maybe someone else can think of concrete examples?) Regards, Phil.
----- Original Message -----
Erik Erlandson wrote:
I am grappling with how best to represent the returned "edit script".
the most generic way to return the edit script would be to template the algorithm on an output-handling class:
template
void diff(ITER1 begin1, ITER1 end1, ITER2 begin2, ITER2 end2, OUTPUT& output) I had a design using this approach, which has obvious advantages. The one thing I wasn't crazy about is that to do this, the implementation layer is committed to always computing and/or storing start+length information and edit-op cost information, so that the output handler has its space of choices about which information it cares about.
Not necessarily; you can allow a variety of methods in the output handler, with different names or signatures, and use "enable-if" SFINAE techniques to enable appropriate parts of the implementation algorithm.
True, it might be possible to apply Clever MPL Tricks, maybe involving boost::function_types, to allow smart template specializations from an output-class method. Considering how the Meyers and SSSP algorithms work, I'm now feeling less sure about how much it even matters. Good fodder for next round of tinkering.
I reworked both the algorithms and the interface. https://github.com/erikerlandson/edit_distance I devised a variation on the Dijkstra Single Source Shortest Path algorithm, optimized for the particular structure of an edit graph. Like the Meyers algorithm, it is very efficient on long sequences with localized differences (e.g. computing the diff of two similar files, or comparing two similar DNA sequences). Under such conditions, I've been comparing two similar sequences with 100 million elements in under a second. I went with an SSSP-based approach because the Meyers algorithm makes assumptions about the edit operation cost function, and a general-purpose library should allow any cost function (think std::sort() or std::map<> allowing any definition of '<'). I don't feel like I sacrificed a lot of performance. It's quite fast. I doubt it would be hard to allow the user to provide a specialized 'meyers_cost' object, and have it trigger a specific Meyers algorithm under the hood. I also upgraded the interface to a named-argument style function, using the boost::parameter library, so calls now support some named optional parameters: // apply an optional custom cost, and an optional beam constraint int d = edit_distance(seq1, seq2, _cost = my_cost(), _beam=100); In the case of edit_alignment(), the actual sequence of edit operations is received by a user-supplied output handling object: // output_handler provides methods for handling insertion, deletion, substitution and equal elements int d = edit_alignment(seq1, seq2, output_handler, _cost=my_cost(), _beam=1000);
----- Original Message -----
I reworked both the algorithms and the interface. https://github.com/erikerlandson/edit_distance
Under such conditions, I've been comparing two similar sequences with 100 million elements in under a second.
Nope, more egg on face. For arbitrary cost function, I'm computing fast approximation of minimum edit path, not true minimum.
project repo: https://github.com/erikerlandson/edit_distance
Since last time:
I implemented specializations for the Myers algorithm, which matches whenever the necessary conditions are met:
*) sequences are random access
*) no substitution (i.e. _allow_sub = boost::false_type(), which is now the default)
*) unit cost for insertion and deletion (also the default)
*) no optional pruning behaviors (excepting the new _max_cost, which is supported)
Default behavior is now such that the Myers algorithm applies, unless options are specifically chosen such that it cannot be used.
I altered the semantic for cost function: a separate equality relation is now applied explicitly to determine equality, which takes precedence over substitution cost function.
optional behaviors are now:
_cost = <cost-function-object> // customize the cost functions
_equal = <equal-object> // customize the element equality relation
_allow_sub =
On 12/14/2013 03:54 AM, Erik Erlandson wrote:
I've been considering possible alternate names for the functions: edit_distance / edit_alignment or: edit_cost / edit_path or: edit_cost / edit_script
I suggest that you rename the boost::algorithm::sequence_alignment namespace to boost::algorithm::edit. Then we can drop the "edit_" prefix on the function names, giving boost:algorithm::edit::distance and likewise for alignment. The edit::distance is reminiscient of std::distance, so I think that name should be retained. I have no strong opinion about edit::alignment. Research papers use "edit alignment" but I find "edit path" slightly more descriptive. Talking about naming, it may be a good idea avoid abbreviations for some of the names. So "allow_sub" could become "allow_substitution" (or just "substitution".) Likewise, the cost function object has names like cost_ins and cost_del, but as these are already located inside the cost function object, they could be renamed to insertion and deletion (or erasure to match the STL namining a bit better.)
----- Original Message -----
On 12/14/2013 03:54 AM, Erik Erlandson wrote:
I've been considering possible alternate names for the functions: edit_distance / edit_alignment or: edit_cost / edit_path or: edit_cost / edit_script
I suggest that you rename the boost::algorithm::sequence_alignment namespace to boost::algorithm::edit.
My original idea for boost::algorithm::sequence_alignment was that it could also serve as a home for other sequence alignment algorithms, such as BLAST, or HMMs. However, that is pretty speculative.
Then we can drop the "edit_" prefix on the function names, giving boost:algorithm::edit::distance and likewise for alignment.
Just 'distance' or 'alignment' (or 'path') seem like they might be too generic. The likelihood that they'd need to be qualified seems higher. I'll give it some thought.
The edit::distance is reminiscient of std::distance, so I think that name should be retained. I have no strong opinion about edit::alignment. Research papers use "edit alignment" but I find "edit path" slightly more descriptive.
I've been leaning toward 'edit_distance' and 'edit_path'. My survey of the literature hasn't given me any impression of an unambiguous standard for terminology.
Talking about naming, it may be a good idea avoid abbreviations for some of the names. So "allow_sub" could become "allow_substitution" (or just "substitution".)
Agreed, my abbreviations aren't really warranted.
Likewise, the cost function object has names like cost_ins and cost_del, but as these are already located inside the cost function object, they could be renamed to insertion and deletion (or erasure to match the STL namining a bit better.)
That makes sense. I could do the same for the output function object.
I've been considering possible alternate names for the functions: edit_distance / edit_alignment or: edit_cost / edit_path or: edit_cost / edit_script
Come to think of it, now that the interface sits on top of boost::parameter, it ought to be possible to simply present the single 'edit_distance' function, and enable the path-output implementations iff an '_output = outputter_class' is provided. Tangentially, I considered 'edit_cost' instead of 'edit_distance' mostly because in the general case, the function may violate the definition of a distance metric, for example if the cost of insertion is different than the cost of deletion. In that case, edit_cost(seq1, seq2) != edit_cost(seq2, seq1).
On 12/16/2013 10:45 PM, Erik Erlandson wrote:
Come to think of it, now that the interface sits on top of boost::parameter, it ought to be possible to simply present the single 'edit_distance' function, and enable the path-output implementations iff an '_output = outputter_class' is provided.
Very good idea.
Tangentially, I considered 'edit_cost' instead of 'edit_distance' mostly because in the general case, the function may violate the definition of a distance metric, for example if the cost of insertion is different than the cost of deletion. In that case, edit_cost(seq1, seq2) != edit_cost(seq2, seq1).
Which reminds me... If you are running without substitution (_allow_sub set to false), then we may need to have a lower cost for deletions than for insertions, because for substitutions we want a script with delete(A) followed by insert(A), rather than insert(A) followed by delete(A). The latter may result in 'A' being removed rather than replaced.
----- Original Message -----
Which reminds me... If you are running without substitution (_allow_sub set to false), then we may need to have a lower cost for deletions than for insertions, because for substitutions we want a script with delete(A) followed by insert(A), rather than insert(A) followed by delete(A). The latter may result in 'A' being removed rather than replaced.
I did a few experiments trying to cajole the algorithms into outputting edit operations in some kind of canonical ordering. Getting the algorithms to do this natively didn't look very promising, and it left me feeling like the best solution would be to collect operations in the output-object, and then sort them after-the-fact into some desired ordering, for cases where that is important.
On 12/17/2013 10:22 PM, Erik Erlandson wrote:
I did a few experiments trying to cajole the algorithms into outputting edit operations in some kind of canonical ordering. Getting the algorithms to do this natively didn't look very promising, and it left me feeling like the best solution would be to collect operations in the output-object, and then sort them after-the-fact into some desired ordering, for cases where that is important.
You are probably right. I can think of two solutions, but both are rather complex. One solution would be to have a path-dependent cost function (as you suggested earlier in issue #5), and the other to prune the insert-delete combinations from the internal search tree. Regarding your suggestion, AFAIK, the contraint is always between adjacent path elements, and it therefore can be handled in O(1) time in the output object via element swapping. So it seems like a feasible solution.
participants (4)
-
Bjorn Reese
-
Erik Erlandson
-
Phil Endecott
-
Rene Rivera