[algorithm] Review Request: edit_distance

older
[graph] Are you interested in the...

Erik Erlandson

27 Jan 2014 27 Jan '14

5:26 p.m.

I would like to request a formal review for inclusion of the edit_distance() function. Description: The edit distance is the length of the shortest (or least-cost) edit script from one sequence to another, where an edit script is defined as a sequence of insertion, deletion and (optionally) substitution operations. The function implementing the edit distance is named edit_distance. This function will return the edit distance between two sequences, where sequences may be any valid range object supporting forward iteration. The edit_distance function will also, if requested, return the edit script. http://en.wikipedia.org/wiki/Edit_distance Documentation: http://erikerlandson.github.io/algorithm/libs/algorithm/doc/html/algorithm/S... Code resides on a branch of my fork of Boost.Algorithm: https://github.com/erikerlandson/algorithm/tree/edit_distance The feature branch above includes the doc, unit tests and examples that are integrated via Jamfiles. The new dev resides specifically in this subdir: https://github.com/erikerlandson/algorithm/tree/edit_distance/sequence I have built the tests and examples on gcc and clang, with and without c++11 mode. I don't have access to a windows/MSVC build env, so I am definitely curious if it compiles properly in that env. My concept is that boost::algorithm::sequence can eventually serve as a home to additional related algorithms, for example other sequence comparison algorithms or perhaps other categories of sequence-related algorithms.

Show replies by date

Phil Endecott

28 Jan 28 Jan

5:49 p.m.

Hi Erik, Erik Erlandson wrote:

...

I would like to request a formal review for inclusion of the edit_distance() function.

Congratulations on your progress. I have tried your code and done a quick benchmark comparison with my own version of Myers' algorithm. I'm pleased to see that you have excellent performance. In particular, the way that you have chosen to store the state needed for generating the script seems more performant than mine, which uses std containers and copying. (If I were doing it again today I might try to use Boost.Intrusive.) Some comments: - I think it should be possible to relax the requirement for random access iterators. Either continue to store integers and then use std::advance, or store iterators. I'm considering, for example, comparing UTF-8 with an iterator adaptor than changes a random-access iterator over the underlying bytes into a bidirectional iterator over the characters. - Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ? - It may be useful to group contiguous operations of the same kind, i.e. a "run" of N equal characters would result in one call to the output script object, rather than N calls. This can be achieved by merging in the script object, but that would be less efficient than passing the entire run as an iterator pair: PSEUDO-CODE!!! struct output { std::string pending; enum ... last_action; void insertion(char e, int) { if (last_action != insert) { flush(); last_action = insert; } pending.append(e); } // etc. void flush() { do_something_with(pending); pending.clear(); } }; vs. struct output { template <typename ITER_T> void insertion(ITER_T begin, ITER_T end) { do_something_with(begin,end); } }; - Some might prefer a version that doesn't use Boost.Parameter. - In the same vein, you might offer a version that takes an iterator pair rather than a range. - My experience has been that early termination is best expressed in terms of RAM, i.e. the caller could indicate a maximum amount of RAM, in bytes, rather than a maximum for the edit cost. You can probably convert between them via sizeof. - I'm not convinced that the /sequence/ in the include path is useful. Why not just put it in boost/algorithm/ ? - It would be great to see a benchmark against GNU diff! I hope Marshall is reading. What exactly is the process going to be for getting this into Boost.Algorithm? Regards, Phil.

Erik Erlandson

9:01 p.m.

----- Original Message -----

...

Hi Erik,

Erik Erlandson wrote:

...
I would like to request a formal review for inclusion of the edit_distance() function.

Congratulations on your progress.

I have tried your code and done a quick benchmark comparison with my own version of Myers' algorithm. I'm pleased to see that you have excellent performance.

Thanks for looking at it, and benchmarking too, I really appreciate it!

...

Some comments:

- I think it should be possible to relax the requirement for random access iterators. Either continue to store integers and then use std::advance, or store iterators. I'm considering, for example, comparing UTF-8 with an iterator adaptor than changes a random-access iterator over the underlying bytes into a bidirectional iterator over the characters.

I briefly toyed with this idea earlier on, but it seemed (imo) like it would undermine the virtues of the algorithm. I guess what I mean by that is, Myers does stuff like storing only one index in the working vector 'V' for memory efficiency, and so we have "j2 = j1-k", which either requires index manipulation, or if you wanted to do it via iterators, it is only efficient for random-access iterators. Similar issues apply to other checks, e.g. "if ((j1-j2) == (r1-r2) && j1 >= r1) return 2*D-1;" Part of the genius of the algorithm is how baked-down it is, and trying to support non-random-access iterators would either turn certain operations from constant-time to linear time, or require carrying around indexes as well as iterators, at substantial additional memory cost. For that, I have the SSSP specialization. However, I might be mis-reading the trade-offs. I'll keep thinking about it.

...

- Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ?

I have not, I'll read up on it.

...

struct output {

template <typename ITER_T> void insertion(ITER_T begin, ITER_T end) { do_something_with(begin,end); }

};

One issue with the above is that the actual operation cost can be different for each element. So I can't get away with just passing begin/end iterators. I'd have to also provide some kind of iteration over the corresponding edit costs. Unless one didn't care about the cost information. How to best handle script output has been thorny for me. Not every user is going to care about all the same information. My philosophy has been that I will provide it, and it can be ignored. Using in-line methods, ignoring it can even allow the compiler to avoid generating code around it, I think. Another implication is that it requires doing what you did above -- writing handler methods templated to account for unknown/opaque iterator types, which seems less easy to think about than describing the handler methods as simply taking element values. I would also note that substitution() and equality() involve two iterator types, two begin/end pairs, etc. However, the potential scaling benefits on large sequences might make some added complexity worthwhile.

...

- Some might prefer a version that doesn't use Boost.Parameter.

I'm not sure how that would look: would it just take all 8 parameters all the time?

...

- In the same vein, you might offer a version that takes an iterator pair rather than a range.

I agree -- I originally put off figuring out the examples that involved multiple function specializations combined with Boost.Parameter, but it could be time to revisit. Given that both range arguments and hypothetical iterator arguments are templates, I wasn't even sure if I could get it to properly differentiate. Tangentially, I'm not totally clear if the 'range' view of the universe is The Future, and doing function interfaces taking iterator pairs is now deprecated, or if the idea is to support both schemes in parallel.

...

- My experience has been that early termination is best expressed in terms of RAM, i.e. the caller could indicate a maximum amount of RAM, in bytes, rather than a maximum for the edit cost. You can probably convert between them via sizeof.

Agreed -- tangentially, I filed an issue for myself using maximum time: https://github.com/erikerlandson/edit_distance/issues/6

...

- I'm not convinced that the /sequence/ in the include path is useful. Why not just put it in boost/algorithm/ ?

I'm pretty open to ideas about where it should live. My logic is that there are other worthy proposals for sequence-related algorithms, and so it might be good to open up a new bucket that might include things besides edit_distance(). For example, last year a couple different proposals of this nature were submitted as GSoC candidates, like this one: http://lists.boost.org/Archives/boost/2013/04/202971.php

...

- It would be great to see a benchmark against GNU diff!

Also a good idea, although my impression of that code (which I poked around while doing this) is that it isn't factored in a way that would make dropping in a replacement very practical. I suppose the other option is to write a version of diff using edit_distance, which in fact is more or less the kind of project that got me to wanting a generic edit_distance library in the first place.

Phil Endecott

29 Jan 29 Jan

2:59 p.m.

Erik Erlandson wrote:

...

...
- I think it should be possible to relax the requirement for random access iterators. Either continue to store integers and then use std::advance, or store iterators. I'm considering, for example, comparing UTF-8 with an iterator adaptor than changes a random-access iterator over the underlying bytes into a bidirectional iterator over the characters.

I briefly toyed with this idea earlier on, but it seemed (imo) like it would undermine the virtues of the algorithm. I guess what I mean by that is, Myers does stuff like storing only one index in the working vector 'V' for memory efficiency, and so we have "j2 = j1-k", which either requires index manipulation, or if you wanted to do it via iterators, it is only efficient for random-access iterators.

OK, I've had a proper look at this. If sizeof(iterator) == sizeof(size_t), then I think you have to store a pair of iterators which doubles the memory requirements. I was hoping to find that you could continue to store a single integer and use std::advance and keep O(ND), but I think it probably becomes O(N D^2). So probably not worth doing.

...

...
- Some might prefer a version that doesn't use Boost.Parameter.

I'm not sure how that would look: would it just take all 8 parameters all the time?

Yes, with defaults.

...

...
- It would be great to see a benchmark against GNU diff!

Also a good idea, although my impression of that code (which I poked around while doing this) is that it isn't factored in a way that would make dropping in a replacement very practical. I suppose the other option is to write a version of diff using edit_distance

Yes, that's exactly the sort of thing I had in mind; something that would work like diff, though without most of its myriad options. You could use this as a tutorial, as suggested in another post. Another tutorial idea is an approximate match / spell check replacement word suggester, i.e. scan /usr/dict/words and output words closest to the input by edit distance. You could use the max_cost feature. Regards, Phil.

Erik Erlandson

30 Jan 30 Jan

3:20 p.m.

----- Original Message -----

...

- My experience has been that early termination is best expressed in terms of RAM, i.e. the caller could indicate a maximum amount of RAM, in bytes, rather than a maximum for the edit cost. You can probably convert between them via sizeof.

As I see it, there are three early-termination options on the table: _max_cost _max_memory _max_time Each of those has its potential utility. It is easy to code them, and I have techniques to implement them so that they cost nothing when they aren't invoked: https://gist.github.com/erikerlandson/7455677 A few questions on my mind: 1) Adding any further parameters will require increasing BOOST_PARAMETER_MAX_ARITY from its default value. Is it best to just do that automatically in edit_distance.hpp, or explain to the user that they should do it? 2) Currently, _max_cost comes paired with the _max_cost_exception flag. Adding _max_memory and _max_time would seem to imply _max_memory_exception and _max_time_exception. So now we have six optional parameters governing early-termination conditions. Not sure I really have a problem with that, but it's getting populous. It is possible that the '*_exception' flags could be collapsed to a single flag, say '_terminate_exception'. I'm not sure how likely it is that a person would want to use fallback for one condition, and throw an exception on another. 3) maybe better names would be _terminate_cost, etc? 4) I see that there are now even monadic approaches to exception handling: boost::make_optional(). Is exception throwing deprecated? Does make_optional require c++11? Does _terminate_exception change from a boolean flag to an enumerated type: { fallback, exception, optional }?

Erik Erlandson

4:13 p.m.

----- Original Message -----

...

As I see it, there are three early-termination options on the table:

_max_cost _max_memory _max_time

Or, we might unify all of it into a single parameter: _terminate = cost(cost_thresold, [bool exception - defaults to false]) _terminate = time(time_threshold, etc) _terminate = memory(mem_threhold, etc) That is, 'cost', 'time' and 'memory' are data types that carry an appropriate threshold and a termination policy flag that defaults to 'false', aka 'fallback without exception.' The above assumes only one kind of termination condition. That might be OK. Time, memory and cost are roughly surrogates of each other, and limiting any one of them limits the other two. I guess it wouldn't be a c++ project if it didn't have a large, confusing solution space.

...

Each of those has its potential utility. It is easy to code them, and I have techniques to implement them so that they cost nothing when they aren't invoked: https://gist.github.com/erikerlandson/7455677

A few questions on my mind:

1) Adding any further parameters will require increasing BOOST_PARAMETER_MAX_ARITY from its default value. Is it best to just do that automatically in edit_distance.hpp, or explain to the user that they should do it?

2) Currently, _max_cost comes paired with the _max_cost_exception flag. Adding _max_memory and _max_time would seem to imply _max_memory_exception and _max_time_exception. So now we have six optional parameters governing early-termination conditions. Not sure I really have a problem with that, but it's getting populous. It is possible that the '*_exception' flags could be collapsed to a single flag, say '_terminate_exception'. I'm not sure how likely it is that a person would want to use fallback for one condition, and throw an exception on another.

3) maybe better names would be _terminate_cost, etc?

4) I see that there are now even monadic approaches to exception handling: boost::make_optional(). Is exception throwing deprecated? Does make_optional require c++11? Does _terminate_exception change from a boolean flag to an enumerated type: { fallback, exception, optional }?

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Bjorn Reese

1 Feb 1 Feb

12:18 p.m.

On 01/30/2014 05:13 PM, Erik Erlandson wrote:

...

Or, we might unify all of it into a single parameter:

_terminate = cost(cost_thresold, [bool exception - defaults to false]) _terminate = time(time_threshold, etc) _terminate = memory(mem_threhold, etc)

That is, 'cost', 'time' and 'memory' are data types that carry an appropriate threshold and a termination policy flag that defaults to 'false', aka 'fallback without exception.' The above assumes only one kind of termination condition. That might be OK. Time, memory and cost are roughly surrogates of each other, and limiting any one of them limits the other two.

This looks like a really good solution (and I assume that you are also going to rename _max_cost_exception to _terminate_exception.) You may consider using duration() (possibly chrono::duration()) instead of time().

Erik Erlandson

30 Jan 30 Jan

8:44 p.m.

----- Original Message -----

...

- Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ?

Assuming I'm reading this paper correctly, their O(NP) improvement is essentially a tweak to the previous O(ND) algorithm, that examines a narrower band of path-heads on each iteration combined with a clever re-parameterization of the distance to D = delta+2*P. The semantics of the working vector looks unchanged aside from its required width. If so, it may be a fairly simple modification of the existing code to get this working. Either way, their implied performance is uniformly better than the original O(ND) variation, so getting an implementation working eventually is definitely on my to-do list.

Erik Erlandson

31 Jan 31 Jan

12:57 a.m.

----- Original Message -----

...

----- Original Message -----

...
- Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ?

The semantics of the working vector looks unchanged aside from its required width. If so, it may be a fairly simple modification of the existing code to get this working.

Hmm, not quite so easy. The indexing logic wants 2nd string to be longer, and stores y instead of x. So it will be one of those algorithms that needs two variations: for seq2 longer than seq1, and vice versa.

Erik Erlandson

1 Feb 1 Feb

12:53 a.m.

----- Original Message -----

...

- It would be great to see a benchmark against GNU diff!

So I cooked up an implementation of unix diff: https://github.com/erikerlandson/algorithm/blob/edit_distance/sequence/examp... It produces standard default unix diff output: http://en.wikipedia.org/wiki/Diff#Usage That is to say, edit_distance_diff_example will reproduce the output of diff, at least if you just give both programs two files for arguments: $ edit_distance_diff_example foo.txt bar.txt > ed.out $ diff foo.txt bar.txt > diff.out $ diff ed.out diff.out $ I generated a few benchmarking data-sets, by editing /usr/share/dict/words. On my machine (F18), the 'words' file has 479,828 lines (one word per line) and is 4,953,680 bytes. So, a decent size of file in both line length and bytes. I did a couple different experiments, where I compared against a variation that differed by around 150 lines, and then another variation that differed by around 1200 lines. I also varied input size, by just duplicating the contents, so the result was twice the length and size. Across these variations of input size and file difference that I tested, edit_distance_diff_example ran consistently 40% to 60% faster than unix diff. So, roughly speaking it's performing about twice as fast. And less than 150 lines of code!

Erik Erlandson

2 Feb 2 Feb

9:28 p.m.

----- Original Message -----

...

- It may be useful to group contiguous operations of the same kind, i.e. a "run" of N equal characters would result in one call to the output script object, rather than N calls. This can be achieved by merging in the script object, but that would be less efficient than passing the entire run as an iterator pair:

struct output {

template <typename ITER_T> void insertion(ITER_T begin, ITER_T end) { do_something_with(begin,end); }

};

I've been mulling this over, and I have an idea that is starting to grow on me. What if we left insertion() and deletion() (and substitution) as-is: they continue to receive individual sequence elements, and operation scores. But we *do* modify the equality() method to take two ranges, in the manner you suggested above. It would introduce an asymmetry of equality() with respect to the other methods, but I think there are some real arguments for it: *) the order of insertions and deletions identified by these algorithms is not generally "consecutive" - that is, if the edit script involves a sequence of deletions and insertions, you cannot count on these being presented consecutively, and that will tend to undermine the benefit of accepting such ranges in the handling methods. *) the algorithms *do* identify consecutive runs of equal elements. *) the insertion() and deletion() methods receive operation-cost information, and it isn't trivial to fold that information into an iterator stream. *) the equality() method does *not* receive operation cost information, because that cost is always zero by definition. That means it is sensible (and easy) to simply present a range constructed of the iterators at the start and end of the equal run in question. This begin/end information for equal runs is readily available, and saves cost both internal to the algorithms, and (likely) externally as well. *) Runs of 'equal' dominate the script in most larger-scale applications of interest. Actual insertions, deletions, etc, are relatively small. Compressing equal runs in this way is where almost all of the bang/buck resides. *) This is more speculative, but just looking at how I ended up doing my 'diff' benchmark implementation, accepting ranges for insertion() and deletion() would actually have required me to do a bit *more* coding, to handle the logic of appending an incoming range of data onto my 'ins' and 'del' vectors, compared to simply push_back() as I did: https://github.com/erikerlandson/algorithm/blob/edit_distance/sequence/examp...

Phil Endecott

3 Feb 3 Feb

6:48 p.m.

Hi Erik, Erik Erlandson wrote:

...

...
- It may be useful to group contiguous operations of the same kind, i.e. a "run" of N equal characters would result in one call to the output script object, rather than N calls. This can be achieved by merging in the script object, but that would be less efficient than passing the entire run as an iterator pair:

struct output {

template <typename ITER_T> void insertion(ITER_T begin, ITER_T end) { do_something_with(begin,end); }

};

I've been mulling this over, and I have an idea that is starting to grow on me.

What if we left insertion() and deletion() (and substitution) as-is: they continue to receive individual sequence elements, and operation scores. But we *do* modify the equality() method to take two ranges, in the manner you suggested above. It would introduce an asymmetry of equality() with respect to the other methods, but I think there are some real arguments for it:

[snip]

...

*) Runs of 'equal' dominate the script in most larger-scale applications of interest. Actual insertions, deletions, etc, are relatively small.

I can't agree with that; for me, a common case is where the two inputs are entirely different. Here's another possibility: don't pass ranges, but do pass iterators: struct output { ITER pending; enum ... last_action; void insertion(ITER i, int) { if (last_action != insert) { do_something_with(pending,i); pending = i; last_action = insert; } } // etc. }; The point here is that I just need to store the iterator for the start of the range until I get a different action; in contrast, if the value is passed I need to accumulate a copy of the range. Regards, Phil.

Erik Erlandson

10:36 p.m.

----- Original Message -----

...

...
I've been mulling this over, and I have an idea that is starting to grow on me.

What if we left insertion() and deletion() (and substitution) as-is: they continue to receive individual sequence elements, and operation scores. But we *do* modify the equality() method to take two ranges, in the manner you suggested above. It would introduce an asymmetry of equality() with respect to the other methods, but I think there are some real arguments for it:

[snip]

...
*) Runs of 'equal' dominate the script in most larger-scale applications of interest. Actual insertions, deletions, etc, are relatively small.

I can't agree with that; for me, a common case is where the two inputs are entirely different.

Something of a tangent, but I'm curious how that works for you. When two strings are entirely different, the algorithm cost is O(NxM). On larger data, say a million elements, that's terabytes of working data, and tera-ops.

...

Here's another possibility: don't pass ranges, but do pass iterators:

struct output {

ITER pending; enum ... last_action; void insertion(ITER i, int) { if (last_action != insert) { do_something_with(pending,i); pending = i; last_action = insert; } } // etc.

};

The point here is that I just need to store the iterator for the start of the range until I get a different action; in contrast, if the value is passed I need to accumulate a copy of the range.

That seems promising. It makes it straightforward to handle the script data either as it appears, or save up an iterator range. I could use a similar variation on the 'diff' example, and not bother to copy the file data, just refer back to the original for output.

Phil Endecott

4 Feb 4 Feb

11:09 a.m.

Hi Erik, Erik Erlandson wrote:

...

...
...
*) Runs of 'equal' dominate the script in most larger-scale applications of interest. Actual insertions, deletions, etc, are relatively small.

I can't agree with that; for me, a common case is where the two inputs are entirely different.

Something of a tangent, but I'm curious how that works for you. When two strings are entirely different, the algorithm cost is O(NxM). On larger data, say a million elements, that's terabytes of working data, and tera-ops.

This is what the max-cost / max-memory / max-time is useful for. I just give up and return the "trivial" output. Regards, Phil.

Erik Erlandson

15 Feb 15 Feb

8:47 p.m.

----- Original Message -----

...

- Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ?

So, I've been fooling around with this O(NP) variation. It took a few weeks of head scratching, but I think I have worked out how to properly do the bi-directional variation: https://github.com/erikerlandson/algorithm/blob/order_np_alg/include/boost/a... As you can see, the code is all up on blocks, but you can get the flavor of how it works. More work remaining to do, with additional testing, plugging the fallback logic back in, porting it over to the version that extracts edit scripts, benchmarking, etc. But looks promising.

Erik Erlandson

21 Feb 21 Feb

2:41 a.m.

...

...
- Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ?

I've been doing more cross-testing and benchmarking against the current O(ND) version. The latest O(NP) prototype is passing randomized cross-testing and is now about 25% faster than the O(ND) baseline. The latest code now looks like this: https://github.com/erikerlandson/algorithm/blob/order_np_alg/include/boost/a... To actually get it performing faster than O(ND), I had to reorganize the looping over the diagonals so it proceeds outside-in, and derive adaptive looping bounds as a function of P and current-best distance. In the paper, Myers et al claim more like a 50% improvement, but I'm not sure if their experiments were comparing the two uni-directional variations. It should use closer to 50% less memory, but I haven't done measurements.

Erik Erlandson

22 Feb 22 Feb

4:22 a.m.

----- Original Message -----

...

...
...
- Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ?

...

I've been doing more cross-testing and benchmarking against the current O(ND) version. The latest O(NP) prototype is passing randomized cross-testing and is now about 25% faster than the O(ND) baseline.

I wrote up a blog post that describes the various ideas I applied to get this result: http://erikerlandson.github.io/blog/2014/02/20/a-bi-directional-variation-of...

Erik Erlandson

15 Apr 15 Apr

1:20 a.m.

----- Original Message -----

...

...
...
...
- Have you looked at the supposedly better algorithm "An O(NP) Sequence Comparison Algorithm" by Sun Wu, Udi Manber, and Gene Myers ?

...
I've been doing more cross-testing and benchmarking against the current O(ND) version. The latest O(NP) prototype is passing randomized cross-testing and is now about 25% faster than the O(ND) baseline.

My latest prototype code based on O(NP) is now running about 35% faster than O(ND) in my benchmarking: https://github.com/erikerlandson/algorithm/blob/order_np_alg/include/boost/a...

...

I wrote up a blog post that describes the various ideas I applied to get this result:

http://erikerlandson.github.io/blog/2014/02/20/a-bi-directional-variation-of...

Ron Garcia

28 Jan 28 Jan

9:39 p.m.

Hello Erik, I have received your request and will add the library to the review schedule. On 2014-01-27, at 9:26 AM, Erik Erlandson wrote:

...

I would like to request a formal review for inclusion of the edit_distance() function.

Description: The edit distance is the length of the shortest (or least-cost) edit script from one sequence to another, where an edit script is defined as a sequence of insertion, deletion and (optionally) substitution operations. The function implementing the edit distance is named edit_distance. This function will return the edit distance between two sequences, where sequences may be any valid range object supporting forward iteration. The edit_distance function will also, if requested, return the edit script. http://en.wikipedia.org/wiki/Edit_distance

Documentation: http://erikerlandson.github.io/algorithm/libs/algorithm/doc/html/algorithm/S...

Code resides on a branch of my fork of Boost.Algorithm: https://github.com/erikerlandson/algorithm/tree/edit_distance

The feature branch above includes the doc, unit tests and examples that are integrated via Jamfiles. The new dev resides specifically in this subdir: https://github.com/erikerlandson/algorithm/tree/edit_distance/sequence

I have built the tests and examples on gcc and clang, with and without c++11 mode. I don't have access to a windows/MSVC build env, so I am definitely curious if it compiles properly in that env.

My concept is that boost::algorithm::sequence can eventually serve as a home to additional related algorithms, for example other sequence comparison algorithms or perhaps other categories of sequence-related algorithms.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Erik Erlandson

4 Feb 4 Feb

6:02 p.m.

----- Original Message -----

...

Hello Erik,

I have received your request and will add the library to the review schedule.

Thank you Ron! I have a question. I was looking at the current review schedule: http://www.boost.org/community/review_schedule.html Taking the '10 days per review' as typical, it appears that there is a six-month backlog of reviews. Assuming 10 day review sessions, back to back. That seems like a long backlog, am I missing something?

Brian Smith

10:05 p.m.

On 2/4/14, Erik Erlandson wrote:

...

I have a question.

I was looking at the current review schedule: http://www.boost.org/community/review_schedule.html

Taking the '10 days per review' as typical, it appears that there is a six-month backlog of reviews. Assuming 10 day review sessions, back to back. That seems like a long backlog, am I missing something?

There doesn't appear to be a list of available volunteer review managers to assign for any particular request. If a potential review manager volunteers themselves and the submitter is agreeable a review date can be set, otherwise I think the consensus is for the submitter to seek a willing volunteer to manage the review. In short, some of the entries in the review schedule have been there longer than six months, and some have come later, gotten a review manager and have been reviewed. Regards Brian -- www.maidsafe.net

Bjorn Reese

8 Feb 8 Feb

12:54 p.m.

On 02/04/2014 11:05 PM, Brian Smith wrote:

...

There doesn't appear to be a list of available volunteer review managers to assign for any particular request. If a potential review manager volunteers themselves and the submitter is agreeable a review date can be set, otherwise I think the consensus is for the submitter to seek a willing volunteer to manage the review. In short, some of

If Marshall Clow agrees, then I would like to volunteer as review manager for edit_distance.

Erik Erlandson

8 Mar 8 Mar

4:31 p.m.

----- Original Message -----

...

On 02/04/2014 11:05 PM, Brian Smith wrote:

...
There doesn't appear to be a list of available volunteer review managers to assign for any particular request. If a potential review manager volunteers themselves and the submitter is agreeable a review date can be set, otherwise I think the consensus is for the submitter to seek a willing volunteer to manage the review. In short, some of

If Marshall Clow agrees, then I would like to volunteer as review manager for edit_distance.

Has Marshall Clow weighed in on this yet?

Bjorn Reese

10 Mar 10 Mar

2:14 p.m.

On 03/08/2014 05:31 PM, Erik Erlandson wrote:

...

...
If Marshall Clow agrees, then I would like to volunteer as review manager for edit_distance.

Has Marshall Clow weighed in on this yet?

Not that I am aware of.

Phil Endecott

5 Feb 5 Feb

3:26 p.m.

Erik Erlandson wrote:

...

...
I have received your request and will add the library to the review schedule.

Thank you Ron!

I have a question.

I was looking at the current review schedule: http://www.boost.org/community/review_schedule.html

Taking the '10 days per review' as typical, it appears that there is a six-month backlog of reviews. Assuming 10 day review sessions, back to back. That seems like a long backlog, am I missing something?

As a proposed addition to Boost.Algorithm, my understanding is that this applies (from the Boost.Algorithm introduction): I will be soliciting submissions from other developers, as well as looking through the literature for existing algorithms to include. [...] My goal is to run regular algorithm reviews, similar to the Boost library review process, but with smaller chunks of code. To date I don't believe that any code other than the originally- reviewed content has been added to Boost.Algorithm. Marshall, are you reading this? How do you propose to handle it? Regards, Phil.

Rob Stewart

29 Jan 29 Jan

1:42 a.m.

On Jan 27, 2014, at 12:26 PM, Erik Erlandson wrote:

...

I would like to request a formal review for inclusion of the edit_distance() function.

[snip]

...

Documentation: http://erikerlandson.github.io/algorithm/libs/algorithm/doc/html/algorithm/S...

I realize that your documentation is for an algorithm, and not a library, but you would do well to give use cases to which it applies and even flesh out one of them in a tutorial. The example code you supplied merely illustrate usage. Those that don't already understand their need for computing the edit distance will ignore this algorithm without more information. ___ Rob (Sent from my portable computation engine)

Erik Erlandson

2:11 a.m.

----- Original Message -----

...

...
Documentation: http://erikerlandson.github.io/algorithm/libs/algorithm/doc/html/algorithm/S...

I realize that your documentation is for an algorithm, and not a library, but you would do well to give use cases to which it applies and even flesh out one of them in a tutorial.

Both of those seem like good ideas. I've had an idea for a tutorial in the back of my mind, where I would write a rudimentary unix-diff style program. That reminds me of a question. One angle I thought would be fun is if I could use a range (or iterator) that iterates directly over lines in a text file (analogous to how you can say "for line in File" in python). Mostly just to show that you can use any exotic kind of range you want to. I don't know if such an iterator exists, or if it is forward and not just single-pass. It seems like something somebody would have already written, but my googling hasn't turned one up.

Marshall Clow

3:08 a.m.

On Jan 28, 2014, at 6:11 PM, Erik Erlandson wrote:

...

----- Original Message -----

...
...
Documentation: http://erikerlandson.github.io/algorithm/libs/algorithm/doc/html/algorithm/S...

I realize that your documentation is for an algorithm, and not a library, but you would do well to give use cases to which it applies and even flesh out one of them in a tutorial.

Both of those seem like good ideas. I've had an idea for a tutorial in the back of my mind, where I would write a rudimentary unix-diff style program.

That reminds me of a question. One angle I thought would be fun is if I could use a range (or iterator) that iterates directly over lines in a text file (analogous to how you can say "for line in File" in python). Mostly just to show that you can use any exotic kind of range you want to. I don't know if such an iterator exists, or if it is forward and not just single-pass. It seems like something somebody would have already written, but my googling hasn't turned one up.

Eric Niebler had a blog post about this just a few months ago. http://ericniebler.com/2013/10/13/out-parameters-vs-move-semantics/ — Marshall

Nat Goodspeed

3:09 a.m.

On Tue, Jan 28, 2014 at 9:11 PM, Erik Erlandson wrote:

...

I've had an idea for a tutorial in the back of my mind, where I would write a rudimentary unix-diff style program.

That reminds me of a question. One angle I thought would be fun is if I could use a range (or iterator) that iterates directly over lines in a text file (analogous to how you can say "for line in File" in python). Mostly just to show that you can use any exotic kind of range you want to. I don't know if such an iterator exists, or if it is forward and not just single-pass. It seems like something somebody would have already written, but my googling hasn't turned one up.

I especially appreciate a diff tool that not only indicates different lines, but within each change block, indicates which characters have changed. Admittedly that often just highlights the whole block -- but for small edits it can be *really* helpful.

Rob Stewart

5:24 a.m.

On Jan 28, 2014, at 10:09 PM, Nat Goodspeed wrote:

...

On Tue, Jan 28, 2014 at 9:11 PM, Erik Erlandson wrote:

...
I've had an idea for a tutorial in the back of my mind, where I would write a rudimentary unix-diff style program.

I especially appreciate a diff tool that not only indicates different lines, but within each change block, indicates which characters have changed. Admittedly that often just highlights the whole block -- but for small edits it can be *really* helpful.

Emacs' ediff FTW! ;) ___ Rob (Sent from my portable computation engine)

Erik Erlandson

2:11 p.m.

----- Original Message -----

...

On Tue, Jan 28, 2014 at 9:11 PM, Erik Erlandson wrote:

...
I've had an idea for a tutorial in the back of my mind, where I would write a rudimentary unix-diff style program.

That reminds me of a question. One angle I thought would be fun is if I could use a range (or iterator) that iterates directly over lines in a text file (analogous to how you can say "for line in File" in python). Mostly just to show that you can use any exotic kind of range you want to. I don't know if such an iterator exists, or if it is forward and not just single-pass. It seems like something somebody would have already written, but my googling hasn't turned one up.

I especially appreciate a diff tool that not only indicates different lines, but within each change block, indicates which characters have changed. Admittedly that often just highlights the whole block -- but for small edits it can be *really* helpful.

meld is nice for this, and viewing diffs in general: http://meld.sourceforge.net/ you can hook meld into git: http://blog.deadlypenguin.com/blog/2011/05/03/using-meld-with-git-diff/

Brandon Kohn

2:52 p.m.

On 1/28/2014 20:42 PM, Rob Stewart wrote:

...

[snip] I realize that your documentation is for an algorithm, and not a library, but you would do well to give use cases to which it applies and even flesh out one of them in a tutorial. The example code you supplied merely illustrate usage. Those that don't already understand their need for computing the edit distance will ignore this algorithm without more information.

A few years ago I wrote a fuzzy comparison between strings using edit distance which was used to merge two databases which shared common fields (names/addresses) but which had a lot of human errors. Maybe that would make a great use-case as it illustrates the usefulness of the algorithm as well as stressing performance characteristics. Cheers, Brandon

3878

Age (days ago)

3956

Last active (days ago)

List overview

Download

31 comments

9 participants

participants (9)

Bjorn Reese
Brandon Kohn
Brian Smith
Erik Erlandson
Marshall Clow
Nat Goodspeed
Phil Endecott
Rob Stewart
Ron Garcia