[gsoc 2013] Approximate string matching

Hi, I'm Jan Strnad, currently a Master's student at CTU in Prague, Czech Republic. This summer, I would like to participate in GSOC and I believe that approximate string matching is a good match for me, since my background i this area is quite strong. Now more specifically. I would like to implement approximate string matching algorithm(s) based on the NFA. I would like to support various approximate distances, such as Hamming distance [1], Levenstein distance [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and finally (delta, gamma) distance. Sorry no explanation of the last three found on the Web, but I can provide one if interested. In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well. Can I ask you for any kind of thoughts or feedback regarding this topic? Best regards, Jan Strnad [1] http://en.wikipedia.org/wiki/Hamming_distance [2] http://en.wikipedia.org/wiki/Levenstein_distance [3] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

On 4/12/13 4:52 AM, Jan Strnad wrote:
Hi, I'm Jan Strnad, currently a Master's student at CTU in Prague, Czech Republic. This summer, I would like to participate in GSOC and I believe that approximate string matching is a good match for me, since my background i this area is quite strong.
Now more specifically. I would like to implement approximate string matching algorithm(s) based on the NFA. I would like to support various approximate distances, such as Hamming distance [1], Levenstein distance [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and finally (delta, gamma) distance. Sorry no explanation of the last three found on the Web, but I can provide one if interested.
In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well.
Can I ask you for any kind of thoughts or feedback regarding this topic?
Best regards, Jan Strnad
[1] http://en.wikipedia.org/wiki/Hamming_distance [2] http://en.wikipedia.org/wiki/Levenstein_distance [3] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
Sounds very useful! I guess this would fit in the string algorithm library? What would the interface look like? Is it as simple as: namespace boost { namespace algorithm { template< typename Range1T, typename Range2T > std::size_t levenstein_distance( Range1T first, Range2T second ); } }

Dne 12.4.2013 19:50, Michael Marcin napsal(a):
On 4/12/13 4:52 AM, Jan Strnad wrote:
Hi, I'm Jan Strnad, currently a Master's student at CTU in Prague, Czech Republic. This summer, I would like to participate in GSOC and I believe that approximate string matching is a good match for me, since my background i this area is quite strong.
Now more specifically. I would like to implement approximate string matching algorithm(s) based on the NFA. I would like to support various approximate distances, such as Hamming distance [1], Levenstein distance [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and finally (delta, gamma) distance. Sorry no explanation of the last three found on the Web, but I can provide one if interested.
In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well.
Can I ask you for any kind of thoughts or feedback regarding this topic?
Best regards, Jan Strnad
[1] http://en.wikipedia.org/wiki/Hamming_distance [2] http://en.wikipedia.org/wiki/Levenstein_distance [3] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
Sounds very useful!
I guess this would fit in the string algorithm library?
That seems to me reasonable as well.
What would the interface look like?
Is it as simple as:
namespace boost { namespace algorithm {
template< typename Range1T, typename Range2T > std::size_t levenstein_distance( Range1T first, Range2T second );
} }
Sure, that would be possible, but I was originally aiming at something little bit different. Basically, I want to design an algorithm that would find all (or just the first) occurrences of a given pattern with given distance in some string. So to your question, simple interface could look like this: namespace boost { namespace algorithm { template <typename T> class approximate_distance; template <typename TRange> find_iterator<TRange> approximate_find (TRange text, approximate_distance<TRange>& model); } } And a simple usage: string pattern = "acbdef"; approximate_distance<string>* model = new hamming_distance<string>(pattern, 1); find_iterator<string> result = approximate_find("aacdefbcdefabccef"); while(result()) { cout << * result << endl; ++result; } // prints aacdef, fbcdef and abccef

Jan Strnad
Dne 12.4.2013 19:50, Michael Marcin napsal(a):
On 4/12/13 4:52 AM, Jan Strnad wrote:
...
Now more specifically. I would like to implement approximate string matching algorithm(s) based on the NFA. I would like to support various approximate distances, such as Hamming distance [1], Levenstein distance [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and finally (delta, gamma) distance. ...
Sounds very useful!
I guess this would fit in the string algorithm library?
That seems to me reasonable as well.
What would the interface look like?
Is it as simple as:
namespace boost { namespace algorithm {
template< typename Range1T, typename Range2T > std::size_t levenstein_distance( Range1T first, Range2T second );
} }
Sure, that would be possible, but I was originally aiming at something little bit different. Basically, I want to design an algorithm that would find all (or just the first) occurrences of a given pattern with given distance in some string.
So to your question, simple interface could look like this: namespace boost { namespace algorithm { template <typename T> class approximate_distance;
template <typename TRange> find_iterator<TRange> approximate_find (TRange text, approximate_distance<TRange>& model); } } ...
I would recommend starting off with the basic implementations of these distance algorithms if possible. There are several situations where I currently use my own Levenshtein distance implementation, and could use other alternative 'distance' measures, but the approximate_find is useless here. Of course it could be useful in other situations, but I would suggest exposing the underlying distance algorithms first, so they can then be adapted into further utilities like approximate_find or used directly as necessary. -Adam D. Walling

Dne 15.4.2013 15:11, Adam D. Walling napsal(a):
Jan Strnad
writes: Dne 12.4.2013 19:50, Michael Marcin napsal(a):
On 4/12/13 4:52 AM, Jan Strnad wrote:
...
Now more specifically. I would like to implement approximate string matching algorithm(s) based on the NFA. I would like to support various approximate distances, such as Hamming distance [1], Levenstein distance [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and finally (delta, gamma) distance. ...
Sounds very useful!
I guess this would fit in the string algorithm library?
That seems to me reasonable as well.
What would the interface look like?
Is it as simple as:
namespace boost { namespace algorithm {
template< typename Range1T, typename Range2T > std::size_t levenstein_distance( Range1T first, Range2T second );
} }
Sure, that would be possible, but I was originally aiming at something little bit different. Basically, I want to design an algorithm that would find all (or just the first) occurrences of a given pattern with given distance in some string.
So to your question, simple interface could look like this: namespace boost { namespace algorithm { template <typename T> class approximate_distance;
template <typename TRange> find_iterator<TRange> approximate_find (TRange text, approximate_distance<TRange>& model); } } ...
I would recommend starting off with the basic implementations of these distance algorithms if possible. There are several situations where I currently use my own Levenshtein distance implementation, and could use other alternative 'distance' measures, but the approximate_find is useless here.
Of course it could be useful in other situations, but I would suggest exposing the underlying distance algorithms first, so they can then be adapted into further utilities like approximate_find or used directly as necessary.
-Adam D. Walling
OK, I should definitely include these "word" distance functions as well. Jan Strnad

On 04/12/2013 11:52 AM, Jan Strnad wrote:
In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well.
You may also want to look into Thomson NFAs: http://swtch.com/~rsc/regexp/regexp1.html

Dne 13.4.2013 10:41, Bjorn Reese napsal(a):
On 04/12/2013 11:52 AM, Jan Strnad wrote:
In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well.
You may also want to look into Thomson NFAs:
When I was writing about NFA I actually meant Thomson like NFA -- no backtracking required. I still believe that dynamic programming approach is the best: its complexity is low (O(m*n), m is length of pattern, n is length of a string) and so are memory requirements (e.g. for Hamming distance k it is O(k*m)). There is another area where NFA can be used: finding (approximate) repetitions in a string. However, I'm not sure how useful/desired that functionality is.

On 04/13/2013 01:45 PM, Jan Strnad wrote:
Dne 13.4.2013 10:41, Bjorn Reese napsal(a):
On 04/12/2013 11:52 AM, Jan Strnad wrote:
In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well.
You may also want to look into Thomson NFAs:
When I was writing about NFA I actually meant Thomson like NFA -- no backtracking required.
Maybe I was a bit to brief in my previous reply. What I meant was that you should consider, as an alternative to dynamic programming and shift-or, to execute the NFA in a simulation as described in the above link. This is just a suggestion though.

Dne 13.4.2013 14:30, Bjorn Reese napsal(a):
On 04/13/2013 01:45 PM, Jan Strnad wrote:
Dne 13.4.2013 10:41, Bjorn Reese napsal(a):
On 04/12/2013 11:52 AM, Jan Strnad wrote:
In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well.
You may also want to look into Thomson NFAs:
When I was writing about NFA I actually meant Thomson like NFA -- no backtracking required.
Maybe I was a bit to brief in my previous reply. What I meant was that you should consider, as an alternative to dynamic programming and shift-or, to execute the NFA in a simulation as described in the above link. This is just a suggestion though.
Sure, I understood you in the first place. I'm considering all possibilities including NFA simulation. All I meant is that at this moment dyn. prog. approach is suits the problem in my opinion the best. I can't think of any advantage of NFA simulation against dyn. prog., and I only see disadvantages (preprocessing required, higher memory requirements).

So, to sum it up: I would like to implement following approximate distance functions as an extension of Boost String Algorithm library: Hamming distance Levenstein distance Generalized Levenstein (weighted Levenstein) distance Damerau–Levenshtein distance Weighted Damerau–Levenshtein distance Jaro distance Jaro-Winkler distance For an ordered alphabet, I would also implement Delta distance Gamma distance The interface for most distances would look like: template< typename Range1T, typename Range2T > std::size_t hamming_distance( Range1T first, Range2T second ); Jaro and Jaro-Winkler distances have following interface (return float instead of size_t): template< typename Range1T, typename Range2T > float hamming_distance( Range1T first, Range2T second ); Finally, I would implement functions template <typename TRange> find_iterator<TRange> approximate_find (TRange text, TRange pattern, approximate_distance<TRange>& distance); template <typename TRange> find_iteratorstd::size_t approximate_find_index (TRange text, TRange pattern, approximate_distance<TRange>& distance); These would return all approximate matches of pattern int text. The first one return the approximate string itself, the later one returns its index. Do you have any further suggestions? Is there anybody willing to mentor this project? Jan Strnad

On 4/17/2013 12:40 PM, Jan Strnad wrote:
Finally, I would implement functions template <typename TRange> find_iterator<TRange> approximate_find (TRange text, TRange pattern, approximate_distance<TRange>& distance);
template <typename TRange> find_iteratorstd::size_t approximate_find_index (TRange text, TRange pattern, approximate_distance<TRange>& distance);
These would return all approximate matches of pattern int text. The first one return the approximate string itself, the later one returns its index.
These functions are for finding substrings in the text? This seems to me like how the regex_search interface works. Is it possible to make the interface similar? Something like string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::amatch m; boost::approx e( hamming_distance, pattern, 1 ); while( boost::approx_search( text, m, e ) { cout << m.str() << endl; text = m.suffix().str(); } // prints aacdef, fbcdef abccef Although I could be seeing similarities where they don't exist.

Dne 17.4.2013 20:00, Michael Marcin napsal(a):
On 4/17/2013 12:40 PM, Jan Strnad wrote:
Finally, I would implement functions template <typename TRange> find_iterator<TRange> approximate_find (TRange text, TRange pattern, approximate_distance<TRange>& distance);
template <typename TRange> find_iteratorstd::size_t approximate_find_index (TRange text, TRange pattern, approximate_distance<TRange>& distance);
These would return all approximate matches of pattern int text. The first one return the approximate string itself, the later one returns its index.
These functions are for finding substrings in the text?
This seems to me like how the regex_search interface works. Is it possible to make the interface similar?
Something like
string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::amatch m; boost::approx e( hamming_distance, pattern, 1 ); while( boost::approx_search( text, m, e ) { cout << m.str() << endl; text = m.suffix().str(); } // prints aacdef, fbcdef abccef
Although I could be seeing similarities where they don't exist.
Well, the problem with this interface is that may not find all occurrences of pattern. Consider text="ababa" and pattern "aba". There are clearly two matches, the first starts at index 0 and the second at index 2. Using the interface above, I only get the one at index 0. The point is that I can not just drop all matched characters. I may still need them. So, I think the interface could look like this: string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::amatches m; boost::approx e( hamming_distance, pattern, 1 ); boost::approx_search(text, m, e); while(m.next()) { cout << m.str() << endl; } // prints aacdef, fbcdef abccef Therefore, boost:approx_search should return an interator-like structure. Jan Strnad

On 4/17/2013 4:54 PM, Jan Strnad wrote:
Dne 17.4.2013 20:00, Michael Marcin napsal(a):
On 4/17/2013 12:40 PM, Jan Strnad wrote:
Finally, I would implement functions template <typename TRange> find_iterator<TRange> approximate_find (TRange text, TRange pattern, approximate_distance<TRange>& distance);
template <typename TRange> find_iteratorstd::size_t approximate_find_index (TRange text, TRange pattern, approximate_distance<TRange>& distance);
These would return all approximate matches of pattern int text. The first one return the approximate string itself, the later one returns its index.
These functions are for finding substrings in the text?
This seems to me like how the regex_search interface works. Is it possible to make the interface similar?
Something like
string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::amatch m; boost::approx e( hamming_distance, pattern, 1 ); while( boost::approx_search( text, m, e ) { cout << m.str() << endl; text = m.suffix().str(); } // prints aacdef, fbcdef abccef
Although I could be seeing similarities where they don't exist.
Well, the problem with this interface is that may not find all occurrences of pattern. Consider text="ababa" and pattern "aba". There are clearly two matches, the first starts at index 0 and the second at index 2. Using the interface above, I only get the one at index 0.
The point is that I can not just drop all matched characters. I may still need them.
So, I think the interface could look like this: string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::amatches m; boost::approx e( hamming_distance, pattern, 1 ); boost::approx_search(text, m, e); while(m.next()) { cout << m.str() << endl; } // prints aacdef, fbcdef abccef
Therefore, boost:approx_search should return an interator-like structure.
I see. In that case try to use an iterator interface that plays more nicely with the STL. Maybe something closer to regex_iterator. Specifically look at the ForwardIterator concept. Maybe something like: string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::approx_pattern ap( hamming_distance, pattern, 1 ); boost::sapprox_match_iterator begin( text, ap ); boost::sapprox_match_iterator end; std::for_each( begin, end, []( const boost::sapprox_match_result& what ) { cout << what.str() << " starting at " << what.position() << endl; } );

Dne 18.4.2013 05:59, Michael Marcin napsal(a):
On 4/17/2013 4:54 PM, Jan Strnad wrote:
Dne 17.4.2013 20:00, Michael Marcin napsal(a):
On 4/17/2013 12:40 PM, Jan Strnad wrote:
Finally, I would implement functions template <typename TRange> find_iterator<TRange> approximate_find (TRange text, TRange pattern, approximate_distance<TRange>& distance);
template <typename TRange> find_iteratorstd::size_t approximate_find_index (TRange text, TRange pattern, approximate_distance<TRange>& distance);
These would return all approximate matches of pattern int text. The first one return the approximate string itself, the later one returns its index.
These functions are for finding substrings in the text?
This seems to me like how the regex_search interface works. Is it possible to make the interface similar?
Something like
string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::amatch m; boost::approx e( hamming_distance, pattern, 1 ); while( boost::approx_search( text, m, e ) { cout << m.str() << endl; text = m.suffix().str(); } // prints aacdef, fbcdef abccef
Although I could be seeing similarities where they don't exist.
Well, the problem with this interface is that may not find all occurrences of pattern. Consider text="ababa" and pattern "aba". There are clearly two matches, the first starts at index 0 and the second at index 2. Using the interface above, I only get the one at index 0.
The point is that I can not just drop all matched characters. I may still need them.
So, I think the interface could look like this: string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::amatches m; boost::approx e( hamming_distance, pattern, 1 ); boost::approx_search(text, m, e); while(m.next()) { cout << m.str() << endl; } // prints aacdef, fbcdef abccef
Therefore, boost:approx_search should return an interator-like structure.
I see.
In that case try to use an iterator interface that plays more nicely with the STL. Maybe something closer to regex_iterator. Specifically look at the ForwardIterator concept.
Maybe something like:
string pattern = "acbdef"; string text = "aacdefbcdefabccef"; boost::approx_pattern ap( hamming_distance, pattern, 1 ); boost::sapprox_match_iterator begin( text, ap ); boost::sapprox_match_iterator end; std::for_each( begin, end, []( const boost::sapprox_match_result& what ) { cout << what.str() << " starting at " << what.position() << endl; } );
This seems as the best choice to me right now.

On Apr 17, 2013, at 10:40 AM, Jan Strnad
So, to sum it up: I would like to implement following approximate distance functions as an extension of Boost String Algorithm library:
[ snip]
Do you have any further suggestions? Is there anybody willing to mentor this project?
Jan -- I am the maintainer of the Boost String Algorithm library. While I do not have time this summer to be a mentor (typified by the fact that I am replying to a 10 day old email), I would be glad to look over your design/code and offer advice. -- Marshall Marshall Clow Idio Software mailto:mclow.lists@gmail.com A.D. 1517: Martin Luther nails his 95 Theses to the church door and is promptly moderated down to (-1, Flamebait). -- Yu Suzuki

----- Original Message -----
On Apr 17, 2013, at 10:40 AM, Jan Strnad
wrote: So, to sum it up: I would like to implement following approximate distance functions as an extension of Boost String Algorithm library:
I have been wondering if these distance functions should be specifically part of string algorithms, or more general algorithms. For example, edit distance is really a function on any two sequences, not just sequences of characters. Rather like std::sort() is an algorithm that can be applied to a sequence of any kind of object.

Dne 29.4.2013 03:07, Erik Erlandson napsal(a):
I have been wondering if these distance functions should be specifically part of string algorithms, or more general algorithms.
For example, edit distance is really a function on any two sequences, not just sequences of characters. Rather like std::sort() is an algorithm that can be applied to a sequence of any kind of object.
Well, I depends on the distance itself. Hamming, Levenstein and their relatives can be used with any kind of object (as long as it implements proper == operator). To compute Delta and Gamma distances I must be able to compute distance between two non-equal objects. This is easy for characters, but may not be so easy for "regular" objects. I'm not sure about Jaro and Jaro-Winkler distances right now, but I believe it is the same case as Hamming distance -- generalization should be possible. Hope this answers your question. Jan Strnad

On 4/28/2013 9:07 PM, Erik Erlandson wrote:
----- Original Message -----
On Apr 17, 2013, at 10:40 AM, Jan Strnad
wrote: So, to sum it up: I would like to implement following approximate distance functions as an extension of Boost String Algorithm library:
I have been wondering if these distance functions should be specifically part of string algorithms, or more general algorithms.
For example, edit distance is really a function on any two sequences, not just sequences of characters. Rather like std::sort() is an algorithm that can be applied to a sequence of any kind of object.
The Introduction of Boost String Algorithms Library states: Important note: In this documentation we use term string to designate a sequence of characters stored in an arbitrary container. A string is not restricted to std::basic_string and character does not have to be char or wchar_t, although these are most common candidates. Consult the design chapter to see precise specification of supported string types. see: http://www.boost.org/doc/libs/1_53_0/doc/html/string_algo.html#string_algo.i... Which refers to: http://www.boost.org/doc/libs/1_53_0/doc/html/string_algo/design.html Jeff

Dne 28.4.2013 18:32, Marshall Clow napsal(a):
I am the maintainer of the Boost String Algorithm library.
While I do not have time this summer to be a mentor (typified by the fact that I am replying to a 10 day old email), I would be glad to look over your design/code and offer advice.
Thanks. I would be quite useful. Jan Strnad

Dne 12.4.2013 11:52, Jan Strnad napsal(a):
Hi, I'm Jan Strnad, currently a Master's student at CTU in Prague, Czech Republic. This summer, I would like to participate in GSOC and I believe that approximate string matching is a good match for me, since my background i this area is quite strong.
Now more specifically. I would like to implement approximate string matching algorithm(s) based on the NFA. I would like to support various approximate distances, such as Hamming distance [1], Levenstein distance [2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and finally (delta, gamma) distance. Sorry no explanation of the last three found on the Web, but I can provide one if interested.
In the matter of NFA implementation, I would probably choose dynamic programming approach, but I'm thinking of shift-or algorithm as well.
Can I ask you for any kind of thoughts or feedback regarding this topic?
Best regards, Jan Strnad
[1] http://en.wikipedia.org/wiki/Hamming_distance [2] http://en.wikipedia.org/wiki/Levenstein_distance [3] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
Hi, I've posted my GSOC proposal on-line. Available at: http://webdev.fit.cvut.cz/~strnaj11/gsoc/ Do you have any tips/recommendations/criticism ...? Thanks. Jan Strnad
participants (7)
-
Adam D. Walling
-
Bjorn Reese
-
Erik Erlandson
-
Jan Strnad
-
Jeff Flinn
-
Marshall Clow
-
Michael Marcin