[gsoc-2013]Approximate string matching

Hello, I'm Yu Jiang, currently a master student from Tsinghua University in China. I'm attracted by the idea of "approximate string matching" since it's highly related about my research and also my interest. I have seen in a previous discussion, another student proposed to implement some basic matching functions such as "edit distance" for boost. He said by using dynamic programming, the time complexity is O(m*n). However, if we give a threshold T and only report the exact edit distance if it's no larger than T, the complexity can be O(T * min(m,n)). I think sometimes we just need to decide whether the edit distance is small enough. Furthermore, I'm also interested in approximate string search. That is, given a query string word Q and a candidate word set S, find all the strings in S who have the edit distance no larger than a threshold T with the query string. This is very useful in many area such as data integration, and I have strong knowledge in fast algorithms to solve this. So, is it compatible for gsoc 2013 and for boost? Looking forward to your opinions. Thanks, Yu

----- Original Message -----
From: "Yu Jiang"

2013/4/23 Erik Erlandson
How one would design the interface for edit_distance<> is an interesting question. In the basic general case, edit_distance<> is a function of two sequences, and a cost matrix for insertion/deletion/substitution/equality. In boost/template world, the "equality" test would of course be a template parameter, with the expected default. These days, I'd expect to be able to provide sequences as either begin/end iterators, or ranges. The cost matrix could be either provided as numeric values or functionals, with defaults to: 1/1/1/0.
Thanks for your advice! I'm working on writing a proposal. Looking forward to your further advice on my proposal.

----- Original Message -----
2013/4/23 Erik Erlandson
How one would design the interface for edit_distance<> is an interesting question. In the basic general case, edit_distance<> is a function of two sequences, and a cost matrix for insertion/deletion/substitution/equality. In boost/template world, the "equality" test would of course be a template parameter, with the expected default. These days, I'd expect to be able to provide sequences as either begin/end iterators, or ranges. The cost matrix could be either provided as numeric values or functionals, with defaults to: 1/1/1/0.
Thanks for your advice! I'm working on writing a proposal. Looking forward to your further advice on my proposal.
A few more thoughts: *) Cost functionals should take parameters, probably something like: struct substitution_cost { double operator()(const T& a, const T& b) const { // return some distance metric between a and b that represents // cost of substitution. In the classic case, this just returns 1 // if a != b, and zero if a == b. } } struct insertion_cost { double operator()(const T& a) const { // cost of deleting element (a), classic case is also just 1. } } *) Using the above framework, substitution and equality are really the same. For example, if you are substituting (a) for (b), and it so happens that a == b, then the substitution cost may very well return as zero (but it's up to you!). So really there are only three edit operations: insertion, deletion and substitution, where substituting two equal elements will commonly be defined to return zero for the cost. *) Also notice that in this framework, you don't need to supply a separate equality operator, since that logic is now subsumed by substitution cost functional. *) I can think of three plausible ways to supply the edit operation cost-functionals. The first is of course just three template parameters. This might be slightly inelegant if you want to use defaults for the first two, but alter the 3rd. The second is to supply a single object that provides all three: struct costs { double insertion(const T& a) const { ... } double deletion(const T& a) const { ... } double substitution(const T& a, const T& b) const { ... } } The third is to use a keyword argument scheme. Boost provides this, or at least it did in the past. The one time I tried it, I personally didn't care for it, but that was several years ago and if it's widely used in the community then it should probably be considered. I'd defer to community opinions. *) Edit distance can either just return the distance and nothing else, or it can optionally return the actual sequence of edit operations and op-costs that yielded the distance measure. Returning just the cost allows for more internal efficiencies, but it would probably be useful to supply variations that also return the actual edit sequence.

A few more thoughts:
*) Cost functionals should take parameters, probably something like:
struct substitution_cost { double operator()(const T& a, const T& b) const { // return some distance metric between a and b that represents // cost of substitution. In the classic case, this just returns 1 // if a != b, and zero if a == b. } }
struct insertion_cost { double operator()(const T& a) const { // cost of deleting element (a), classic case is also just 1. } }
......
The third is to use a keyword argument scheme. Boost provides this, or at
least it did in the past. The one time I tried it, I personally didn't care for it, but that was several years ago and if it's widely used in the community then it should probably be considered. I'd defer to community opinions.
Thanks for your further thoughts! I agree with you. *) Edit distance can either just return the distance and nothing else, or
it can optionally return the actual sequence of edit operations and op-costs that yielded the distance measure. Returning just the cost allows for more internal efficiencies but it would probably be useful to supply variations that also return the actual edit sequence.
Seems interesting! But it may be hard to design the interface? I have no idea on this.. By the way, I have posted my proposal for this idea in this mail list, would you like to read it? Thanks!
participants (3)
-
Erik Erlandson
-
Erik Erlandson
-
Yu Jiang