Re: [boost] [gsoc 2013] Approximate string matching

12 Apr 2013

      Dne 12.4.2013 19:50, Michael Marcin napsal(a):
...
On 4/12/13 4:52 AM, Jan Strnad wrote:
...
Hi, I'm Jan Strnad, currently a Master's student at CTU in Prague, Czech
Republic. This summer, I would like to participate in GSOC and I believe
that approximate string matching is a good match for me, since my
background i this area is quite strong.
Now more specifically. I would like to implement approximate string
matching algorithm(s) based on the NFA. I would like to support various
approximate distances, such as Hamming distance [1], Levenstein distance
[2], Damerau-Levenstein distance [3], Delta distance, Gamma distance and
finally (delta, gamma) distance. Sorry no explanation of the last three
found on the Web, but I can provide one if interested.
In the matter of NFA implementation, I would probably choose dynamic
programming approach, but I'm thinking of shift-or algorithm as well.
Can I ask you for any kind of thoughts or feedback regarding this topic?
Best regards,
Jan Strnad
[1] http://en.wikipedia.org/wiki/Hamming_distance
[2] http://en.wikipedia.org/wiki/Levenstein_distance
[3] http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
Sounds very useful!
I guess this would fit in the string algorithm library?
That seems to me reasonable as well.
...
What would the interface look like?
Is it as simple as:
namespace boost {
namespace algorithm {
template< typename Range1T, typename Range2T >
std::size_t levenstein_distance( Range1T first, Range2T second );
}
}
Sure, that would be possible, but I was originally aiming at something 
little bit different. Basically, I want to design an algorithm that 
would find all (or just the first) occurrences of a given pattern with 
given distance in some string.

So to your question, simple interface could look like this:
namespace boost
{
   namespace algorithm
   {
   	template <typename T>
   	class approximate_distance;  	

     template <typename TRange>  	
     find_iterator<TRange> approximate_find
	(TRange text,
          approximate_distance<TRange>& model);
   }
}

And a simple usage:

string pattern = "acbdef";
approximate_distance<string>* model =
   new hamming_distance<string>(pattern, 1);
find_iterator<string> result = approximate_find("aacdefbcdefabccef");
while(result())
{
   cout << * result << endl;
   ++result;
}
// prints aacdef, fbcdef and abccef