2013/4/27 Vadim Stadnik
On Sat, Apr 27, 2013 at 3:00 PM, Yu Jiang
wrote: 2013/4/27 Erik Erlandson
Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance
is a
function of the maximum supported sequence size.
Thanks, you're right. For edit distance, return size_type is really a better choice. For some other similarity metrics whose result is a real number in [0,1], I will choose double as the return type.
For those who use STL distance is signed integer. If you choose unsigned integer, then I suggest reconsider function names to avoid potential confusion for users. For example, is it "deviation" that you name "distance"?
Thanks! I would prefer not to change the function name since in such area "edit distance" is used everywhere so user may be familiar with this word. IMO, STL distance returns signed integer because the order of the two parameters matters. For this edit distance function, usually the order of the two strings may not affect the result unless in (4). So I may prefer to return a unsigned integer.
class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be
Part B. string similarity search and join algorithms, focus on edit distance the
same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. };
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.
Thanks, I will consider it carefully. Currently I'm not sure about whether I can support one of the standard container interfaces. Probably I can. With the candidate strings are inserted and removed, I should update the index for them.
IMO, container should be parameterized to allow users choosing the best performing container. Strings based on augmented data structures offer a wide set of very efficient search and update operations, including logarithmic time splice and split. They can give significant performance improvement against strings based on basic data structures, such as std::string, std::vector<char>, std::list<char> etc.
Sorry for this. It seems I didn't describe my idea clearly. I'd like to do something like this: S is a large container of strings and when the user performing searching, we can return all the strings in S which are similar to the given string(maybe the edit distance between them is less than 2). So here I need to build a index first. In fact I want to supply some functions to manipulate the index, since the user may want to change the strings in S slightly. So maybe I'll make the "searcher" class look like a container, but there isn't a real container in it. Thanks.