On Sat, Apr 27, 2013 at 3:00 PM, Yu Jiang
2013/4/27 Erik Erlandson
Here is a prototype for the interface: Part A. basic similarity metrics (use edit distance as an example) (1) int edit_distance(const Range1T&, const Range2T&); // return edit distance. (2) int edit_distance(const Range1T&, const Range2T&, int threshold); // return edit distance if it is no larger than threshold, else return -1 or threshold + 1. (3) int edit_distance(const Range1T&, const Range2T&, equal_func); // use uqual_func to determine whether two elements are equal or not.
For these first three, I'd typically expect them to return an unsigned integer. The ideal type might be the size_type of implied sequence containers, if it can be acquired, since the maximum possible distance
is a
function of the maximum supported sequence size.
Thanks, you're right. For edit distance, return size_type is really a better choice. For some other similarity metrics whose result is a real number in [0,1], I will choose double as the return type.
For those who use STL distance is signed integer. If you choose unsigned integer, then I suggest reconsider function names to avoid potential confusion for users. For example, is it "deviation" that you name "distance"?
class searcher { void insert(IDType id, const string &word); // insert a word to the index, id is used as an identifier of this word. void erase(IDType id); // erase a word from the index, id should be
Part B. string similarity search and join algorithms, focus on edit distance the
same one used in insertion process. vector<Result> search(const string &query, int threshold); // performing similarity search, which will return all the strings which are similar to the query string. Result may be a pair of the id and real edit distance. };
This searcher class has the feel of some kind of container class, and maybe ought to be designed with an eye toward satisfying one of the standard container interfaces.
Thanks, I will consider it carefully. Currently I'm not sure about whether I can support one of the standard container interfaces. Probably I can. With the candidate strings are inserted and removed, I should update the index for them.
IMO, container should be parameterized to allow users choosing the best performing container. Strings based on augmented data structures offer a wide set of very efficient search and update operations, including logarithmic time splice and split. They can give significant performance improvement against strings based on basic data structures, such as std::string, std::vector<char>, std::list<char> etc. Regards, Vadim Stadnik