Aho - Corasick String Search (for searching multiple patterns) Interface
Hi,
In some of my previous mails I had suggested that the Rabin Karp and Aho -
Corasick string search algorithms be implemented as a part of the Boost
algorithm library. This is primarily because these algorithms perform well
for searching for multiple patterns.
The Aho - Corasick algorithm works by building a trie (a tree with each
node corresponding to an ASCII character) of the patterns strings and
traversing the trie to search for the pattern in a given text.
Additionally, the Aho - Corasick introduced the concept of "failure
pointer/failure node" which is the node to be traversed when there is a
mismatch.
For more information:
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
The interface I propose is as follows.
1. The input of the patterns should be strings within a *container*. This
is because these algorithms are primarily used for multiple pattern
searching. The user can input pointers/iterators to the first and last
pattern . The final output which is returned is a container with each
pattern and a iterator/pointer to their occurrences (There are multiple
patterns so just returning an iterator/pointer to where the patterns first
occurs in the pattern will not suffice).
2. Just like the previous string searching algorithms that have been
implemented in Boost, there can be an object based interface as well as a
procedure based interface. In the object based interface, the trie is
constructed and the failure nodes are computed in the constructor (while
creating the object). The user can then search the text using the
operator().
template <typename patIter>
class aho_corasick {
public:
aho_corasick(patIter first, patIter last): pat_first(first),
pat_last(last), head(std::shared_ptr
participants (1)
-
meghana madhyastha