tokenizer for overlapping delimiters?
Dear boost devs, I'm writing here because I've been coding a tokenizer that I believe, given its generality, could be an addition to Boost. I'm asking for a judgement on whether the idea has room in boost or not. The interface I'm proposing is a function of two arguments, a string (call it text) and a set of strings (call it key-terms), that returns a vector<string> (call it result) which fulfils 4 constraints: 1. a string in the result can only be either a key-term or a string between two key-terms; 2. the concatenation of the result is always the original text; 3. a key-term containing other key-terms has priority over the latter; 4. a key-term overlapping other has priority based on its position in the text A tokenizer that divides a string by delimiters is a subset of this interface whose key-terms are the delimiters. This is considered in Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical addition here is the ability to deal with *overlapping key-terms*. A common use case of overlapping key-terms is when you have key-terms that you want to consider as single tokens but they overlap with common delimiters. A practical example: tokenize a string (with words separated by spaces) and guarantee that both `"United States of America"` and `"United States of Brazil"` are interpreted as single tokens. The non-triviality appears because such feat requires storing which key-terms are using a previous sub-string, and how to reverse in case the match fails. (e.g. "United States of " is common to both terms above, but once the first letter appears, either one or both can be discarded as potential matches). Some examples in pseudo-code (see how they fulfil constraints 1-4) tokenize("the end", {}) --> {"the end"} tokenize("the end", {" "}) --> {"the", " ", "end"} tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"} tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"} tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s ds"} As a proof of concept, I've implemented such interface and respective test cases, which you can find in https://github.com/jorgecarleitao/tokenize. Any change is possible to accommodate to boost standards: this can be generalized to arbitrary sequences, to arbitrary types, and to use iterators, documented, better tested etc. But before anything else, I would like to ask for an opinion on whether this is sufficiently general and useful to be considered to boost. Thank you for your time, Best regards, Jorge
Hi Jorge,
Can you please provide some ideas about the use-cases ? I seem to get your
point, but I can not visualize its usefulness in construction of compilers
and similar things. Also, should not generic tokenizers take RegEx patterns
?
It would be great if you could elaborate it in greater details.
Best Wishes
Ganesh Prasad
On 23 April 2015 at 16:00, Jorge Cardoso Leitão
Dear boost devs,
I'm writing here because I've been coding a tokenizer that I believe, given its generality, could be an addition to Boost. I'm asking for a judgement on whether the idea has room in boost or not.
The interface I'm proposing is a function of two arguments, a string (call it text) and a set of strings (call it key-terms), that returns a vector<string> (call it result) which fulfils 4 constraints:
1. a string in the result can only be either a key-term or a string between two key-terms; 2. the concatenation of the result is always the original text; 3. a key-term containing other key-terms has priority over the latter; 4. a key-term overlapping other has priority based on its position in the text
A tokenizer that divides a string by delimiters is a subset of this interface whose key-terms are the delimiters. This is considered in Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical addition here is the ability to deal with *overlapping key-terms*.
A common use case of overlapping key-terms is when you have key-terms that you want to consider as single tokens but they overlap with common delimiters. A practical example:
tokenize a string (with words separated by spaces) and guarantee that both `"United States of America"` and `"United States of Brazil"` are interpreted as single tokens.
The non-triviality appears because such feat requires storing which key-terms are using a previous sub-string, and how to reverse in case the match fails. (e.g. "United States of " is common to both terms above, but once the first letter appears, either one or both can be discarded as potential matches).
Some examples in pseudo-code (see how they fulfil constraints 1-4)
tokenize("the end", {}) --> {"the end"} tokenize("the end", {" "}) --> {"the", " ", "end"} tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"} tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"} tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s ds"}
As a proof of concept, I've implemented such interface and respective test cases, which you can find in https://github.com/jorgecarleitao/tokenize. Any change is possible to accommodate to boost standards: this can be generalized to arbitrary sequences, to arbitrary types, and to use iterators, documented, better tested etc.
But before anything else, I would like to ask for an opinion on whether this is sufficiently general and useful to be considered to boost.
Thank you for your time, Best regards, Jorge
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi Ganesh,
I have two main use cases in mind:
1. text interpretation, which is what I've been using it for. The
motivation is that some composite words define terms that the user may want
to use as tokens. At some level of the interpretation, the analyser must
interpret composite words as single tokens, which exactly what this
interface does.
2. search indexes: traditionally, search indexes (e.g. Xapian, Sphinx)
consider single words to be independent tokens and store their relative
position on the text to allow exact matches of composite words (e.g. they
allow queries like "the PHRASE end" to say that the query wants the words
together). This query is naturally less efficient than having the composite
word indexed. Firstly because it requires having the positional information
and secondly because it requires more than a single lookup to the inverted
index. With this interface, composite words are indexed without indexing
its individual words, which is useful when the individual words are not
informative for a search, but the composite word is.
I agree that regexes are important here. I'm not sure regexes fulfill the 4
constraints above, as there may be no single way to define what is the best
match for two key-terms with greedy regexes. I.e. tokenise("foo bar", {"fo.
ba", "fo.*"}) should return {"foo ba", "r"} or {"foo bar"}? I could try to
formalize the constraints and the algorithm to include regexes.
Regards,
Jorge
On Thu, Apr 23, 2015 at 2:17 PM, Ganesh Prasad
Hi Jorge, Can you please provide some ideas about the use-cases ? I seem to get your point, but I can not visualize its usefulness in construction of compilers and similar things. Also, should not generic tokenizers take RegEx patterns ? It would be great if you could elaborate it in greater details.
Best Wishes Ganesh Prasad
On 23 April 2015 at 16:00, Jorge Cardoso Leitão
wrote: Dear boost devs,
I'm writing here because I've been coding a tokenizer that I believe, given its generality, could be an addition to Boost. I'm asking for a judgement on whether the idea has room in boost or not.
The interface I'm proposing is a function of two arguments, a string (call it text) and a set of strings (call it key-terms), that returns a vector<string> (call it result) which fulfils 4 constraints:
1. a string in the result can only be either a key-term or a string between two key-terms; 2. the concatenation of the result is always the original text; 3. a key-term containing other key-terms has priority over the latter; 4. a key-term overlapping other has priority based on its position in the text
A tokenizer that divides a string by delimiters is a subset of this interface whose key-terms are the delimiters. This is considered in Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical addition here is the ability to deal with *overlapping key-terms*.
A common use case of overlapping key-terms is when you have key-terms that you want to consider as single tokens but they overlap with common delimiters. A practical example:
tokenize a string (with words separated by spaces) and guarantee that both `"United States of America"` and `"United States of Brazil"` are interpreted as single tokens.
The non-triviality appears because such feat requires storing which key-terms are using a previous sub-string, and how to reverse in case the match fails. (e.g. "United States of " is common to both terms above, but once the first letter appears, either one or both can be discarded as potential matches).
Some examples in pseudo-code (see how they fulfil constraints 1-4)
tokenize("the end", {}) --> {"the end"} tokenize("the end", {" "}) --> {"the", " ", "end"} tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"} tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"} tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s ds"}
As a proof of concept, I've implemented such interface and respective test cases, which you can find in https://github.com/jorgecarleitao/tokenize. Any change is possible to accommodate to boost standards: this can be generalized to arbitrary sequences, to arbitrary types, and to use iterators, documented, better tested etc.
But before anything else, I would like to ask for an opinion on whether this is sufficiently general and useful to be considered to boost.
Thank you for your time, Best regards, Jorge
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (2)
-
Ganesh Prasad
-
Jorge Cardoso Leitão