Hi Jorge,
Can you please provide some ideas about the use-cases ? I seem to get your
point, but I can not visualize its usefulness in construction of compilers
and similar things. Also, should not generic tokenizers take RegEx patterns
?
It would be great if you could elaborate it in greater details.
Best Wishes
Ganesh Prasad
On 23 April 2015 at 16:00, Jorge Cardoso Leitão
Dear boost devs,
I'm writing here because I've been coding a tokenizer that I believe, given its generality, could be an addition to Boost. I'm asking for a judgement on whether the idea has room in boost or not.
The interface I'm proposing is a function of two arguments, a string (call it text) and a set of strings (call it key-terms), that returns a vector<string> (call it result) which fulfils 4 constraints:
1. a string in the result can only be either a key-term or a string between two key-terms; 2. the concatenation of the result is always the original text; 3. a key-term containing other key-terms has priority over the latter; 4. a key-term overlapping other has priority based on its position in the text
A tokenizer that divides a string by delimiters is a subset of this interface whose key-terms are the delimiters. This is considered in Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical addition here is the ability to deal with *overlapping key-terms*.
A common use case of overlapping key-terms is when you have key-terms that you want to consider as single tokens but they overlap with common delimiters. A practical example:
tokenize a string (with words separated by spaces) and guarantee that both `"United States of America"` and `"United States of Brazil"` are interpreted as single tokens.
The non-triviality appears because such feat requires storing which key-terms are using a previous sub-string, and how to reverse in case the match fails. (e.g. "United States of " is common to both terms above, but once the first letter appears, either one or both can be discarded as potential matches).
Some examples in pseudo-code (see how they fulfil constraints 1-4)
tokenize("the end", {}) --> {"the end"} tokenize("the end", {" "}) --> {"the", " ", "end"} tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"} tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"} tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s ds"}
As a proof of concept, I've implemented such interface and respective test cases, which you can find in https://github.com/jorgecarleitao/tokenize. Any change is possible to accommodate to boost standards: this can be generalized to arbitrary sequences, to arbitrary types, and to use iterators, documented, better tested etc.
But before anything else, I would like to ask for an opinion on whether this is sufficiently general and useful to be considered to boost.
Thank you for your time, Best regards, Jorge
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost