Re: [boost] tokenizer for overlapping delimiters?

23 Apr 2015

      Hi Jorge,
Can you please provide some ideas about the use-cases ? I seem to get your
point, but I can not visualize its usefulness in construction of compilers
and similar things. Also, should not generic tokenizers take RegEx patterns
?
It would be great if you could elaborate it in greater details.

Best Wishes
Ganesh Prasad

On 23 April 2015 at 16:00, Jorge Cardoso Leitão <jorgecarleitao@gmail.com>
wrote:
...
Dear boost devs,
I'm writing here because I've been coding a tokenizer that I believe, given
its generality, could be an addition to Boost. I'm asking for a judgement
on whether the idea has room in boost or not.
The interface I'm proposing is a function of two arguments, a string (call
it text) and a set of strings (call it key-terms), that returns a
vector<string> (call it result) which fulfils 4 constraints:
1. a string in the result can only be either a key-term or a string
   between two key-terms;
   2. the concatenation of the result is always the original text;
   3. a key-term containing other key-terms has priority over the latter;
   4. a key-term overlapping other has priority based on its position in
   the text
A tokenizer that divides a string by delimiters is a subset of this
interface whose key-terms are the delimiters. This is considered in
Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical
addition here is the ability to deal with *overlapping key-terms*.
A common use case of overlapping key-terms is when you have key-terms that
you want to consider as single tokens but they overlap with common
delimiters. A practical example:
tokenize a string (with words separated by spaces) and guarantee that both
`"United States of America"` and `"United States of Brazil"` are
interpreted as single tokens.
The non-triviality appears because such feat requires storing which
key-terms are using a previous sub-string, and how to reverse in case the
match fails. (e.g. "United States of " is common to both terms above, but
once the first letter appears, either one or both can be discarded as
potential matches).
Some examples in pseudo-code (see how they fulfil constraints 1-4)
tokenize("the end", {}) --> {"the end"}
tokenize("the end", {" "}) --> {"the", " ", "end"}
tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"}
tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"}
tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s
ds"}
As a proof of concept, I've implemented such interface and respective test
cases, which you can find in https://github.com/jorgecarleitao/tokenize.
Any change is possible to accommodate to boost standards: this can be
generalized to arbitrary sequences, to arbitrary types, and to use
iterators, documented, better tested etc.
But before anything else, I would like to ask for an opinion on whether
this is sufficiently general and useful to be considered to boost.
Thank you for your time,
Best regards,
Jorge
_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost