[Spirit] Looking for a little Qi guidance for Unicode parsing
Hello,
I am turning a corner in my JSON parser. I support ASCII through and
through, but now I want to support Unicode, apparently UTF-8, part of
the JSON standard. From what I can tell, this is not the entire
grammar, but just for Strings.
Looking for a little guidance on how to approach that issue, the
elements involved, etc. Such as, are we talking about C++
std::wstring? I have also seen std::u32string referenced in some
forums.
To begin with, it is a somewhat naive impression, would the characters
not translate to unsigned char or char, but rather to
std::wstring::value_type or std::u32string::value_type? Things like
that come to mind approaching the issue.
Additionally, how to otherwise handle symbol tables such as escape
characters, i.e. from:
struct escapes_t : qi::symbols
On Sun, Jan 27, 2019 at 11:05 Michael Powell via Boost-users < boost-users@lists.boost.org> wrote:
Hello,
I am turning a corner in my JSON parser. I support ASCII through and through, but now I want to support Unicode, apparently UTF-8, part of the JSON standard. From what I can tell, this is not the entire grammar, but just for Strings.
Looking for a little guidance on how to approach that issue, the elements involved, etc. Such as, are we talking about C++ std::wstring? I have also seen std::u32string referenced in some forums.
To begin with, it is a somewhat naive impression, would the characters not translate to unsigned char or char, but rather to std::wstring::value_type or std::u32string::value_type? Things like that come to mind approaching the issue.
Additionally, how to otherwise handle symbol tables such as escape characters, i.e. from:
struct escapes_t : qi::symbols
{ escapes_t() { this->add("\\b", '\b') ("\\f", '\f') ("\\n", '\n') ("\\r", '\r') ("\\t", '\t') ("\\v", '\v') ("\\\\", '\\') ("\\/", '/') ("\\'", '\'') ("\\\"", '"') ; } } char_esc; And on from there.
Thanks!
Best regards,
Michael W Powell _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org https://lists.boost.org/mailman/listinfo.cgi/boost-users
The answer to your question is a bit more complicate than you might expect. In short, std::string is capable of representing Unicode text, as the difference between binary representation (bits and bytes) and meaning (codepoints). It would probably be illuminating for you to watch a talk called “Unicode in C++” by James McNellis ( https://m.youtube.com/watch?v=tOHnXt3Ycfo).
https://lists.boost.org/mailman/listinfo.cgi/boost-users -- Travis Göckel +1.720.234.9330
participants (2)
-
Michael Powell
-
Travis Gockel