On 9/23/18 6:37 PM, Zach Laine via Boost wrote:
On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:
On 9/23/18 7:45 AM, Zach Laine via Boost wrote:
I think a Unicode library is very much needed in Boost.
Out of curiosity, it looks like you implemented Unicode algorithms yourself. Why not use a specialized library, like ICU?
It's partly a question of the size of ICU, which is several megabytes, whereas Boost.Text is only 1.2-2MB depending on your compiler.
I built HEAD of ICU just now, and here are the resulting .so's:
-rwxrwxr-x 1 tzlaine tzlaine 26M Sep 23 10:29 ./lib/libicudata.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 3.6M Sep 23 10:28 ./lib/libicui18n.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 65K Sep 23 10:28 ./lib/libicuio.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 66K Sep 23 10:28 ./lib/libiculx.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 234K Sep 23 10:28 ./lib/libicutu.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 2.2M Sep 23 10:28 ./lib/libicuuc.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 5.3K Sep 23 10:28 ./stubdata/libicudata.so.62.1 -rwxrwxr-x 1 tzlaine tzlaine 83K Sep 23 10:28 ./tools/ctestfw/libicutest.so.62.1
So, I don't know how many of those you need, but if you require data (and you do!), 26MB is a lot. Note that I put collation data into headers, so your runtime memory footprint might be much larger than 1.2-2MB, but the minimum requirement is still only that small. Requiring the user to pay more than this minimum is a classic "Don't pay for what you don't use" violation.
Runtime memory footprint is actually more important. If I have 10 processes running on the machine that use ICU then I'm only paying its price once while in your case I would be paying it 10 times. Given that ICU is rather well adopted, this is not an unrealistic benefit. So, if not using ICU you may want to consider if at least some of the runtime data can be put in constant sections of a shared library.
ICU also has IMO a poor (too complicated and confusing) API; there are way too many types and functions, and the types that are emphasized are often the wrong ones, like UTF-16 strings. The algorithms should be C++-style algorithms if this is something we're going to standardize.
Its API could be wrapped inside your library so that users never have to interface with it directly. Nevertheless, thanks for the answer, and I still think a Unicode library like yours is very much needed.