Re: [boost] Boost MySQL Review Starts Today

15 May 2022

      .

On Fri, 13 May 2022 at 04:25, Alan de Freitas via Boost
<boost@lists.boost.org> wrote:
...
Hi! Here's my review.
Hi Alan, thank you very much for taking your time to write your review.
...
Boost.MySql is an impressive library. I would like to thank Rubén for that.
It implements the complete protocol from the ground up with async
functionalities and certainly involved a lot of work. The documentation is
extensive, goal-oriented, and helpful. The async operations support
cancellation and the examples with coroutines are beautiful. During this
review, I was impressed at how easy it works and I've noticed a few points
in the API that have also been brought up in previous reviews and could be
improved. Some other issues could also be better highlighted in the
documentation, which would avoid many problems. The overall recommendation
in this review is acceptance conditional on some fixes.
Thanks a lot.
...
Others have questioned the benefit of the library when compared to sqlpp11
or any wrapper around the C API. The main difference is other libraries are
high-level but this is a discussion still worth having from the point of
view of users. I was thinking about the transition cost from Boost.MySQL to
any other SQL database, since many applications have the
requirement/necessity of allowing different SQL databases. In sqlpp11, the
user can just change the backend. The user could use Boost.MySQL as an
sqlpp11 backend and that would have the same effect. However, I think Rubén
mentioned this is not possible at some point. I'm not sure whether this is
just for the async functionality. In the same line, I wonder if a library
for Postgres or Sqlite would be possible with a similar API, which could
also solve the problem, although I'm not sure anyone would be willing to
implement that. If we did, we could have the convenience of sqlpp11 and the
ASIO async functionalities of Boost.Mysql for other DBs.
It is indeed possible. My point was that sqlpp11 does not provide an async
interface. You can surely write a sqlpp11 backend for MySQL using
the proposed Boost.MySQL library. I haven't done it, but having a quick
look at the interface, it should just work. However, you won't get the most
of Boost.MySQL, because you will be using just sync code.

It is possible to write a Postgres connector with a similar approach.
I indeed wrote a POC available here:
https://github.com/anarthal/postgres-asio
The protocols are different, though, so if you follow my approach
of providing primitives easy to use, yet close to the protocol,
the APIs will be a little different. The Postgres protocol allows
finer grain control over statement execution, AFAIR.

I don't know much about SQLite - but being just a library
and not a protocol, the connector would have to wrap libsqlite.
I can't tell you whether exposing an Asio interface for it
is possible or not.
...
The library really provides ease-of-use, when we consider what it provides
and how low-level it is. However, unlike in other libraries like
Boost.Beast, Boost.MySql users might not be sold into the Asio way of doing
things. Applications that require access to databases might be making
sparse database requests where the Asio asynchronous model is not as
useful. Highlighting these differences in the docs is important. Asio takes
some time to learn, and I guess for a user not used to Asio, already
understanding Asio does not sound like the ease of use. The docs could
focus on the protocol before moving on to the asynchronous functionality.
What actions do you think we can take here? Adding a section in the intro
(https://anarthal.github.io/mysql/mysql/intro.html) about "knowing Asio
is a requirement to use this library"? Do you think an overview on the protocol
(handshake, query, metadata and row packets) would really be useful
for the user at that point?
...
I'm also a little worried about the maintainability of the library and
protocol changes and how this could impact boost as a whole. Should we
really announce it as compatible with MariaDB? What direction would the
library take if they diverge? How often does the protocol change or is
extended? Is Ruben going to stick around if the protocol changes? How hard
would it be for someone to understand the code and implement extensions?
Can a user be sure it's always going to provide the same features and be as
reliable as the C API? I don't have the answer to these questions, but it's
something that got me wondering. I guess this kind of question is going to
come up for any library that is related to a protocol.
The protocol it implements is the "client/server" protocol, which was in
place before the MariaDB fork. From what I know, this is a pretty old
protocol and remains pretty stable. MySQL is separately developing a
new X-protocol`(which this library does *NOT* implement). Docs here:
https://dev.mysql.com/doc/dev/mysql-server/latest/page_mysqlx_protocol.html
MariaDB does not implement it and does not plan to do it. I don't
really think MySQL is going to develop the client/server protocol a lot.
I wouldn't say they will deprecate it in the next years, either, as
there are millions of clients relying on the client/server protocol,
and there aren't many connectors for the X protocol yet (no C
client AFAIK). Does that answer part of your questions?
...
I don't know if the name "MySql" can be used for the library, as it belongs
to Oracle. I'm not saying it can't. I'm really saying I don't know. I'm not
a lawyer and I don't understand the implications here. But this should be
considered, investigated, and evidence should be provided. The library is
also compatible with MariaDB and the name "MySql" might not reflect that.
Maybe there's a small probability it might be compatible with some other
similar DB protocol derived from MySql in the future?
I'm as stuck as you here. What would be the best path here?
For now, I've tried to contact a former user in the Boost ML who works
at Oracle, to see whether he can provide me some guidance here.
Any ideas are welcome.
...
- No two asynchronous MySql query reads can happen concurrently. While this
still has value among other Asio operations, like a server that needs the
DB eventually, the user needs to be careful about that. Maybe it would be
safer if all MySql operations were on some special kind of strand. Or maybe
the library could provide some "mysql::transaction_strand" functionality to
help ensure this invariant for individual queries in the future.
- A second implication is that some applications might find the
asynchronous functionalities in Boost.Mysql not as useful as asynchronous
functionalities in other protocols, like the ones in Boost.Beast. This
depends on how their applications are structured. Since this is the main
advantage over the C API, these users may question the value of the library
and the documentation should discuss this more explicitly.
- These implications could become irrelevant if the library provides some
kind of functionality to enable a non-blocking mode. I have no idea how the
MySql client achieves that.
At the end of the day, I'd say this is similar to HTTP (half-duplex).
The client sends a request and gets a response. I think the docs here
are not doing a great job, and that the interface may be a little limiting
(not allowing the possibility of pipelining). This is something brought up
in other reviews, too, and tracked by
https://github.com/anarthal/mysql/issues/76
and https://github.com/anarthal/mysql/issues/75.
Do you think such pipelining API would mitigate your concerns?
...
I'm worried about the lack of possibility of reusing memory for the
results, as the interface depends on vector. This is not the usual Asio
pattern. These vectors look even weirder in the asynchronous callbacks:
This is also an issue for resultsets, and tracked by
https://github.com/anarthal/mysql/issues/59.
...
- People have their containers/buffers and I would assume reading into some
kind of existing row buffer would be the default interface, as is the case
with other Asio read functions. In other words, read_many and read_all
should work more like read_one.
- Not returning these vectors is the common pattern in Asio: the initiating
function receives a buffer for storage and the callback returns how many
elements were read. Note that the buffer size already delimits how many
elements we should read.
- If we return vectors, async operations would need to instantiate the
vector with the custom allocator for the operation. The callback wouldn't
use std::vector<T> then. I would be std::vector<T,
very_long_allocator_name> for every query callback, which is inconvenient.
Using only std::vector<T> in the callback is not only inconvenient but
dangerous because it would be the wrong allocator and everything would need
to be copied.
Brought up by another user during review, tracked by
https://github.com/anarthal/mysql/issues/58. I think the best thing
here is just passing the vectors by lvalue reference.
...
- One problem I see without the vectors is that each row would still need
to allocate memory internally because of their dynamic size. I don't know
the best solution for that second problem unless we have some sort of
static size row/tuple, which does make sense since the user should know the
number of columns at compile time.
Brought up by everyone, so I will give this high priority. Tracked by
https://github.com/anarthal/mysql/issues/60.
...
The library provides read_one, read_all, and read_many, but no "read_some".
Besides the implication of a buffer to the initiating function, I would
imagine reading as much as possible and parsing that to be the usual best
strategy in this kind of an operation since we have a stream and we don't
know how much data is available to read yet. Unfortunately, read_one,
read_all, and read_many don't allow that. Unless read_many can already read
less than the max number of rows in intermediary calls, which I don't think
is the intention. It might also be useful to replicate the naming
conventions Asio uses to read some or read all of the data.
Not sure if I follow you. What's the interface you propose here?
The connection object already holds an internal buffer for requests.
The library currently performs boost::asio::read call per packet,
but I'm working in optimizing that to use read_some instead.
With that approach, read_one may get the row from the buffer
without performing any read at all. Is that what you mean?
...
If "value" is intended to encapsulate the variant type,
"value::to_variant" seems dangerous. The variant supports lots of types,
which indicates the probability the internal design might need to change to
fit new user demands is high. This could also need to change according to
protocol extensions, and changing any of these types would break the API.
If "value" already represents what it needs to represent,
"value::to_variant" may not need to be exposed unless there's a lot of user
demand for that.
I exposed it to avoid having to implement a visit function. Having
such a visit function would have the same problems. I think
visiting a variant-like is one of the most common things. It make
sense not to expose the function if we end up with an alternate
tuple-like interface, though.
...
There is also an existential value/reference problem with "value":
- The internal variant type includes a string_view whose string is owned by
the row. This seems kind of dangerous or at least misleading. We have a
reference type that's called "value", while this is more like a const
"field view" or something.
- At the same time, it's not a view either because I assume this is
mutable, and updating numbers would not update the row.
- The other values in the variant are value types, which makes "value" both
a value and a reference type depending on the column type.
- A reference type seems appropriate here because we don't want to copy all
the data when iterating the results, but we need some other pattern for
"value". A const view is especially useful because we might be able to keep
value types for numbers, etc... Encapsulating the variant is also helpful
here.
- Calling a type "value" is a bad indication anyway. Any value is a "value".
If we go down the compile-time parsing strategy, together with the buffering
I described above, making it owning may be the best option. Any naming
ideas? field_value? field? sql_value?
...
As others have mentioned, "value::get_std_optional" is unusual. It may not
look problematic at first but it has a number of problems:
First of all, I'd like to know your thoughts about the conversion
functionality in value
(https://anarthal.github.io/mysql/mysql/values.html#mysql.values.conversions).
Do you think it's helpful for the user? If it wasn't because of conversions,
we could replace value::get_optional and value::get_std_optional for
something similar to variant2::get_if, returning a pointer to the
underlying value,
avoiding the problem altogether. This could make even more sense if we make
the value owning, as value::get_optional<std::string> would end up copying
the underlying string, which doesn't make a lot of sense.
...
If I understood the documentation correctly, it's also problematic to use
an empty optional for both a null value and an invalid result. They are
simply not the same thing. Libraries always differentiate an invalid state
from null in a field. This is especially relevant because the documentation
warns us against using the "is_convertible_to" pattern over "get_optional"
because "is_convertible_to" is unnecessarily slow. By the way, it would be
nice to explain why "// WARNING!! Inefficient, do NOT do" is true. I
wouldn't expect that to cost more than creating the optional value.
The distinction can be made with something like:

value v = /* get the value */
if (v.is_null()) {
    // handle the NULL case
} else {
    double d = v.get<double>(); // will throw on type mismatch
}

So maybe what we're lacking is something in the documentation
to address how to handle NULL values effectively. I've raised
https://github.com/anarthal/mysql/issues/77
to address this issue.

Calling value::is_convertible_to, then value::get will perform the conversion
twice, while value::get_optional will do it just once. It isn't
horribly inefficient,
though - maybe this section is causing more confusion than aid.

Considering all this, I'd make the following to the value class:
1) Make it owning.
2) Remove conversions.
3) Remove get_optional and get_std_optional, and provide a get_if
function returning a pointer to the internal value, as variant2::get_if does.
4) Make value::get return a reference instead of a value.
5) Remove value::to_variant

I've summarized these points in https://github.com/anarthal/mysql/issues/85.
...
The documentation states that the behavior of relational operators may
change in the future. What do other libraries do about these comparisons?
We probably don't want something like PHP, where "20" == 20 so that we need
some kind of "20" === 20 in the future, but the documentation implies we
also don't want "value(std::int64_t(200)) > value(std::uint64_t(200))" to
fail. The library makes the whole thing particularly tricky with number
types because not checking the proper number type in the variant would
return invalid/null even when the number could be stored in a variable of
another number type perfectly fine, which is even worse when both invalid
and null are presented by the same empty optional.
Thanks to conversions, value(int64_t(200)).get_optional<uint64_t>() succeeds
and returns a non-empty optional. The main use I see for inequality operators
is being able to store values in containers such as std::set, so I don't think
we should change the behavior in the future. Raised
https://github.com/anarthal/mysql/issues/80.
...
About the "resultset":
- If "resultset" is a table, the "set" part of the name might be a little
misleading in C++ as the results are in a deterministic sequence, even
though this is what MySQL may call a similar concept. However, I imagine
there's no equivalent concept in MySql for this specific
Asio-async-lazy-fetching functionality itself.
- This class is more like a "table", "frame", "vector", "list", or just
"result". However, after all the steps of "connect", "prepare", and
"execute", the name "result" gives the impression that the results are
finally there, but this is just another stream object like a "result
fetcher" or something.
I'm thinking on calling it query_operation or query_result.
...
- Since the protocol is strictly sequential, it might be useful to not even
have a resultset class at all. The connection could send the query and have
some read function, like the usual read/write functions in Asio can operate
on the same stream.
Resultsets also hold whether they're complete or not, and some extra
info like affected_rows and warning_count, which become available once
the resultset has been completely read (or from the beginning, if the
query was an INSERT, for example). Docs:
https://anarthal.github.io/mysql/mysql/resultsets.html#mysql.resultsets.comp...
If we implement pipeline mode, it will have to hold some state specific
to the operation (i.e. a sequence number that the protocol dictates - although
that's internal). It is also possible to make resultset (or whatever name we
come up with) not an I/O object, and move the read functions to the connection.
...
Minor issues:
- Maybe it's not clear from the examples, but why would the user need
connection::handshake when connection::connection already does it?
If your Stream object does not satisfy the SocketStream requirements
(https://anarthal.github.io/mysql/mysql/ref/boost__mysql__SocketStream.html),
you won't be able to use connection::connect or connection::close.
This would be the case for Windows named pipes.
Docs here:
https://anarthal.github.io/mysql/mysql/other_streams.html#mysql.other_stream...
...
- Shouldn't `tcp_ssl_connection::connect` accept an EndpointSequence to be
more like other Asio functions.
I think it should accept both. Tracked by
https://github.com/anarthal/mysql/issues/72
...
- The names "tcp_ssl_prepared_statement", "tcp_ssl_<component>" are quite
repetitive and verbose in the documentation. Maybe "tcp_ssl" could become a
namespace? Isn't that what Asio does for the ssl related objects? Some
names might be reduced if there's no alternative. For instance,
"prepared_statement" might become "statement" since there's no
"unprepared_statement".
Would having type aliases under connection (i.e. connection::resultset_type)
alleviate your concerns?
...
- Wouldn't some form of `statement::execute` with variadic args (different
name perhaps) be a simpler alternative to
`statement::execute(boost::mysql::make\_values(...))`. Or maybe
statement::operator(), as Phil suggested. Or some
`statement::execute_with(...)`. In the variadic args, the optional err/info
can go into some specified positions, since they're not ambiguous with the
field variant types. This is a little tricky but definitely useful,
especially if suggestions of customization points are implemented in the
future.
Definitely. https://github.com/anarthal/mysql/issues/74 tracks it.
How would variadic operator() work with async functions? The convention
is having the _async suffix, but that's not possible with an operator overload.
...
## Implementation
...
What is your evaluation of the implementation?
I haven't analyzed the implementation very profoundly. I skimmed through
the source code and couldn't find anything problematic. It would be useful
if the experts could inspect the Asio composed ops more deeply.
CMakeLists.txt:
- I believe the CMakeLists.txt script is not in the format of other boost
libraries in boost/libs so it won't work with the super-project as it is.
Not aware there was a standard format. Could you please point
me out to the relevant documentation?
...
- The example/CMakeLists.txt script refers to
BOOST_MYSQL_INTEGRATION_TESTS. I don't think examples can be considered
integration tests.
They are from the perspective that they require a real server to run.
...
Examples:
- The examples are very nice. Especially the one with coroutines. They are
also very limited. They are all about the same text queries, which
shouldn't even be used in favor of prepared statements.
Tracked by https://github.com/anarthal/mysql/issues/81. May be worth
reconsidering when we tackle https://github.com/anarthal/mysql/issues/69
...
- Many examples about continuation styles are not very useful because this
is more of an Asio feature than a library feature. The library feature, so
to speak, is supporting Asio tokens properly. The continuation styles could
be exemplified in the exposition with some small snippets for users not
used to Asio without the documentation losing any value.
I may move them into a separate section, instead of having all of them
in a single section.
...
- The page about the docker container should specify that the username and
password are "root" and "".
Raised https://github.com/anarthal/mysql/issues/82
...
Tests:
Some unit tests take a ***very*** long time. Enough to make coffee and a
sandwich. And they seem to not be adding a lot of value in terms of
coverage. For instance, "mysql/test/unit/detail/protocol/date.cpp(72):
info: check '1974- 1-30' has passed" going through all possible dates
multiple times took a long time.
True. They may have been useful during development but they don't
give us a lot as regression tests. Raised
https://github.com/anarthal/mysql/issues/83
...
The documentation is complete. The main points that differentiate the
library are
- it's a complete rewrite of the protocol,
- it's low-level and
- it's based on Boost.Asio
The documentation should emphasize these points as much as possible,
especially the first one. This should be in the introduction, the
motivation, slogans, logos, and wherever people can see it easily.
Has been noted in several other reviews.
https://github.com/anarthal/mysql/issues/70 tracks it.
...
The documentation should also provide arguments and evidence that
these design goals are a good idea, as often discussed when the topic is
the value of this library. Why is it worth rewriting the protocol? To what
use cases are such a low-level library useful? Why should a person who
already uses other libraries or the C API care about Asio now? Something
that should also be highlighted is the difference between the library and
other higher-level libraries, in particular, naming names.
https://github.com/anarthal/mysql/issues/50 tracks it.
...
Minor issues:
- There's no link in the documentation to the protocol specification. It
would be interesting to know what the reference specification is. Or
whether the protocol was inferred somehow. Is there any chance this
protocol might change? What about divergences between MySql and MariaDB?
How stable is the protocol? For what range of versions does it work? What's
the policy when it changes?
Raised https://github.com/anarthal/mysql/issues/84.
FYI, this is the specification:
https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_PROTOCOL.html.
Some mechanics are better explained here:
https://dev.mysql.com/doc/internals/en/client-server-protocol.html
(although it's an old page).
...
- Some links are broken (for instance, linking to
https://anarthal.github.io/boost-mysql/index.html).
Just fixed it in develop.
...
- "All async operations in this library support per-operation
cancellation". It's important to highlight this is per operation in the
Asio sense of an operation but not in the MySql sense of an operation
because the MySql connection is invalid after that.
Note added in develop.
...
- "Boost.MySql has been tested with the following versions of MySQL".
MariaDB is not a version of MySql.
Fixed in develop.
...
- Prepared statements should come first in the examples, to highlight them
as the default pattern.
Will be tackled as part of https://github.com/anarthal/mysql/issues/81
...
- The documentation refers to topics that haven't been explained yet. Maybe
"value" could be explained after "row", and "row" could be explained after
"resultset" and "resultset" after "queries".
The basic concepts have been covered in the tutorial, which goes first.
So at this point the user should know the *basic* concept of what a value,
a row and a resultset is. The rest of the sections go deeper in these concepts.
Did you find section order counter-intuitive, even reading the tutorial first?
...
- The section "Text Queries" is quite small in comparison to other
sections. It could include some examples and snippets like other sections
do.
I'll probably expand it as part of https://github.com/anarthal/mysql/issues/69,
because there will be more stuff to cover.
...
- "The following completion tokens can be used in any asyncrhonous
operation within Boost.Mysql" -> "Any completion token..."
Fixed in develop
...
- "When they fail, they throw a boost::system::system_error exception".
Don't these functions just set the proper error_code, as usual with Asio
and Beast?
No. In Asio, you always get two overloads for sync operations. Let's take
basic_stream_socket::read_some, for instance:
https://www.boost.org/doc/libs/1_79_0/doc/html/boost_asio/reference/basic_st...
https://www.boost.org/doc/libs/1_79_0/doc/html/boost_asio/reference/basic_st...
Overload 1 throws boost::system::system_error, while overload 2 returns the
error by setting the appropriate error_code. This library is
consistent with that.
...
- The "MySQL to C++ mapping reference" section should be using a table.
I'd say this is personal preference. I will explore the tabular format
when I have the chance.
...
- A small subsection on transactions would be helpful even if there's no
library functionality to help with that.
https://github.com/anarthal/mysql/issues/65 tracks it.
...
- The documentation should include some comparisons that are not obvious to
potential users. C/C++ APIs. The advantages of the Asio async model.
Benchmarks if possible.
https://github.com/anarthal/mysql/issues/50
...
...
Are there any immediate improvements that could be made after acceptance,
if acceptance should happen?
While it's important to have a general variant type for row values, a
simpler interface for tuples of custom types would be very welcome and
would simplify things by a lot, while also avoiding allocations, since
columns always have the same types. This feature is too obvious since users
almost always know their column types at compile time and this demand is
too recurrent in applications to ignore.
...
Do you think the library should be accepted as a Boost library? Be sure
to say this explicitly so that your other comments don't obscure your
overall opinion.
I believe it should be conditionally accepted with the same conditions
stated in other reviews: allowing for memory reuse in the read_* functions
and fixing the "value" type.
Could you please be a little more explicit on your second point?
Would adding a struct/tuple interface
(https://github.com/anarthal/mysql/issues/60)
and making it owning (https://github.com/anarthal/mysql/issues/85), which would
in turn tackle all the optional stuff, be enough?

Kind regards,
Ruben.

Re: [boost] Boost MySQL Review Starts Today

Ruben Perez