Re: [boost] [url] Andrzej's notes
On Mon, Aug 22, 2022 at 11:48 PM Andrzej Krzemienski
Like in this example from the docs: https://www.boost.org/doc/libs/1_80_0/libs/beast/doc/html/beast/quick_start/...
http::requesthttp::string_body req{http::verb::get, target, version}; req.set(http::field::host, host); req.set(http::field::user_agent, BOOST_BEAST_VERSION_STRING);
At no point do I have to see or provide a full URL.
"target" is the URL there, it came from the command line. It is a relative-ref.
* modifiers which take un-encoded inputs have a wide contract: all input strings are valid - however the url might need to reallocate memory to encode the result
Does set_port() taking a string_view fall into that category? Unlike other parts, port has special requirements on the string contents that cannot be satisfied by pct-encoding.
Yeah, we have a problem here. It sounds like we need to design an extra set functions. Maybe: url_base& url_base::set_port( string_view ); // throws result<void> url_base::try_set_port( string_view ); // returns result what do you think about that?
If you used result<>, you would lose the ability to chain the setters.
Yeah... well, I think I'm OK with that.
As an alternative, you could say that function set_port() has a precondition: the input string has to represent a number, the caller is responsible for that, and set_port() performs no validation, and therefore throws nothing. But I guess that would violate one of the design goals of the library: "securely, including the case where the inputs come from untrusted sources".
My original thinking was that untrusted sources would only be presented to parse functions. But our dialog has convinced me that we should treat the parameters to modification functions as untrusted as well. True, in some situations we will be performing unavoidable, needless re-validation. But I think it is the right tradeoff, as offering the stronger invariant has more value for users. Besides, there are workarounds for assembling a URL which trade back performance in exchange for weaker invariants.
Just to clarify what I mean: https://godbolt.org/z/qY3Y76fd9
urls::string_view s = "https://path?id=42&id=43"; urls::url r = urls::parse_uri( s ).value();
Should the above input string not cause the invariant to be broken? (because param 'id' appears twice.)
No, because the query is just a string. The interpretation of the query as "params" (an ampersand delimited list of name=value pairs) is an HTTP thing (and it has also spread to some other domains). When the query is used this way, duplicates are allowed. So the invariant is preserved in this case. It is always a valid URL. It might not be valid for custom schemes though. I could invent the "boost" scheme which requires that keys used in query parameters are unique. But there is no way for Boost.URL to enforce this (see my previous message regarding "mailto"). Users are provided with the tools to further specialize the generic URLs into custom schemes in a way that can preserve scheme-specific invariants. Thanks
On Tue, Aug 23, 2022 at 2:30 PM Vinnie Falco
If you used result<>, you would lose the ability to chain the setters.
Yeah... well, I think I'm OK with that.
If you chain the setters then how can you look at the result to see if it failed? Anyway.. I thought of a solution: result< url_base& > url_base::set_port( string_view s ); now you can write url().set_port( "8080" )->set_encoded_host( "www.example.com" )->set_path( "/" ) Regards
Vinnie Falco wrote:
On Tue, Aug 23, 2022 at 2:30 PM Vinnie Falco
wrote: If you used result<>, you would lose the ability to chain the setters.
Yeah... well, I think I'm OK with that.
If you chain the setters then how can you look at the result to see if it failed? Anyway.. I thought of a solution:
result< url_base& > url_base::set_port( string_view s );
now you can write
url().set_port( "8080" )->set_encoded_host( "www.example.com" )->set_path( "/" )
That's not going to work because op-> is unchecked.
wt., 23 sie 2022 o 23:32 Vinnie Falco
On Tue, Aug 23, 2022 at 2:30 PM Vinnie Falco
wrote: If you used result<>, you would lose the ability to chain the setters.
Yeah... well, I think I'm OK with that.
If you chain the setters then how can you look at the result to see if it failed? Anyway.. I thought of a solution:
result< url_base& > url_base::set_port( string_view s );
now you can write
url().set_port( "8080" )->set_encoded_host( "www.example.com" )->set_path( "/" )
Actually, I thought about something like: error_code ec; url.set_port(s1, ec).set_encoded_host(s2, ec).set_path(s3, ec); or: error_code ec1, ec2, ec3; url.set_port(s1, ec1).set_encoded_host(s2, ec2).set_path(s3, ec3); And then some operations would not set their part upon invalid input and record this in the error code. But the library seems to be moving away from error codes. Regards, &rzej;
On 24/08/2022 09:30, Vinnie Falco wrote:
Does set_port() taking a string_view fall into that category? Unlike other parts, port has special requirements on the string contents that cannot be satisfied by pct-encoding.
Yeah, we have a problem here. It sounds like we need to design an extra set functions. Maybe:
url_base& url_base::set_port( string_view ); // throws
result<void> url_base::try_set_port( string_view ); // returns result
what do you think about that?
Why is it a string at all? Shouldn't it be exclusively a uint16_t? Are there any cases where anything else is valid?
On Tue, Aug 23, 2022 at 4:23 PM Gavin Lambert via Boost
url_base& url_base::set_port( string_view ); // throws result<void> url_base::try_set_port( string_view ); // returns result
Why is it a string at all? Shouldn't it be exclusively a uint16_t?
Its a good question. Every member function in the reference has a BNF section which shows the grammar, and a Specification section which provides a link to the relevant specifications document (RFC3986 in this case): https://master.url.cpp.al/url/ref/boost__urls__url_base/set_port/overload2.h... The RFC says: port = *DIGIT That means that "80", "443", and "8080" are examples of valid ports. "65535" is a valid port. But so is "4294967295" or even "999999999999999999". In fact "0" is a valid port, as well as "00" or any number of leading zeroes, followed by any number of digits (including no digits). And the empty string "" is a valid port. Boost.URL recognizes the distinction between having a port which is the empty string, and not having a port at all, by providing the function remove_port() and the function has_port(). Calling set_port("") is semantically different from calling remove_port(), and results in a different URL. This is per the RFC grammar: authority = [ userinfo "@" ] host [ ":" port ] Because this library is capable of representing ALL URLs, it is necessary for the interface to allow the caller to interact with the port as a string which is valid according to the grammar. But of course people need to treat it like an integer so we have overloads for setting the port as a sixteen bit unsigned number, and retrieving the port as the same. https://master.url.cpp.al/url/ref/boost__urls__url/port_number.html https://master.url.cpp.al/url/ref/boost__urls__url/set_port/overload1.html Regards
śr., 24 sie 2022 o 01:37 Vinnie Falco via Boost
On Tue, Aug 23, 2022 at 4:23 PM Gavin Lambert via Boost
wrote: url_base& url_base::set_port( string_view ); // throws result<void> url_base::try_set_port( string_view ); // returns
result
Why is it a string at all? Shouldn't it be exclusively a uint16_t?
Its a good question. Every member function in the reference has a BNF section which shows the grammar, and a Specification section which provides a link to the relevant specifications document (RFC3986 in this case):
< https://master.url.cpp.al/url/ref/boost__urls__url_base/set_port/overload2.h...
The RFC says:
port = *DIGIT
That means that "80", "443", and "8080" are examples of valid ports. "65535" is a valid port. But so is "4294967295" or even "999999999999999999". In fact "0" is a valid port, as well as "00" or any number of leading zeroes, followed by any number of digits (including no digits). And the empty string "" is a valid port. Boost.URL recognizes the distinction between having a port which is the empty string, and not having a port at all, by providing the function remove_port() and the function has_port(). Calling set_port("") is semantically different from calling remove_port(), and results in a different URL. This is per the RFC grammar:
authority = [ userinfo "@" ] host [ ":" port ]
Because this library is capable of representing ALL URLs, it is necessary for the interface to allow the caller to interact with the port as a string which is valid according to the grammar.
Is this the only reason? I would think that the primary reason is performance: usually you obtain a URL as a string, so the port-string is just sitting there. Then you may want to set it in another URL, so it will be a string again. But of
course people need to treat it like an integer so we have overloads for setting the port as a sixteen bit unsigned number, and retrieving the port as the same.
https://master.url.cpp.al/url/ref/boost__urls__url/port_number.html < https://master.url.cpp.al/url/ref/boost__urls__url/set_port/overload1.html
Oh, it wasn't clear from the docs. So, it looks like you are saying "I know it's an integer port number, you know it's an integer port number, but I need to keep it as string because other schemes may use it for their own murky schemes, like: "ak://www.boost.org" -- identifies one resource "ak://www.boost.org:" -- identifies another resource This convinces me even further that I might want a type like http_url with a tighter invariant: 1. The port is an optional int 2. The query part doesn't have duplicate keys. Regards, &rzej;
Le 2022-08-24 10:03, Andrzej Krzemienski via Boost a écrit :
śr., 24 sie 2022 o 01:37 Vinnie Falco via Boost
napisał(a):
This convinces me even further that I might want a type like http_url with a tighter invariant: 1. The port is an optional int 2. The query part doesn't have duplicate keys.
Unless i'm mistaken, it is perfectly legal to have duplicate keys in the query part, even for the http/https scheme. And IIRC that's been (and probably still is) used by several web frameworks to represent arrays in GET request. So that would not fit for an http_url invariant. Regards, Julien
On Wed, Aug 24, 2022 at 1:04 AM Andrzej Krzemienski
This convinces me even further that I might want a type like http_url with a tighter invariant:
Uhh.. I don't know about that. The request-target in HTTP is not exactly a URL. From rfc7230: request-target = origin-form / absolute-form / authority-form / asterisk-form origin-form is a relative-ref which has no scheme or authority. There's no port in it. absolute-form is the same as absolute-URI which can have an authority. authority-form is just an authority, can have a port, and is technically not even a URL. That's why Boost.URL provides the type authority_view and the function parse_authority. And asterisk-form is literally just a single asterisk character '*' which is also not a URL. Boost.URL does not provide a type or parser for this because, well come on man you can parse that yourself :) So what would http_url be? It can't be all of these things.. unless maybe you're thinking to make it a variant. I feel that the current model of providing url_view, url, and static_url as general purpose containers which faithfully implement rfc3986, while leaving it to the user to build up tighter invariants or other scheme-specific interfaces to the general container, strikes a reasonable balance. On a side note, it is because the specifics of the grammars matter so much for where they are used that I have been strict in the naming of parse functions. It is essential to use the right terms or else more confusion results. Did you know that the request-target was sometimes not a URL, and has a grammar that is dependent on the context (e.g. GET versus CONNECT)? I did not... Users will need to know these things in order to produce correct programs, and Boost.URL does its part by being precise. Thanks
On 24/08/2022 20:03, Andrzej Krzemienski wrote:
śr., 24 sie 2022 o 01:37 Vinnie Falco napisał(a):
Because this library is capable of representing ALL URLs, it is necessary for the interface to allow the caller to interact with the port as a string which is valid according to the grammar.
I don't really think that's a good reason. The grammar specifically
states that the value must be numeric, and while it doesn't explicitly
place a limit on its range, it's not really reasonable to use it for
anything other than a TCP/UDP port number, which *is* explicitly limited
to 16-bit values. While URLs do occasionally get used for non-Internet
protocols, I can't think of a case where a larger port number would be used.
As such, I do think it's reasonable for it to fail parsing if someone
tries to use an out-of-range port number, at least until someone
demonstrates a (non-arbitrary) case where a larger range is needed.
Regarding 0 being potentially ambiguous, I don't think that's an issue
either, since it's not a valid TCP/UDP port number. And you do have
.has_port() if someone really wants to disambiguate that case. You
could potentially change it to optional
Is this the only reason? I would think that the primary reason is performance: usually you obtain a URL as a string, so the port-string is just sitting there. Then you may want to set it in another URL, so it will be a string again.
This is perhaps a better reason, and there are some other network libraries that do take the port as a string (or as a full authority string including hostname and colon). And if you're just disassembling, modifying, and reassembling the URL to pass on as a string then you may have no need of converting to integer. But then, in those cases you'd never call .port() anyway. So I don't think it would cost anything to remove the string accessors from the public API, even if it keeps the internal string representation and only parses to int on-demand. And it would avoid some complications with .set_port(s) and invalid input. Or perhaps make .port() and .set_port(n) use uint16_t but retain a string getter (but not setter) .port_string() for the corner cases.
2. The query part doesn't have duplicate keys.
I definitely disagree with this one. Repeated keys in the query are quite commonplace to represent array values or other multi-selects. Having said that, it may be useful to present (both for read and write) the params as a multi-map-like structure with unique keys and multiple values. This can't be the primary interface (since it can only be used when the order of the keys is not important, which is commonly the case but not *always* the case) but it would be a more convenient one for most usage. Aside: at the bottom of https://master.url.cpp.al/url/containers/query.html the examples 3 & 4 are both labelled "a key with no value", which is not what #4 shows. There is also no example for "no key and no value", which was stated to be a possible result in the preceding paragraphs. The example query string suggests that perhaps that should have appeared in between 3 and 4.
On Wed, Aug 24, 2022 at 4:44 PM Gavin Lambert via Boost
śr., 24 sie 2022 o 01:37 Vinnie Falco napisał(a):
Because this library is capable of representing ALL URLs, it is necessary for the interface to allow the caller to interact with the port as a string which is valid according to the grammar.
While URLs do occasionally get used for non-Internet protocols, I can't think of a case where a larger port number would be used.
I can't either, but when building this library we have done our best to refrain from subjective interpretations of the grammar, which states unambiguously: port = *DIGIT This means zero or more digits. TCP/IP was already quite well established for decades before rfc3986 was written, so I have to think that if they wanted the port to be limited to only what is possible with TCP/IP they would have stated so explicitly. Port zero for example, is an invalid port number, but it is allowed by this grammar. Some of the grammars in the spec are explicit when it comes to numeric limits, for example a dec-octet is limited to 0-255: dec-octet = DIGIT ; 0-9 / %x31-39 DIGIT ; 10-99 / "1" 2DIGIT ; 100-199 / "2" %x30-34 DIGIT ; 200-249 / "25" %x30-35 ; 250-255 had the authors intended to restrict the port they would have written it this way. I have been reluctant to put my own spin on interpreting the RFC if for no other reason, that I do not have sufficient field experience with the countless number of published and unpublished schemes which are currently in use. A conservative design choice follows the specification to the letter.
As such, I do think it's reasonable for it to fail parsing if someone tries to use an out-of-range port number
This is weird because you're saying that we should not accept valid productions according to the grammar?
So I don't think it would cost anything to remove the string accessors from the public API, even if it keeps the internal string representation and only parses to int on-demand. And it would avoid some complications with .set_port(s) and invalid input.
There's a problem here. What if we get url_view u( "//example.com:00" ); Is this a valid URL? According to the RFC it is. we parse it, and return 0 from port_number(). But you want to take away the port string modifiers. One of the principles of this library, is that the API allows the user to create any possible valid URL. That is, given a valid URL string, there exists a finite sequence of calls to the library that will produce the string. Taking away the port string modifiers leaves the library in the questionable position that users cannot create a URL that the parser thinks is valid. Regards
wt., 23 sie 2022 o 23:30 Vinnie Falco
On Mon, Aug 22, 2022 at 11:48 PM Andrzej Krzemienski
wrote: Like in this example from the docs:
https://www.boost.org/doc/libs/1_80_0/libs/beast/doc/html/beast/quick_start/...
http::requesthttp::string_body req{http::verb::get, target,
version};
req.set(http::field::host, host); req.set(http::field::user_agent, BOOST_BEAST_VERSION_STRING);
At no point do I have to see or provide a full URL.
"target" is the URL there, it came from the command line. It is a relative-ref.
Ok, then maybe I do not understand how Beast works, I thought req.set(http::field::host, host); was setting the host. Anyway, other than this example, I did not have any other interface in mind.
* modifiers which take un-encoded inputs have a wide contract: all input strings are valid - however the url might need to reallocate memory to encode the result
Does set_port() taking a string_view fall into that category? Unlike other parts, port has special requirements on the string contents that cannot be satisfied by pct-encoding.
Yeah, we have a problem here. It sounds like we need to design an extra set functions. Maybe:
url_base& url_base::set_port( string_view ); // throws
result<void> url_base::try_set_port( string_view ); // returns result
what do you think about that?
If you used result<>, you would lose the ability to chain the setters.
Yeah... well, I think I'm OK with that.
As an alternative, you could say that function set_port() has a precondition: the input string has to represent a number, the caller is responsible for that, and set_port() performs no validation, and therefore throws nothing. But I guess that would violate one of the design goals of the library: "securely, including the case where the inputs come from untrusted sources".
My original thinking was that untrusted sources would only be presented to parse functions. But our dialog has convinced me that we should treat the parameters to modification functions as untrusted as well. True, in some situations we will be performing unavoidable, needless re-validation. But I think it is the right tradeoff, as offering the stronger invariant has more value for users. Besides, there are workarounds for assembling a URL which trade back performance in exchange for weaker invariants.
According to the design-by-contract theory, putting a precondition on class members (like, sting_view must represent an int) does not weaken the class invariant. It is just that the invariant is enforced through the new precondition rather than runtime validation. "I guarantee the invariant as long as users guarantee that they satisfy the preconditions." Of course, this may not work for your case, as you may want to allow for a use case when the user calls set_port() with an argument obtained from an untrusted source. Regards, &rzej;
Just to clarify what I mean: https://godbolt.org/z/qY3Y76fd9
urls::string_view s = "https://path?id=42&id=43"; urls::url r = urls::parse_uri( s ).value();
Should the above input string not cause the invariant to be broken? (because param 'id' appears twice.)
No, because the query is just a string. The interpretation of the query as "params" (an ampersand delimited list of name=value pairs) is an HTTP thing (and it has also spread to some other domains). When the query is used this way, duplicates are allowed. So the invariant is preserved in this case. It is always a valid URL. It might not be valid for custom schemes though. I could invent the "boost" scheme which requires that keys used in query parameters are unique. But there is no way for Boost.URL to enforce this (see my previous message regarding "mailto"). Users are provided with the tools to further specialize the generic URLs into custom schemes in a way that can preserve scheme-specific invariants.
Thanks
participants (5)
-
Andrzej Krzemienski
-
Gavin Lambert
-
Julien Blanc
-
Peter Dimov
-
Vinnie Falco