Boost.URL -- some notes
Hi Everyone, I am trying to do a review of Boost.URI library (I usually find the official ten-day review period to be too short), and there are a number of interesting things I came across that I thought I would mention. The docs say:
The library requires Boost and a compiler supporting at least C++11.
Even though the library is a candidate for a Boost library, I understand
that it is now offered as a "stand alone" version, and this is what the
docs describe. I tried to use it with the latest MinGW Distro on Windows (
https://nuwen.net/mingw.html), which uses GCC 11.2 and Boost 1.77. Without
success. This is because Boost.URL relies on the component
boost::system::result<T>, which is present in Boost.System only since
version 1.78:
https://www.boost.org/doc/libs/1_79_0/libs/system/doc/html/system.html#chang...
First, this is news to me that we have `result<T>` in Boost.System, which
has an overlap with result
https://master.url.cpp.al/url/ref/boost__urls__string_view.html, use their Boost equivalents.
After reading this, I expected that Boost.URL would use boost::string_view
from Boost.Utility library:
https://www.boost.org/doc/libs/1_79_0/libs/utility/doc/html/utility/utilitie...
But instead, it uses boost::core::string_view, which is an implementation
detail from Boost.Core library:
https://github.com/CPPAlliance/url/blob/master/include/boost/url/string_view...
Again, this is news for me that Boost has two implementations of
string_view. Why? Second, I do not think that Boost.URL should rely on the
implementation details of Boost.Core. A better alternative would be to use
the official boost::string_view from Boost.Utility. Or is there a good
reason not to?
Next, the section on the parsers (
https://master.url.cpp.al/url/parsing/url.html) describes the function
parse_uri() which returns result
And the line below it says that the function throws std::length_error when the input is too long. It looks like a bug in specs. Later we read: Return value: A result containing the view to the URL, or an error code if
the parsing was unsuccessful.
Which is not precise enough to give me the answer to the URI-vs-URL question. When can a parsing be non-successful? Is it only because it was not conformant to the grammar? The synopsis says "This function parses a string according to the URI grammar below", but is it a URI grammar or a URL grammar actually? Maybe the "return value" section should say instead: Return value: A result containing the view to the URL, or an error code if
the contents of `s` were not conformant with the above grammar.
That is, any other reason for not being successful (if any resources needed to be allocated and failed) may still be reported via exceptions. Now, there is probably a good explanation to the URI vs URL discrepancy. I think it would be good if it was placed in the docs, so that the users don't get confused. While this might look like a list of complaints, I really appreciate the efforts the authors put in creating this library and its documentation. The documentation is really high quality, way higher than the average you will find in GitHub. And this is actually because of this high quality that I am able to spot and report these issues. Regards, &rzej;
On Sat, Jun 4, 2022 at 5:23 PM Andrzej Krzemienski wrote:
Aliases for standard types, such as string_view
https://master.url.cpp.al/url/ref/boost__urls__string_view.html, use their Boost equivalents.
After reading this, I expected that Boost.URL would use boost::string_view from Boost.Utility library: https://www.boost.org/doc/libs/1_79_0/libs/utility/doc/html/utility/utilitie...
But instead, it uses boost::core::string_view, which is an implementation detail from Boost.Core library: https://github.com/CPPAlliance/url/blob/master/include/boost/url/string_view...
Again, this is news for me that Boost has two implementations of string_view. Why? Second, I do not think that Boost.URL should rely on the implementation details of Boost.Core. A better alternative would be to use the official boost::string_view from Boost.Utility. Or is there a good reason not to?
The mailing list thread in which Core's string_view was discussed is: https://lists.boost.org/Archives/boost/2021/10/252092.php Originally it was not meant to be in the 'detail' namespace. Glen
On Sat, Jun 4, 2022 at 2:23 PM Andrzej Krzemienski via Boost
I am trying to do a review of Boost.URI library
Great! But umm.... err..., well - its called Boost.URL ;)
I tried to use it with the latest MinGW Distro on Windows ( https://nuwen.net/mingw.html), which uses GCC 11.2 and Boost 1.77.
Yeah you need the latest Boost. Until the library is actually accepted, it is written for the tip of the develop branch of the superproject. Note that this goes for all our in-development libraries.
Second, I recommend that Boost.URL docs say that it requires Boost 1.78 or higher.
That's not unreasonable. We develop the documentation as-if the library is already accepted, to minimize the changes that must be made post-acceptance. You should open an issue as that is the best way to motivate change: https://github.com/CPPAlliance/url/issues
Aliases for standard types, such as string_view
https://master.url.cpp.al/url/ref/boost__urls__string_view.html, use their Boost equivalents.
After reading this, I expected that Boost.URL would use boost::string_view from Boost.Utility library: https://www.boost.org/doc/libs/1_79_0/libs/utility/doc/html/utility/utilitie...
But instead, it uses boost::core::string_view, which is an implementation detail from Boost.Core library: https://github.com/CPPAlliance/url/blob/master/include/boost/url/string_view...
Yeah, this documentation was written before we started using Core's string_view. It will need to be updated in Boost.URL, Boost.JSON, Boost.Beast, and Boost.HTTP.Proto. Newly opened issues are the best way to motivate change: https://github.com/boostorg/beast/issues https://github.com/boostorg/beast/json https://github.com/CPPAlliance/url/issues https://github.com/CPPAlliance/http_proto
Again, this is news for me that Boost has two implementations of string_view. Why?
Yeah, so Peter has convinced me that offering two versions of every one of our libraries is not a great idea. By that I mean, that offering a macro that lets the user configure the library for either std::string_view or boost::string_view is detrimental. Because this produces two distinct linkable libraries that each have their own diverging ABIs (or is it APIs?). This unnecessary friction is a constant source of complaints. Peter's vision is that Boost evolves so that its types are more compatible with their std equivalents. For example boost::core::string_view will be more easily converted implicitly in places where the user expects such conversions to take place. We couldn't do this in Boost.Utility's string view because the author is philosophically opposed to making this change. There's some discussion here: https://github.com/boostorg/utility/issues/40 https://github.com/boostorg/utility/pull/51
Next, the section on the parsers ( https://master.url.cpp.al/url/parsing/url.html) describes the function parse_uri() which returns result
. What strikes me is this difference: URI (Identifier) in the function name, and URL (Locator) in the return type. I always used the terms URL and URI interchangeably.
About that. So, the library uses the term "URL" to mean any of the provided containers, e.g. url_view, url, static_url. The term "URI" always refers to the specific BNF syntax found in the relevant RFC.
But now that I see them used in this way in a well designed library, it looks disturbing. The quoted rfc3986 ( https://datatracker.ietf.org/doc/html/rfc3986#section-1.1.3) says that an URL is a subset of URI.
The decision that I have made is to just ignore the RFC's guidance on what URL means, and instead use the term as it has become popularly known. I believe that the distinction between URL and URI is just not recognized by the general public and in particular the wide audience to which Boost.URL applies. No one asks you for your URI, but everyone asks you for your URL. People put URLs into the address bar. No one says "type this URI into the address bar." The address bar accepts non-http schemes such as mailto and file. These are technically URIs (see: https://en.wikipedia.org/wiki/Mailto). But no one calls them that. A google search for "URL" produces fifteen times more results than a search for "URI" although you would think that URIs would be more common since they are a superset of URLs. Go figure :) Therefore I have chosen to use the less technically correct but the more marketable term "URL" in the key places where it matters: the name of the library and the name of the container. Or to put it a different way url u; Looks a hell of a lot better than uri u;
The synopsis for parse_uri ( https://master.url.cpp.al/url/ref/boost__urls__parse_uri.html) says:
Exception safety: throws nothing.
And the line below it says that the function throws std::length_error when the input is too long. It looks like a bug in specs. Later we read:
Return value: A result containing the view to the URL, or an error code if
the parsing was unsuccessful.
Yep this needs an open issue :) https://github.com/CPPAlliance/url/issues
Which is not precise enough to give me the answer to the URI-vs-URL question. When can a parsing be non-successful? Is it only because it was not conformant to the grammar? The synopsis says "This function parses a string according to the URI grammar below", but is it a URI grammar or a URL grammar actually?
Actually this is covered by the docs :) see table 1.1: https://master.url.cpp.al/url/parsing/url.html
Now, there is probably a good explanation to the URI vs URL discrepancy. I think it would be good if it was placed in the docs, so that the users don't get confused.
Yes we could use a blurb which explains that the library settles on the name URL to refer to containers: https://github.com/CPPAlliance/url/issues
While this might look like a list of complaints, I really appreciate the efforts the authors put in creating this library and its documentation. The documentation is really high quality, way higher than the average you will find in GitHub. And this is actually because of this high quality that I am able to spot and report these issues.
Hey thanks!!! Yeah there's of course going to be the usual rogues gallery of doc mistakes, missing explanations, etc... We appreciate your investigation of the library and the accompanying reports as they will help us provide the last bits of polish needed to make this great! Regards
On 5/06/2022 10:29, Vinnie Falco wrote:
The decision that I have made is to just ignore the RFC's guidance on what URL means, and instead use the term as it has become popularly known. I believe that the distinction between URL and URI is just not recognized by the general public and in particular the wide audience to which Boost.URL applies. No one asks you for your URI, but everyone asks you for your URL. People put URLs into the address bar. No one says "type this URI into the address bar." The address bar accepts non-http schemes such as mailto and file. These are technically URIs (see: https://en.wikipedia.org/wiki/Mailto). But no one calls them that.
file is also an URL protocol, although you're correct that mailto is not. I don't think anyone actually types mailto: addresses into the address bar on a browser, though (nor can I think of any other URIs that someone might manually type, except for things like about:config that are browser-specific). You're correct that the wider public doesn't really understand the distinction, but it does seem a bit weird that you're mixing the terms. Perhaps you should just use URL everywhere, if you don't like URIs?
On Jun 6, 2022, at 5:10 PM, Gavin Lambert via Boost
I don't think anyone actually types mailto: addresses into the address bar on a browser,
I just tried it (because why not) mailto:boost@mirality.co.nz * Google Chrome - launched my mail program and created a new message * Safari - asked if I wanted to let the page create an email message and (when I said “Allow”) did so. * Brave - asked if I wanted to let the page create an email message and (and when I said “Yes”) did so. — Marshall
On 7/06/2022 12:21, Marshall Clow wrote:
On Jun 6, 2022, at 5:10 PM, Gavin Lambert wrote:
I don't think anyone actually types mailto: addresses into the address bar on a browser,
I just tried it (because why not)
mailto:boost@mirality.co.nz
* Google Chrome - launched my mail program and created a new message * Safari - asked if I wanted to let the page create an email message and (when I said “Allow”) did so. * Brave - asked if I wanted to let the page create an email message and (and when I said “Yes”) did so.
I was not claiming that it wouldn't work, just that people don't actually type it in as a general rule. Browsers will of course support it, because it's a standard thing to support for hyperlinks in pages (at least for people who don't mind addresses being harvested by bots).
On Mon, Jun 6, 2022 at 8:21 PM Marshall Clow wrote:
On Jun 6, 2022, at 5:10 PM, Gavin Lambert wrote:
I don't think anyone actually types mailto: addresses into the address bar on a browser,
I just tried it (because why not)
mailto:boost@mirality.co.nz
* Google Chrome - launched my mail program and created a new message * Safari - asked if I wanted to let the page create an email message and (when I said “Allow”) did so. * Brave - asked if I wanted to let the page create an email message and (and when I said “Yes”) did so.
Typing mailto:user@domain.com?subject=Hello in Netscape Navigator would open Eudora and save me a few key presses. Glen
On Mon, Jun 6, 2022 at 5:10 PM Gavin Lambert via Boost
You're correct that the wider public doesn't really understand the distinction, but it does seem a bit weird that you're mixing the terms. Perhaps you should just use URL everywhere, if you don't like URIs?
Yes, that is what I have done almost everywhere. The exception is when documentation or interface refers explicitly to grammar, for example in the function parse_uri: https://master.url.cpp.al/url/ref/boost__urls__parse_uri.html The Syntax Components section of the RFC uses the term URI as the label for the BNF production grammar: https://datatracker.ietf.org/doc/html/rfc3986#section-3 Therefore to keep keyword searches and documentation sensible and aligned with the library, I always use the term URI in this context. This also applies to compound terms such as URI-reference: https://master.url.cpp.al/url/ref/boost__urls__parse_uri_reference.html https://datatracker.ietf.org/doc/html/rfc3986#section-4.1 It wouldn't make sense to rename this to "URL-reference" as users would not find it scanning the RFC or doing a keyword search in the RFC document. They would also not find it on a precise Google search: Compare: https://www.google.com/search?q=%2B%22uri-reference%22 with https://www.google.com/search?q=%2B%22url-reference%22 Thanks
вт, 7 июн. 2022 г., 03:10 Gavin Lambert via Boost
file is also an URL protocol, although you're correct that mailto is not. I don't think anyone actually types mailto: addresses into the address bar on a browser, though (nor can I think of any other URIs that someone might manually type, except for things like about:config that are browser-specific).
There's no such thing as "URL protocol". The distinction between a URI and a URL is purely in the way its intended to be used. Every URL is a URI, because it Uniquely Identifies a thing. Some URIs are also URLs, because you can Locate things with them. Any URI scheme can be used for a URL, and I know for a fact that you can register handlers for custom URI schemes in at least Windows, Linux, and Android. And on the other hand, http scheme is usually associated with the idea of URL. But e.g. XML namespaces are identified by URIs, often those are http URIs, and there's absolutely no guarantee that the URI can be used to retrieve some document from the Internet.
On Tue, 7 Jun 2022 at 13:21, Дмитрий Архипов via Boost
вт, 7 июн. 2022 г., 03:10 Gavin Lambert via Boost
: file is also an URL protocol, although you're correct that mailto is not. I don't think anyone actually types mailto: addresses into the address bar on a browser, though (nor can I think of any other URIs that someone might manually type, except for things like about:config that are browser-specific).
There's no such thing as "URL protocol".
No, it is not. It [URL] contains one though.
From [1] which is one of best explanation I've came across:
"The difference between a URI and a URL is that a URI can be just a name by itself, or a name [*what* - ML] with a *protocol* that tells you *how* to reach it - which is a URL." [1] https://danielmiessler.com/study/difference-between-uri-url/ Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
Le 2022-06-07 13:29, Mateusz Loskot via Boost a écrit :
From [1] which is one of best explanation I've came across:
[1] https://danielmiessler.com/study/difference-between-uri-url/
I wouldn't pay too much attention to this article, which reflects more the author's point of view than any academical consensus, nor the intent of the people who actually wrote the RFC. And has a poor reading of the RFC wording, for example confusing “can be classified as” and “is”. For those who can read French (I apologize for the others), https://www.bortzmeyer.org/3986.html gives a nice overview, from someone who is actually involved in the IETF. To get back to the topic, i find boost.url choice regarding url/uri to be a very reasonable one. Regards, Julien
On Tue, Jun 7, 2022 at 7:41 AM Julien Blanc via Boost
To get back to the topic, i find boost.url choice regarding url/uri to be a very reasonable one.
Thank you. To throw some virtual chum into the mailing list waters, I would point out that the key innovations and design choices of this library are: 1. Constructed url and url_view objects are always in encoded form 2. Constructed url and url_view objects are always syntactically valid 3. Mutations on a url object always leave it in a syntactically valid state 4. Modification of a url and removal of url-encoding can be performed without allocating memory And it is also worth mentioning that the request-target in the HTTP request-line request-line = method SP request-target SP HTTP-version CRLF is not actually a URL, but this production: request-target = origin-form / absolute-form / authority-form / asterisk-form asterisk-form is just "*" so users can just use op== to detect that. origin-form and absolute-form are relative-ref and absolute-URI respectively, and may be parsed using the functions boost::urls::parse_relative_ref and boost::urls::parse_absolute_uri. authority-form however, is special, because the syntax is not valid for what the url_view and url containers allow. The authority-form is the authority part of a valid URL, which is not a valid URL when it appears by itself: authority = [ userinfo "@" ] host [ ":" port ] Therefore, to facilitate the parsing of request-target in HTTP messages, the type authority_view is provided: https://master.url.cpp.al/url/ref/boost__urls__authority_view.html And a corresponding parsing function: https://master.url.cpp.al/url/ref/boost__urls__authority_view/parse_authorit... Thanks
From [1] which is one of best explanation I've came across
I've been trying to synthesize an explanation to this in the documentation. One useful resource I found is: https://datatracker.ietf.org/doc/html/rfc3305#section-2.3 The confusion is everywhere.
People who are well-versed in URI matters tend to use "URL" and "URI" in ways that seem to be interchangeable.
-- Alan Freitas https://github.com/alandefreitas
On Tue, Jun 7, 2022 at 8:17 AM Alan de Freitas via Boost
https://datatracker.ietf.org/doc/html/rfc3305#section-2.3
The confusion is everywhere.
Even the various authors of the specifications can't get it straight :) Hopefully this is sufficient evidence that trying to maintain the distinction between URL and URI (and URN lol) is a losing battle not worth fighting. Thanks
On 7/06/2022 23:21, Дмитрий Архипов wrote:
There's no such thing as "URL protocol".
I was using an abbreviation. A more correct phrasing might be "an URI scheme that denotes a protocol that uses the URL format" but that's much more of a mouthful. The parsing-level distinction between the two is generically obvious and does not require recognition of the specific scheme; if the scheme colon is immediately followed by one or more slashes then it's an URL, otherwise it is not. But this is off-topic. (Well, ok, *technically* the distinction is even more vague than this, and it's possible to have an 'URL' that doesn't lead with slashes -- it just needs to semantically represent a 'location' rather than a 'name/id'. But for practical purposes the above rule suffices, especially in the context of an URI parser, since it must treat the content as opaque if it does not have leading slashes.)
And on the other hand, http scheme is usually associated with the idea of URL. But e.g. XML namespaces are identified by URIs, often those are http URIs, and there's absolutely no guarantee that the URI can be used to retrieve some document from the Internet.
They're still URLs, regardless of whether they resolve to a valid web resource or not. (They're also URIs, of course.)
ср, 8 июн. 2022 г., 03:07 Gavin Lambert via Boost
The parsing-level distinction between the two is generically obvious and does not require recognition of the specific scheme; if the scheme colon is immediately followed by one or more slashes then it's an URL, otherwise it is not. But this is off-topic.
This is wrong, there's absolutely no syntactic distinction between URIs and URLs. The distinction is in their purpose. Consult this section: https://datatracker.ietf.org/doc/html/rfc3986#section-1.1.3 The two slashes mean that URI contains the authority part. One slash means it doesn't. What it means for the URI is essentially decided by the scheme or the application authors.
And on the other hand, http scheme is usually associated with the
idea of URL. But e.g. XML namespaces are identified by URIs, often those are http URIs, and there's absolutely no guarantee that the URI can be used to retrieve some document from the Internet.
They're still URLs, regardless of whether they resolve to a valid web resource or not. (They're also URIs, of course.)
How could they be URLs, when they aren't supposed to be used to Locate anything, and the defining characteristic of being a URL is that it describes how to get the thing?
On 9/06/2022 08:40, Дмитрий Архипов wrote:
The parsing-level distinction between the two is generically obvious and does not require recognition of the specific scheme; if the scheme colon is immediately followed by one or more slashes then it's an URL, otherwise it is not. But this is off-topic.
This is wrong, there's absolutely no syntactic distinction between URIs and URLs. The distinction is in their purpose. Consult this section:
Which I did clarify further down. I know this is the Internet, but you don't need to pick every little nit.
The two slashes mean that URI contains the authority part. One slash means it doesn't. What it means for the URI is essentially decided by the scheme or the application authors.
Containing any leading slashes means that the URI contains a path, which means that it represents a location, which automatically makes it an URL and not an URN. As I also said, while it is technically possible for a non-slash-leading URI to semantically represent a location and thus also be an URL, this is atypical and for most intents and purposes can and should be ignored as a useless distinction. Especially in the context of a generic parser.
How could they be URLs, when they aren't supposed to be used to Locate anything, and the defining characteristic of being a URL is that it describes how to get the thing?
Any URI that uses the http protocol is (by definition) an URL, because it is representing a location. It does not stop being a location just because there is no valid resource at the server claiming to be the authority for that location at any particular point in time (or if there is no such server); that just makes it an unreachable location.
On Wed, Jun 8, 2022 at 4:18 PM Gavin Lambert via Boost
...URI...URL...URN.
So.. have you seen the Boost.URL Reference Card? https://master.url.cpp.al/url/help_card.html Thanks
On 9/06/2022 12:28, Vinnie Falco wrote:
So.. have you seen the Boost.URL Reference Card?
That's a decent reference for URL parsing (although it's then odd that the method names state "uri", which is how this discussion started), but it's non-obvious from that how it parses non-URL URIs. From the RFC, I'm assuming you put the non-scheme component into the "path", but then the only valid operations are to get and set it as a single opaque unit, and it's *also* unclear from that reference card which of the path methods will do those things without trying to interpret its contents in some way. (I assume encoded_path/set_encoded_path.) (Side note: docs missing for https://master.url.cpp.al/url/ref/boost__urls__url/set_path.html )
On Wed, Jun 8, 2022 at 6:13 PM Gavin Lambert via Boost
That's a decent reference for URL parsing (although it's then odd that the method names state "uri", which is how this discussion started), but it's non-obvious from that how it parses non-URL URIs.
This is all explained in the docs: https://master.url.cpp.al/url/parsing/url.html (plus the soon to be merged documentation changes which came about from these discussions)
From the RFC, I'm assuming you put the non-scheme component into the "path", but then the only valid operations are to get and set it as a single opaque unit
Right, if you want to treat the path (or query) as a single unit then the corresponding members of url_view and url are used (those in the aqua and lavender boxes). However if you want to deal with the path as a const or modifiable range of segments then you use the containers returned by the functions encoded_segments() or segments().
and it's *also* unclear from that reference card which of the path methods will do those things without trying to interpret its contents in some way. (I assume encoded_path/set_encoded_path.)
Functions with the word "set" are mutating, otherwise they are not. Functions with the word "encoded" always return percent-encoded strings. This is how things are stored natively. We should add this somewhere to the docs if it is not already there/ The Reference Card assumes some familiarity with the library and consolidates the bulk of the APIs of the library into one page for convenience. I'm experimenting with some new documentation ideas. Thanks
On 9/06/2022 14:57, Vinnie Falco wrote:
Right, if you want to treat the path (or query) as a single unit then the corresponding members of url_view and url are used (those in the aqua and lavender boxes). However if you want to deal with the path as a const or modifiable range of segments then you use the containers returned by the functions encoded_segments() or segments().
The [encoded_]segments view is *technically* illegal for non-URL URIs (or at least non-sensible), though I suppose there's no particular harm if you just don't use it (or only use it if the data format happens to coincidentally be slash-separated, assuming it won't add stray leading slashes).
Functions with the word "set" are mutating, otherwise they are not. Functions with the word "encoded" always return percent-encoded strings. This is how things are stored natively. We should add this somewhere to the docs if it is not already there/
It would be nice to explicitly call out that the "encoded" form is the original unmodified text (and not, for example, parsed and then re-encoded). I've only really skimmed the docs at this point so I certainly might have overlooked that, but what I saw seemed to imply this but didn't really state it explicitly.
The Reference Card assumes some familiarity with the library and consolidates the bulk of the APIs of the library into one page for convenience. I'm experimenting with some new documentation ideas.
There's an example of an URN being parsed to scheme and path in the RFC; it might be nice to add that (or perhaps a mailto:) to your diagram for clarity. Although perhaps also including query and fragment parts too, since those are legal parts of the grammar.
You might find this link explains some of the mysteries of the UR* thingys ...URI...URL...URN... XXX P
-----Original Message----- From: Boost
On Behalf Of Vinnie Falco via Boost Sent: 9 June 2022 01:28 To: boost@lists.boost.org List Cc: Vinnie Falco ; Gavin Lambert Subject: Re: [boost] Boost.URL -- some notes On Wed, Jun 8, 2022 at 4:18 PM Gavin Lambert via Boost
wrote: ...URI...URL...URN.
So.. have you seen the Boost.URL Reference Card?
https://master.url.cpp.al/url/help_card.html
Thanks
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Containing any leading slashes means that the URI contains a path, which means that it represents a location, which automatically makes it an URL and not an URN. Can you direct me to an IETF or a W3C document that says that? The only official text that I know of that distinguishes URIs and URLs is the section from URI RFC (already linked in this thread several times) and it explicitly says that
В письме от четверг, 9 июня 2022 г. 02:18:07 MSK пользователь Gavin Lambert via Boost написал: the difference between a URI and a URL is not in syntax, but in intent. Furthermore, the RFC calls "path" the part between (optional) "//" authority and (optional) "?" query. It explicitly says that path is required, but it can be empty. As such me@server.tld in mailto:me@server.tld is path. Is it a URL by your definition? Conversely, the path for http://example.com is empty. Is it still a URL?
Any URI that uses the http protocol is (by definition) an URL, because it is representing a location. Again, can you direct me to a document that spells out this feature of the http scheme?
Andrzej Krzemienski wrote:
Next, we read:
Aliases for standard types, such as string_view
https://master.url.cpp.al/url/ref/boost__urls__string_view.html, use their Boost equivalents.
After reading this, I expected that Boost.URL would use boost::string_view from Boost.Utility library: https://www.boost.org/doc/libs/1_79_0/libs/utility/doc/html/utility/utilitie... tring_view.html
But instead, it uses boost::core::string_view, which is an implementation detail from Boost.Core library: https://github.com/CPPAlliance/url/blob/master/include/boost/url/string_vie w.hpp
Again, this is news for me that Boost has two implementations of string_view. Why? Second, I do not think that Boost.URL should rely on the implementation details of Boost.Core. A better alternative would be to use the official boost::string_view from Boost.Utility. Or is there a good reason not to?
boost::core::string_view is convertible from/to std::string_view, which allows user code to pass std::string_view to a library's functions and store their return values into a std::string_view. This is a feature users request _very_ often nowadays since C++17 is common. We have proposed that these conversions be added to boost::string_view, but there were concerns that they won't work well in practice because of creating a mess of ambiguities. Which they kind of do, but it can be made to work. So boost::core::string_view is that boost::string_view, we have it because we can now try using it in library APIs and see if it's worth the bother. (So far it seems to be.) Where we go from here is undetermined yet. Maybe we integrate it into boost::string_view and retire the Core one, maybe we undelete the documentation of the Core one and make it official. I'd prefer to see it used in a released Boost library or two first, for a release or two. If it works, we decide what to do with it. There's been a discussion here about it: https://lists.boost.org/archives/boost/2021/10/252092.php
Hi Andrzej, Thanks for reviewing the library.
I recommend that Boost.URL docs say that it requires Boost 1.78 or higher.
Definitely. I'll look at the issue in the next days. https://github.com/CPPAlliance/url/issues/184
A better alternative would be to use the official boost::string_view from Boost.Utility. Or is there a good reason not to?
As others have noted, core::string_view is convertible to std::string_view, which is becoming more and more important. A string_view not convertible is std::string_view is problematic. Others have already shared some relevant links. Now, the name `parse_uri` implies that it will
recognize any URI,
It does. URLs and URIs have the same fields. The distinction is only relevant for URNs, which would have some subcomponents we don't consider.
but on the other hand it is impossible that the result will fit into a url_view, because not every URI is an URL.
This is possible because the url_view has all the necessary fields. Maybe for the same reason, the distinction between URL and URI is becoming more and more pointless. For instance, Javascript calls everything a URL. The synopsis for parse_uri (
https://master.url.cpp.al/url/ref/boost__urls__parse_uri.html) says:
Exception safety: throws nothing.
And the line below it says that the function throws std::length_error when the input is too long. It looks like a bug in specs.
Definitely. https://github.com/CPPAlliance/url/issues/185
When can a parsing be non-successful? Is it only because it was
not conformant to the grammar? Yes.
The synopsis says "This function parses a string according to the URI grammar below", but is it a URI grammar or a URL grammar actually?
We should probably try to better explain the difference between URI, URL, and URNs in the docs. There's some content but it's probably not enough. This is naturally confusing because people use URI and URL interchangeably. But then they see URL is a subset of URIs and assume a URL cannot represent any URI. But this is incorrect, and it's precisely the reason people use URI and URL interchangeably. In fact, the distinction between absolute-URI, relative-ref, URI, and URI-reference is much more relevant. The distinction between URLs and URIs is not that relevant because a URL has all fields required by a URI. Only URNs consider some URI subcomponents to represent extra fields. So the class is called URL because that's what everyone calls it. And all algorithms are called parse_<component>, where <component> is exactly the name as it happens in the grammar. Thus, we have parse_absolute_uri, parse_relative_ref, parse_uri, and parse_uri_reference, which is what the spec calls them.
That is, any other reason for not being successful (if any resources needed to be allocated and failed) may still be reported via exceptions.
These algorithms don't allocate memory.
Now, there is probably a good explanation to the URI vs URL discrepancy. I think it would be good if it was placed in the docs, so that the users don't get confused.
There are some mentions of that in the docs, but we could create a section to discuss the distinction between them more explicitly and provide examples.
Regards, &rzej;
Thanks again! -- Alan Freitas https://github.com/alandefreitas
participants (11)
-
Alan de Freitas
-
Andrzej Krzemienski
-
Gavin Lambert
-
Glen Fernandes
-
Julien Blanc
-
Marshall Clow
-
Mateusz Loskot
-
pbristow@hetp.u-net.com
-
Peter Dimov
-
Vinnie Falco
-
Дмитрий Архипов