On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
Have you looked at the RFC 3986 errata at https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0
ruh roh.. I have not gotten to that yet but yeah. I see problems :) Not difficult problems, but there are relevant errata that must be fixed before any review. Thanks!
? For example, I note that your rule for relative-part on the "Parsing" page doesn't seem to include the path-abempty alternative added in erratum 5428.
Thanks I need to update the javadocs and exposition. Although, I think the parsing is still correct despite not showing path-abempty. I will add more tests.
I would be interested to see how compile time, run time, and lines-of-code compare between your implementation, a Spirit parser, and a regexp.
Someone volunteered to do that so I might explore it at one point.
You also have the choice of (a) parsing and storing pointers to each component as I believe you do,
Right, the library keeps a small table of offsets to various parts. For the modifiable containers I plan to also keep a table for the segments and query parameters, to allow O(1) access of indexed elements (right now its linear). I think this is a good balance for accessing parts of the URL and helping speed up mutation operations.
(b) could be better if you're mostly just passing around complete URLs, storing them in containers, etc.
Well, in url the table adds overhead. If you don't want that overhead then... just store the URL string :) and parse it again when you need it, this is equivalent to your (b), and less impact on the library.
There's also the question of worst-case complexity; does your parser backtrack? If so, is there a pathological input that causes it to take exponential time?
I have done no profiling or optimization to speak of except some minor design work to facilitate the use of character-based SIMD algorithms later. I don't have a monolithic "parser" per-se, there are classes called "bnf elements" which can be used in the generic bnf::parse function. These elements call parse on child elements recursively. I do backtrack. For example, URI-reference is defined as: URI-reference = URI / relative-ref The class uri_reference_bnf implements this by first attempting to parse the scheme and colon, and if that fails then it backtracks and attempts parsing a "relative-ref." You can see this here: https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278f... I'm not really sure how this can be improved, you need to know if there's a scheme and that can only be determined by scanning forward until reaching a colon or an invalid character for scheme. In theory this could scan the entire input. Thus I have not put any effort into making this better.
Can you add a complexity guarantee to the docs? (I posted about this in the context of regular expression libraries a few weeks ago.)
Maybe... I did it for JSON because performance is critical and the difference between a poorly performing library and a greatly performing library is massive. I'm not sure that URL parsing is the same. URLs tend to be short and it is sort of difficult to really get bad performance out of parsing them even if you go a character at a time. Unlike JSON you don't need to convert between base 2 and base 10 for numbers (which is a headache). Mutations always give the strong exception safety guarantee and it also guarantees only one memmove, I will probably document that. I can give you a better answer in the future when it is at the point where benchmarks are explored.
auto url = parse_url(str);
The problem is that url is a view of the string parameter, but nothing warns the user about that and...
Well, it is documented profusely. I agree there is some danger here but with great danger comes great power. Or something about great responsibility?
..now url is dangling. Am I right, or is there something that prevents this?
No you are right, in the statement below `uv` references the string and does not take ownership: url_view uv = parse_uri( "https://example.com/" ).value(); However, url is constructible from url_view and url DOES make a copy with ownership, so this statement also works and doesn't have the problem of dangling: url u = parse_uri( "https://example.com/" ).value();
I believe that functions that returns views should be named so that that is obvious, and in cases like this the function with the simplest name should be the safer version that returns a value-type.
These algorithms only return views, because there are a few different owning URL containers and they can all be constructed from the views. Overloading on return type doesn't really work and some containers may have template parameters or allocators which makes APIs to parse into them awkward. If we're debating the name though, it means we have accepted the general design of the library :)
Also I note that parse_uri() returns a result<> and doesn't throw. My preference would be to name that e.g. try_parse_uri() and for the simpler-named function to be a wrapper that returns a plain uri or throws on error. My rationale is to make the simplest-named functions provide the easiest-to-use functionality, so that *teaching C++ to a Javascript programmer is not so embarrassing*, while retaining the more performant API variants for those who need them.
Well the whole point of boost::system::result (the new type that Boost.URL uses) is to cut down on those extra names and overloads. Your suggestion would keep the additional complexity of result and also increase the number of overloads which isn't sounding like a clear cut win to me... although, this "result" return type is sort of new.
(Is it parse_uri() or parse_url()? I thought you had decided on URL.)
I use the term URL generally, but I use the precise wording when referring to grammar. "parse_uri" uses the URI grammar specified in rf3986, modulo any errata fixes which I have yet to apply.
"pct" doesn't immediately mean "percent" to me.
The grammar says "pct-encode", I followed that. I think there is sufficient use of the abbreviation that it is justified but I could be convinced otherwise if there is new data.
It doesn't look like your url type is comparable. I think it should be. The comparisons need to be careful about case-sensitivity.>
And it should be hashable.
You can track those issues here: https://github.com/CPPAlliance/url/issues/64 https://github.com/CPPAlliance/url/issues/65
I recently had to implement request signing for AWS. This requires canonicalisation of URLs. There are some subtleties in percent encoding: you need to be able to control exactly which characters are escaped (I think you allow this); you need to be able to control whether + is used to escape space (do you?)
Yes and yes.
and you need to be able to control whether the hex characters are upper or lower case (do you?).
But no to this... we can add it though, there's a "pct_encode_opts" structure: https://master.url.cpp.al/url/ref/boost__urls__pct_encode_opts.html
This code also needed to sort the query parameters by key; that raises the question of whether the query is a sequence or an associative container of key-value pairs, and if an associative container, what the comparison function is.
The library treats the query parameters as a ForwardRange (url_view) or RandomAccessRange (url) of key/value pairs, allowing for duplicate keys. The functions find/find_next can be used to visit all values with the same key. If you are using the "encoded" params container (by calling encoded_params()) then comparisons are bitwise. If you are using the "plain" params container (by calling params()) which decodes everything for you, then the comparison is made using notional percent-decoding. That is, your decoded key is compared against each container key as-if it had percent-decoding applied.
I would hope that the path would be similar to a std::filesystem::path i.e. it should use the same vocabulary where possible (or could even actually be a std::filesystem::path, on C++17).
OOF I hope never to see this. std::filesystem is kind of a mess. These are the path containers: https://master.url.cpp.al/url/ref/boost__urls__segments.html https://master.url.cpp.al/url/ref/boost__urls__segments_encoded.html https://master.url.cpp.al/url/ref/boost__urls__segments_view.html https://master.url.cpp.al/url/ref/boost__urls__segments_encoded_view.html "view" containers are read-only and reference the character buffer without ownership, these containers are obtained by calling functions on url_view. The other containers allow modification (the iterators are still read-only, because a proxy-assignment implementation proved to be clumsy) and these containers are obtained by calling functions on url.
Conversion between std::filesystem::path and file: URLs would be nice to have.
Yeah something like that is possible, falling back to boost::filesystem on C++11.
In your table 1.7, "Return the type of host specified, if any." should I think say "Return the host, if any."
hmm, yeah - the function "host_type()" is missing from that list (it returns the type of host).
Your percent-decoding methods return strings. I'm a bit surprised that you've not tried to return lazy decoding views, at least as an option.
Now that is a very interesting idea! I didn't think of it. But, its hard to see the use case, what would you do with a lazy view other than unroll it into a linear buffer? Thanks