[beast] Formal review

I am going to focus exclusively on the HTTP API in the review below because that needs most attention. Furthermore, to keep this review within reasonable bounds, I am omitting issues already reported by others. I. DESIGN --------- Beast offers a well-designed WebSocket API along with a minimal and deficient HTTP API. Beast claims to offer low-level building-blocks that can be used to produce higher-level HTTP server and client libraries. The low-level approach makes Beast minimal. I do not have a problem with a minimal API, other than it diminishes the usefulness of the library. Although this can be solved by adding more functionality over time, if I was to evaluate Beast with its current feature-level, I would only choose it if I needed WebSocket. This, however, is not a precondition for library acceptance. But the low-level approach also means the abstractions must be right. The abstractions that Beast currently offers are deficient, which is best shown by the poor support for HTTP chunking. Chunking can and should be represented more naturally by a HTTP API. HTTP chunking splits a message into several parts, where each part can contain metadata (chunk extensions) and data (chunk.) The metadata is a set of key-value pairs, just like the HTTP headers. During the review period I have described several use cases for HTTP chunking that does require metadata (e.g. radio metadata,) as well as the preservation of chunk boundaries (e.g. event-based publish-subscribe channels.) Conceptually, a normal HTTP message has the following structure: message = start-line metadata* body? A chunked HTTP message can be described in a similar manner as: message = start-line metadata* ( metadata* chunk? )* The Beast message container is designed for the structure of the normal HTTP message, and assumes that chunks are appended the body. The chunk metadata does not fit into this model, and therefore Beast discards them by default. The repetitive structure of chunked messages is not accommodated by the Beast message container. This means that we have to make our own chunk decorator to generate chunk metadata, and our own custom parser to receive chunk metadata. The chunk decorator inserts metadata via a callback mechanism. Its design is poor for two reasons. First, the user must keep track of context to ensure that the correct metadata is applied to the correct chunks. Second, the user must write the metadata in correct HTTP syntax. The latter may sound trivial, but does anybody here know if " ;key=value\r\n" is correctly formatted? (notice the initial whitespace.) The answer is that it is correct according to RFC 2616, incorrect according to RFC 7230, and correct according to RFC 7230 errata. The best approach for handling this situation is to never insert whitespaces, but to always accept whitespaces from others. Users should not have to be bothered with such detail. Receiving chunk metadata rather than having them discarded by default requires a custom parser. The Beast message parser is a push parser (SAX) where callbacks are invoked for each parsed message part. This means that the user must keep track of which metadata belongs to which chunk. The user must also parse the metadata manually. Beast does not even check if the syntax is correct. The custom parser is also necessary if the user needs to receive individual chunks rather than having them appended to the body. I would have expected that a low-level API returns the next message part as soon as it is identified. The current collect-entire-message design is a higher-level design. In summary, 1. Beast should not discard anything by default, but should faithfully forward all information received. 2. Beast should handle all low-level syntax and not leave parts of it to the user. 3. Beast should not break the end-to-end principle unless allowed by the user. As a way forward I am going to sketch another simple message model that does encompass HTTP chunking. Conceptually it looks like this (I am going to ignore that, say, metadata may require different syntax depending on context) message = start-line section* section = metadata* body-part? Building an API around this model should offer read/write functions that handles partial bodies (whether chunked or progressively downloaded.) The current collect-entire-message API could be build on top of that. With this approach there is no need to mess around with custom parsers. This design does require state information to be able to select the proper syntax for both metadata and data. On another topic, the HTTP status codes can be roughly categorized as instructions (the 1xx and 2xx series) and errors (the 4xx and 5xx series.) I recommend that you create two error categories and pass status codes as error_code. Notice that I have no comment about whether or not they should be stored in the message container. What I really want to achieve is to enable the user to easily propagate HTTP errors in the same way that non-HTTP errors (including beast::http::error) are propagated. II. IMPLEMENTATION ------------------ I read the implementation to understand the design, so I did not look for QA issues. My overall impression is that much of the code is well-written, but there are parts that are convoluted and difficult to verify. There are too many goto statements for my taste; I do not mind the goto pattern used in the parser, but the state-machines in read_some_op and serializer looks more like micro-optimization. Furthermore, "case do_body + 1:" looks like an awful hack. The dynamic buffers are useful even by themselves. The code contains several foreign files with different licensing conditions: * include/beast/core/detail/base64.hpp * include/beast/zlib/zlib.hpp * include/beast/websocket/detail/utf8_checker.hpp * include/beast/http/detail/basic_parser.hpp All these include both the BSL and the original license. This should be checked by the Boost legal council. III. DOCUMENTATION ------------------ The documentation is generally well-written (and it looks like the DynamicBuffer section got a little help from Networking TS.) The Reference is missing some concepts, e.g. SyncReadStream. IV. NAMING ---------- The library name must change. "Boost.Beast" sounds like a bad joke. I do not have strong opinions about the new name. V. VERDICT ---------- I recommend to REJECT the library because the HTTP API provides the wrong low-level abstractions as explained in the design section above.

On Sun, Jul 9, 2017 at 1:57 PM, Bjorn Reese via Boost
...
I'll state for the record that I reached out to you and the mailing list a few times since Beast was announced in June 2016, offering any interested parties the opportunity to participate in the design of the library. No one showed up. You are correct that Beast could offer more control over the chunking and better facilities for interacting with extensions. This was by choice; I did not invest heavily into that part of the library because it is a feature unlikely to see significant use in the near term. That is not to say that I won't address it in the future. As others can attest, I am very responsive to issues and timely with providing updates and improvements to the library. So it is not a stretch of the imagination to think that all of your points will be resolved. So that no one reading Bjorn's review has any misunderstandings: Beast DOES decode chunked message bodies correctly and provides the decoded payload to the caller, and can do so incrementally. What I find interesting is the decision to focus the review on just one aspect of the library, the handling of chunks for niche use cases. Bjorn, you are obviously accomplished and knowledgeable with respect to HTTP and implementations of said protocol, and now you know some Beast so I pose the question: can you offer any concrete solutions on how Beast's interfaces may be adjusted (with a minimum of disturbance to existing use) in order to achieve the results you desire? Thanks

It is quite usual for really useful feedback to not arrive until a library is prepared for formal review. Your users before a review are by definition already fans, and won't give you really useful feedback. It takes a lot of effort to give high quality feedback, and many of us can't spare the time until we know the thing we are reviewing is supposed to be finished. I know that is frustrating, but it is what it is. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

On 10-07-17 00:13, Vinnie Falco via Boost wrote:
it is a feature unlikely to see significant use in the near term.
I don't think this is how Boost design review can be driven. You get 1 chance to do the right thing, or you cripple the library either due to legacy stuff or due to continually breaking changes.
Seth

On Sun, Jul 9, 2017 at 3:13 PM, Vinnie Falco
For example, I'm doing a mock-up of what a solution to the chunk encoding problem you described might look like, and I have questions which would be simple for an expert like yourself to answer and save me a lot of time (and help create a better product). Here's am example: * Should the new chunk interface require that the chunk exist in memory as a single ConstBufferSequence? * Or is it necessary to support a system where a chunk can be sent using a smaller buffer, periodically refilled?

On Mon, Jul 10, 2017 at 5:09 PM, Vinnie Falco via Boost
You can use utf_traits http://www.boost.org/doc/libs/1_64_0/libs/locale/doc/html/structboost_1_1loc... decode does the job. Artyom

On Mon, Jul 10, 2017 at 7:47 AM, Artyom Beilis via Boost
The utf-8 validation of text websocket message payloads is a critical bottleneck. The best websocket implementations apply significant optimizations to this operation, recognizing the vast majority of inputs are low-ascii (no code point larger than one byte). For example JSON data. With collaboration from Ripple employees, the code in Beast was developed as a high performance utf8-validation function. Preliminary tests indicate that switching to Boost.Locale away from Beast's well optimized and well tested utf8 validation function would incur a 67,400% performance penalty. I hope you understand that I might be reluctant to switch. Benchmark results: beast.benchmarks.utf8_checker beast: 2,016,637,738 char/s beast: 1,921,062,599 char/s beast: 1,939,159,018 char/s locale: 3,053,539 char/s locale: 2,989,265 char/s locale: 3,060,962 char/s Longest suite times: 17.8s beast.benchmarks.utf8_checker 17.8s, 1 suite, 1 case, 1 test total, 0 failures The program '[75300] benchmarks.exe' has exited with code 0 (0x0). Code: https://github.com/vinniefalco/Beast/commit/3df7de8ce2e8f797722118b9d7512662...

On Mon, Jul 10, 2017 at 8:15 AM, Vinnie Falco
With collaboration from Ripple employees, the code in Beast was developed as a high performance utf8-validation function.
Apologies, I forgot to credit Peter Dimov who improved it further just before the Beast formal review. Thanks

Vinnie Falco wrote:
The utf-8 validation of text websocket message payloads is a critical bottleneck.
I thought I'd have a look at that. Your code has a fast-path that
attempts to check 8 bytes at a time by masking with 0x8080808080808080.
I agree that that is probably a useful optimisation which a compiler is
unlikely to apply by itself. Your implementation
(
https://github.com/vinniefalco/Beast/blob/6f88f0182d461e9cabe55032f966c9ca4f...
) does this:
if((*reinterpret_cast

On 07/10/17 19:57, Phil Endecott via Boost wrote:
I wouldn't be so quick to agree. Modern compilers tend to be able to vectorize simple loops quite efficiently.
C++ allows chars and unsigned chars (which std::uint8_t presumably is) to alias other types, so unless the compiler is able to prove that the pointers refer to an array of something else than std::size_t, it should not apply optimizations based on strict aliasing rules. The code above performs pointer alignment, which strictly speaking has unspecified behavior because it assumes pointer representation. It would be cleaner to use std::align for that. BTW, the fast loop ends on the first byte >=0x80 and then the validation continues in the slow loop until the end of input. It might be better to resume the fast loop if validation of the "vector" passes. Otherwise, for example, and early comment in the input could ruin the performance. (I don't know how probable such case would be.)

Andrey Semashev wrote:
Not this one (I've tried it).
My understanding is that you can reinterpret_cast FROM other types TO chars, but not FROM chars TO other types. But there are surely people here who understand this stuff better than me.
Yes, I also noticed that; quite a common case in en_GB would be to have mostly-ASCII with occasional currency symbols. Vinnie Falco wrote:
Wrong question. It's likely that it works on current compilers on current platforms. But if I'm right that it is undefined behaviour then it might stop working with some future compiler update somewhere.
Yes, I believe that's the right thing to do.
Regarding the relative performance of different UTF-8 checking code:
I've done a quick test by hacking some of my own UTF-8 decoding code to
just validate, and I see a factor of about 4 improvement between the
byte-at-a-time and an optimised version that checks the top bits of 8 bytes
at a time. The code is below. I measure a throughput of about 1.6 GB/second
for ASCII-only input, on my ARMv8 test system. I'd be interested to hear how
that compares to the other implementations; a very quick test suggests that
it might be a bit quicker than the Beast code but I'd like someone else to
confirm or deny. (It might also be wrong - I've not tested it with non-ASCII
input.)
Note that this is only about 50 lines of code; Beast's utf8_checker.hpp is
maybe 5 times as long. Code follows.
Regards, Phil.
template <typename ITER>
bool is_valid_utf8(ITER i, ITER end)
{
// Check if range is valid and complete UTF-8.
// (Maybe incomplete input, i.e. ending in the middle of a multi-byte character, needs
// to be reported differently?)
while (i != end) {
// If i is suitably aligned, do a fast word-at-a-time check for ASCII characters.
// FIXME this only works if ITER is a contiguous iterator; it needs a "static if".
const char* p = &(*i);
const char* e = p + (end-i); // I don't think &(*end) is allowed because it appears to dereference end.
unsigned long int w; // Should be 32 bits on 32-bit processor and 64 bits on 64-bit processor.
if (reinterpret_cast

On Mon, Jul 10, 2017 at 5:34 PM, Phil Endecott via Boost
That hurts 32-bit ARM. So I am faced with a choice, penalize an existing platform today to benefit some possible future platform? Hmmmm......let me think about that.... ...probably not such a good idea! And the 32-bit ARM users might revolt with such a change. I've heard from a couple of users running Beast on constrained hardware.
Note that this is only about 50 lines of code; Beast's utf8_checker.hpp is maybe 5 times as long. Code follows.
Nice, this is great! I like where you are headed with your function and thanks for investing the time to write it. Perhaps the Beast utf8 validator could be improved, there's nothing more satisfying than removing lines of code! There's just an eensy teensy problem, the Beast validator is an "online" algorithm. It works with chunks of the entire input sequence at a time, sequentially, so there could be a code point that is split across the buffer boundary. All of that extra code you see in Beast is designed to handle that case, by saving the bytes at the end for when it gets called again (after the validator returns it will never see that buffer again). I admit that there is surprisingly large amount of code required just to handle this case. The good news is that those extra lines only execute in the special case where the code point is split. The bulk of the loop works on the parts of the buffer where code points can't possibly be split. And the unit test is exhaustive, it tries all possible code points and split positions. But who knows? I have never claimed to be a great coder, I consider myself average at best so its entirely possible that this could all be done in far fewer lines. Maybe you can update your function to handle this case? I am always happy to accept improvements into Beast. You might need to turn the function into a class to save those bytes at the end. Thanks

Vinnie Falco wrote:
I think that's an issue with whatever compiler you're using, not the architecture; I've just done a quick test with arm-linux-gnueabihf-g++-6 6.3.0 and I get about a 5% speedup by using memcpy.
Yes, I did notice that but it wasn't clear that it was actually being used.
I admit that there is surprisingly large amount of code required just to handle this case.
The following code is totally untested.
template <typename ITER>
bool is_valid_utf8(ITER i, ITER end, uint8_t& pending)
{
// Check if range is valid and complete UTF-8.
// pending is used to carry state about an incomplete multi-byte character
// from one call to the next. It should be zero initially and is zero on return if
// the input is not mid-character. After submitting the last chunk the caller
// should check both the return value and pending==0.
// Skip bytes pending from last buffer.
// The number of 1s at the most significant end of the first byte of a multi-byte
// character indicates the total number of bytes in the character. pending is
// this byte, shifted to allow for the number of bytes already seen.
while (pending & 0x80) {
uint8_t b = *i++;
pending = pending<<1;
if ((b & 0xc0) != 0x80) return false; // Must be a 10xxxxxx continuation byte.
if (i == end) return true;
}
pending = 0;
while (i != end) {
// If i is suitably aligned, do a fast word-at-a-time check for ASCII characters.
// FIXME this only works if ITER is a contiguous iterator; it needs a "static if".
const char* p = &(*i);
const char* e = p + (end-i); // I don't think &(*end) is allowed because it appears to dereference end.
unsigned long int w; // Should be 32 bits on 32-bit processor and 64 bits on 64-bit processor.
if (reinterpret_cast

Phil Endecott wrote:
No, it's an issue with ARM32 not allowing unaligned loads. The memcpy code, at best, uses four byte loads, instead of one 32 bit one. __builtin_assume_aligned doesn't help. https://godbolt.org/g/iC9X35 I suspect that all architectures that don't have unaligned loads will suffer similarly. Myself, I'd go with the reinterpret_cast for the time being. It's indeed _technically UB_, but it works.

On 11/07/2017 14:49, Peter Dimov via Boost wrote:
Be aware that some recent ARM CPUs no longer penalise unaligned loads. Cortex A15 I vaguely remember mostly does not except sometimes, Cortex A57 definitely does not. This may explain confounding results: a Cortex A9 most definitely punishes unaligned loads badly, as do any of the lower end even if very new ARM CPUs. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Peter Dimov wrote:
That link only seems to redirect to about:blank for me, for some reason.
For this code:
unsigned long f(const uint8_t* p)
{
#if 1
unsigned long w;
memcpy(&w, p, sizeof(w));
return w;
#else
return * reinterpret_cast
Myself, I'd go with the reinterpret_cast for the time being. It's indeed _technically UB_, but it works.
Well, whatever. The is the "Beast" review. My opinion is still that the code is over-optimised, and that this has made it more likely to be buggy and insecure - and the claimed performance benefits may not be as great as claimed. Regards, Phil.

On Tue, Jul 11, 2017 at 6:40 AM, Phil Endecott via Boost
Ah, I see what you did - very nice! This is almost the same as the current version but smaller. I like the way you run the code-point-at-a-time until reaching alignment, and then run the mask loop. This is a better way of organizing the code, thanks. I've imported the code into Beast to plug it into the tests and benchmarks, but it fails the tests: https://github.com/vinniefalco/Beast/tree/utf8 https://github.com/vinniefalco/Beast/commit/29cd9d6a9e1e7217798be06ded299ffa... I tried to figured out what's wrong with the routine but my grasp of utf8 is not quite as good as yours. Thanks

On Tue, Jul 11, 2017 at 10:21 AM, Phil Endecott via Boost
I was hoping one of those links would tell me what exactly failed....
https://travis-ci.org/vinniefalco/Beast/jobs/252431572#L6391 e.g. https://github.com/vinniefalco/Beast/blob/29cd9d6a9e1e7217798be06ded299ffac5...

Vinnie Falco wrote:
OK, I think that is an "overlong encoding" issue, where the UTF-8 would decode to a codepoint that should be represented by a shorter sequence. The UTF-8 decoder that I used as a starting point for this happily accepts such input. There are probably other issues. For example, I don't think the "skip pending" loop does the right thing when the buffer ends and the pending bytes end at the same time. I wasn't expecting you to actually try to use this code! Regards, Phil.

On 07/11/17 03:34, Phil Endecott via Boost wrote:
You can cast both ways. The casted-from-char pointer can be used as long as the underlying object, in the C++ object model, matches the pointed type. In other words, this is fine: std::size_t* p1 = new std::size_t[10]; char* p2 = reinterpret_cast< char* >(p1); std::size_t* p3 = reinterpret_cast< std::size_t* >(p2); assert(p1 == p3); p3[0] = 10; // ok In Beast's case we (and compiler) cannot tell whether the pointers refer to std::size_t objects. For compiler that means it has to assume the objects exist and thus prevent any optimizations that would contradict it. The cases where the compiler cannot tell whether the object exists or not are kind of a grey area in the standard because it doesn't clarify that, but in practice it must work. Otherwise it would be the end of the world because we use C APIs everywhere and C cannot create a C++ object currently. There were multiple discussions about this on the std mailing lists. Note that when you _can_ tell there is an object, and the type of the object is different from the pointee then this is still UB: double* p1 = new double[10]; char* p2 = reinterpret_cast< char* >(p1); std::size_t* p3 = reinterpret_cast< std::size_t* >(p2); p3[0] = 10; // UB

On 11/07/2017 21:30, Andrey Semashev wrote:
I'm not an expert, but I believe that strictly according to the standard it is UB to cast from a char* to any other pointer type, regardless. (It's also important to note that the standard only permits casting to "char*" -- not "unsigned char*" or "uint8_t*".) Having said that, as long as you are careful to only do so with POD types (or at least types that cannot have a vtable) and only when the alignment requirements of the type are met, then most compilers should probably let you get away with it, because it's far too useful functionality to ban outright. In the example above, since you're starting with a "real" type and then casting to char* and back you're guaranteed to meet the alignment requirements, so it should be reasonably safe. You can still run afoul of strict aliasing if you pass both pointers to other methods, however (but passing only one should be safe). Hopefully the compiler is smart enough to notice that they're aliased as long as you stay within the same method.

On Wed, Jul 12, 2017 at 2:33 AM, Gavin Lambert via Boost
It's not, according to [expr.static.cast]/13 and [expr.reinterpret.cast]/7.
(It's also important to note that the standard only permits casting to "char*" -- not "unsigned char*" or "uint8_t*".)
As noted above, you can cast between any of these types. It is what you can do with the resulting pointers that is restricted. Character types (and std::byte since C++17 - which, BTW, has unsigned char as the base type) are special because storage and object representation are expressed with these types. See [intro.object]/3, [basic.types]/4. There are also multiple other places where the spec allows to obtain a pointer to storage expressed as a pointer to one of the character types or std::byte.
The nature of the type (POD or not) is not significant - all objects obey the same object model.
If you break strict aliasing rules you are in the UB land. The compiler is not required to notice anything - it simply follows the assumption that the rules are not violated. The end result is likely misbehaving code.

(snip)
(It's also important to note that the standard only permits casting to "char*" - - not "unsigned char*" or "uint8_t*".)
Sure that unsigned char isn't OK? I thought, out of the char family, only signed char wasn't. (And uint8_t might theoretically be a problem, because an implementation doesn't necessarily have to implement uint8_t as a typedef to unsigned char.)
I didn't follow the discussion close enough so I might be mistaken, but isn't this a function that takes a char* as input (from user code)? If that is so, using size_ts to read the chars could be an aliasing violation. Namely if the user thinks that he can pass e.g. a pointer to an int array, because the function takes a char pointer, and char may alias... Then he'd have one part of the application access a memory location with size_t, which was written by another part of the application using int. A real example that causes a real bug here would probably be very hard to construct. But I don't think this should be ignored. Also I wouldn't feel very good about using memcpy for this and similar optimizations - seen it come out slow (=real call to memcpy) too often in the past. Since this is a common problem with a common optimization, maybe it's time to introduce aliasing_load and aliasing_store helper functions (e.g. in Boost.Utility)? That way any improvements (speed, reliability) would benefit all users and not just one library. Or, a different idea: wouldn't a compiler fence like std::atomic_signal_fence before the aliasing read be enough? I'm almost sure that this doesn't make it OK with the standard, but I can't imagine what kind of optimization a compiler might possibly do to break things if there are fences before all aliasing loads/after all aliasing stores. (Granted: this is a bad argument. Maybe the standard should provide std::aliasing_fence xD). Regards, Paul

On 07/12/17 11:19, Groke, Paul via Boost wrote:
That would be a problem, yes. Although the use case when a UTF-8 string is represented with an array of ints seems unlikely to me. I think the best solution would be to mark size_t with __attribute__((__may_alias__)) and use it instead of plain size_t. #if defined(__GNUC__) #define BEAST_MAY_ALIAS __attribute__((__may_alias__)) #else #define BEAST_MAY_ALIAS #endif typedef std::size_t BEAST_MAY_ALIAS pack_t; const pack_t* p = reinterpret_cast< const pack_t* >(string);

Not sure if that makes sense with regard to Beast, but in general ... the UTF-8 string could have been produced by a something-to-UTF-8 encoder. Which might also use some kind of fast path utilizing a "pack" type for writing the UTF-8 bits.
Given the tools currently available in std/Boost, I agree. (BTW: someone recently proposed adding - among others - BOOST_MAY_ALIAS. Do you know if there has been any progress there?). I still think that some "nicer" way of doing this would be cool. And the more I think about it, the more I'm liking the "aliasing_barrier()" idea. Hmm...

On 07/12/17 13:18, Groke, Paul via Boost wrote:
BTW: someone recently proposed adding - among others - BOOST_MAY_ALIAS. Do you know if there has been any progress there?
Last I remember it was proposed as part of this PR: https://github.com/boostorg/config/pull/82 That PR was not accepted, but mostly due to many unrelated changes, some of which were more controversial than others. I don't remember any more recent attempts, but may be I missed some. I agree that such a macro would be useful in Boost.Config as I know this attribute is used in multiple places in Boost. I'll prepare a PR.
I still think that some "nicer" way of doing this would be cool. And the more I think about it, the more I'm liking the "aliasing_barrier()" idea. Hmm...
I don't think this approach is the way to go. First, because it would be complicated to use. You would have to guard _every_ context where the aliasing pointer is dereferenced with a pair of barriers, and in some contexts it is difficult to do (think operator -> or passing the pointed value as a function argument). Besides, it's just too easy to miss a place where the barrier is needed. Second, the barrier prevents code reordering, possibly CSE and maybe other kinds of optimization even if the code is unrelated to the guarded pointer. Third, I'm not totally sure that the barrier would be effective in all cases anyway. If aliasing_barrier is implemented through atomic_signal_fence then the compiler may still assume that some private data like temporaries and values on the stack cannot be affected by a signal handler and reorder some code across the barrier. I think the attribute is the closest we can get. We could also provide a helper like this: template< typename T > T BOOST_MAY_ALIAS* alias(T*); Although the caller still needs to know the type T BOOST_MAY_ALIAS* (unless he can use auto for that), so I'm not sure how useful that would be.

Cool, thanks :)
Hm. Maybe you're right. Have to think about this some more.
It could also be used like this: T* ptr... for (...) { T value = *alias(ptr); ptr++; } Another possibility would be something like template <class T> T aliasing_read(void* address) {...} template <class T> void aliasing_write(void* address, prevent_deduction<T>::type value) {...} Which is what we use in our code. (I had to try very hard to resist the urge to simply call those "peek" and "poke".)

Andrey Semashev wrote:
I think the best solution would be to mark size_t with __attribute__((__may_alias__)) and use it instead of plain size_t.
No, the problem with this code is (realistically speaking) not aliasing, it's object lifetime. I don't think that any compiler is so strict as to break it though; not yet, and maybe not ever. (There's another source of _technically UB_, padding bits and trap representations, but we can also reasonably assume that they don't mater nowadays. Incidentally, unsigned char is the best type with which to access object representations, although all three narrow character types (and now std::byte) work.)

Isn't that kind-of the same thing? The aliasing rules more or less say that you must only use some char variant or a type that's compatible with the most derived type of an object at that location (which implies that the object is alive). Or did I completely misinterpret the standard/fail to remember correctly?
That should be easy to solve by excluding the optimization on platforms where CHAR_BIT != 8 or SIZE_MAX != 256 * sizeof(size_t).

Groke, Paul wrote:
Kind of. The undefined-ness comes from two separate places in the standard, so it's not the same in theory. And it's not the same in practice in general (although not here) because we don't want accesses to our fake size_t to alias everything in the universe, as this would cause spills and reloads and would defeat the optimization.

On Mon, Jul 10, 2017 at 9:57 AM, Phil Endecott via Boost
So you're reinterpret_casting a char* to a size_t* and then dereferencing it. Isn't that undefined behaviour?
Which platform do you think this doesn't work on? I have to apologize again for the earlier misleading comments. The benchmarks were measuring the wrong thing. The actual figure is more in line with 20,000%. The reinterpret_cast<> can be trivially changed to std::memcpy: std::size_t temp; std::memcpy(&temp, in, sizeof(temp)); if((temp & mask) != 0) This hurts MSVC a bit, gcc/clang/MIPS not at all, but ARM takes quite a hit: https://godbolt.org/g/HFxxub (Thanks to Peter for working up the CE ARM example) Updated benchmarks (run on MSVC with optimizations) With reinterpret_cast beast.benchmarks.utf8_checker beast: 1,977,629,044 char/s beast: 1,474,907,121 char/s beast: 1,930,979,173 char/s beast: 1,899,078,149 char/s beast: 1,893,740,588 char/s locale: 81,966,125 char/s locale: 82,200,658 char/s locale: 81,802,070 char/s locale: 82,103,369 char/s locale: 81,802,149 char/s Longest suite times: 2.7s beast.benchmarks.utf8_checker 2.7s, 1 suite, 1 case, 1 test total, 0 failures The program '[79364] benchmarks.exe' has exited with code 0 (0x0). With std::memcpy beast.benchmarks.utf8_checker beast: 1,124,515,969 char/s beast: 1,336,074,093 char/s beast: 1,494,183,562 char/s beast: 1,506,365,044 char/s beast: 1,533,419,187 char/s locale: 75,457,683 char/s locale: 81,358,140 char/s locale: 80,413,657 char/s locale: 81,635,114 char/s locale: 67,234,619 char/s Longest suite times: 3.0s beast.benchmarks.utf8_checker 3.0s, 1 suite, 1 case, 1 test total, 0 failures The program '[82220] benchmarks.exe' has exited with code 0 (0x0).

Interesting...
How did you get these numbers...
My small benchmark (see below)
Gives
For: https://en.wikipedia.org/wiki/War_and_Peace
Beast 0.0555949ms
Locale 0.110325ms
For: https://he.wikipedia.org/wiki/%D7%9E%D7%9C%D7%97%D7%9E%D7%94_%D7%95%D7%A9%D7...
Beast 0.0542079ms
Locale 0.0772768ms
For: https://zh.wikipedia.org/wiki/%E6%88%B0%E7%88%AD%E8%88%87%E5%92%8C%E5%B9%B3
Beast 0.0849873ms
Locale 0.0930336ms
using gcc 5.4.0 Ubuntu 64 bit.
I mean Beast is somewhat faster but by no means by the magnitude you show
Have you run your benchmark using release mode?
It looks for me you are running into premature optimization.
Artyom
----------------------------------
#include
}

On Mon, Jul 10, 2017 at 2:06 PM, Artyom Beilis via Boost
It looks for me you are running into premature optimization.
Hmm, no, I don't think so. Correct utf8 validation is a bottleneck for every websocket program, I have used a profiler so this comes from measurement not opinion. It looks like you are using source inputs that contain high-ascii characters. In this case, Beast switches to the "slow" algorithm which is similar to what Locale does. Try using an input file that consists only of low-ASCII characters. Your results are quite different from mine, even with std::memcpy: beast: 1,124,515,969 char/s beast: 1,336,074,093 char/s beast: 1,494,183,562 char/s beast: 1,506,365,044 char/s beast: 1,533,419,187 char/s locale: 75,457,683 char/s locale: 81,358,140 char/s locale: 80,413,657 char/s locale: 81,635,114 char/s locale: 67,234,619 char/s Ubuntu VM: beast.benchmarks.utf8_checker beast: 2894806032 char/s beast: 2874126708 char/s beast: 2890616214 char/s beast: 2017890885 char/s beast: 2785087614 char/s locale: 574731777 char/s locale: 571439694 char/s locale: 242245477 char/s locale: 511534158 char/s locale: 574121386 char/s Travis https://travis-ci.org/vinniefalco/Beast/jobs/252155928#L1334 beast: 1155900653 char/s beast: 1146058480 char/s beast: 1162309551 char/s beast: 1151093660 char/s beast: 1159334387 char/s locale: 218684840 char/s locale: 220357048 char/s locale: 208476005 char/s locale: 224853783 char/s locale: 209990002 char/s On every machine I try, locale performs more poorly on all-low-ascii inputs by at least a factor of 5. Code: https://github.com/vinniefalco/Beast/blob/da7946b6e5f8bda225ff122984e945b9e0...

On Sun, Jul 9, 2017 at 1:57 PM, Bjorn Reese via Boost
Beast already works this way. See: http://vinniefalco.github.io/stage/beast/review/beast/ref/beast__http__async... Callers can also provide their own fixed size buffers and read the message body progressively, chunked or otherwise, as shown below. This feature was in the review branch but the documentation page is new: http://vinniefalco.github.io/beast/beast/using_http/parser_stream_operations...
Consider: read(stream, buffer, res, ec); After reading a complete response message, does `ec` evaluate to `true` or `false`? How does one distinguish between, say ECONNABORTED and "403 Unauthorized"? Requests do not have a status so any errors generated receiving them would be "real" errors. Responses would be treated differently under your scheme. Every call site would have to check "is this a real error". I don't think this is a good idea.
The Reference is missing some concepts, e.g. SyncReadStream.
Did you see this page? http://vinniefalco.github.io/stage/beast/review/beast/concept/streams.html SyncReadStream is a Boost.Asio concept. Thanks

On Sun, Jul 9, 2017 at 1:57 PM, Bjorn Reese via Boost
That's purely your opinion and not a very insightful one at that because break and goto are efffectively the same thing, except that in beast's more complex composed operation state machines: break; indicates that the state machine will run for another loop (transition to a new state), while goto upcall; is used to annotate that the composed operation is terminating. All composed operations must 1. eventually call the final handler, and 2.deallocate all temporary allocations before invoking the final handler (the "deallocation-before-invocation" guarantee). The rules for composed operations are strict, and the code is written in a style that minimizes the possibility of errors by expressing intent clearly. The subject is important enough that the documentation includes an entire section of helpful tips on how to avoid common mistakes when writing composed operations (bottom of page): http://vinniefalco.github.io/stage/beast/review/beast/using_networking/writi...
Furthermore, "case do_body + 1:" looks like an awful hack.
Again subjective and indicative more of a less than thorough understanding of the implementation. Named case labels denote "subroutines" within the state machine while increments represent line numbers. This keeps the code organized and makes things easier when making changes, Quite the opposite of an awful hack, beast websocket composed operations have *intense*, comprehensive unit tests. They use coroutines and manually advance the io_service one completion at a time to set up the state of a client/server pair in order to exercise every code path. That code is here: https://github.com/vinniefalco/Beast/blob/review/test/websocket/stream.cpp#L... Beast websocket has a very special feature, it allows a pending asynchronous read to auto-respond to pings and close frames (which perform socket writes), even while an asynchronous write is pending. It also allows an asynchronous ping to be sent even while both an asynchronous write and asynchronous read are pending! This is no trivial implementation (pending composed operations are "parked" in a separate object and resumed later). Its all very well tested with great coverage. Does this coverage look like "awful hack?" <https://codecov.io/gh/vinniefalco/Beast/src/review/include/beast/websocket/i.... The mischaracterization of Beast code is probably an honest mistake on your part. Thanks

On Wed, Jul 12, 2017 at 5:40 PM, Edward Diener via Boost
Boost has two state machine libraries. Did you look at either of them for your uses ?
No. The thought of introducing a dependency on a library dedicated only to large scale state machines never crossed my mind. Have you seen the state machine in question? Here's a typical example: https://github.com/vinniefalco/Beast/blob/ceeb354424bb00474bbf713450b1d03de1... Thanks

On 7/12/2017 8:43 PM, Vinnie Falco via Boost wrote:
I want to make a general point: You can always use library X, even if the library encompasses much more than you need in your own library or application. I have never understood why software developers are afraid of creating a dependency on some other library when that library solves some software problem which makes some other software module easier to code or understand. I see no advantage of developers rolling their own solution when some other library has already solved the particular software problem in an elegant and useful manner. I neither know nor can judge whether your particular solution is better/worse than using a ready-made library. I was just surprised when you say that the thought never crossed your mind, considering that you described your code as a state machine and Boost has to state machine libraries, one fairly low level and one at a higher level. But if you think using either library would have been overkill that is understandable. Still I myself would have at least investigated the other libraries, if indeed you are implementing a state machine in your code as you say.
Thanks

On Wed, Jul 12, 2017 at 8:15 PM, Edward Diener via Boost
I've used Boost.Statechart in the past and the amount of additional infrastructure in terms of types, lines, and the required splitting of code into separate functions is just not worth it. The current implementation is straightforward if a bit unorthodox, but that is a consequence of the inversion of flow of control inherent when implementing composed asynchronous operations using callbacks. For this particular purpose, I think introducing another library's types and methodology is a net negative. Asynchronous programming is tricky, and its even more tricky for a library to protect users from all possible mistakes when using callbacks (tricky as in, hardly possible at all). Even seasoned Asio programmers sometimes make mistakes and need to get a breakpoint in the implementation (of Asio, or Beast). When that happens I don't want them to see Boost;.Statechart in the innards of a composed operation callback. Beast does use other parts of Boost, just not for state machines. It uses Boost.Intrusive for example (GREAT library!!) Thanks

On July 12, 2017 8:43:14 PM EDT, Vinnie Falco via Boost
Seeing that code, it appears that the there's no need for goto at all. Instead, have that function call another and then do the upcall work when the other function returns. Where you use goto, you'd just return. (I viewed the code using my phone, so I may have missed something, but it certainly looked like goto upcall was quite unnecessary.) -- Rob (Sent from my portable computation device.)

On Sun, Jul 9, 2017 at 1:57 PM, Bjorn Reese via Boost
Hey buddy! I've got a pull request sitting in the queue which addresses the chunked output issues you raised, would you care to have a look before it gets merged? You sure do sound like an expert; I'm sure the ever-growing masses of Beast users would be very grateful if you vetted the design and weighed in as you so graciously did during the formal review! Here's the branch: https://github.com/vinniefalco/Beast/pull/649 Thanks!!!
participants (12)
-
Andrey Semashev
-
Artyom Beilis
-
Bjorn Reese
-
Edward Diener
-
Gavin Lambert
-
Groke, Paul
-
Niall Douglas
-
Peter Dimov
-
Phil Endecott
-
Rob Stewart
-
Seth
-
Vinnie Falco