[iostreams] problem with gzip_compressor
Hello.
I've been experiencing the following problem with the gzip_compressor
filter: It always writes data down the stream, regardless if any data
has been written to the filtering_stream.
Consider the following program:
--------------------------------------------------------
#include <iostream>
#include <fstream>
#include
Hello.
I've been experiencing the following problem with the gzip_compressor filter: It always writes data down the stream, regardless if any data has been written to the filtering_stream.
Consider the following program:
<snip>
If compiled as "test", the following behavior is observed:
<snip>
That is, it always adds 8 bytes to the file, despite the fact that nothing was written to the stream. To top things off, the data inside the file isn't even recognizable by gzip:
gzip_compressor works as follows: The first time you write data to it, it writes the gzip header information to the downstream Sink, and then writes the beginning of the compressed data. When the filter is closed, it writes any compressed data that has been buffered, plus the gzip footer, which consists of a checksum and the length of the uncompressed data. In the above example, the filter is automatically closed at the end of main; this causes the gzip footer to be written. But since no data was ever compressed, the gzip header has never been written. I guess this is a bug of some sort. What behavior would you expect in this case? It seems to me it would make the most sense to output data in the gzip format representing a 0-length file. -- Jonathan Turkanis www.kangaroologic.com
On 10/30/2005 10:34 PM, Jonathan Turkanis wrote:
gzip_compressor works as follows: The first time you write data to it, it writes the gzip header information to the downstream Sink, and then writes the beginning of the compressed data. When the filter is closed, it writes any compressed data that has been buffered, plus the gzip footer, which consists of a checksum and the length of the uncompressed data.
In the above example, the filter is automatically closed at the end of main; this causes the gzip footer to be written. But since no data was ever compressed, the gzip header has never been written.
I guess this is a bug of some sort. What behavior would you expect in this case? It seems to me it would make the most sense to output data in the gzip format representing a 0-length file.
That would also make sense to me, but it would be inconsistent with the
bzip2_compressor behavior, which doesn't write any footer if there was
no header.
And also it would create an impossibility of just visiting a file in
append mode, without writing any data to it. This could be fixed if
gzip_compressor were seekable. Is this possible to be implemented? Is
this an intended feature?
--
Tiago de Paula Peixoto
Tiago de Paula Peixoto wrote:
On 10/30/2005 10:34 PM, Jonathan Turkanis wrote:
gzip_compressor works as follows: The first time you write data to it, it writes the gzip header information to the downstream Sink, and then writes the beginning of the compressed data. When the filter is closed, it writes any compressed data that has been buffered, plus the gzip footer, which consists of a checksum and the length of the uncompressed data.
In the above example, the filter is automatically closed at the end of main; this causes the gzip footer to be written. But since no data was ever compressed, the gzip header has never been written.
I guess this is a bug of some sort. What behavior would you expect in this case? It seems to me it would make the most sense to output data in the gzip format representing a 0-length file.
That would also make sense to me, but it would be inconsistent with the bzip2_compressor behavior, which doesn't write any footer if there was no header.
I can't really change the behavior of bzip2, since it's just a wrapper around libbz2, whereas with gzip I implemented the header and footers myself. I wouldn't worry too much about consistency, since this is a corner case.
And also it would create an impossibility of just visiting a file in append mode, without writing any data to it.
I don't follow. What do you want to be able to do?
This could be fixed if gzip_compressor were seekable. Is this possible to be implemented?
The only way I can see to implement this would be to buffer all i/o and only compress or decompress it when the stream is closed. This could be implemented as an adapter that would work with almost any filter, so I wouldn't want to build it into gzip. I'll put this on my list of possibilities for 1.34.
Is this an intended feature?
-- Jonathan Turkanis www.kangaroologic.com
On 10/31/2005 04:04 PM, Jonathan Turkanis wrote:
In the above example, the filter is automatically closed at the end of main; this causes the gzip footer to be written. But since no data was ever compressed, the gzip header has never been written.
I guess this is a bug of some sort. What behavior would you expect in this case? It seems to me it would make the most sense to output data in the gzip format representing a 0-length file.
That would also make sense to me, but it would be inconsistent with the bzip2_compressor behavior, which doesn't write any footer if there was no header.
I can't really change the behavior of bzip2, since it's just a wrapper around libbz2, whereas with gzip I implemented the header and footers myself. I wouldn't worry too much about consistency, since this is a corner case.
Well, anyway is fine for me personally, as long as the resulting file is a valid gzip/bzip2 file (which isn't the case with gzip in 1.33.0). Although, strictly speaking, a zero-length file isn't either a gzip nor a bzip2 file, most people will be able to cope with it nevertheless. So I don't feel strongly about it either way. But people still may expect (as I did) that changing between gzip_compressor and bzip2_compressor would maintain this same invariant. So I would prefer having both writing nothing to the stream in this case, than having them behaving differently (since bzip2 can't be changed easily). Would you find it too ugly/wrong to modify gzip_compressor to delay the writing of the header until some data would be sent?
And also it would create an impossibility of just visiting a file in append mode, without writing any data to it.
I don't follow. What do you want to be able to do?
Well, suppose a program keeps a log file which is gzipped. Every time the program runs, and opens the log file in append mode, some data gets written to the file, even if the program exits without logging any information, which would make the file grow continuously, albeit slowly. Of course, the obvious workaround would be to delay the opening of the logfile until there's some data to be written. But that may be less convenient and/or intuitive. I realize that this may not be smart to start with, since writing small chunks to a compressed file in this way makes the file sometimes much larger than if it were uncompressed. That's why I said that the ideal solution would be to be able to open the file, push it into a filtering_stream with bzip_compressor, and then seek to the end, in a way that the footer and header would be only at the end and at the beginning of the file, and not between the chunks that were written between opens. I'm just not sure how easy/possible it is to implement that.
This could be fixed if gzip_compressor were seekable. Is this possible to be implemented?
The only way I can see to implement this would be to buffer all i/o and only compress or decompress it when the stream is closed. This could be implemented as an adapter that would work with almost any filter, so I wouldn't want to build it into gzip. I'll put this on my list of possibilities for 1.34.
So the entire uncompressed file would be in memory? Doesn't the
gzip/bzip2 interface provide a more efficient alternative? Not even to
seek only forward?
--
Tiago de Paula Peixoto
On 10/31/2005 07:37 PM, Tiago de Paula Peixoto wrote:
Would you find it too ugly/wrong to modify gzip_compressor to delay the writing of the header until some data would be sent?
No, that's wrong. What I meant to say was: modify gzip_compressor to
write the footer only if the header has already been written.
Sorry for the confusion.
--
Tiago de Paula Peixoto
Tiago de Paula Peixoto wrote:
On 10/31/2005 04:04 PM, Jonathan Turkanis wrote:
In the above example, the filter is automatically closed at the end of main; this causes the gzip footer to be written. But since no data was ever compressed, the gzip header has never been written.
I guess this is a bug of some sort. What behavior would you expect in this case? It seems to me it would make the most sense to output data in the gzip format representing a 0-length file.
That would also make sense to me, but it would be inconsistent with the bzip2_compressor behavior, which doesn't write any footer if there was no header.
I can't really change the behavior of bzip2, since it's just a wrapper around libbz2, whereas with gzip I implemented the header and footers myself. I wouldn't worry too much about consistency, since this is a corner case.
Well, anyway is fine for me personally, as long as the resulting file is a valid gzip/bzip2 file (which isn't the case with gzip in 1.33.0). Although, strictly speaking, a zero-length file isn't either a gzip nor a bzip2 file, most people will be able to cope with it nevertheless. So I don't feel strongly about it either way.
But people still may expect (as I did) that changing between gzip_compressor and bzip2_compressor would maintain this same invariant. So I would prefer having both writing nothing to the stream in this case, than having them behaving differently (since bzip2 can't be changed easily).
Since I can't easily produce similar behavior for all the compression filters, maybe I should specify in the docs that the output of the compression filters is well-defined only if some data is written.
Would you find it too ugly/wrong to modify gzip_compressor to delay the writing of the header until some data would be sent?
It's easy to do (when rephrased ;-) ), but I'm not sure it makes that much sense. If you're just compressing and decompressing, it's easy to treat the case of an empty file specially. But if you have a long chain of filters with a compressor or decompressor in the middle, thing could get messy.
And also it would create an impossibility of just visiting a file in append mode, without writing any data to it.
I don't follow. What do you want to be able to do?
Well, suppose a program keeps a log file which is gzipped. Every time the program runs, and opens the log file in append mode, some data gets written to the file, even if the program exits without logging any information, which would make the file grow continuously, albeit slowly. Of course, the obvious workaround would be to delay the opening of the logfile until there's some data to be written. But that may be less convenient and/or intuitive.
This sounds difficult to implement, since when you open the log for appending you have to find a way to restore the compressor to the state it was in when it finished compressing the existing data. The only way I know how to do this would be to decompress the data, then compress it again.
This could be fixed if gzip_compressor were seekable. Is this possible to be implemented?
The only way I can see to implement this would be to buffer all i/o and only compress or decompress it when the stream is closed. This could be implemented as an adapter that would work with almost any filter, so I wouldn't want to build it into gzip. I'll put this on my list of possibilities for 1.34.
So the entire uncompressed file would be in memory? Doesn't the gzip/bzip2 interface provide a more efficient alternative? Not even to seek only forward?
The zlib API docs are here: http://www.gzip.org/zlib/manual.html. If you can see a way to this I'll definitely consider it. -- Jonathan Turkanis www.kangaroologic.com
participants (2)
-
Jonathan Turkanis
-
Tiago de Paula Peixoto