Tiago de Paula Peixoto wrote:
On 10/31/2005 04:04 PM, Jonathan Turkanis wrote:
In the above example, the filter is automatically closed at the end of main; this causes the gzip footer to be written. But since no data was ever compressed, the gzip header has never been written.
I guess this is a bug of some sort. What behavior would you expect in this case? It seems to me it would make the most sense to output data in the gzip format representing a 0-length file.
That would also make sense to me, but it would be inconsistent with the bzip2_compressor behavior, which doesn't write any footer if there was no header.
I can't really change the behavior of bzip2, since it's just a wrapper around libbz2, whereas with gzip I implemented the header and footers myself. I wouldn't worry too much about consistency, since this is a corner case.
Well, anyway is fine for me personally, as long as the resulting file is a valid gzip/bzip2 file (which isn't the case with gzip in 1.33.0). Although, strictly speaking, a zero-length file isn't either a gzip nor a bzip2 file, most people will be able to cope with it nevertheless. So I don't feel strongly about it either way.
But people still may expect (as I did) that changing between gzip_compressor and bzip2_compressor would maintain this same invariant. So I would prefer having both writing nothing to the stream in this case, than having them behaving differently (since bzip2 can't be changed easily).
Since I can't easily produce similar behavior for all the compression filters, maybe I should specify in the docs that the output of the compression filters is well-defined only if some data is written.
Would you find it too ugly/wrong to modify gzip_compressor to delay the writing of the header until some data would be sent?
It's easy to do (when rephrased ;-) ), but I'm not sure it makes that much sense. If you're just compressing and decompressing, it's easy to treat the case of an empty file specially. But if you have a long chain of filters with a compressor or decompressor in the middle, thing could get messy.
And also it would create an impossibility of just visiting a file in append mode, without writing any data to it.
I don't follow. What do you want to be able to do?
Well, suppose a program keeps a log file which is gzipped. Every time the program runs, and opens the log file in append mode, some data gets written to the file, even if the program exits without logging any information, which would make the file grow continuously, albeit slowly. Of course, the obvious workaround would be to delay the opening of the logfile until there's some data to be written. But that may be less convenient and/or intuitive.
This sounds difficult to implement, since when you open the log for appending you have to find a way to restore the compressor to the state it was in when it finished compressing the existing data. The only way I know how to do this would be to decompress the data, then compress it again.
This could be fixed if gzip_compressor were seekable. Is this possible to be implemented?
The only way I can see to implement this would be to buffer all i/o and only compress or decompress it when the stream is closed. This could be implemented as an adapter that would work with almost any filter, so I wouldn't want to build it into gzip. I'll put this on my list of possibilities for 1.34.
So the entire uncompressed file would be in memory? Doesn't the gzip/bzip2 interface provide a more efficient alternative? Not even to seek only forward?
The zlib API docs are here: http://www.gzip.org/zlib/manual.html. If you can see a way to this I'll definitely consider it. -- Jonathan Turkanis www.kangaroologic.com