Movatterモバイル変換

Issue21872

➜

This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/66071

classification

Title:	LZMA library sometimes fails to decompress a file
Type:	behavior	Stage:	commit review
Components:	Library (Lib)	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Esa.Peuha, Jeffrey.Kintscher, akira, gregory.p.smith, josh.r, kenorb, malin, maubp, miss-islington, nadeem.vawda, peremen, serhiy.storchaka, vnummela
Priority:	normal	Keywords:	patch

Created on2014-06-25 18:28 byvnummela, last changed2022-04-11 14:58 byadmin. This issue is nowclosed.

Files
File name	Uploaded	Description	Edit
Archive.zip	vnummela,2014-06-25 18:28	Example lzma-compressed files, a good one and a bad one
more_bad_lzma_files.zip	vnummela,2014-07-01 18:17	15 more example files that fail lzma decompression
decompress-example-files.py	akira,2014-11-21 08:15
02h_ticks.bi5	kenorb,2015-09-28 18:40	http://www.dukascopy.com/datafeed/EURUSD/2014/00/22/02h_ticks.bi5
failed_files_more.zip	peremen,2017-12-24 17:51	2 more failing files
fix-bug.diff	malin,2019-06-05 05:06
test_bad_files.py	malin,2019-06-18 09:33

Pull Requests
URL	Status	Linked	Edit
PR 14048	merged	malin,2019-06-13 09:42
PR 16054	merged	miss-islington,2019-09-12 14:20
PR 16055	merged	miss-islington,2019-09-12 14:21

Messages (20)
msg221566 -(view)	Author: Ville Nummela (vnummela)	Date: 2014-06-25 18:28
Python lzma library sometimes fails to decompress a file, even though the file does not appear to be corrupt. Originally discovered with OS X 10.9 / Python 2.7.7 / bacports.lzmaNow also reproduced on OS X / Python 3.4 / lzma, please seehttps://github.com/peterjc/backports.lzma/issues/6 for more details.Two example files are provided, a good one and a bad one. Both are compressed using the older lzma algorithm (not xz). An attempt to decompress the 'bad' file raises "EOFError: Compressed file ended before the end-of-stream marker was reached."The 'bad' file appears to be ok, because- a direct call to XZ Utils processes the files without complaints- the decompressed files' contents appear to be ok.The example files contain tick data and have been downloaded from the Dukascopy bank's historical data feed service. The service is well known for it's high data quality and utilised by multiple analysis SW platforms. Thus I think it is unlikely that a file integrity issue on their end would have gone unnoticed.The error occurs relatively rarely; only around 1 - 5 times per 1000 downloaded files.
msg221583 -(view)	Author: Josh Rosenberg (josh.r)*	Date: 2014-06-25 23:49
Just to be clear, when you say "1 - 5 times per 1000 downloaded files", have you confirmed that redownloading the same file a second time produces the same error? Just making sure we've ruled out corruption during transfer over the network; small errors might make it past one decompressor with minimal effect in the midst of a huge data file, while a more stringent error checking decompressor would reject them.
msg221597 -(view)	Author: Serhiy Storchaka (serhiy.storchaka)*	Date: 2014-06-26 08:00
>>> import lzma>>> f = lzma.open('22h_ticks_bad.bi5')>>> len(f.read())Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/lzma.py", line 310, in read return self._read_all() File "/home/serhiy/py/cpython/Lib/lzma.py", line 251, in _read_all while self._fill_buffer(): File "/home/serhiy/py/cpython/Lib/lzma.py", line 225, in _fill_buffer raise EOFError("Compressed file ended before the "EOFError: Compressed file ended before the end-of-stream marker was reachedThis is similar toissue1159051. We need a way to say "read as much as possible without error and raise EOFError only on next read".
msg221599 -(view)	Author: Ville Nummela (vnummela)	Date: 2014-06-26 08:20
My stats so far:As of writing this, I have attempted to decompress about 5000 downloaded files (two years of tick data). 25 'bad' files were found within this lot.I re-downloaded all of them, plus about 500 other files as the minimum lot the server supplies is 24 hours / files at a time.I compared all these 528 file pairs using hashlib.md5 and got identical hashes for all of them.I guess what I should do next is to go through the decompressed data and look for suspicious anomalies, but unfortunately I don't have the tools in place to do that quite yet.
msg221784 -(view)	Author: Esa Peuha (Esa.Peuha)	Date: 2014-06-28 13:05
This codeimport _lzmawith open('22h_ticks_bad.bi5', 'rb') as f: infile = f.read()for i in range(8191, 8195): decompressor = _lzma.LZMADecompressor() first_out = decompressor.decompress(infile[:i]) first_len = len(first_out) last_out = decompressor.decompress(infile[i:]) last_len = len(last_out) print(i, first_len, first_len + last_len, decompressor.eof)prints this8191 36243 45480 True8192 36251 45473 False8193 36253 45475 False8194 36260 45480 TrueIt seems to me that this is a subtle bug in liblzma; if the input stream to the incremental decompressor is broken at the wrong place, the internal state of the decompressor is corrupted. For this particular file, it happens when the break occurs after reading 8192 or 8193 bytes, and lzma.py happens to use a buffer of 8192 bytes. There is nothing wrong with the compressed file, since lzma.py decompresses it correctly if the buffer size is set to almost any other value.
msg222052 -(view)	Author: Ville Nummela (vnummela)	Date: 2014-07-01 18:17
Uploading a few more 'bad' lzma files for testing.
msg231466 -(view)	Author: Akira Li (akira)*	Date: 2014-11-21 07:13
@Esa changing the buffer size helps with some "bad" filesbut lzma module still fails on some files.I've uploaded decompress-example-files.py script that demonstrates it.
msg231467 -(view)	Author: Akira Li (akira)*	Date: 2014-11-21 08:15
If lzma._BUFFER_SIZE is less than 2048 then all example files aredecompressed successfully (at least lzma module produces the sameresults as xz utility)
msg251784 -(view)	Author: (kenorb)	Date: 2015-09-28 18:40
The same with this attached file. It fails with Python 3.5 (small buffers like 128, 255, 1023, etc.) , but it seems to work in Python 3.4 with lzma._BUFFER_SIZE = 1023. So it looks like something regressed.
msg309005 -(view)	Author: Shinjo Park (peremen)	Date: 2017-12-24 17:51
Hi, I think I encountered this bug with Ubuntu 17.10 / Python 3.6.3. The same error was triggered by Python's LZMA library, while the xz command line tool can extract the problematic file. Not sure whether there is the bug in 3.7/3.8. I am attaching the problematic archives, they should contain UTF-16LE encoded text.
msg344530 -(view)	Author: Jeffrey Kintscher (Jeffrey.Kintscher)*	Date: 2019-06-04 07:07
I adapted the example inmsg221784:with open('22h_ticks_bad.bi5', 'rb') as f: infile = f.read()for i in range(1, 9000): decompressor = _lzma.LZMADecompressor() first_out = decompressor.decompress(infile[:i]) first_len = len(first_out) last_out = decompressor.decompress(infile[i:]) last_len = len(last_out) if not decompressor.eof: print(i, first_len, first_len + last_len, decompressor.eof)which outputs this using both 3.7.3 and 3.8.0a3+ on macOS 10.14.4:648 2682 45479 False1834 7442 45479 False2766 11667 45473 False2767 11668 45474 False3591 15428 45473 False5051 21743 45473 False5052 21745 45475 False5589 24387 45475 False5590 24388 45476 False6560 28823 45476 False6561 28824 45477 False7327 32325 45474 False8192 36251 45473 False8193 36253 45475 False8368 37283 45475 False8369 37285 45477 FalseSo, yes, still an active bug.
msg344668 -(view)	Author: Ma Lin (malin)*	Date: 2019-06-05 05:06
fix-bug.diff fixes this bug, I will submit a PR after thoroughly understanding the problem.
msg345491 -(view)	Author: Ma Lin (malin)*	Date: 2019-06-13 10:03
I wrote a review guide inPR 14048.
msg345971 -(view)	Author: Ma Lin (malin)*	Date: 2019-06-18 09:33
I investigated this problem.Here is the toggle conditions:- The format is FORMAT_ALONE, this is the legacy .lzma container format.- The file's header recorded "Uncompressed Size".- The file doesn't have "End of Payload Marker" or "End of Stream Marker".Otherwise, liblzma's internal state doesn't hold any bytes that can be output. Good news is:- lzma module's default compressing format is FORMAT_XZ, not FORMAT_ALONE.- Even FORMAT_ALONE files generated by lzma module (underlying xz library), always have "End of Payload Marker".- Maybe FORMAT_ALONE format is being outdated in the world.Attached file test_bad_files.py, test `DecompressReader.read(size=-1)` function [1] with different max_length values (from -1 to 1000, exclude 0), can ensure that the needs_input mechanism works properly.Usage: modify `DIR` variable to bad files' folder.[1]https://github.com/python/cpython/blob/v3.8.0b1/Lib/_compression.py#L72-L111
msg345972 -(view)	Author: Ma Lin (malin)*	Date: 2019-06-18 09:34
toggle conditions -> trigger conditions
msg352176 -(view)	Author: Gregory P. Smith (gregory.p.smith)*	Date: 2019-09-12 14:20
New changeset4ffd05d7ec47cfd0d7fc95dce851633be9663255 by Gregory P. Smith (animalize) in branch 'master':bpo-21872: fix lzma library decompresses data incompletely (GH-14048)https://github.com/python/cpython/commit/4ffd05d7ec47cfd0d7fc95dce851633be9663255
msg352183 -(view)	Author: miss-islington (miss-islington)	Date: 2019-09-12 14:41
New changeset824407f76e211a2a19c94d5cb1f39fc422ab62ee by Miss Islington (bot) in branch '3.8':bpo-21872: fix lzma library decompresses data incompletely (GH-14048)https://github.com/python/cpython/commit/824407f76e211a2a19c94d5cb1f39fc422ab62ee
msg352185 -(view)	Author: miss-islington (miss-islington)	Date: 2019-09-12 14:41
New changeseta3c53a1b45b05bcb69660eac5a271443b37ecc42 by Miss Islington (bot) in branch '3.7':bpo-21872: fix lzma library decompresses data incompletely (GH-14048)https://github.com/python/cpython/commit/a3c53a1b45b05bcb69660eac5a271443b37ecc42
msg352200 -(view)	Author: Gregory P. Smith (gregory.p.smith)*	Date: 2019-09-12 15:25
thanks!
msg352405 -(view)	Author: Ma Lin (malin)*	Date: 2019-09-14 04:31
Some memos:1, In liblzma, these missing bytes were copied inside `dict_repeat` function: 788 case SEQ_COPY: 789 // Repeat len bytes from distance of rep0. 790 if (unlikely(dict_repeat(&dict, rep0, &len))) {See liblzma's source code (xz-5.2 branch):https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/lzma/lzma_decoder.c2, Above replies said xz's command line tools can extract the problematic files successfully.This is because xz checks `if (avail_out == 0)` first, then checks `if (avail_in == 0)`See `uncompress` function in this source code (xz-5.2 branch):https://git.tukaani.org/?p=xz.git;a=blob;f=src/xzdec/xzdec.c;hb=refs/heads/v5.2This check order just avoids the problem.

History
Date	User	Action	Args
2022-04-11 14:58:05	admin	set	github: 66071
2019-09-14 04:31:45	malin	set	messages: +msg352405
2019-09-12 15:25:19	gregory.p.smith	set	status: open -> closed resolution: fixed messages: +msg352200 stage: patch review -> commit review
2019-09-12 14:41:33	miss-islington	set	messages: +msg352185
2019-09-12 14:41:14	miss-islington	set	nosy: +miss-islington messages: +msg352183
2019-09-12 14:21:01	miss-islington	set	pull_requests: +pull_request15678
2019-09-12 14:20:51	miss-islington	set	pull_requests: +pull_request15677
2019-09-12 14:20:41	gregory.p.smith	set	nosy: +gregory.p.smith messages: +msg352176
2019-06-18 09:34:31	malin	set	messages: +msg345972
2019-06-18 09:33:12	malin	set	files: +test_bad_files.py messages: +msg345971
2019-06-13 10:03:43	malin	set	messages: +msg345491 versions: + Python 3.8, Python 3.9, - Python 2.7, Python 3.4, Python 3.5, Python 3.6
2019-06-13 09:42:35	malin	set	stage: patch review pull_requests: +pull_request13910
2019-06-05 05:06:15	malin	set	files: +fix-bug.diff keywords: +patch messages: +msg344668
2019-06-04 07:07:01	Jeffrey.Kintscher	set	messages: +msg344530 versions: + Python 3.7
2019-06-03 05:12:08	malin	set	nosy: +malin
2019-06-01 07:55:22	Jeffrey.Kintscher	set	nosy: +Jeffrey.Kintscher
2017-12-24 17:51:00	peremen	set	files: +failed_files_more.zip versions: + Python 3.6 nosy: +peremen messages: +msg309005
2015-09-28 18:40:52	kenorb	set	files: +02h_ticks.bi5 nosy: +kenorb messages: +msg251784
2014-11-21 08:15:30	akira	set	files: -decompress-example-files.py
2014-11-21 08:15:19	akira	set	files: +decompress-example-files.py messages: +msg231467
2014-11-21 07:26:34	akira	set	files: -decompress-example-files.py
2014-11-21 07:26:29	akira	set	files: +decompress-example-files.py
2014-11-21 07:13:51	akira	set	files: +decompress-example-files.py nosy: +akira messages: +msg231466
2014-11-19 10:37:14	maubp	set	nosy: +maubp
2014-07-01 18:17:55	vnummela	set	files: +more_bad_lzma_files.zip messages: +msg222052
2014-06-28 13:05:57	Esa.Peuha	set	nosy: +Esa.Peuha messages: +msg221784
2014-06-26 08:20:46	vnummela	set	messages: +msg221599
2014-06-26 08:00:23	serhiy.storchaka	set	nosy: +serhiy.storchaka messages: +msg221597 versions: + Python 3.5
2014-06-25 23:49:30	josh.r	set	nosy: +josh.r messages: +msg221583
2014-06-25 18:28:55	vnummela	create

Supported byThe Python Software Foundation,
Powered byRoundup