Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solid 7z file use archive.read_bytes(am.name)[12:21] == b'somebyte' will get py7zr.exceptions.CrcError #2

Closed
kokutoukiritsugu opened this issue Jul 18, 2024 · 14 comments · Fixed by #3

Comments

@kokutoukiritsugu
Copy link

kokutoukiritsugu commented Jul 18, 2024

Traceback (most recent call last):
  File "D:\cdg\cdg\cdg_search.py", line 78, in check_file_is_enced1
    if archive.read_bytes(am.name)[12:21] == b'somebyte':
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\validate_call_decorator.py", line 60, in wrapper_function
    return validate_call_wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\_internal\_validate_call.py", line 96, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\archivefile\_core.py", line 685, in read_bytes
    data = self.extract(member, destination=tmpdir.name).read_bytes()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\validate_call_decorator.py", line 60, in wrapper_function
    return validate_call_wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\_internal\_validate_call.py", line 96, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\archivefile\_core.py", line 507, in extract
    self._handler.extract(path=destination, targets=[member])
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1012, in extract
    self._extract(path, targets, return_dict=False, recursive=recursive)
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 629, in _extract
    self.worker.extract(
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1253, in extract
    self.extract_single(
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1341, in extract_single
    raise e
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1338, in extract_single
    self._extract_single(fp, files, path, src_end, q, skip_notarget)
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1375, in _extract_single
    self._check(fp, just_check, src_end)
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1432, in _check
    raise CrcError(crc32, f.crc32, f.filename)
py7zr.exceptions.CrcError: (2426214743, 2390100170, 'asdf.pdf')

solid 7z file will get error.
non-solid 7z file no problem.

@Ravencentric
Copy link
Owner

Is it an archivefile issue or is it a py7zr issue? Can you check if you get the same issue when using py7zr on it's own?

@kokutoukiritsugu
Copy link
Author

py7zr no problem

            with py7zr.SevenZipFile(full_file_path) as archive1:
                for fname, bio in archive1.readall().items():
                    print(f'{fname}: {bio.read(21)}')

asdf_10.pdf: b'b\x14#eb\x00\x9e\x01\x00\x00\x00\x01somebyte'
asdf_3.pdf: b'b\x14#eW\x00\xa9\x01\x00\x00\x00\x01somebyte'
asdf_4.pdf: b'b\x14#el\x00\x94\x01\x00\x00\x00\x01somebyte'
asdf_5.pdf: b'b\x14#ek\x00\x95\x01\x00\x00\x00\x01somebyte'
asdf_6.pdf: b'b\x14#eh\x00\x98\x01\x00\x00\x00\x01somebyte'
asdf_7.pdf: b'b\x14#e_\x00\xa1\x01\x00\x00\x00\x01somebyte'
asdf_8.pdf: b'b\x14#eq\x00\x8f\x01\x00\x00\x00\x01somebyte'
asdf_9.pdf: b'b\x14#ep\x00\x90\x01\x00\x00\x00\x01somebyte'

@Ravencentric
Copy link
Owner

If you can give me steps to reproduce this, I can probably look into fixing this

@kokutoukiritsugu
Copy link
Author

just use 7-Zip compress some file
check solid

use read_bytes(am.name)[12:21]

@Ravencentric
Copy link
Owner

I've added solid 7z files to test_data (8cccb95) and added read tests (2e56f58). As you can see, the tests pass without issues and I cannot reproduce this on my end. Unless you give me concrete reproduction steps I cannot help you anymore.

@Ravencentric Ravencentric closed this as not planned Won't fix, can't repro, duplicate, stale Jul 21, 2024
@kokutoukiritsugu
Copy link
Author

problem 7z file inside this zip file.
3月1_2.zip

@Ravencentric Ravencentric reopened this Jul 23, 2024
@Ravencentric
Copy link
Owner

I'll take another look

@Ravencentric
Copy link
Owner

@kokutoukiritsugu I fixed it in #3. Would be nice if you could test it and let me know before I merge and release

@kokutoukiritsugu
Copy link
Author

function ok, but speed slow in a lot of file in 7z...

            with archivefile.ArchiveFile(apb, 'r') as archive:
                for name in archive.get_names():
                    if archive.get_member(name).is_file:
                        check_archive_enc1(apb, name, archive.read_bytes(name)[12:21])

vs

            with py7zr.SevenZipFile(apb) as archive:
                for name, bio in archive.read().items():
                    if not name.endswith("/"):
                        check_archive_enc1(apb, name, bio.read(21)[12:21])

@Ravencentric
Copy link
Owner

You can do for member in archive.get_members() there. Being slower than the dedicated library is expected because archive file is a wrapper after all but if you can time it that would be nice to get an idea of how slow it actually is.

@kokutoukiritsugu
Copy link
Author

i try, use for member...

6.49850606918335 vs 0.5358200073242188

yes, read_bytes not best suitable for warpper

@Ravencentric
Copy link
Owner

Ravencentric commented Jul 25, 2024

That's slower than I expected. Anyway that's something I'll look into now but not really an immediate goal. I'll close this issue when I merge #3

@kokutoukiritsugu
Copy link
Author

ok
thanks very much !

@Ravencentric
Copy link
Owner

Ravencentric commented Jul 27, 2024

#4 is pretty much a complete re-write which does end up speeding things up a bit

from time import perf_counter

import archivefile
import py7zr

file = "3月1.7z"

start = perf_counter()
with archivefile.ArchiveFile(file) as archive:
    for member in archive.get_members():
        if member.is_file:
            archive.read_bytes(member)
print(perf_counter() - start)

start = perf_counter()
with py7zr.SevenZipFile(file) as archive:
    for name, bio in archive.read().items():
        if not name.endswith("/"):
            bio.read()
print(perf_counter() - start)
ArchiveFile: 0.013419300004898105
SevenZipFile: 0.007313699999940582

Although it will never beat the underlying library for obvious reasons, think I'm happy with the minor improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants