Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_subprocess.py stdout reading may corrupt UTF-8 characters, and then fail when decodes the data #573

Closed
dnso86 opened this issue Mar 23, 2023 · 4 comments · Fixed by #572
Labels
bug Something isn't working component:dep-sources Dependency sources duplicate This issue or pull request already exists

Comments

@dnso86
Copy link

dnso86 commented Mar 23, 2023

Bug description

_subprocess.py may not read the whole contents of the stdout and stderr pipes. Attempting to decode the resulting incomplete stdout data as UTF-8 might lead to an UnicodeDecodeError.

Reproduction steps

mock_pip.py

Generates output with a two-byte UTF-8 character split "in half" at the buffer boundary.

import random
import string
import sys

from nose.tools import assert_raises

BUFFER_LENGTH = 4096


def random_text(length):
    characters = string.ascii_lowercase + string.digits
    return ''.join(random.choice(characters) for i in range(length)).encode()

if __name__ == '__main__':
    # COPYRIGHT SIGN (U+00A9)
    two_byte_utf8 = b'\xc2\xae'

    assert two_byte_utf8.decode('utf-8')

    buffer_length_text = random_text(BUFFER_LENGTH)
    buffer_length_plus_one_byte = random_text(BUFFER_LENGTH - 1) + two_byte_utf8

    # It is valid UTF-8
    assert buffer_length_plus_one_byte.decode('utf-8')

    # One iteration of while wouldn't read it all
    assert len(buffer_length_plus_one_byte) > BUFFER_LENGTH

    buffer_length_broken_utf_8 = buffer_length_plus_one_byte[:BUFFER_LENGTH]
    one_byte_broken_utf_8 = buffer_length_plus_one_byte[BUFFER_LENGTH:]

    with assert_raises(UnicodeDecodeError):
        buffer_length_broken_utf_8.decode('utf-8')

    with assert_raises(UnicodeDecodeError):
        one_byte_broken_utf_8.decode('utf-8')

    sys.stdout.buffer.write(buffer_length_text)
    sys.stdout.flush()

    sys.stdout.buffer.write(buffer_length_broken_utf_8)
    sys.stdout.flush()

    # This will not be read
    sys.stdout.buffer.write(one_byte_broken_utf_8)
    sys.stdout.flush()

subprocess_run_isolated.py

Minimal reproducible example based on _subprocess.py:

import subprocess

if __name__ == '__main__':
    process = subprocess.Popen(
        [
            'python3',
            'mock_pip.py',
        ],
        bufsize=0,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )

    terminated = False
    stdout = b""
    stderr = b""

    while not terminated:
        terminated = process.poll() is not None
        # NOTE(ww): Buffer size chosen arbitrarily here and below.
        stdout += process.stdout.read(4096)  # type: ignore
        stderr += process.stderr.read(4096)  # type: ignore

    if process.returncode != 0:
        pass

    stdout.decode("utf-8")

Running subprocess_run_isolated.py will lead to UnicodeDecodeError most of the times.

Expected behavior

It should read everything from the pipes prior to attempting .decode("utf-8"). It should also be able to handle decoding errors whatsoever, if the called process terminates unexpectedly, leaving possibly truncated output.

Screenshots and logs

A typical error message is:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 8191: unexpected end of data

Platform information

  • OS name and version: Fedora 37
  • pip-audit version (pip-audit -V): v2.5.1, v2.5.2
  • Python version (python -V or python3 -V): Python 3.11.2
  • pip version (pip -V or pip3 -V): Python 3.11.2

Additional context

Sporadic UnicodeDecodeErrors popped up during repeated, similar pip-audit (...) --fix runs.

@dnso86 dnso86 added the bug-candidate Might be a bug. label Mar 23, 2023
@woodruffw
Copy link
Member

Thanks for the detailed report @dnso86! I believe this is a dupe of #569, and that #572 should resolve this.

Could you give the changes in that PR a try and let me know if they resolve the crash for you?

@woodruffw woodruffw added bug Something isn't working component:dep-sources Dependency sources duplicate This issue or pull request already exists and removed bug-candidate Might be a bug. labels Mar 23, 2023
@woodruffw
Copy link
Member

(More generally, this indicates that our loop-and-poll technique isn't completely sound. We should probably simplify it; the fix in #572 is more of a temporary fix.)

@dnso86
Copy link
Author

dnso86 commented Mar 23, 2023

@woodruffw super fast response, thanks! :) #572 should fix it for now I believe. I'll test it in the original environment and report back if there is still something. Keep up the good work on pip-audit 💪 .

@woodruffw
Copy link
Member

Thanks a ton for testing, and for the kind words!

I've also filed #574 to track a longer-term fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:dep-sources Dependency sources duplicate This issue or pull request already exists
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants