_subprocess.py stdout reading may corrupt UTF-8 characters, and then fail when decodes the data #573

dnso86 · 2023-03-23T19:47:11Z

Bug description

_subprocess.py may not read the whole contents of the stdout and stderr pipes. Attempting to decode the resulting incomplete stdout data as UTF-8 might lead to an UnicodeDecodeError.

Reproduction steps

mock_pip.py

Generates output with a two-byte UTF-8 character split "in half" at the buffer boundary.

import random
import string
import sys

from nose.tools import assert_raises

BUFFER_LENGTH = 4096


def random_text(length):
    characters = string.ascii_lowercase + string.digits
    return ''.join(random.choice(characters) for i in range(length)).encode()

if __name__ == '__main__':
    # COPYRIGHT SIGN (U+00A9)
    two_byte_utf8 = b'\xc2\xae'

    assert two_byte_utf8.decode('utf-8')

    buffer_length_text = random_text(BUFFER_LENGTH)
    buffer_length_plus_one_byte = random_text(BUFFER_LENGTH - 1) + two_byte_utf8

    # It is valid UTF-8
    assert buffer_length_plus_one_byte.decode('utf-8')

    # One iteration of while wouldn't read it all
    assert len(buffer_length_plus_one_byte) > BUFFER_LENGTH

    buffer_length_broken_utf_8 = buffer_length_plus_one_byte[:BUFFER_LENGTH]
    one_byte_broken_utf_8 = buffer_length_plus_one_byte[BUFFER_LENGTH:]

    with assert_raises(UnicodeDecodeError):
        buffer_length_broken_utf_8.decode('utf-8')

    with assert_raises(UnicodeDecodeError):
        one_byte_broken_utf_8.decode('utf-8')

    sys.stdout.buffer.write(buffer_length_text)
    sys.stdout.flush()

    sys.stdout.buffer.write(buffer_length_broken_utf_8)
    sys.stdout.flush()

    # This will not be read
    sys.stdout.buffer.write(one_byte_broken_utf_8)
    sys.stdout.flush()

subprocess_run_isolated.py

Minimal reproducible example based on _subprocess.py:

import subprocess

if __name__ == '__main__':
    process = subprocess.Popen(
        [
            'python3',
            'mock_pip.py',
        ],
        bufsize=0,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )

    terminated = False
    stdout = b""
    stderr = b""

    while not terminated:
        terminated = process.poll() is not None
        # NOTE(ww): Buffer size chosen arbitrarily here and below.
        stdout += process.stdout.read(4096)  # type: ignore
        stderr += process.stderr.read(4096)  # type: ignore

    if process.returncode != 0:
        pass

    stdout.decode("utf-8")

Running subprocess_run_isolated.py will lead to UnicodeDecodeError most of the times.

Expected behavior

It should read everything from the pipes prior to attempting .decode("utf-8"). It should also be able to handle decoding errors whatsoever, if the called process terminates unexpectedly, leaving possibly truncated output.

Screenshots and logs

A typical error message is:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 8191: unexpected end of data

Platform information

OS name and version: Fedora 37
pip-audit version (pip-audit -V): v2.5.1, v2.5.2
Python version (python -V or python3 -V): Python 3.11.2
pip version (pip -V or pip3 -V): Python 3.11.2

Additional context

Sporadic UnicodeDecodeErrors popped up during repeated, similar pip-audit (...) --fix runs.

The text was updated successfully, but these errors were encountered:

woodruffw · 2023-03-23T19:50:04Z

Thanks for the detailed report @dnso86! I believe this is a dupe of #569, and that #572 should resolve this.

Could you give the changes in that PR a try and let me know if they resolve the crash for you?

woodruffw · 2023-03-23T19:52:23Z

(More generally, this indicates that our loop-and-poll technique isn't completely sound. We should probably simplify it; the fix in #572 is more of a temporary fix.)

dnso86 · 2023-03-23T19:57:09Z

@woodruffw super fast response, thanks! :) #572 should fix it for now I believe. I'll test it in the original environment and report back if there is still something. Keep up the good work on pip-audit 💪 .

woodruffw · 2023-03-23T19:57:56Z

Thanks a ton for testing, and for the kind words!

I've also filed #574 to track a longer-term fix.

dnso86 added the bug-candidate Might be a bug. label Mar 23, 2023

woodruffw added bug Something isn't working component:dep-sources Dependency sources duplicate This issue or pull request already exists and removed bug-candidate Might be a bug. labels Mar 23, 2023

This was referenced Mar 23, 2023

_subprocess: perform invalid UTF-8 substitution #572

Merged

From version 2.5.0+ there looks to be an change in behavior that breaks github actions for pip-audit #569

Closed

Subprocess: fix stream handling #574

Open

woodruffw closed this as completed in #572 Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_subprocess.py stdout reading may corrupt UTF-8 characters, and then fail when decodes the data #573

_subprocess.py stdout reading may corrupt UTF-8 characters, and then fail when decodes the data #573

dnso86 commented Mar 23, 2023

woodruffw commented Mar 23, 2023

woodruffw commented Mar 23, 2023

dnso86 commented Mar 23, 2023

woodruffw commented Mar 23, 2023

_subprocess.py stdout reading may corrupt UTF-8 characters, and then fail when decodes the data #573

_subprocess.py stdout reading may corrupt UTF-8 characters, and then fail when decodes the data #573

Comments

dnso86 commented Mar 23, 2023

Bug description

Reproduction steps

mock_pip.py

subprocess_run_isolated.py

Expected behavior

Screenshots and logs

Platform information

Additional context

woodruffw commented Mar 23, 2023

woodruffw commented Mar 23, 2023

dnso86 commented Mar 23, 2023

woodruffw commented Mar 23, 2023