Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Open
wants to merge 100 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
e2952c4
add l_end arg to reverse_readfile
DanielYang59 Sep 4, 2024
619a38d
rstrip remove hard coded trailing white space
DanielYang59 Sep 4, 2024
6cbea60
add default value for l_end
DanielYang59 Sep 4, 2024
0de9696
add l_end to reverse_readline
DanielYang59 Sep 4, 2024
f114afe
continue CI test jobs upon failure
DanielYang59 Sep 4, 2024
deb1ad7
bump codecov to v4
DanielYang59 Sep 4, 2024
c4a845c
tweak docstring
DanielYang59 Sep 4, 2024
afbe573
var name and docstring tweak
DanielYang59 Sep 4, 2024
a4c4fe3
add helper function to get line ending and unit test
DanielYang59 Sep 5, 2024
86ea01b
add bzip2 and gzip file support and test
DanielYang59 Sep 5, 2024
c7ec2de
sort import and tweak docstring
DanielYang59 Sep 5, 2024
ae125c3
use unix line ending \n as default if empty
DanielYang59 Sep 5, 2024
343e0db
fix docstring
DanielYang59 Sep 5, 2024
d288e0d
update unit test
DanielYang59 Sep 5, 2024
064c064
remove accidental comment sign
DanielYang59 Sep 5, 2024
30624aa
use if after return
DanielYang59 Sep 5, 2024
3084123
tweak docstring
DanielYang59 Sep 5, 2024
1639725
reset pointer to fix test in windows
DanielYang59 Sep 6, 2024
dfe553d
encode in linux
DanielYang59 Sep 6, 2024
3b7a811
reset pointer in win only
DanielYang59 Sep 6, 2024
b291707
yield str in windows
DanielYang59 Sep 6, 2024
aeba5a7
use Iterator as return type
DanielYang59 Sep 6, 2024
1d7adda
assert iterator return str
DanielYang59 Sep 6, 2024
e6312a6
strict usage of rstrp(line_ending)
DanielYang59 Sep 6, 2024
7911e0a
better assert no error
DanielYang59 Sep 6, 2024
6cb5349
decode before strip, but tests are failing
DanielYang59 Sep 6, 2024
79f07f1
add test_file_with_empty_lines for readfile, Win still not working
DanielYang59 Sep 6, 2024
d2ea989
allow zipped file directly into get_line_end, and add test
DanielYang59 Sep 9, 2024
6803c7b
remove seemingly unused file
DanielYang59 Sep 9, 2024
351452a
update test file to include line end for the last line
DanielYang59 Sep 9, 2024
6f8baf9
skip the first match of line ending char
DanielYang59 Sep 9, 2024
74a451f
fix test for zipped format
DanielYang59 Sep 9, 2024
ef454b3
drop test for legacy MacOS \r
DanielYang59 Sep 9, 2024
3516c27
make capture of mmap valueerror more narrow
DanielYang59 Sep 9, 2024
5e1344c
use platform.system for windows check
DanielYang59 Sep 9, 2024
283a383
WIP: test for \r\n is failing for some reason
DanielYang59 Sep 9, 2024
ace9264
fix \r\n location, but it looks silly, need opt
DanielYang59 Sep 9, 2024
9e92bb9
add TODO tag
DanielYang59 Sep 9, 2024
6a8f8b5
finally fixed, avoid newline otherwise \n get overwrite by \r\n causi…
DanielYang59 Sep 9, 2024
f69c205
fix line ending for text mode
DanielYang59 Sep 9, 2024
28e6cc4
drop test of \r altogether
DanielYang59 Sep 9, 2024
bf3b90a
remove manual counter
DanielYang59 Sep 9, 2024
a03b301
copy unit test, to be fixed
DanielYang59 Sep 9, 2024
138b756
update warn msg upon empty file
DanielYang59 Sep 10, 2024
29dc50d
tweak docstring
DanielYang59 Sep 10, 2024
8f94964
add comments
DanielYang59 Sep 10, 2024
3804727
Fix line ending handling in reverse readline
DanielYang59 Sep 13, 2024
76c0243
significantly increase test
DanielYang59 Sep 13, 2024
3b68d14
add warning for overly small mem size
DanielYang59 Sep 13, 2024
cd3bbd1
update unit test
DanielYang59 Sep 13, 2024
36a02f4
add test for illegal usage
DanielYang59 Sep 13, 2024
6457922
use tuple for instance check for now
DanielYang59 Sep 13, 2024
14dad79
avoid repeated len(l_end) call
DanielYang59 Sep 13, 2024
306a8f1
tweak unit test names
DanielYang59 Sep 13, 2024
168bc42
supress newline convert
DanielYang59 Sep 13, 2024
dcdfde5
pre-commit auto-fixes
pre-commit-ci[bot] Sep 13, 2024
f255121
support bufferedreader and clarify m_file type
DanielYang59 Sep 14, 2024
57157ef
clarify and test missing last l_end char
DanielYang59 Sep 14, 2024
4276032
clean up docstring
DanielYang59 Sep 14, 2024
33257c5
clarify myth around blk size and max mem
DanielYang59 Sep 14, 2024
f75abd3
fix skip last l_end (temporarily)
DanielYang59 Sep 14, 2024
709ed17
fix skip first l_end match
DanielYang59 Sep 14, 2024
4799c24
update test big file, but it's failing and need to fix
DanielYang59 Sep 14, 2024
702e6ca
tweak comment
DanielYang59 Sep 14, 2024
d0a0fda
update TODO list
DanielYang59 Sep 15, 2024
5af7145
fix incorrect buffer refill position
DanielYang59 Sep 15, 2024
384c231
suppress OS l_end translate
DanielYang59 Sep 15, 2024
6102263
suppress unrelated warnings
DanielYang59 Sep 15, 2024
e9c22ed
check file pointer reset
DanielYang59 Sep 15, 2024
0be5a85
track benchmark script in case we need it someday
DanielYang59 Sep 15, 2024
edfa0cb
track benchmark script in case we need it someday
DanielYang59 Sep 15, 2024
bdda9f8
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 15, 2024
8714cb4
simplify test script, make env install manual
DanielYang59 Sep 15, 2024
c236434
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 15, 2024
174d940
save test log on wsl2
DanielYang59 Sep 15, 2024
1f7476e
pre-commit auto-fixes
pre-commit-ci[bot] Sep 15, 2024
0d2af60
also track test file create time
DanielYang59 Sep 16, 2024
7cddd29
get obj size as getsize is slow in win
DanielYang59 Sep 16, 2024
0b478dd
pre-commit auto-fixes
pre-commit-ci[bot] Sep 16, 2024
3053e87
update builtin readline test not to read entire file
DanielYang59 Sep 16, 2024
097ae65
remove outdated test log
DanielYang59 Sep 16, 2024
cf542a8
update test log on windows
DanielYang59 Sep 16, 2024
8dc894e
pre-commit auto-fixes
pre-commit-ci[bot] Sep 16, 2024
e90da64
test on Ubuntu 22.04 WSL2
DanielYang59 Sep 18, 2024
df0288b
pre-commit auto-fixes
pre-commit-ci[bot] Sep 18, 2024
0ea6828
remove dup test script
DanielYang59 Sep 18, 2024
8db6bb4
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 18, 2024
f354756
clear finished TODO tag
DanielYang59 Sep 19, 2024
96c9108
tweak var name
DanielYang59 Sep 19, 2024
e1786ef
fix missing newline char in comment
DanielYang59 Sep 19, 2024
f90074c
guard warning filter with context manager
DanielYang59 Sep 19, 2024
27b28a6
add type annotation
DanielYang59 Sep 19, 2024
1cdc75b
put tag into condition branch
DanielYang59 Sep 21, 2024
e4940e0
untrack benchmark script and results
DanielYang59 Sep 22, 2024
279cd4c
Merge branch 'master' into readline-line-ending
shyuep Oct 21, 2024
6613d33
fix typo in test var name
DanielYang59 Oct 22, 2024
3baea3a
remove merge issue
DanielYang59 Oct 22, 2024
61f4aff
Merge branch 'master' into readline-line-ending
DanielYang59 Oct 22, 2024
f1fd669
revise comment
DanielYang59 Oct 22, 2024
d70e80e
suppress mypy errors
DanielYang59 Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ jobs:
strategy:
fail-fast: false
max-parallel: 20
fail-fast: false
matrix:
os: [ubuntu-latest, macos-14, windows-latest]
python-version: ["3.9", "3.12"]
Expand All @@ -30,6 +31,6 @@ jobs:
run: pytest --cov=monty --cov-report html:coverage_reports tests

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v3
uses: codecov/codecov-action@v4
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
274 changes: 191 additions & 83 deletions src/monty/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,21 @@
import errno
import gzip
import io
import lzma
import mmap
import os
import subprocess
import time
import warnings
from pathlib import Path
from typing import TYPE_CHECKING
from typing import TYPE_CHECKING, Literal, cast

try:
import lzma
except ImportError:
lzma = None # type: ignore[assignment]

if TYPE_CHECKING:
from typing import IO, Generator, Union
from typing import IO, Iterator, Union


def zopen(filename: Union[str, Path], *args, **kwargs) -> IO:
Expand All @@ -41,6 +46,7 @@ def zopen(filename: Union[str, Path], *args, **kwargs) -> IO:

_name, ext = os.path.splitext(filename)
ext = ext.upper()

if ext == ".BZ2":
return bz2.open(filename, *args, **kwargs)
if ext in {".GZ", ".Z"}:
Expand All @@ -50,7 +56,64 @@ def zopen(filename: Union[str, Path], *args, **kwargs) -> IO:
return open(filename, *args, **kwargs)


def reverse_readfile(filename: Union[str, Path]) -> Generator[str, str, None]:
def _get_line_ending(
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
file: str
| Path
| io.TextIOWrapper
| io.BufferedReader
| gzip.GzipFile
| bz2.BZ2File,
) -> Literal["\r\n", "\n"]:
"""Helper function to get line ending of a file.

This function assumes the file has a single consistent line ending.

WARNING: as per the POSIX standard, a line is: "A sequence of zero or
more non-<newline> characters plus a terminating <newline> char.",
as such this func might fail if the only line misses a terminating character.
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html

Returns:
"\n": Unix line ending.
"\r\n": Windows line ending.

Raises:
ValueError: If line ending is unknown.

Warnings:
If file is empty, "\n" would be used as default.
"""
if isinstance(file, (str, Path)):
with zopen(file, "rb") as f:
first_line = f.readline()
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
elif isinstance(file, io.TextIOWrapper):
first_line = file.buffer.readline()
elif isinstance(file, (io.BufferedReader, gzip.GzipFile, bz2.BZ2File)):
first_line = file.readline()
else:
raise TypeError(f"Unknown file type {type(file).__name__}")

# Reset pointer to start of file if possible
if hasattr(file, "seek"):
file.seek(0)

# Return Unix "\n" line ending as default if file is empty
if not first_line:
warnings.warn("File is empty, return Unix line ending \n.", stacklevel=2)
return "\n"

if first_line.endswith(b"\r\n"):
return "\r\n"
if first_line.endswith(b"\n"):
return "\n"

# It's likely the line is missing a line ending for the first line
raise ValueError(f"Unknown line ending in line {repr(first_line)}.")

DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved

def reverse_readfile(
filename: Union[str, Path],
) -> Iterator[str]:
"""
A much faster reverse read of file by using Python's mmap to generate a
memory-mapped file. It is slower for very small files than
Expand All @@ -63,108 +126,154 @@ def reverse_readfile(filename: Union[str, Path]) -> Generator[str, str, None]:
Yields:
Lines from the file in reverse order.
"""
try:
with zopen(filename, "rb") as file:
if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
for line in reversed(file.readlines()):
yield line.decode("utf-8").rstrip(os.linesep)
else:
# Get line ending
l_end = _get_line_ending(filename)
len_l_end = len(l_end)

with zopen(filename, "rb") as file:
if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
for line in reversed(file.readlines()):
# "readlines" would keep the line end character
yield line.decode("utf-8")

DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
else:
try:
filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
n = len(filemap)
while n > 0:
i = filemap.rfind(os.linesep.encode(), 0, n)
yield filemap[i + 1 : n].decode("utf-8").rstrip(os.linesep)
n = i
except ValueError:
warnings.warn("trying to mmap an empty file.", stacklevel=2)
return
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved

except ValueError:
return
file_size = len(filemap)
while file_size > 0:
# Find line segment start and end positions
seg_start_pos = filemap.rfind(l_end.encode(), 0, file_size)
seg_end_pos = file_size + len_l_end

# The first (originally) line doesn't have an ending character at its head
if seg_start_pos == -1:
yield (filemap[:seg_end_pos].decode("utf-8"))

# Skip the first match (the last line ending character)
elif file_size != len(filemap):
yield (
filemap[seg_start_pos + len_l_end : seg_end_pos].decode("utf-8")
)
file_size = seg_start_pos


def reverse_readline(
m_file, blk_size: int = 4096, max_mem: int = 4000000
) -> Generator[str, str, None]:
m_file: io.BufferedReader | io.TextIOWrapper | gzip.GzipFile | bz2.BZ2File,
blk_size: int = 4096,
max_mem: int = 4_000_000,
) -> Iterator[str]:
"""
Generator function to read a file line-by-line, but backwards.
This allows one to efficiently get data at the end of a file.
Read a file backwards line-by-line, and behave similarly to
the file.readline function. This allows one to efficiently
get data from the end of a file.

Read file forwards and reverse in memory for files smaller than the
max_mem parameter, or for gzip files where reverse seeks are not supported.
Supported file stream formats:
- TextIOWrapper (text mode) | BufferedReader (binary mode)
- gzip/bzip2 file stream

Files larger than max_mem are dynamically read backwards.
Cases where file would be read forwards and reversed in RAM:
- If file size is smaller than RAM usage limit (max_mem).
- Gzip files, as reverse seeks are not supported.

Reference:
Based on code by Peter Astrand <[email protected]>, using modifications
by Raymond Hettinger and Kevin German.
http://code.activestate.com/recipes/439045-read-a-text-file-backwards
-yet-another-implementat/
Based on code by Peter Astrand <[email protected]>, using
modifications by Raymond Hettinger and Kevin German.
http://code.activestate.com/recipes/439045-read-a-text-
file-backwards-yet-another-implementat/

Args:
m_file (File): File stream to read (backwards)
blk_size (int): The buffer size. Defaults to 4096.
max_mem (int): The maximum amount of memory to involve in this
operation. This is used to determine when to reverse a file
in-memory versus seeking portions of a file. For bz2 files,
this sets the maximum block size.
m_file: File stream to read (backwards).
blk_size (int): The block size to read each time in bytes.
Defaults to 4096.
max_mem (int): Threshold to determine when to reverse a file
in-memory versus reading blocks of a file each time.
For bz2 files, this sets the block size.

Returns:
Generator that yields lines from the file. Behave similarly to the
file.readline() function, except the lines are returned from the back
of the file.
Yields:
Lines from the back of the file.

Raises:
TypeError: If m_file is the name of the file (expect file stream).

Warnings:
If max_mem is smaller than blk_size.
"""
# Check if the file stream is a bit stream or not
is_text = isinstance(m_file, io.TextIOWrapper)

try:
file_size = os.path.getsize(m_file.name)
except AttributeError:
# Bz2 files do not have name attribute. Just set file_size to above
# max_mem for now.
file_size = max_mem + 1

# If the file size is within our desired RAM use, just reverse it in memory
# GZip files must use this method because there is no way to negative seek
# For windows, we also read the whole file.
if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt":
# Check for illegal usage
if isinstance(m_file, (str, Path)):
raise TypeError("expect a file stream, not file name")

# Generate line ending
l_end: Literal["\r\n", "\n"] = _get_line_ending(m_file)
len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end))

# Bz2 files do not have "name" attribute, just set to max_mem for now
if hasattr(m_file, "name"):
file_size: int = os.path.getsize(m_file.name)
else:
file_size = max_mem

# If the file size is within desired RAM limit, just reverse it in memory.
# Gzip files must use this method because there is no way to negative seek.
if file_size < max_mem or isinstance(m_file, gzip.GzipFile):
for line in reversed(m_file.readlines()):
yield line.rstrip()
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
yield line if isinstance(line, str) else cast(bytes, line).decode("utf-8")

else:
# RAM limit should be greater than block size,
# as file is read into RAM one block each time.
if max_mem < blk_size:
warnings.warn(f"{max_mem=} smaller than {blk_size=}", stacklevel=2)

# For bz2 files, seek is expensive. It is therefore in our best
# interest to maximize the block size within RAM usage limit.
if isinstance(m_file, bz2.BZ2File):
# for bz2 files, seeks are expensive. It is therefore in our best
# interest to maximize the blk_size within limits of desired RAM
# use.
blk_size = min(max_mem, file_size)

buf = ""
m_file.seek(0, 2)
lastchar = m_file.read(1) if is_text else m_file.read(1).decode("utf-8")
# Check if the file stream is text (instead of binary)
is_text: bool = isinstance(m_file, io.TextIOWrapper)

trailing_newline = lastchar == os.linesep
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
buffer: str = ""
m_file.seek(0, 2)
skipped_1st_l_end: bool = False

while True:
newline_pos = buf.rfind(os.linesep)
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
pos = m_file.tell()
if newline_pos != -1:
# Found a newline
line = buf[newline_pos + 1 :]
buf = buf[:newline_pos]
if pos or newline_pos or trailing_newline:
line += os.linesep
yield line

elif pos:
# Need to fill buffer
toread = min(blk_size, pos)
m_file.seek(pos - toread, 0)
l_end_pos: int = buffer.rfind(l_end)
# Pointer position (also size of remaining file)
pt_pos: int = m_file.tell()

# Line ending found within buffer
if l_end_pos != -1:
line = buffer[l_end_pos + len_l_end :]
buffer = buffer[:l_end_pos] # buffer doesn't include l_end

# Skip first match (the last line ending)
if skipped_1st_l_end:
yield line + l_end
else:
skipped_1st_l_end = True

# Line ending not in current buffer, load next block into the buffer
elif pt_pos > 0:
to_read: int = min(blk_size, pt_pos)
m_file.seek(pt_pos - to_read)
if is_text:
buf = m_file.read(toread) + buf
buffer = cast(str, m_file.read(to_read)) + buffer
else:
buf = m_file.read(toread).decode("utf-8") + buf
m_file.seek(pos - toread, 0)
if pos == toread:
buf = os.linesep + buf
buffer = cast(bytes, m_file.read(to_read)).decode("utf-8") + buffer

# Move pointer forward
m_file.seek(pt_pos - to_read)

else:
# Start-of-file
# Add a l_end to the start of file
if pt_pos == to_read:
buffer = l_end + buffer

# Start of file
else: # l_end_pos == -1 and pt_post == 0
return


Expand Down Expand Up @@ -266,8 +375,7 @@ def get_open_fds() -> int:
"""
Get the number of open file descriptors for current process.

Warnings:
Will only work on UNIX-like OS-es.
Warning, this will only work on UNIX-like OS.

Returns:
int: The number of open file descriptors for current process.
Expand Down
2 changes: 1 addition & 1 deletion tests/test_files/3000_lines.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2997,4 +2997,4 @@
2997
2998
2999
3000
3000
Binary file modified tests/test_files/3000_lines.txt.bz2
Binary file not shown.
Binary file modified tests/test_files/3000_lines.txt.gz
Binary file not shown.
Binary file removed tests/test_files/3000lines.txt.gz
Binary file not shown.
Loading
Loading