How to read last line of a gzip file in S3 efficiently? #769
-
I am trying to read last line of a gzip file. This is my current way: from smart_open import open
# ...
last_line = None
with open(
f"s3://{source}",
encoding="utf-8",
mode="r",
transport_params=dict(client=s3_client),
) as file:
for line in file:
last_line = line
print(last_line) However, as this file is extremely big, it takes a while. I tried from smart_open import open
# ...
with open(
f"s3://my-bucket/my-file.gz",
encoding="utf-8",
mode="r",
transport_params=dict(client=s3_client),
) as file:
file.seek(-2, os.SEEK_END)
while file.read(1) != b'\n':
file.seek(-2, os.SEEK_CUR)
last_line = file.readline()
print(last_line) This will give error
which does make sense to me. I am wondering is there any more efficient method? Thanks! 😃 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
No, you can't do that because of how gzip works. You need to decompress the entire block (in this case, the entire file) in order to be able to access the contents of the block. One work-around is to compress your file so that it consists of multiple gzip blocks, for example:
file.gz now consists of three blocks. You can seek to the start of each block and begin decompressing from there. |
Beta Was this translation helpful? Give feedback.
No, you can't do that because of how gzip works. You need to decompress the entire block (in this case, the entire file) in order to be able to access the contents of the block.
One work-around is to compress your file so that it consists of multiple gzip blocks, for example:
file.gz now consists of three blocks. You can seek to the start of each block and begin decompressing from there.