How to read last line of a gzip file in S3 efficiently? #769

hongbo-miao · 2023-04-11T18:46:25Z

hongbo-miao
Apr 11, 2023

I am trying to read last line of a gzip file. This is my current way:

from smart_open import open

# ...

    last_line = None
    with open(
        f"s3://{source}",
        encoding="utf-8",
        mode="r",
        transport_params=dict(client=s3_client),
    ) as file:
        for line in file:
            last_line = line
        print(last_line)

However, as this file is extremely big, it takes a while.

I tried

from smart_open import open

# ...

   with open(
        f"s3://my-bucket/my-file.gz",
        encoding="utf-8",
        mode="r",
        transport_params=dict(client=s3_client),
    ) as file:
        file.seek(-2, os.SEEK_END)
        while file.read(1) != b'\n':
            file.seek(-2, os.SEEK_CUR)
        last_line = file.readline()
        print(last_line)

This will give error

io.UnsupportedOperation: can't do nonzero end-relative seeks

which does make sense to me.

I am wondering is there any more efficient method? Thanks! 😃

Answered by mpenkov

Apr 12, 2023

No, you can't do that because of how gzip works. You need to decompress the entire block (in this case, the entire file) in order to be able to access the contents of the block.

One work-around is to compress your file so that it consists of multiple gzip blocks, for example:

$ gzip part1
$ gzip part2
$ gzip part3
$ cat part1.gz part2.gz part3.gz > file.gz

file.gz now consists of three blocks. You can seek to the start of each block and begin decompressing from there.

View full answer

mpenkov · 2023-04-12T15:20:15Z

mpenkov
Apr 12, 2023
Collaborator

No, you can't do that because of how gzip works. You need to decompress the entire block (in this case, the entire file) in order to be able to access the contents of the block.

One work-around is to compress your file so that it consists of multiple gzip blocks, for example:

$ gzip part1
$ gzip part2
$ gzip part3
$ cat part1.gz part2.gz part3.gz > file.gz

file.gz now consists of three blocks. You can seek to the start of each block and begin decompressing from there.

2 replies

mpenkov Apr 12, 2023
Collaborator

Relevant: https://rushter.com/blog/gzip-indexing/ and https://pypi.org/project/gzipi/

hongbo-miao Apr 13, 2023
Author

Thanks @mpenkov ! 😃

Unfortunately, the files are existing, and I don't have choice for how to generate the files.

I also tried

gztool
pigz -dc
zcat
pragzip
SQL reference for Amazon S3 Select: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference.html

If only for one file, some are faster because using multiple CPU cores. However, I am running concurrently so I don't want them compete each other for resources.

I will stay with my current way (first solution in the post) though which has more flexibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read last line of a gzip file in S3 efficiently? #769

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to read last line of a gzip file in S3 efficiently? #769

hongbo-miao Apr 11, 2023

Replies: 1 comment · 2 replies

mpenkov Apr 12, 2023 Collaborator

mpenkov Apr 12, 2023 Collaborator

hongbo-miao Apr 13, 2023 Author

hongbo-miao
Apr 11, 2023

Replies: 1 comment 2 replies

mpenkov
Apr 12, 2023
Collaborator

mpenkov Apr 12, 2023
Collaborator

hongbo-miao Apr 13, 2023
Author