Modern Perl bindings to libarchive
use 5.020;
use Archive::Libarchive qw( :const );
my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;
$r->open_filename("archive.tar", 10240) == ARCHIVE_OK
or die $r->error_string;
my $e = Archive::Libarchive::Entry->new;
say $e->pathname while $r->next_header($e) == ARCHIVE_OK;
This module provides a Perl object-oriented interface to the libarchive
library. The libarchive
library is the API used to implemnt bsdtar
, the default tar implementation on a number of operating systems,
including FreeBSD, macOS and Windows. It can also be installed on most Linux distributions. But wait, there
is more, libarchive
supports a number of formats, compressors and filters transparently, so it can be a useful
when used as a universal archiver/extractor. Supported formats include:
- various tar formats, including the oldest forms and the newest extensions
- zip
- ISO 9660 (CD-ROM image format)
- gzip
- bzip2
- uuencoded files
- shell archive (shar)
- ... and many many more
There are a number of "simple" interfaces around this distribution, which are worth considering if you do not need the full power and configurability that this distribution provides.
-
Provides an interface for listing and retrieving entries from an archive without extracting them to the local filesystem.
-
Provides an interface for extracting arbitrary archives of any format/filter supported by
libarchive
. -
Decompresses / unwraps files that have been compressed or wrapped in any of the filter formats supported by
libarchive
This distribution is split up into several classes, that correspond to libarchive
classes. Probably the best
place to start when learning how to use this module is to look at the "EXAMPLES" section below, but you
can also take a look at the main class documentation for the operation that you are interested in as well:
-
Archive => Archive::Libarchive::ArchiveRead
Class for reading from archives.
-
Archive => Archive::Libarchive::ArchiveWrite
Class for creating new archives.
-
Archive => ArchiveRead => Archive::Libarchive::DiskRead
Class for reading file entries from a local filesystem.
-
Archive => ArchiveWrite => Archive::Libarchive::DiskWrite
Class for writing file entries to a local filesystem.
-
Class representing file metadata of a file inside an archive, or in the local filesystem.
-
Archive::Libarchive::EntryLinkResolver
This is the
libarchive
link resolver API. -
Archive => Archive::Libarchive::Match
This is the
libarchive
match API.
This module attempts to provide comprehensive bindings to the libarchive
library. For more details on
the history and alternatives to this project see the "HISTORY" section below. All recent versions of
libarchive
should be supported, although some methods are only available when you have the most recent
version of libarchive
installed. For methods not available on older versions please consult
Archive::Libarchive::API, which will list these methods as (optional)
. If you need to support both
older versions of libarchive
and exploit the newer methods on newer versions of libarchive
you can use
the can
method to check if they are available. If you need the latest version of libarchive
, and
your system provides an older version, then you can force a share
install of Alien::Libarchive3:
env ALIEN_INSTALL_TYPE=share cpanm Alien::Libarchive3
The main functionality of this module is implemented in the classes listed above, but this module does
also provide a few top level non-object-oriented functions as well. These methods are not exported
by default, but they can be requested using the usual Exporter interface, either individually, or
with the :func
or :all
tags (The latter will also import constants).
# archive_bzlib_version
my $string = archive_bzlib_version();
The bzlib
version that libarchive
was built with. This will return undef
if the library was
not found at build time.
# archive_liblz4_version
my $string = archive_liblz4_version();
The liblz4
version that libarchive
was built with. This will return undef
if the library was
not found at build time.
# archive_liblzma_version
my $string = archive_liblzma_version();
The liblzma
version that libarchive
was built with. This will return undef
if the library was
not found at build time.
# archive_libzstd_version (optional)
my $string = archive_libzstd_version();
The zstd
version that libarchive
was built with. This will return undef
if the library was
not found at build time.
# archive_version_details
my $string = archive_version_details();
Detailed textual name/version of the library and its dependencies. This has the form:
libarchive x.y.z zlib/a.b.c liblzma/d.e.f ... etc ...
the list of libraries described here will vary depending on how libarchive was compiled.
# archive_version_number
my $int = archive_version_number();
The libarchive
version expressed as an integer. This will be the major, minor and patch
levels each using up to three digits, so 3.5.1 will be 3005001
.
# archive_version_string
my $string = archive_version_string();
The libarchive
version as a string.
# archive_zlib_version
my $string = archive_zlib_version();
The zlib
version that libarchive
was built with. This will return undef
if the library was
not found at build time.
my %versions = Archive::Libarchive->versions();
This returns a hash of libarchive
and Archive::Libarchive versions and dependency versions. This
may be useful in a test report diagnostic.
These examples are translated from the libarchive
C examples, which can be found here:
The main Archive::Libarchive API is based around two basic type of classes. The Archive::Libarchive::Archive class serves as a basis for all archive objects. The Archive::Libarchive::Entry represents the header or metadata for files stored inside an archive (or as we will see later, files on disk).
The basic life cycle of an archive instance is:
-
Create one using its
new
constructorThe constructor does not take any arguments, instead you will configure it in the next step.
-
Configure it using "support" or "set" calls
Support calls allow Archive::Libarchive to decide when to use a feature; "set" calls enable the feature unconditionally.
-
"Open" a particular data source
This can be using callbacks for a custom source, or one of the pre-canned data sources supported directly by Archive::Libarchive.
-
Iterate over the contents
Ask alternatively for "header" or entry/file metadata (which is represented by a Archive::Libarchive::Entry instance), and entry/file content.
-
Finish by calling "close"
This will be called automatically if the archive instance falls out of scope.
Writing an archive is very similar, except that you provide the "header" and content data to Archive::Libarchive instead of asking for them.
Here is a very basic example that simply opens a file and lists the contents of the archive.
use 5.020;
use Archive::Libarchive qw( ARCHIVE_OK );
my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;
my $ret = $r->open_filename("archive.tar", 10240);
if($ret != ARCHIVE_OK) {
exit 1;
}
my $e = Archive::Libarchive::Entry->new;
while($r->next_header($e) == ARCHIVE_OK) {
say $e->pathname;
$r->read_data_skip;
}
Note that open_filename method inspects the file before deciding how to handle the block size. If the filename provided refers to a tape device, for example, it will use exactly the block size you specify. For other devices, it may adjust the requested block size in order to obtain better performance.
Note that the call to read_data_skip here is not actually necessary, since Archive::Libarchive will invoke it automatically if you request the next header without reading the data for the last entry.
The module Archive::Libarchive::Peek also provides similar functionality to this example in a simple, less powerful interface.
There are several variants of the open methods. The "filename" variant used above is intended to be simple to use in the common case of reading from a file from disk, but you may find the "memory" variant useful in other cases.
use 5.020;
use Path::Tiny qw( path );
use Archive::Libarchive qw( ARCHIVE_OK );
my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;
my $buffer = path('archive.tar')->slurp_raw;
my $ret = $r->open_memory(\$buffer);
if($ret != ARCHIVE_OK) {
exit 1;
}
my $e = Archive::Libarchive::Entry->new;
while($r->next_header($e) == ARCHIVE_OK) {
say $e->pathname;
$r->read_data_skip;
}
There are also variants to read from an already-opened file descriptor, a libc
FILE
pointer, or a Perl
file handle.
Sometimes, none of the packaged open methods will work for you. In that case, you can use the lower-level open
method, which accepts a number of callbacks. For this example we will use the open
, read
and close
callbacks.
use 5.020;
use Archive::Libarchive qw( :const );
my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;
my $fh;
$r->open(
open => sub {
open $fh, '<', 'archive.tar';
binmode $fh;
return ARCHIVE_OK;
},
read => sub {
my(undef, $ref) = @_;
my $size = read $fh, $$ref, 512;
return $size;
},
close => sub {
close $fh;
return ARCHIVE_OK;
},
) == ARCHIVE_OK or die $r->error_string;
my $e = Archive::Libarchive::Entry->new;
while(1) {
my $ret = $r->next_header($e);
last if $ret == ARCHIVE_EOF;
die $r->error_string if $ret < ARCHIVE_WARN;
warn $r->error_string if $ret != ARCHIVE_OK;
say $e->pathname;
}
$r->close;
For full power of read callbacks see the open method's documentation.
When writing to an archive the Archive::Libarchive::ArchiveWrite class also has its own open method and callbacks.
The "raw" format handler treats arbitrary binary input as a single-element archive. This allows you to get the
output of a libarchive filter chain, including files with multiple encodings, such as gz.uu
files:
use 5.020;
use Archive::Libarchive;
my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_raw;
$r->open_filename("hello.txt.uu");
$r->next_header(Archive::Libarchive::Entry->new);
my $buffer;
while($r->read_data(\$buffer)) {
print $buffer;
}
$r->close;
Note that the "raw" format is not enabled by the support_format_all method on Archive::Libarchive::ArchiveRead. Also note that the "raw" format handler does not recognize or accept empty files. If you specifically want to be able to read empty files, you'll need to also invoke the support_format_empty method on Archive::Libarchive::ArchiveRead.
The module Archive::Libarchive::Unwrap also provides similar functionality to this example in a simple, less powerful interface.
The following is a very simple example of using Archive::Libarchive to write a group of files into a tar archive. This is a little more complex than the read examples above because the write example actually does something with the file bodies.
use 5.020;
use Archive::Libarchive;
use Path::Tiny qw( path );
my $w = Archive::Libarchive::ArchiveWrite->new;
$w->set_format_pax_restricted;
$w->open_filename("outarchive.tar");
path('.')->visit(sub ($path, $) {
my $path = shift;
return if $path->is_dir;
my $e = Archive::Libarchive::Entry->new;
$e->set_pathname("$path");
$e->set_size(-s $path);
$e->set_filetype('reg');
$e->set_perm( oct('0644') );
$w->write_header($e);
$w->write_data(\$path->slurp_raw);
}, { recurse => 1 });
$w->close;
Note that:
-
filetype
The filetype methods take either a string code, or an integer constant with the
AE_IF
prefix. When returning a filetype code, they will return a dualvar with both. The codereg
/AE_IFREG
is the code for a regular file (not a directory, symlink or other special filetype). -
gzip
If you wanted to write a gzipped tar archive, you would just add a call to the add_filter_gzip method on Archive::Libarchive::ArchiveRead, and append
.gz
to the output filename. -
pax restricted
The "pax restricted" format is a tar format that uses pax extensions only when absolutely necessary. Most of the time, it will write plain ustar entries. This is recommended tar format for most uses. You should explicitly use ustar format only when you have to create archives that will be readable on older systems; you should explicitly request pax format only when you need to preserve as many attributes as possible.
-
reusing entry instance
This example creates a fresh Archive::Libarchive::Entry instance for each file. For better performance, you can reuse the same entry instance by using the clear method to erase it after each use.
-
required properties
Size, file type and pathname are all required properties here. You can also use the copy_stat method to copy all information from file to the archive entry, including file type. To get even more complete information, look at the Archive::Libarchive::DiskRead class, which provides an easy way to get more extensive file metadata―including ACLs and extended attributes on some systems―than using
stat
. It also works on platforms such as Windows wherestat
either doesn't exist or is broken. -
calling close
The close method will be called implicitly when the archive instance falls out of scope. However, the close call returns an error code, which may be useful for catching errors.
Archive::Libarchive includes a Archive::Libarchive::DiskWrite class that works very much like Archive::Libarchive::ArchiveWrite, except that it constructs objects on disk, instead of adding them to an archive. This class knows how to construct directories, regular files, symlinks, hard links and other types of disk objects. Here is a very simple example showing how you could use it to create a regular file on disk:
use 5.020;
use Archive::Libarchive qw( :const );
my $dw = Archive::Libarchive::DiskWrite->new;
$dw->disk_set_options(ARCHIVE_EXTRACT_TIME);
my $text = "Hello World!\n";
my $e = Archive::Libarchive::Entry->new;
$e->set_pathname("hello.txt");
$e->set_filetype('reg');
$e->set_size(length $text);
$e->set_mtime(time);
$e->set_mode(oct('0644'));
$dw->write_header($e);
$dw->write_data(\$text);
$dw->finish_entry;
Note that if you set a size in the entry instance, Archive::Libarchive::DiskWrite will enforce that size. If you try to write more than the size set in the entry content, your writes will be truncated; if you write fewer bytes than you promised, the file will be extended with zero bytes.
The pattern above can also be used to reconstruct directories, device nodes, and FIFOs. The same idea also works for restoring symlinks and hardlinks, but you do have to initialize the entry a little differently:
-
symlinks
Symlinks have a file type
lnk
/AE_IFLNK
and require a target to be set with the set_symlink method. -
hardlinks
Hardlinks require a target to be set with the set_hardlink method; if this is set, the regular filetype is ignored. If the entry describing a hardlink has a size, you must be prepared to write data to the linked files. If you don't want to overwrite the file, leave the size unset.
Using the facilities described above, you can extract most archives to disk by simply copying entries from an Archive::Libarchive::ArchiveRead instance to an Archive::Libarchive::DiskWrite instance.
use 5.020;
use Archive::Libarchive qw( :const );
my $tarball = 'archive.tar';
my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_format_all;
$r->support_filter_all;
my $dw = Archive::Libarchive::DiskWrite->new;
$dw->disk_set_options(
ARCHIVE_EXTRACT_TIME | ARCHIVE_EXTRACT_PERM | ARCHIVE_EXTRACT_ACL | ARCHIVE_EXTRACT_FFLAGS
);
$dw->disk_set_standard_lookup;
$r->open_filename($tarball) == ARCHIVE_OK
or die "unable to open $tarball @{[ $r->error_string ]}";
my $e = Archive::Libarchive::Entry->new;
while(1) {
my $ret = $r->next_header($e);
last if $ret == ARCHIVE_EOF;
if($ret < ARCHIVE_OK) {
if($ret < ARCHIVE_WARN) {
die "header read error on $tarball @{[ $r->error_string ]}";
} else {
warn "header read warning on $tarball @{[ $r->error_string ]}";
}
}
$ret = $dw->write_header($e);
if($ret < ARCHIVE_OK) {
if($ret < ARCHIVE_WARN) {
die "header write error on disk @{[ $dw->error_string ]}";
} else {
warn "header write warning disk @{[ $dw->error_string ]}";
}
}
if($e->size > 0)
{
my $buffer;
my $offset;
while(1) {
$ret = $r->read_data_block(\$buffer, \$offset);
last if $ret == ARCHIVE_EOF;
if($ret < ARCHIVE_OK) {
if($ret < ARCHIVE_WARN) {
die "file read error on member @{[ $e->pathname ]} @{[ $r->error_string ]}";
} else {
warn "file read warning on member @{[ $e->pathname ]} @{[ $r->error_string ]}";
}
}
$ret = $dw->write_data_block(\$buffer, $offset);
if($ret < ARCHIVE_OK) {
if($ret < ARCHIVE_WARN) {
die "file write error on member @{[ $e->pathname ]} @{[ $dw->error_string ]}";
} else {
warn "file write warning on member @{[ $e->pathname ]} @{[ $dw->error_string ]}";
}
}
}
}
$dw->finish_entry;
if($ret < ARCHIVE_OK) {
if($ret < ARCHIVE_WARN) {
die "finish error on disk @{[ $dw->error_string ]}";
} else {
warn "finish warning disk @{[ $dw->error_string ]}";
}
}
}
$r->close;
$dw->close;
You could create an archive by going the other way by copying entries from an Archive::Libarchive::DiskRead instance to an Archive::Libarchive::ArchiveWrite instance.
The module Archive::Libarchive::Extract also provides similar functionality to this example in a simple, less powerful interface.
This module provides all of the constants used by libarchive
. These typically
are prefixed either ARCHIVE_
or AE_
and can be imported into your code
individually, or en masse using the :const
export tag. The will also be imported
if you use the :all
export tag to import everything.]
The complete list of available constants is listed in Archive::Libarchive::API.
The most common constants are the return of status codes from most functions:
-
ARCHIVE_EOF
is returned only from read_data and read_data_block from the Archive::Libarchive::ArchiveRead class when you reach the end of a structure.
-
ARCHIVE_OK
The operation completed successfully.
-
ARCHIVE_WARN
If the operation completed with some surprises. You may want to report the issue to your user. The error_string method on most classes will return a suitable text message; the errno method on most classes returns an associated system
errno
value. (Since not all errors are caused by failing system calls, this is not always meaningful). -
ARCHIVE_FAILED
If this operation failed. In particular, this means that further operations on this entry are impossible. This is returned, for example, if you try to write an entry type that's not supported by this archive format. Recovery usually consists of simply going on to the next entry.
-
ARCHIVE_FATAL
If the archive object itself is no longer usable, typically because of an I/O failure or memory allocation failure.
I started working with libarchive
in order to experiment with FFI. To that end I implemented bindings
for libarchive
using both XS and FFI to compare
and contrast the process. It was the basis for my first YAPC::NA talk back in 2014.
When I was working on the XS and FFI implementations I recognized that some degree of automation would be
required, mainly because the libarchive
is a C API of hundreds of methods, and new methods are being
added all the time. I also wanted both implementations to use the same test suite, since their interfaces
should be identical. While this work was useful, and I even ended up using both versions in production at
a previous $work
, the tools that I chose to automate managing the large number of methods, and the common
test suite made both modules quite difficult to maintain.
I think also the interface that I chose was wrong. I opted to provide a very thin layer over libarchive
,
to avoid as much object-oriented overhead as possible. I intended to one day make an object-oriented
layer over this thin layer to make it easier to use, but I never found the time to do this. I think a better
approach would have been to bite the bullet provide only an object-oriented interface, because the ease of
using a library that automatically free's its pointers when an object falls out of scope is worth the
performance penalty of object oriented invocation.
I did, however, learn a lot about XS and FFI, and I started to think about what would make FFI easier in Perl. At the time the only viable FFI on cpan was FFI::Raw, and I contributed a number of enhancements and fixes to that project, and even got it working on Strawberry Perl. But I was starting to crave a better experience writing FFI bindings in Perl.
BULK88 was in the audience for a DC / Baltimore version of my Never Need to Write XS talk and he pointed
me to a feature in XS that would make FFI calls much faster than what was possible in FFI::Raw. Using
the any_ptr
it is possible to remove method calls from an FFI interface, which, due to their dynamic
nature are slower that non-method subroutine calls.
I was loosing faith in FFI::Raw being tenable or performant for large APIs, so I I gathered up my ideas
of what would make a better FFI experience in Perl and the any_ptr
feature that Bulk had shown me and
I started working on a prototype FFI library. I gave a talk at the Pittsburgh workshop based on the
work of that prototype.
I didn't release that prototype, because I kept hoping that FFI would catch fire and someone else
would write a killer FFI for Perl. Since it didn't seem to be happening I re-worked my prototype
into what eventually became FFI::Platypus. I wrote lots of bindings for Perl using Platypus,
and I always had the idea that I would circle back to my FFI bindings for libarchive
(Archive::Libarchive::FFI) and rework it using Platypus instead of FFI::Raw. The problem is
that the project has since atrophied, and the problems with the dual module and automation
tools that I chose made this not really a viable enterprise.
I next thing that FFI needs in Perl is some good tools to introspect C and generate bindings automatically.
There are lots of challenges in this area. One being that exactly what a function signature (assuming
you can even introspect that) can be ambiguous. For example a char
could either be a 8 bit integer value
(it could even be signed or unsigned depending on architecture) or it could be a single character.
A pointer int *
could actually be used by the callee as an array. There are lots of things that are
unsafe about C, and a ton of corner cases because of the way the C pre-processor works, but if we can
surmount these challenges then it would be very useful, because even when two different non-C languages are
trying to talk to each other, they are usually using the C ABI to do it. This sort of drives me crazy
but it is the way the world works, at least today.
I've been working on some low-level tools that I'm hoping we can build on to do some of this introspection.
Const::Introspect::C is able to extract #define
constants from a C header file, and Clang::CastXML
uses the castxml
project to extract a model of the functions and strct
s in a C header file. I'm hoping
with a middle layer these modules could be used to write a h2ffi
tool similar to h2xs. I've had
a number of false starts writing this middle layer: so I've decided to write some custom introspection
with libarchive
, which is a very FFI-friendly library, and one that I am familiar with, but that also
has some interesting challenges and edge cases. I'm hoping this work will help design a more general middle
layer that will be usable for other libraries.
At the same time, I've decided to fix some of the design flaws of my original XS and FFI implementations.
There really isn't a good way of doing this with the original implementations so I'm deprecating them in
favor of this one. I feel confident that the overall experience of using this library should be much
better than using one of the older ones. I also think this one will be more easily maintainable, because
I am using castxml
, and I've created a reference build of libarchive
using docker, which should
ensure that the code generation is done consistently.
-
Provides an interface for listing and retrieving entries from an archive without extracting them to the local filesystem.
-
Provides an interface for extracting arbitrary archives of any format/filter supported by
libarchive
. -
Decompresses / unwraps files that have been compressed or wrapped in any of the filter formats supported by
libarchive
-
This contains the full and complete API for all of the Archive::Libarchive classes. Because
libarchive
has hundreds of methods, the main documentation pages elsewhere only contain enough to be useful, and not to overwhelm. -
The base class of all archive classes. This includes some common error reporting functionality among other things.
-
Archive::Libarchive::ArchiveRead
This class is used for reading from archives.
-
Archive::Libarchive::ArchiveWrite
This class is for creating new archives.
-
This class is for reading Archive::Libarchive::Entry objects from disk so that they can be written to Archive::Libarchive::ArchiveWrite objects.
-
Archive::Libarchive::DiskWrite
This class is for writing Archive::Libarchive::Entry objects to disk that have been written from Archive::Libarchive::ArchiveRead objects.
-
This class represents a file in an archive, or on disk.
-
Archive::Libarchive::EntryLinkResolver
This class exposes the
libarchive
link resolver API. -
This class exposes the
libarchive
match API. -
Dist::Zilla::Plugin::Libarchive
Build Dist::Zilla based dist tarballs with libarchive instead of the built in Archive::Tar.
-
If a suitable system
libarchive
can't be found, then this Alien will be installed to provide it. -
The
libarchive
project home page. -
https://github.com/libarchive/libarchive/wiki
The
libarchive
project wiki. -
https://github.com/libarchive/libarchive/wiki/ManualPages
Some of the
libarchive
man pages are listed here.
Graham Ollis [email protected]
This software is copyright (c) 2021,2022 by Graham Ollis.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.