Skip to content

Modern Perl bindings to libarchive

Notifications You must be signed in to change notification settings

uperl/Archive-Libarchive

Repository files navigation

Archive::Libarchive static linux ref

Modern Perl bindings to libarchive

SYNOPSIS

use 5.020;
use Archive::Libarchive qw( :const );

my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;
$r->open_filename("archive.tar", 10240) == ARCHIVE_OK
  or die $r->error_string;

my $e = Archive::Libarchive::Entry->new;
say $e->pathname while $r->next_header($e) == ARCHIVE_OK;

DESCRIPTION

This module provides a Perl object-oriented interface to the libarchive library. The libarchive library is the API used to implemnt bsdtar, the default tar implementation on a number of operating systems, including FreeBSD, macOS and Windows. It can also be installed on most Linux distributions. But wait, there is more, libarchive supports a number of formats, compressors and filters transparently, so it can be a useful when used as a universal archiver/extractor. Supported formats include:

  • various tar formats, including the oldest forms and the newest extensions
  • zip
  • ISO 9660 (CD-ROM image format)
  • gzip
  • bzip2
  • uuencoded files
  • shell archive (shar)
  • ... and many many more

There are a number of "simple" interfaces around this distribution, which are worth considering if you do not need the full power and configurability that this distribution provides.

  • Archive::Libarchive::Peek

    Provides an interface for listing and retrieving entries from an archive without extracting them to the local filesystem.

  • Archive::Libarchive::Extract

    Provides an interface for extracting arbitrary archives of any format/filter supported by libarchive.

  • Archive::Libarchive::Unwrap

    Decompresses / unwraps files that have been compressed or wrapped in any of the filter formats supported by libarchive

This distribution is split up into several classes, that correspond to libarchive classes. Probably the best place to start when learning how to use this module is to look at the "EXAMPLES" section below, but you can also take a look at the main class documentation for the operation that you are interested in as well:

This module attempts to provide comprehensive bindings to the libarchive library. For more details on the history and alternatives to this project see the "HISTORY" section below. All recent versions of libarchive should be supported, although some methods are only available when you have the most recent version of libarchive installed. For methods not available on older versions please consult Archive::Libarchive::API, which will list these methods as (optional). If you need to support both older versions of libarchive and exploit the newer methods on newer versions of libarchive you can use the can method to check if they are available. If you need the latest version of libarchive, and your system provides an older version, then you can force a share install of Alien::Libarchive3:

env ALIEN_INSTALL_TYPE=share cpanm Alien::Libarchive3

FUNCTIONS

The main functionality of this module is implemented in the classes listed above, but this module does also provide a few top level non-object-oriented functions as well. These methods are not exported by default, but they can be requested using the usual Exporter interface, either individually, or with the :func or :all tags (The latter will also import constants).

archive_bzlib_version

# archive_bzlib_version
my $string = archive_bzlib_version();

The bzlib version that libarchive was built with. This will return undef if the library was not found at build time.

archive_liblz4_version

# archive_liblz4_version
my $string = archive_liblz4_version();

The liblz4 version that libarchive was built with. This will return undef if the library was not found at build time.

archive_liblzma_version

# archive_liblzma_version
my $string = archive_liblzma_version();

The liblzma version that libarchive was built with. This will return undef if the library was not found at build time.

archive_libzstd_version

# archive_libzstd_version (optional)
my $string = archive_libzstd_version();

The zstd version that libarchive was built with. This will return undef if the library was not found at build time.

archive_version_details

# archive_version_details
my $string = archive_version_details();

Detailed textual name/version of the library and its dependencies. This has the form:

  • libarchive x.y.z zlib/a.b.c liblzma/d.e.f ... etc ...

the list of libraries described here will vary depending on how libarchive was compiled.

archive_version_number

# archive_version_number
my $int = archive_version_number();

The libarchive version expressed as an integer. This will be the major, minor and patch levels each using up to three digits, so 3.5.1 will be 3005001.

archive_version_string

# archive_version_string
my $string = archive_version_string();

The libarchive version as a string.

archive_zlib_version

# archive_zlib_version
my $string = archive_zlib_version();

The zlib version that libarchive was built with. This will return undef if the library was not found at build time.

versions

my %versions = Archive::Libarchive->versions();

This returns a hash of libarchive and Archive::Libarchive versions and dependency versions. This may be useful in a test report diagnostic.

EXAMPLES

These examples are translated from the libarchive C examples, which can be found here:

List contents of archive stored in file

The main Archive::Libarchive API is based around two basic type of classes. The Archive::Libarchive::Archive class serves as a basis for all archive objects. The Archive::Libarchive::Entry represents the header or metadata for files stored inside an archive (or as we will see later, files on disk).

The basic life cycle of an archive instance is:

  • Create one using its new constructor

    The constructor does not take any arguments, instead you will configure it in the next step.

  • Configure it using "support" or "set" calls

    Support calls allow Archive::Libarchive to decide when to use a feature; "set" calls enable the feature unconditionally.

  • "Open" a particular data source

    This can be using callbacks for a custom source, or one of the pre-canned data sources supported directly by Archive::Libarchive.

  • Iterate over the contents

    Ask alternatively for "header" or entry/file metadata (which is represented by a Archive::Libarchive::Entry instance), and entry/file content.

  • Finish by calling "close"

    This will be called automatically if the archive instance falls out of scope.

Writing an archive is very similar, except that you provide the "header" and content data to Archive::Libarchive instead of asking for them.

Here is a very basic example that simply opens a file and lists the contents of the archive.

use 5.020;
use Archive::Libarchive qw( ARCHIVE_OK );

my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;

my $ret = $r->open_filename("archive.tar", 10240);
if($ret != ARCHIVE_OK) {
  exit 1;
}

my $e = Archive::Libarchive::Entry->new;
while($r->next_header($e) == ARCHIVE_OK) {
  say $e->pathname;
  $r->read_data_skip;
}

Note that open_filename method inspects the file before deciding how to handle the block size. If the filename provided refers to a tape device, for example, it will use exactly the block size you specify. For other devices, it may adjust the requested block size in order to obtain better performance.

Note that the call to read_data_skip here is not actually necessary, since Archive::Libarchive will invoke it automatically if you request the next header without reading the data for the last entry.

The module Archive::Libarchive::Peek also provides similar functionality to this example in a simple, less powerful interface.

List contents of archive stored in memory

There are several variants of the open methods. The "filename" variant used above is intended to be simple to use in the common case of reading from a file from disk, but you may find the "memory" variant useful in other cases.

use 5.020;
use Path::Tiny qw( path );
use Archive::Libarchive qw( ARCHIVE_OK );

my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;

my $buffer = path('archive.tar')->slurp_raw;

my $ret = $r->open_memory(\$buffer);
if($ret != ARCHIVE_OK) {
  exit 1;
}

my $e = Archive::Libarchive::Entry->new;
while($r->next_header($e) == ARCHIVE_OK) {
  say $e->pathname;
  $r->read_data_skip;
}

There are also variants to read from an already-opened file descriptor, a libc FILE pointer, or a Perl file handle.

List contents of archive with custom read functions

Sometimes, none of the packaged open methods will work for you. In that case, you can use the lower-level open method, which accepts a number of callbacks. For this example we will use the open, read and close callbacks.

use 5.020;
use Archive::Libarchive qw( :const );

my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_all;

my $fh;

$r->open(
  open => sub {
    open $fh, '<', 'archive.tar';
    binmode $fh;
    return ARCHIVE_OK;
  },
  read => sub {
    my(undef, $ref) = @_;
    my $size = read $fh, $$ref, 512;
    return $size;
  },
  close => sub {
    close $fh;
    return ARCHIVE_OK;
  },
) == ARCHIVE_OK or die $r->error_string;

my $e = Archive::Libarchive::Entry->new;
while(1) {
  my $ret = $r->next_header($e);
  last if $ret == ARCHIVE_EOF;
  die $r->error_string if $ret < ARCHIVE_WARN;
  warn $r->error_string if $ret != ARCHIVE_OK;
  say $e->pathname;
}

$r->close;

For full power of read callbacks see the open method's documentation.

When writing to an archive the Archive::Libarchive::ArchiveWrite class also has its own open method and callbacks.

A universal decompressor / defilter-er

The "raw" format handler treats arbitrary binary input as a single-element archive. This allows you to get the output of a libarchive filter chain, including files with multiple encodings, such as gz.uu files:

use 5.020;
use Archive::Libarchive;

my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_filter_all;
$r->support_format_raw;
$r->open_filename("hello.txt.uu");
$r->next_header(Archive::Libarchive::Entry->new);

my $buffer;
while($r->read_data(\$buffer)) {
  print $buffer;
}

$r->close;

Note that the "raw" format is not enabled by the support_format_all method on Archive::Libarchive::ArchiveRead. Also note that the "raw" format handler does not recognize or accept empty files. If you specifically want to be able to read empty files, you'll need to also invoke the support_format_empty method on Archive::Libarchive::ArchiveRead.

The module Archive::Libarchive::Unwrap also provides similar functionality to this example in a simple, less powerful interface.

A basic write example

The following is a very simple example of using Archive::Libarchive to write a group of files into a tar archive. This is a little more complex than the read examples above because the write example actually does something with the file bodies.

use 5.020;
use Archive::Libarchive;
use Path::Tiny qw( path );

my $w = Archive::Libarchive::ArchiveWrite->new;
$w->set_format_pax_restricted;
$w->open_filename("outarchive.tar");

path('.')->visit(sub ($path, $) {
  my $path = shift;

  return if $path->is_dir;

  my $e = Archive::Libarchive::Entry->new;
  $e->set_pathname("$path");
  $e->set_size(-s $path);
  $e->set_filetype('reg');
  $e->set_perm( oct('0644') );
  $w->write_header($e);
  $w->write_data(\$path->slurp_raw);

}, { recurse => 1 });

$w->close;

Note that:

  • filetype

    The filetype methods take either a string code, or an integer constant with the AE_IF prefix. When returning a filetype code, they will return a dualvar with both. The code reg / AE_IFREG is the code for a regular file (not a directory, symlink or other special filetype).

  • gzip

    If you wanted to write a gzipped tar archive, you would just add a call to the add_filter_gzip method on Archive::Libarchive::ArchiveRead, and append .gz to the output filename.

  • pax restricted

    The "pax restricted" format is a tar format that uses pax extensions only when absolutely necessary. Most of the time, it will write plain ustar entries. This is recommended tar format for most uses. You should explicitly use ustar format only when you have to create archives that will be readable on older systems; you should explicitly request pax format only when you need to preserve as many attributes as possible.

  • reusing entry instance

    This example creates a fresh Archive::Libarchive::Entry instance for each file. For better performance, you can reuse the same entry instance by using the clear method to erase it after each use.

  • required properties

    Size, file type and pathname are all required properties here. You can also use the copy_stat method to copy all information from file to the archive entry, including file type. To get even more complete information, look at the Archive::Libarchive::DiskRead class, which provides an easy way to get more extensive file metadata―including ACLs and extended attributes on some systems―than using stat. It also works on platforms such as Windows where stat either doesn't exist or is broken.

  • calling close

    The close method will be called implicitly when the archive instance falls out of scope. However, the close call returns an error code, which may be useful for catching errors.

Constructing objects on disk

Archive::Libarchive includes a Archive::Libarchive::DiskWrite class that works very much like Archive::Libarchive::ArchiveWrite, except that it constructs objects on disk, instead of adding them to an archive. This class knows how to construct directories, regular files, symlinks, hard links and other types of disk objects. Here is a very simple example showing how you could use it to create a regular file on disk:

use 5.020;
use Archive::Libarchive qw( :const );

my $dw = Archive::Libarchive::DiskWrite->new;
$dw->disk_set_options(ARCHIVE_EXTRACT_TIME);

my $text = "Hello World!\n";

my $e = Archive::Libarchive::Entry->new;
$e->set_pathname("hello.txt");
$e->set_filetype('reg');
$e->set_size(length $text);
$e->set_mtime(time);
$e->set_mode(oct('0644'));

$dw->write_header($e);
$dw->write_data(\$text);
$dw->finish_entry;

Note that if you set a size in the entry instance, Archive::Libarchive::DiskWrite will enforce that size. If you try to write more than the size set in the entry content, your writes will be truncated; if you write fewer bytes than you promised, the file will be extended with zero bytes.

The pattern above can also be used to reconstruct directories, device nodes, and FIFOs. The same idea also works for restoring symlinks and hardlinks, but you do have to initialize the entry a little differently:

  • symlinks

    Symlinks have a file type lnk / AE_IFLNK and require a target to be set with the set_symlink method.

  • hardlinks

    Hardlinks require a target to be set with the set_hardlink method; if this is set, the regular filetype is ignored. If the entry describing a hardlink has a size, you must be prepared to write data to the linked files. If you don't want to overwrite the file, leave the size unset.

A complete extractor

Using the facilities described above, you can extract most archives to disk by simply copying entries from an Archive::Libarchive::ArchiveRead instance to an Archive::Libarchive::DiskWrite instance.

use 5.020;
use Archive::Libarchive qw( :const );

my $tarball = 'archive.tar';

my $r = Archive::Libarchive::ArchiveRead->new;
$r->support_format_all;
$r->support_filter_all;

my $dw = Archive::Libarchive::DiskWrite->new;
$dw->disk_set_options(
  ARCHIVE_EXTRACT_TIME | ARCHIVE_EXTRACT_PERM | ARCHIVE_EXTRACT_ACL | ARCHIVE_EXTRACT_FFLAGS
);
$dw->disk_set_standard_lookup;

$r->open_filename($tarball) == ARCHIVE_OK
  or die "unable to open $tarball @{[ $r->error_string ]}";

my $e = Archive::Libarchive::Entry->new;
while(1) {
  my $ret = $r->next_header($e);
  last if $ret == ARCHIVE_EOF;
  if($ret < ARCHIVE_OK) {
    if($ret < ARCHIVE_WARN) {
      die "header read error on $tarball @{[ $r->error_string ]}";
    } else {
      warn "header read warning on $tarball @{[ $r->error_string ]}";
    }
  }

  $ret = $dw->write_header($e);
  if($ret < ARCHIVE_OK) {
    if($ret < ARCHIVE_WARN) {
      die "header write error on disk @{[ $dw->error_string ]}";
    } else {
      warn "header write warning disk @{[ $dw->error_string ]}";
    }
  }

  if($e->size > 0)
  {
    my $buffer;
    my $offset;
    while(1) {

      $ret = $r->read_data_block(\$buffer, \$offset);
      last if $ret == ARCHIVE_EOF;
      if($ret < ARCHIVE_OK) {
        if($ret < ARCHIVE_WARN) {
          die "file read error on member @{[ $e->pathname ]} @{[ $r->error_string ]}";
        } else {
          warn "file read warning on member @{[ $e->pathname ]} @{[ $r->error_string ]}";
        }
      }

      $ret = $dw->write_data_block(\$buffer, $offset);
      if($ret < ARCHIVE_OK) {
        if($ret < ARCHIVE_WARN) {
          die "file write error on member @{[ $e->pathname ]} @{[ $dw->error_string ]}";
        } else {
          warn "file write warning on member @{[ $e->pathname ]} @{[ $dw->error_string ]}";
        }
      }
    }
  }

  $dw->finish_entry;
  if($ret < ARCHIVE_OK) {
    if($ret < ARCHIVE_WARN) {
      die "finish error on disk @{[ $dw->error_string ]}";
    } else {
      warn "finish warning disk @{[ $dw->error_string ]}";
    }
  }
}

$r->close;
$dw->close;

You could create an archive by going the other way by copying entries from an Archive::Libarchive::DiskRead instance to an Archive::Libarchive::ArchiveWrite instance.

The module Archive::Libarchive::Extract also provides similar functionality to this example in a simple, less powerful interface.

CONSTANTS

This module provides all of the constants used by libarchive. These typically are prefixed either ARCHIVE_ or AE_ and can be imported into your code individually, or en masse using the :const export tag. The will also be imported if you use the :all export tag to import everything.]

The complete list of available constants is listed in Archive::Libarchive::API.

The most common constants are the return of status codes from most functions:

  • ARCHIVE_EOF

    is returned only from read_data and read_data_block from the Archive::Libarchive::ArchiveRead class when you reach the end of a structure.

  • ARCHIVE_OK

    The operation completed successfully.

  • ARCHIVE_WARN

    If the operation completed with some surprises. You may want to report the issue to your user. The error_string method on most classes will return a suitable text message; the errno method on most classes returns an associated system errno value. (Since not all errors are caused by failing system calls, this is not always meaningful).

  • ARCHIVE_FAILED

    If this operation failed. In particular, this means that further operations on this entry are impossible. This is returned, for example, if you try to write an entry type that's not supported by this archive format. Recovery usually consists of simply going on to the next entry.

  • ARCHIVE_FATAL

    If the archive object itself is no longer usable, typically because of an I/O failure or memory allocation failure.

HISTORY

I started working with libarchive in order to experiment with FFI. To that end I implemented bindings for libarchive using both XS and FFI to compare and contrast the process. It was the basis for my first YAPC::NA talk back in 2014.

When I was working on the XS and FFI implementations I recognized that some degree of automation would be required, mainly because the libarchive is a C API of hundreds of methods, and new methods are being added all the time. I also wanted both implementations to use the same test suite, since their interfaces should be identical. While this work was useful, and I even ended up using both versions in production at a previous $work, the tools that I chose to automate managing the large number of methods, and the common test suite made both modules quite difficult to maintain.

I think also the interface that I chose was wrong. I opted to provide a very thin layer over libarchive, to avoid as much object-oriented overhead as possible. I intended to one day make an object-oriented layer over this thin layer to make it easier to use, but I never found the time to do this. I think a better approach would have been to bite the bullet provide only an object-oriented interface, because the ease of using a library that automatically free's its pointers when an object falls out of scope is worth the performance penalty of object oriented invocation.

I did, however, learn a lot about XS and FFI, and I started to think about what would make FFI easier in Perl. At the time the only viable FFI on cpan was FFI::Raw, and I contributed a number of enhancements and fixes to that project, and even got it working on Strawberry Perl. But I was starting to crave a better experience writing FFI bindings in Perl.

BULK88 was in the audience for a DC / Baltimore version of my Never Need to Write XS talk and he pointed me to a feature in XS that would make FFI calls much faster than what was possible in FFI::Raw. Using the any_ptr it is possible to remove method calls from an FFI interface, which, due to their dynamic nature are slower that non-method subroutine calls.

I was loosing faith in FFI::Raw being tenable or performant for large APIs, so I I gathered up my ideas of what would make a better FFI experience in Perl and the any_ptr feature that Bulk had shown me and I started working on a prototype FFI library. I gave a talk at the Pittsburgh workshop based on the work of that prototype.

I didn't release that prototype, because I kept hoping that FFI would catch fire and someone else would write a killer FFI for Perl. Since it didn't seem to be happening I re-worked my prototype into what eventually became FFI::Platypus. I wrote lots of bindings for Perl using Platypus, and I always had the idea that I would circle back to my FFI bindings for libarchive (Archive::Libarchive::FFI) and rework it using Platypus instead of FFI::Raw. The problem is that the project has since atrophied, and the problems with the dual module and automation tools that I chose made this not really a viable enterprise.

I next thing that FFI needs in Perl is some good tools to introspect C and generate bindings automatically. There are lots of challenges in this area. One being that exactly what a function signature (assuming you can even introspect that) can be ambiguous. For example a char could either be a 8 bit integer value (it could even be signed or unsigned depending on architecture) or it could be a single character. A pointer int * could actually be used by the callee as an array. There are lots of things that are unsafe about C, and a ton of corner cases because of the way the C pre-processor works, but if we can surmount these challenges then it would be very useful, because even when two different non-C languages are trying to talk to each other, they are usually using the C ABI to do it. This sort of drives me crazy but it is the way the world works, at least today.

I've been working on some low-level tools that I'm hoping we can build on to do some of this introspection. Const::Introspect::C is able to extract #define constants from a C header file, and Clang::CastXML uses the castxml project to extract a model of the functions and strcts in a C header file. I'm hoping with a middle layer these modules could be used to write a h2ffi tool similar to h2xs. I've had a number of false starts writing this middle layer: so I've decided to write some custom introspection with libarchive, which is a very FFI-friendly library, and one that I am familiar with, but that also has some interesting challenges and edge cases. I'm hoping this work will help design a more general middle layer that will be usable for other libraries.

At the same time, I've decided to fix some of the design flaws of my original XS and FFI implementations. There really isn't a good way of doing this with the original implementations so I'm deprecating them in favor of this one. I feel confident that the overall experience of using this library should be much better than using one of the older ones. I also think this one will be more easily maintainable, because I am using castxml, and I've created a reference build of libarchive using docker, which should ensure that the code generation is done consistently.

SEE ALSO

AUTHOR

Graham Ollis [email protected]

COPYRIGHT AND LICENSE

This software is copyright (c) 2021,2022 by Graham Ollis.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.