summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMichał Górny <mgorny@gentoo.org>2018-11-17 12:17:11 +0100
committerUlrich Müller <ulm@gentoo.org>2018-12-08 10:34:59 +0100
commit2795fb710f678f36e558db708fae7b248914f159 (patch)
tree5fabc190b892804dda20968b25b9d8e86a4e5fa2 /glep-0078.rst
parentglep-0075: Fix a typo. (diff)
downloadglep-2795fb710f678f36e558db708fae7b248914f159.tar.gz
glep-2795fb710f678f36e558db708fae7b248914f159.tar.bz2
glep-2795fb710f678f36e558db708fae7b248914f159.zip
glep-0078: GLEP draft, 'Gentoo binary package container format'
Signed-off-by: Michał Górny <mgorny@gentoo.org> Signed-off-by: Ulrich Müller <ulm@gentoo.org> Bug: https://bugs.gentoo.org/672672
Diffstat (limited to 'glep-0078.rst')
-rw-r--r--glep-0078.rst575
1 files changed, 575 insertions, 0 deletions
diff --git a/glep-0078.rst b/glep-0078.rst
new file mode 100644
index 0000000..edb4129
--- /dev/null
+++ b/glep-0078.rst
@@ -0,0 +1,575 @@
+---
+GLEP: 78
+Title: Gentoo binary package container format
+Author: Michał Górny <mgorny@gentoo.org>
+Type: Standards Track
+Status: Draft
+Version: 1
+Created: 2018-11-15
+Last-Modified: 2018-11-30
+Post-History: 2018-11-17
+Content-Type: text/x-rst
+---
+
+Abstract
+========
+
+This GLEP proposes a new binary package container format for Gentoo.
+The current tbz2/XPAK format is shortly described, and its deficiences
+are explained. Accordingly, the requirements for a new format are set
+and a gpkg format satisfying them is proposed. The rationale for
+the design decisions is provided.
+
+
+Motivation
+==========
+
+The current Portage binary package format
+-----------------------------------------
+
+The historical ``.tbz2`` binary package format used by Portage is
+a concatenation of two distinct formats: header-oriented compressed .tar
+format (used to hold package files) and trailer-oriented custom XPAK
+format (used to hold metadata) [#MAN-XPAK]_. The format has already
+been extended incompatibly twice.
+
+The first time, support for storing multiple successive builds of binary
+package for a single ebuild version has been added. This feature relies
+on appending additional hyphen, followed by an integer to the package
+filename. It is disabled by default (preserving backwards
+compatibility) and controlled by ``binpkg-multi-instance`` feature.
+
+The second time, support for additional compression formats has been
+added. When format other than bzip2 is used, the ``.tbz2`` suffix
+is replaced by ``.xpak`` and Portage relies on magic bytes to detect
+compression used. For backwards compatibility, Portage still defaults
+to using bzip2; compression program can be switched using
+``BINPKG_COMPRESS`` configuration variable.
+
+Additionally, there have been minor changes to the stored metadata
+and file storage policies. In particular, behavior regarding
+``INSTALL_MASK``, controllable file compression and stripping has
+changed over time.
+
+
+The advantages of tbz2/XPAK format
+----------------------------------
+
+The tbz2/XPAK format used by Portage has three interesting features:
+
+1. **Each binary package is fully contained within a single file.**
+ While this might seem unnecessary, it makes it easier for the user
+ to transfer binary packages without having to be concerned about
+ finding all the necessary files to transfer.
+
+2. **The binary packages are compatible with regular compressed
+ tarballs, most of the time.** With notable exceptions of historical
+ versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages
+ can be extracted using regular tar utility with a compressor
+ implementation that discards trailing garbage.
+
+3. **The metadata is uncompressed, and can be efficiently accessed
+ without decompressing package contents.** This includes
+ the possibility of rewriting it (e.g. as a result of package moves)
+ without the necessity of repacking the files.
+
+
+Transparency problem with the current binary package format
+-----------------------------------------------------------
+
+Notwithstanding its advantages, the tbz2/XPAK format has a significant
+design fault that consists of two issues:
+
+1. **The XPAK format is a custom binary format with explicit use
+ of binary-encoded file offsets and field lengths.** As such, it is
+ non-trivial to read or edit without specialized tools. Such tools
+ are currently implemented separately from the package manager,
+ as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_.
+
+2. **The tarball compatibility feature relies on obscure feature of
+ ignoring trailing garbage in compressed files**. While this is
+ implemented consistently in most of the compressors, this feature
+ is not really a part of specification but rather traditional
+ behavior. Given that the original reasons for this no longer apply,
+ new compressor implementations are likely to miss support for this.
+
+Both of the issues make the format hard to use without dedicated tools,
+or when the tools misbehave. This impacts the following scenarios:
+
+A. **Using binary packages for system recovery.** In case of serious
+ breakage, it is really preferable that the format depends on as few
+ tools a possible, and especially not on Gentoo-specific tools.
+
+B. **Inspecting binary packages in detail exceeding standard package
+ manager facilities.**
+
+C. **Modifying binary packages in ways not predicted by the package
+ manager authors.** A real-life example of this is working around
+ broken ``pkg_*`` phases which prevent the package from being
+ installed.
+
+
+OpenPGP extensibility problem
+-----------------------------
+
+There are at least three obvious ways in which the current format could
+be extended to support OpenPGP signatures, and each of them has its own
+distinct problem:
+
+1. **Adding a detached signature.** This option is non-intrusive but
+ causes the format to no longer be contained in a single file.
+
+2. **Wrapping the package in OpenPGP message format.** This would use
+ a standard format and make verification and unpacking relatively
+ easy. However, it would break backwards compatibility and add
+ explicit dependency on OpenPGP implementation in order to unpack
+ the package.
+
+3. **Adding OpenPGP signature as extra XPAK member.** This is
+ the clever solution. It implies strengthening the dependency
+ on custom tooling, now additionally necessary to extract
+ the signature and reconstruct the original file to accommodate
+ verification.
+
+
+Goals for a new container format
+--------------------------------
+
+All of the above considered, the new format should combine
+the advantages of the existing format and at the same time address its
+deficiencies whenever possible. Furthermore, since a format replacement
+is taking place it is worthwhile to consider additional goals that could
+be satisfied with little change.
+
+The following obligatory goals have been set for a replacement format:
+
+1. **The packages must remain contained in a single file.** As a matter
+ of user convenience, it should be possible to transfer binary
+ packages without having to use multiple files, and to install them
+ from any location.
+
+2. **The file format must be entirely based on common file formats,
+ respecting best practices, with as little customization as necessary
+ to satisfy the requirements.** The format should be transparent
+ enough to let user inspect and manipulate it without special tooling
+ or detailed knowledge.
+
+3. **The file format must provide support for OpenPGP signatures.**
+ Preferably, it should use standard OpenPGP message formats.
+
+4. **The file format must allow for efficient metadata updates.**
+ In particular, it should be possible to update the metadata without
+ having to recompress package files.
+
+Additionally, the following optional goals have been noted:
+
+A. **The file format should account for easy recognition both through
+ filename and through contents.** Preferably, it should have distinct
+ features making it possible to detect it via file(1).
+
+B. **The file format should provide for partial fetching of binary
+ packages.** It should be possible to easily fetch and read
+ the package metadata without having to download the whole package.
+
+C. **The file format should allow for metadata compression.**
+
+D. **The file format should make future extensions easily possible
+ without breaking backwards compatibility.**
+
+
+Specification
+=============
+
+The container format
+--------------------
+
+The gpkg package container is an uncompressed .tar achive whose filename
+should use ``.gpkg.tar`` suffix.
+
+The archive contains a number of files, stored in a single directory
+whose name should match the basename of the package file. However,
+the implementation must be able to process an archive where
+the directory name is mismatched. There should be no explicit archive
+member entry for the directory.
+
+The package directory contains the following members, in order:
+
+1. The package format identifier file ``gpkg-1`` (required).
+
+2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
+ (optional).
+
+3. The metadata archive ``metadata.tar${comp}``, optionally compressed
+ (required).
+
+4. A signature for the filesystem image archive:
+ ``image.tar${comp}.sig`` (optional).
+
+5. The filesystem image archive ``image.tar${comp}``, optionally
+ compressed (required).
+
+It is recommended that relative order of the archive members is
+preserved. However, implementations must support archives with members
+out of order.
+
+The container may be extended with additional members in the future.
+The implementations should ignore unrecognized members and preserve
+them across package updates.
+
+
+Permitted .tar format features
+------------------------------
+
+The tar archives should use either the POSIX ustar format or a subset
+of the GNU format with the following (optional) extensions:
+
+- long pathnames and long linknames,
+
+- base-256 encoding of large file sizes.
+
+Other extensions should be avoided whenever possible.
+
+
+The package identifier file
+---------------------------
+
+The package identifier file serves the purpose of identifying the binary
+package format and its version.
+
+The implementations must include a package identifier file named
+``gpkg-1``. The filename includes package format version;
+implementations should reject packages which do not contain this file
+as unsupported format.
+
+The file can have any contents. Normally, it should be empty.
+
+Furthermore, this file should be included in the .tar archive
+as the first member. This makes it possible to use it as an additional
+magic at a fixed location that can be used by tools such as file(1)
+to easily distinguish Gentoo binary packages from regular .tar archives.
+
+
+The metadata archive
+--------------------
+
+The metadata archive stores the package metadata needed for the package
+manager to process it. The archive should be included at the beginning
+of the binary package in order to make it possible to read it out of
+partially fetched binary package, and to avoid fetching the remaining
+part of the package if not necessary.
+
+The archive contains a single directory called ``metadata``. In this
+directory, the individual metadata keys are stored as files. The exact
+keys and metadata format is outside the scope of this specification.
+
+The package manager may need to modify the package metadata. In this
+case, it should replace the metadata archive without having to alter
+other package members.
+
+The metadata archive can optionally be compressed. It can also be
+supplemented with a detached OpenPGP signature.
+
+
+The image archive
+-----------------
+
+The image archive stores all the files to be installed by the binary
+package. It should be included as the last of the files in the binary
+package container.
+
+The archive contains a single directory called ``image``. Inside this
+directory, all package files are stored in filesystem layout, relative
+to the root directory.
+
+The image archive can optionally be compressed. It can also be
+supplemented with a detached OpenPGP signature.
+
+
+Archive member compression
+--------------------------
+
+The archive members outlined above support optional compression using
+one of the compressed file formats supported by the package manager.
+The exact list of compression types is outside the scope of this
+specification.
+
+The implementations must support archive members being uncompressed,
+and must support using different compression types for different files.
+
+When compressing an archive member, the member filename should be
+suffixed using the standard suffix for the particular compressed file
+type (e.g. ``.bz2`` for bzip2 format).
+
+
+OpenPGP member signatures
+-------------------------
+
+The archive members support optional OpenPGP signatures.
+The implementations must allow the user to specify whether OpenPGP
+signatures are to be expected in remotely fetched packages.
+
+If the signatures are expected and the archive member is unsigned, the
+package manager must reject processing it. If the signature does not
+verify, the package manager must reject processing the corresponding
+archive member. In particular, it must not attempt decompressing
+compressed members in those circumstances.
+
+The signatures are created as binary detached OpenPGP signature files,
+with filename corresponding to the member filename with ``.sig`` suffix
+appended.
+
+The exact details regarding creating and verifying signatures, as well
+as maintaining and distributing keys are outside the scope of this
+specification.
+
+
+Rationale
+=========
+
+Package formats used by other distributions
+-------------------------------------------
+
+The research on the new package format included investigating
+the possibility of reusing solutions from other operating system
+distributions. While reusing a foreign package format would be
+interesting, the differences in Gentoo metadata structure would prevent
+any real compatibility. Some degree of compatibility might be achieved
+through adapting the Gentoo metadata, however the costs of such
+a solution would probably outweigh its usefulness.
+
+Debian and its derivates are using the .deb package format. This is
+a nested archive format, with the outer archive being of ar format,
+and containing nested tarballs of control information (metadata)
+and data [#DEB-FORMAT]_.
+
+Red Hat, its derivates and some less related distributions are using
+the RPM format. It is a custom binary format, storing metadata directly
+and using a trailer cpio archive to store package files.
+
+Arch Linux is using xz-compressed tarballs (suffixed ``.pkg.tar.xz``)
+as its binary package format. The tarballs contain package files
+on top-level, with specially named dotfiles used for package metadata.
+OpenPGP signatures are stored as detached ``.sig`` files alongside
+packages.
+
+Exherbo is using the pbins format. In this format, the binary package
+metadata is stored in repository alike ebuilds, and the binary package
+files are stored separately and downloaded alike source tarballs.
+
+
+Nested archive format
+---------------------
+
+The basic problem in designing the new format was how to embed multiple
+data streams (metadata, image) into a single file. Traditionally, this
+has been done via using two non-conflicting file formats. However,
+while such a solution is clever, it suffers in terms of transparency.
+
+Therefore, it has been established that the new format should really
+consist of a single archive format, with all necessary data
+transparently accessible inside the file. Consequently, it has been
+debated how different parts of binary package data should be stored
+inside that archive.
+
+The proposal to continue storing image data as top-level data
+in the package format, and store metadata as special directory in that
+structure has been discarded as a case of in-band signalling.
+
+Finally, the proposal has been shaped to store different kinds of data
+as nested archives in the outer binary package container. Besides
+providing a clean way of accessing different kinds of information, it
+makes it possible to add separate OpenPGP signatures to them.
+
+
+Inner vs. outer compression
+---------------------------
+
+One of the points in the new format debate was whether the binary
+package as a whole should be compressed vs. compressing individual
+members. The first option may seem as an obvious choice, especially
+given that with a larger data set, the compression may proceed more
+effectively. However, it has a single strong disadvantage: compression
+prevents random access and manipulation of the binary package members.
+
+While for the purpose of reading binary packages, the problem could be
+circumvented through convenient member ordering and avoiding disjoint
+reads of the binary package, metadata updates would either require
+recompressing the whole package (which could be really time consuming
+with large packages) or applying complex techniques such as splitting
+the compressed archive into multiple compressed streams.
+
+This considered, the simplest solution is to apply compression to
+the individual package members, while leaving the container format
+uncompressed. It provides fast random access to the individual members,
+as well as capability of updating them without the necessity of
+recompressing other files in the container.
+
+This also makes it possible to easily protect compressed files using
+standard OpenPGP detached signature format. All this combined,
+the package manager may perform partial fetch of binary package, verify
+the signature of its metadata member and process it without having to
+fetch the potentially-large image part.
+
+
+Container and archive formats
+-----------------------------
+
+During the debate, the actual archive formats to use were considered.
+The .tar format seemed an obvious choice for the image archive since
+it is the only widely deployed archive format that stores all kinds
+of file metadata on POSIX systems. However, multiple options for
+the outer format has been debated.
+
+Firstly, the ZIP format has been proposed as the only commonly supported
+format supporting adding files from stdin (i.e. making it possible to
+pipe the inner archives straight into the container without using
+temporary files). However, this format has been clearly rejected
+as both not being present in the system set, and being trailer-based
+and therefore unusable without having to fetch the whole file.
+
+Secondly, the ar and cpio formats were considered. The former is used
+by Debian and its derivative binary packages; the latter is used by Red
+Hat derivatives. Both formats have the advantage of having less
+historical baggage than .tar, and having less overhead. However, both
+are also rather obscure (especially given that ar is actually provided
+by GNU binutils rather than as a stand-alone archiver), considered
+obsolete by POSIX and both have file size limitations smaller than .tar.
+
+Thirdly, SquashFS was another interesting option. Its main advantage is
+transparent compression support and ability to mount as a filesystem.
+However, it has a significant implementation complexity, including mount
+management and necessity of fallback to unsquashfs. Since the image
+needs to be writable for the pre-installation manipulations, using it
+via a mount would additionally require some kind of overlay filesystem.
+Using it as top-level format has no real gain over a pipeline with tar,
+and is certainly less portable. Therefore, there does not seem to be
+a benefit in using SquashFS.
+
+All that considered, it has been decided that there is no purpose
+in using a second archive format in the specification unless it has
+significant advantage to .tar. Therefore, .tar has also been used
+as outer package format, even though it has larger overhead than other
+formats (mostly due to padding).
+
+
+.tar portability issues
+-----------------------
+
+The modern .tar dialects could be considered dirty extensions
+of the original .tar format. Three variants may be considered
+of interest: POSIX ustar, pax (newer POSIX standard) and GNU tar.
+All three formats are supported by GNU tar, whose presence on systems
+used to create binary packages could be relied on. Therefore,
+the portability concerns are related mostly to being able to read
+and modify binary packages in scenarios of GNU tar being unavailable.
+
+For the purpose of this specification, detailed research on portability
+of individual tar features has been conducted. The research concluded:
+
+ Judging by the test results, the most portability could be
+ achieved by:
+
+ - using strict POSIX ustar format whenever possible,
+
+ - using GNU format for long paths (that do not fit in ustar format),
+
+ - using base-256 (+ pax if already used) encoding for large files,
+
+ - using pax (+ octal or base-256) for high-range/precision
+ timestamps and user/group identifiers,
+
+ - using pax attributes for extended metadata and/or volume label.
+
+It has been determined that for the purpose of binary package we really
+only need to be concerned about long paths and huge files. Therefore,
+the above was limited to the three first points and a guideline was
+formed from them.
+
+Debian has a similar guideline for the inner tar of their package
+format [#DEB-FORMAT]_.
+
+
+Member ordering
+---------------
+
+The member ordering is explicitly specified in order to provide for
+trivially reading metadata from partially fetched archives.
+By requiring the metadata archive to be stored before the image archive,
+the package manager may stop fetching after reading it and save
+bandwidth and/or space.
+
+
+Detached OpenPGP signatures
+---------------------------
+
+The use of detached OpenPGP signatures is to provide authenticity checks
+for binary packages. Covering the complete members with signatures
+provide for trivial verification of all metadata and image contents
+respectively, without having to invent custom mechanisms for combining
+them. Covering the compressed archives helps to prevent zipbomb
+attacks. Covering the individual members rather than the whole package
+provides for verification of partially fetched binary packages.
+
+
+Format versioning
+-----------------
+
+The format is versioned through an explicit file, with the version
+stored in the filename. If the format changes incompatibly,
+the filename changes and old implementations do not recognize it
+as a valid package.
+
+Previously, the format tried to avoid an explicit file for this purpose
+and used volume label instead. However, the use of label has been
+renounced due to unforeseen portability issues.
+
+
+Backwards Compatibility
+=======================
+
+The format does not preserve backwards compatibility with the tbz2
+packages. It has been established that preserving compatibility with
+the old format was impossible without making the new format even worse
+than the old one was.
+
+For example, adding any visible members to the tarball would cause
+them to be installed to the filesystem by old Portage versions. Working
+around this would require some kind of awful hacks that would oppose
+the goal of using simple and transparent package format.
+
+
+Reference Implementation
+========================
+
+The proof-of-concept implementation of binary package format converter
+is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily
+create packages in the new format for early inspection.
+
+
+References
+==========
+
+.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
+ packages
+ (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)
+
+.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
+ written in C
+ (https://packages.gentoo.org/packages/app-portage/portage-utils)
+
+.. [#DEB-FORMAT] deb(5) — Debian binary package format
+ (https://manpages.debian.org/unstable/dpkg-dev/deb.5.en.html)
+
+.. [#TAR-PORTABILITY] Michał Górny, Portability of tar features
+ (https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html)
+
+.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
+ to gpkg binpkg format
+ (https://github.com/mgorny/xpak2gpkg)
+
+
+Copyright
+=========
+This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
+Unported License. To view a copy of this license, visit
+http://creativecommons.org/licenses/by-sa/3.0/.