cakelab
Home Projects Research Misc. Contact

Delta Service

Background

Related Work

xproxy: an xdelta based proxy.

Debian Package Management System

Debian uses a straight forward package managing system. It basically consists of a repository (Debian Archive) of available packages, usually accessible via HTTP(S), a set of tools (i.e. apt and dpkg) for retrieval and package management on the client machine, and a cache which stores retrieved packages locally.

The debian archive provides a list of all available packages (Packages.gz or Packages.xz) with package locations and checksums. Package names have a a version specifier at the end. Package naming and versioning has to follow the rules of the debian policies, which allows the client to determine the order of versions for a given package.

A debian binary package is a ar(1) archive which contains two compressed tar(1) archives on top level: control.tar and data.tar. The control.tar archive contains a set of so-called control files which provide detailed information about the package, checksums for files and scripts, which perform necessary tasks during installation or removal of the package. The data.tar archive contains the files that have to be copied to the local file system in order to install the software. There may be a third file on top level of a debian binary package called debian-binary which contains the version of dpkg required to properly deal with the package.

The cache (/var/cache/apt/archives) stores packages in the state, they where received from the source. After retrieval, checksum checks made sure, that the packages are intact. Packages usually remain in the cache until a new package version was installed, but they can be manually removed too.

Public Package Archives

Older versions of debian packages are publicly available in two major archives: Debian Snapshot Archive and Ubuntu Launchpad.

Analysis

Architecture Approaches

There are basically two different approaches to be considered. Both rely on a so-called Delta Service, which provides patches for the transition of one locally available old package from the cache to a new version of the same package. Those patches can be downloaded and applied to unpacked debian packages. The result of an applied patch is the unpacked package content of the new version of that package. The two architecture approaches now differ in the way they further process the unpacked content of the new package.

Approach 1: Providing Patched and Repackaged Packages

This approach relies on a service on the local machine, which acts as a proxy between the package management tools and the Debian Archive. On an request for a debian package, the proxy checks whether an old package is available in the cache. In case it is, it requests a patch from the Delta Service. The older version from the cache gets unpacked into a temporary directory and the patch is applied to that directory, resulting in the unpacked content of the new package. This directory is then carefully repacked to create the exact same package (with the same checksum) as the original package in the Debian Archive. This is then provided as reply to APT, which can now proceed with its usual task.

Approach 2: Providing Patched Package Content

In this approach, the local package management tools are extended by so-called Delta Plugins. They basically intercept package retrieval and redirect it to retrieve appropriate patch files instead. As in the other approach, the patch gets applied to a directory with the unpacked content of the old package from cache. The difference is, that the unpackaged content will not be repackaged before installation. Instead, the delta plugins provide the directory to the package management tools in its unpacked form for further processing/installation. While the package management tools perform their task on the patched package content, the Delta Plugins start repackaging the package (with low priority) to store it in the cache later.

Difference Between Both Approaches

The first approach has a significant performance disadvantage, because it repacks packages before they can be installed. During repacking a lot of effort goes into compression and especially the new standard compression method LZMA is know to use up a lot of processing power. On network connections of higher throughput starting with 10 mbps, the time consumption of LZMA compression can even be higher then the time needed to download the full package. Here, the second approach has the advantage, that unpacked content can be instantly installed after patching. Compression will only be applied, if the unpacked content contains compressed files, that where worth patching, and in most cases, those packages aren't compressed with LZMA.

The main advantage of the first approach is, that it does not interfere with the basic functionality of the package management tools. The delta proxy can act as a fully functional, locally available Debian Archive providing common debian packages, which can be checked for integrity the usual way. And in case anything goes wrong, it can still download the actual package from its original source. This makes the first approach highly reliable and trustworthy. In contrast, the plugins in the second approach have to be integrated in the existing package management tools and require a new method to do integrity checks on downloaded and patched package content.

Key Topics to be Further Analysed

Identifying Derivative Packages
A newer package is derived from an older package (1:1 relation), such as the same package of an older version of an updated project. Which package is an update to an older packages may be unknown. Thus, a generic service needs a way to identify derivatives of packages.
Patch Creation and Application
Creating a patch for a transition from an older package version to a newer package version. Applying a patch to an existing package and checking the result.
Patch Servicing
Management of creation, storage and servicing of patches for multiple different versions. Servicing mainly refers to the communication protocol between servers and clients of the delta service. Besides requesting of patches and deciding whether patching is advantageous it also includes fallback strategies and error prevention on feedback of clients (e.g. in case of invalid patches).

Identifying Derivative Packages

Updated packages can have the same name but also different names such as having a different version number in the file name. The relationship between older and newer packages is of course known to the package provider, but not necessarily to the client which initiates the download. Simple example is a manual download through a web browser. Even though the individual behind the PC knows the similarity, the web browser doesn't.

Version Database
There exists some kind of data base known to the delta service, which lists the sequence of versions of packages.
Standardised Versioning Pattern
There exists a standardised naming scheme for package derivatives, which is known to the delta service.
File and Delta Properties Evaluation
Based on pattern matching (e.g. same domain name, path, file name, extension etc. in a URL) the service guesses a relationship to known packages. Based on a size comparison between selected candidates, it then decides, whether patching is more efficient than download of the new package. This is especially easy, if the updated package has the same path and name.

Holger Machens, 02-Jan-2021