Home Projects Research Misc. Contact

Generating Binary Diffs Using Stream Compression

homac, 02-Jul-2015 15:39:21

This was just a test during some research on binary diff tools. The combination of the script and the accompanied script allow to create and apply patch files for binaries and text files. It performs surprisingly well considering its simplicity. A comparison to the mainstream binary diff tools bsdiff and xdelta showed that it often generates up to 20% smaller patch files, especially in cases where the differences in files are more frequently.

The original idea came up when I reviewed the algorithm used by bsdiff of Colin Percival which is customised to deal with updates of executables. His main concern was, that changes in source code lead to changes in code length of functions which in turn leads to changes of the position of every function behind this function in the compiled executable. Hence, all calls to all these functions have to be updated which results in multiple differences consisting of a modification of the function call target (pointer) from a distinct value A (old position) to a new value B (new position).

As all diff algorithms his algorithm starts with the usual identification of longest common subsequences for both input files. Then it analyses the 8 byte prefixes and postfixes of common subsequences to extract minor changes due to changes of pointers and stores them in an additional file to patch the pointers later apart from the actual diff (all in one pass).

The identification of longest common subsequences relies on algorithms also used in compression tools such as gzip, bzip2 etc.. Difference is, that compression tools search for matches in the same file while diff tools search for matches in two different files. Compression tools evaluate matches based on their potential reduction of the compressed result and assign for example Huffman codes to them accordingly. Thereby, they create a code table, which allows mappings of code(X) to X and vice versa where X stands for some byte sequence found at different locations in the file.

Based on this knowledge I thought, a modification of such a function pointer would actually just be a modification of such a code table to be used to decode a compressed file. This initial idea led to a completely different and more promising idea, which I will analyse afterwards, but to evaluate the basics of this idea I kept following on this track.

Following this track, I looked for a way to easily create a diff of the code table of the original file and the code table of the modified file. That requires a lot of coding, I thought first, but then I noticed, that I could get an almost similar effect, when I concatenate both files and use a stream compression tool, which does modifications to its own code table in-bound. The resulting compressed stream contains the compressed original file (and its code table) beginning from the start and modifications to the code table and the encoded construction information based on the code table constructed from the original file. Thus, the remaining portion of the stream will contain all informations to reconstruct the modified file based on the information given in the original file. Hence, the "patch" consists of the tail of the compressed stream, that's all. This patch can be applied to the original file by the following procedure. It starts by creating a compressed stream of the original file. Appending the formerly generated patch file (the tail) to the compressed stream results in a stream identical to that we generated during the creation of the patch file. Thus, decompressing this stream produces the original file with the modified file appended. Finally, we just remove the original file from the start of the decompressed stream and get the modified file as result.

This is basically what and do. The first compression tool I found, which supports stream compression, almost in the way required, was xz (successor of lzma). Unfortunately, xz puts some information (few bytes) at the start and the end of the stream, which differs from run to run (didn't investigate it further but it might be length and time stamp). To fix it, creates a small patch file which is applied to the combined stream (xz original_file + tail) to incorporate those compression tool depended changes. The patch required to do this, is created by a diff between the compressed original file and the combined stream (compressed original + tail). This internally uses a conversion from binary to hexadecimal (xdd) and the (text) diff tool (and patch and xdd -r later). This method is not recommended for binary diffs in general since the patch files are significantly larger and it costs a lot of processing power, but it does the job to fix the compression tool specific changes in this test here.

As initially mentioned, I did a few tests to compare the resulting patch size to those of bsdiff and xdelta. I was slightly surprised as I saw, that this method in most cases generated smaller patches than xdelta and bsdiff. But the reduction in size is inconsistent and subject of further research. It is possible, that the smaller patch size is also the result of a better compression algorithm in xz compared to those used by xdelta and bsdiff. bsdiff performs best with executables where the modifications are rather small; the case for which it was tailored. Worst patch size was consistently achieved on compressed files, where the resulting patch file was almost the size of the compressed file. But this can be explained already: Compression tools use encodings which are based on codes with differing length and not aligned to byte boundaries. Thus, a small difference in the compressed file, results in a different alignment as in the original file, which alters all bytes behind the change. This is bad for all diff tools because they interpret the input based on byte boundaries. The compression tool xz, I'm using in the script, seems to do it as well and thus my script is affected, too. The fix here might be to interpret the input bitwise (not sure about this yet). xdelta supports a mode, where it decompresses compressed files to circumvent this issue, but this is a non-generic solution, meaning, there are compression methods not supported. Another issue I've noticed, is that executables seem to have even worse different modifications than the pointer changes, Percival considered. But this is really subject to further research and I hope I can present a better solution to this problem soon.


Archive with scripts


The bash scripts in the archive use the following command line tools:

  • xz: Compression tool
  • diff: Finding of differences between text files.
  • xdd: Conversion between binary and hexadecimal representation
  • Common Unix tools such as tail, tar, cat, awk, etc. which should be on your system already.


Please note: These scripts are intended for research purposes and not for any kind of productive operation! I'm not responsible if you destroy your kitchen using it!

Generating a patch file

  • Original file: org.dat
  • Modified file: new.dat
  • Resulting patch file: patch
    > org.dat new.dat patch

Applying a patch file

  • Original file: org.dat
  • Patch file: patch
  • Patched file: result.dat
    > org.dat patch result.dat


The state of the art hasn't changed since Percival has published its bsdiff algorithm. Tools to generate and apply patches to binaries still split into three categories:

  1. Deltas of Byte Sequences: Tools which apply the same techniques used for text comparison on binaries. For example xdelta or LibXDiff (which is somehow related to xdelta).
  2. Deltas for Executables: Tools which consider pointer changes in executables resulting from sparse modifications in source code. Examples are bsdiff or Googles Courgette. The method here has not changed since bsdiff and still does some kind of replacement for pointers.
  3. Deltas for Compressed Archives: Tools which consider, that the input is in some compressed format and contains multiple rather text files. Examples are: xdelta, debdelta, rmpdelta or the yum-presto plugin. The method here is still to uncompress the input and apply the usual diff tool on the result.

The first category uses a generic approach but fails in several cases where the modifications are not in the manner of text changes: unique insertions deletions and alterations at distinct places in the data stream.

The second category performs well on single executables dealing with the pointer changes. As soon as there are for example multiple independent binaries concatenated or the repeated modification is larger than a pointer, they fail.

The third category can deal well with archives compressed with a known (e.g. supported) compression algorithm which contain uncompressed files. Obviously they will fail on data which is not known to be compressed (such as lots of document or image formats today), use a compression algorithm which is not supported by them or contain files which are compressed itself.

Thus, there is no generic diff algorithm which can produce competitive small patches which are reliably smaller than a compressed archive with the modified files. But our knowledge about entropy tells us, that the size of a patch file always has to be less equal to the size of the modified file. So, there is still some work to do.


Home page