Generating Binary Diffs Using Stream Compression
homac, 02-Jul-2015 15:39:21
This was just a test during some research on binary diff tools. The combination of the
The original idea came up when I reviewed the algorithm used by
As all diff algorithms his algorithm starts with the usual identification of longest common subsequences for both input files. Then it analyses the 8 byte prefixes and postfixes of common subsequences to extract minor changes due to changes of pointers and stores them in an additional file to patch the pointers later apart from the actual diff (all in one pass).
The identification of longest common subsequences relies on algorithms also used in compression tools such as gzip, bzip2 etc.. Difference is, that compression tools search for matches in the same file while diff tools search for matches in two different files. Compression tools evaluate matches based on their potential reduction of the compressed result and assign for example Huffman codes to them accordingly. Thereby, they create a code table, which allows mappings of code(X) to X and vice versa where X stands for some byte sequence found at different locations in the file.
Based on this knowledge I thought, a modification of such a function pointer would actually just be a modification of such a code table to be used to decode a compressed file. This initial idea led to a completely different and more promising idea, which I will analyse afterwards, but to evaluate the basics of this idea I kept following on this track.
Following this track, I looked for a way to easily create a diff of the code table of the original file and the code table of the modified file. That requires a lot of coding, I thought first, but then I noticed, that I could get an almost similar effect, when I concatenate both files and use a stream compression tool, which does modifications to its own code table in-bound. The resulting compressed stream contains the compressed original file (and its code table) beginning from the start and modifications to the code table and the encoded construction information based on the code table constructed from the original file. Thus, the remaining portion of the stream will contain all informations to reconstruct the modified file based on the information given in the original file. Hence, the "patch" consists of the tail of the compressed stream, that's all. This patch can be applied to the original file by the following procedure. It starts by creating a compressed stream of the original file. Appending the formerly generated patch file (the tail) to the compressed stream results in a stream identical to that we generated during the creation of the patch file. Thus, decompressing this stream produces the original file with the modified file appended. Finally, we just remove the original file from the start of the decompressed stream and get the modified file as result.
This is basically what xdiff.sh and xpatch.sh do. The first compression tool I found, which supports stream compression, almost in the way required, was
As initially mentioned, I did a few tests to compare the resulting patch size to those of bsdiff and xdelta. I was slightly surprised as I saw, that this method in most cases generated smaller patches than xdelta and bsdiff. But the reduction in size is inconsistent and subject of further research. It is possible, that the smaller patch size is also the result of a better compression algorithm in
DOWNLOADArchive with scripts
Please note: These scripts are intended for research purposes and not for any kind of productive operation! I'm not responsible if you destroy your kitchen using it!
Generating a patch file
> xdiff.sh org.dat new.dat patch
Applying a patch file
> xpatch.sh org.dat patch result.dat
FURTHER NOTES ON BINARY DIFF/DELTA/PATCH TOOLS
The state of the art hasn't changed since Percival has published its bsdiff algorithm. Tools to generate and apply patches to binaries still split into three categories:
The first category uses a generic approach but fails in several cases where the modifications are not in the manner of text changes: unique insertions deletions and alterations at distinct places in the data stream.
The second category performs well on single executables dealing with the pointer changes. As soon as there are for example multiple independent binaries concatenated or the repeated modification is larger than a pointer, they fail.
The third category can deal well with archives compressed with a known (e.g. supported) compression algorithm which contain uncompressed files. Obviously they will fail on data which is not known to be compressed (such as lots of document or image formats today), use a compression algorithm which is not supported by them or contain files which are compressed itself.
Thus, there is no generic diff algorithm which can produce competitive small patches which are reliably smaller than a compressed archive with the modified files. But our knowledge about entropy tells us, that the size of a patch file always has to be less equal to the size of the modified file. So, there is still some work to do.