|
||||||||||||||
Meta Diff ToolIntroductionUpdates are annoying because they are quite frequently and occupy various resources of the user's PC for a significant amount of time. Very often a user has to decide whether he ignores the update notification and takes the security risk of using vulnerable/buggy software or wait for the updates to complete. While waiting for an update, he may wonder, why the download for a "bugfix" has almost the same size as the initial download of the entire software. A rough analysis of updates from various software distributors reveals that most of those updates are in fact complete software downloads, where some of them aren't even compressed or use weak compression methods. So, the user ends up downloading a bunch of redundant data. Network bandwidth and connectivity are still an issue due to out-dated infrastructure (example: central Europe), limited bandwidth on broad-band cables over sea and overloaded network nodes (switches) in between. Thus, there is a significant amount of users who still suffer behind 10Mbit and less. Furthermore, an estimate of the potential savings through patches revealed, that even fiber cable networks would not be enough to make patches obsolete. The ideal solution to this update debacle are actual patches, where the patch contains only those data portions, which are not yet on the local machine, accompanied by instructions on how to modify the old data, to achieve the new software state. As a result, a patch usually contains just a fracture of the size of the entire software. Now, software developers/distributors obviously know about patches, but they also have good reasons to avoid them:
But what are the advantages or rather issues of not using patches.
All those issues rising from updates with excessive download times, add up and cause frustration on user side. User frustration, obviously, has negative effects on customer relationship in general. It affects the acceptance of bug-fixes and bugs in general. Everybody knows, that software may have bugs (since developers are human) and should be happy if they are fixed. But since every fix involves a disproportionately large update, users are rather frustrated about it and developers have to add at least something to make up for it. Excessive update times also lower the willingness to download software in general. Online gaming companies which, for example, rely on micro-transactions, will have an interest, that users are willing to update to be able to access their service. But also, for example, when doing a quick purchase of a casual mini-game at late evening on weekend, people consider the time they have to spend waiting for the download. Since users do not really measure the time and calculate the expected download time, they will estimate it based on their experience. And their experience will be worse, the more time they have recently spend waiting for updates. This breakdown of the current state, provides a lot of good arguments to reconsider the current update distribution process. The goal of this project is to tackle all the currently existing issues of patch creation, distribution and application by developing a procedure and a framework, which allows seamless integration into an existing update distribution process, with minimal or no effort. This will be mainly achieved by the introduction of a so-called delta service, which detects updates, generates deltas on client requests and distributes them when ready. Another major topic of this project is the development of a tool, which assists binary diff tools, when dealing with large and compressed software packages which usually produce bad results in patch creation. Not part of this project is the development of a diff tool. Algorithms for generic binary diffs have been researched in the past and the current state is good enough to provide significant advantages when used properly. Issues of Binary Delta Calculation and ApplicationTo understand the issues of binary diff tools and why those issues cannot be fixed by 'add more processors' or 'add more memory' we have to have a look at the task of binary delta calculation. The overall process is quite simple. To create a patch, a diff tool analyses the differences in two files Binary diff tools work quite similar to compression tools. The difference between both is, that binary diff tools look for similarities in two files, while compression tools look for similarities in one file. A simple compression algorithm could be to first gather all patterns found in the file. When finding the same pattern at a subsequent location in the file, it can replace that by a reference to the former location in the same file. This may lead to the following idea: Let's say, we have an older version of a file The limited view range of a "common subsequence algorithm" working on a file leads to a very common issue. Let's say we have created two tar archives with exactly the same content. The ordering of files in a tar file, depends on the order found in the directory node of the file system, containing them. Thus, we can end up having two tar balls with the same content but different order. Now, due to the limited view range of the delta algorithm, the patch ends up containing section Another more obvious issue, is compression. Using a different version of the same compression library or even an entirely different compression tool than before, causes any similarity that may have existed in the original data, to vanish. There are approaches for diff tools, which consider uncompressing the data before analysing the content, but they have issues repacking them afterwards. Command line arguments, such as the compression level, method, memory utilisation aren't usually stored in the compressed file, because they are unnecessary at decompression time. Also, the compression method may have changed or header data may vary with each run (e.g. a time stamp). Beyond those issues there are other, more domain specific issues. For example executables: Executables contain lots of static addresses, especially to call functions. When changing just one line of code in a function, all code addresses behind this particular change, will be offset by the amount of machine instructions generated by the compiler for the added code line. The tools The various issues of delta creation and application depict the need for one particular improvement, that can be made: A tool which manages and controls the delta creation and application process. A tool which increases the similarity of data streams by normalising its content through unpacking, uncompressing and ordering. A tool which is capable to detect the steps required to properly repack formerly unpacked and uncompressed data, to produce the exact same and not just similar output. This would be some kind of meta-diff tool, which has its very own challenges, mostly driven by the lack of processing time, which will be discussed in another section. Naive First ApproachesAs explained in the previous section, the main reasons for binary diff tools to create patches with too much redundancy are:
Thus, an obvious approach is to unpack and uncompress data of the old and new package and create a diff over the resulting directories and files. To apply the patch, the old package will be unpacked again, and the resulting set of files will be patched and finally repacked to assemble the new package. naive patch creation process naive patch application process To explain, why this approach is not sufficient to compete with simple downloads of full packages, we have to have a closer look at each step of the process and the data that is generated throughout. Process Outline
UnpackingUnpacking a package is simply identifying the proper tool and run it to get the unpacked set of files. Since a package can contain other packages, this process has to be recursive. Thus, after unpacking on first level, the unpacked result has to be traversed to identify and unpack contained packages. For the convenience of further processing, unpacked files of each package are stored in a directory with the name of the package at the same location of the package in the directory tree of the unpacked top-level package as shown in the following example. ExampleLet's say,archive.tar.gz is a gzip compressed archive, containing a directory
dir with files
a and
b . Unpacking then produces the following directory structure:
archive.tar.gz : gzip | +-- archive.tar : tar | +-- dir : dir | +-- a : file | +-- b : file Unpacking for RepackingTo be able to create an identical copy of a package, that was unpacked during the process, various information about the original package is required:
File order, type and attributes are usually required to repack archives. File type and attributes can be different when unpacked (such as owner==root) and have to be recorded from the package (e.g. archive), which contained them. To obtain this repacking information, an analysis of the original package is required. This analysis involves repacking for comparison, not only for verification. For example, to determine an unknown compression level, we can compress the uncompressed file with different compression levels and compare the results to the original package until we find an exact match. Or in cases where a tool adds varying information on each run (such as a timestamp in a header), we can identify and copy that particular data portion to patch it in when repacking. The given examples of package analysis procedures are very basic and can be much more efficient. But they show, that the effort to obtain required repacking information can be quite significant (up to several minutes, depending on type and size of data). Fortunately, package analysis has to be done only once, and only when creating the patch. Thus, the patch application time is not affected by it. Delta CalculationThere are two approaches to calculate the data over those unpacked directories:
Both approaches have advantages and disadvantages. Due to sorting by file name, stream delta implicitly avoids cases, where file names where just slightly changed, such as a version number suffix or lower case to upper case. And even if the file name changed entirely, there is still a chance, that the file is inside the view range of the diff algorithm, when processing the streams. But there are of course cases, where the position of files in the streams has changed due to insertions and deletions and thereby went beyond the view range. In this particular case, the recursive delta produces better results. But, as an analysis has shown, individual deltas for each file are in average a disadvantage, because of redundant information added by them (e.g. name of the original files and checksums). Also, the number of file system operations is significantly higher, which makes recursive deltas slower. Main Disadvantages of These ApproachesThe goal is to compete with downloads of full updates. Thus, patch download and application time has to be lower than the download time of a full update. While the download time mainly depends on the network bandwidth, patch application time depends mainly on the runtime performance of the underlying file system. This however, depends on the following factors:
Thus, compression during repacking is the main bottleneck in the whole patching process. That makes a sequential patching process lose in competition to full updates when the patch is close to 80% of the size of the whole package. Therefore the next approach will aim for more parallelism. While a sequential process waits for the patch download to complete, before even starting to work, a parallelised process can already start, while the download is running and almost be finish when the last bit of the patch arrives. Thus, the additional patching effort, when considering the time, is close to null. The processing effort will of course be higher, but not considerably, because of the rather slow network I/O, which will suspend the process from time to time. Sequential vs. Parallel PatchingEvaluation of the runtime performance of the sequential prototypes introduced in the previous section provided devastating results for patches of debian packages. Further analysis revealed compression and decompression as the main bottleneck which lead to a new parallel approach. The parallel approach can optimise the process down to the point that everything runs truly parallel to the download of the patch. Therefore, the compression effort only comes into account, when it is slower than the download. Because the compression run, which would compress the whole package after patching would require the longest time, we just need to consider this last compression run. All the other compressed files in compressed packages can be handled in parallel. Based on the results of preliminary analysis and the fact, that the decompression and compression are the main bottleneck, we can approximate the patching effort for a sequential and a parallel patch process as follows:
Compression and decompression time was measured in a different analysis on recent hardware (Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz, 8MiB cache size) using
All times refer to the time in seconds required to process 1 MiB of uncompressed file size. Thus, in decompression, the output counts and not the input. It has to be mentioned here, that maximum and average compression time are by the most part negatively influenced by Download time was approximated by the maximum bandwidth for 6, 10 and 100 mbit/s (accordingly 1.33, 0.80 and 0.08 sec/MiB). Based on these times, the patching time was approximated for the different (de-)compression times (min/avg/max), different network bandwidths and different patch size ratios ( Best CaseThese graphs consider minimum values for (de-)compression times. At 6 and 10 mbit network bandwidth, both approaches are very close to each other and clearly outperform a full update. At 100 mbit/s a sequential process can only compete with a full update, if the patch size is less than 40% of the package size. The parallel approach, can safe up to 50% of the time compared to a full download. At that point it hits the compression time boundary, where the compression requires more time than the patch download. This boundary only exists, if a package is compressed. Otherwise, patches of less size can improve patch time even further.
Average CaseThese graphs consider average values for (de-)compression times the sequential process was considered to have to compress packages two times (compressed file in compressed package). A sequential patching process can only compete with full updates, if patch sizes are below 55% the update size at 6 mbit/s and below 30% at 10 mbit/s. In contrast, a parallel patching process still outperforms a full update and safes up to 75% of the time at 6 and 10 mbit/s. However, at 100 mbit/s there is no longer any timely advantage in using patches, if the packages are compressed.
Worst CaseThese graphs consider maximum values for (de-)compression times, and the sequential process was considered to have to compress packages three times (compressed file in compressed package). The sequential process can no longer compete with a full download at whatever network bandwidth, but the parallel process still can: At 6 mbit/s it can improve update times up to 38% befor it hits the compression time limit at 50% patch size ratio. At 10 mbit/s it can achieve only up to 15% time saving and at 100 mbit/s it is always slower than a full update. Again, this applies only to compressed packages and if
Holger Machens, 02-Jan-2021
|
||||||||||||||