License: GPL v3
Last Update: 01.10.2014
The evaluation framework provides functionalities required to configure, run and evaluate measurement runs in an automated fashion. Most important parts are management of measurement data and the automated calculation of required measurement runs to fulfil a given accuracy constraint such as confidence intervals.
Note: The framework is still work in progress and contains some bad design decisions and hacks, but I felt someone could be interested in it, because the idea is actually not bad.
Evaluating software manually is a really annoying process. First you start writing a tiny shell script which automates execution of certain measurement runs with different parameters. Then you gather the results and analyse them using visualisation software such as gnuplot or some spreadsheet processor. During analysis you see that the accuracy is too low, so you increase the number of measurement runs and restart your script, generate the graphs again and now you see that there is some misbehaviour you need to fix in your software. You fix it and because all measurements are now obsolete, you have to rerun and analyse all of them. Changing parameters changes accuracy too, so you might have to do this once again and then you see another problem and so on.
Because it is so annoying, you end up writing a whole bunch of scripts which automate most of the tasks, such as cleaning up and generating graphs for analysis. You did not intend these scripts to grow to this size which results in ugly, chaotic code which is not reusable. So, the next time you have to evaluate another software you either force yourself through your ugly scripts of the last evaluation or you start over from the scratch, which definitely costs you a lot of time again!
I simply got angry about this nerve wrecking process and started writing a framework which allows to automate the following tasks:
- Automated accuracy control: You define a certain accuracy constraint (e.g. some confidence interval) on some statistic value (e.g. the arithmetic mean of a measure) which is used to determine the number of required measurement runs for this particular measure. The system then repeats measurement runs for this measure until the constraint is satisfied.
- Statistics: Statistics are also required by the accuracy control. Thus, the system calculates them for you.
- Management of Results: The system keeps raw measurement results for each measure. If you just change the accuracy constraint and rerun the evaluation it will consider all already available results during accuracy control.
- Visualisation: This part is not fully implemented but supports creation of gnuplot scripts for now.
Note: The source code is poorly documented! Best way to start is to look into the package
org.cakelab.eval.mlatency which contains the implementation for a specific evaluation using
mlatency I developed on top of this framework.
The following sections provide a very basic user manual, which covers build, configuration and running of the measurements and analysis of the results to evaluate the memory/cache access latency in highly concurrent scenarios using the
bin/ef.jar which contains the entire framework with the main class to evaluate memory latency scenarios.
The test scenarios are for the most part defined in the source code (see src/org/cakelab/eval/mlatency/Main.java).
Parameters that might change depending on the evaluated runtime environment or required accuracy can be configured in "ef.cfg" located in the current directory during execution (pwd).
"ef.cfg" contains a JSON object with the following members:
||A JSON array with the amount of threads to be iterated through for each measurement scenario. Example: numThreads : [1,2,4,8], will iterate over runs with 1, 2, 4 and 8 threads.
||The minimum runs (int) required for a scenario before the evaluation frameworks is allowed to stop repeating measurements for the scenario based on the achieved accuracy.
||The requested accuracy of measurements depending on the applied constraint. Per default it is a confidence interval.
||The path to the benchmarks executable (
mlatency in our case).
||The location of the output directory to receive all data from the evaluation.
To run the evaluation framework controlling the scenarios for
> java -jar ef.jar
The output directory contains all measurement results and generated plots.
data: All raw data received from the benchmark in each measurement run. Data contains sub-directories for each different configuration of the benchmark. The name of each sub-directory consists of the set of all command line arguments provided to the benchmark. Inside each sub- directory is one sub-directory for each particular measurement run with this configuration. Names of those sub-directories consist of a timestamp of the single measurement run. Each of those sub-directories contains the output of the benchmark received via stdout and stderr. Raw data from a particular measurement run can be used to evaluate different scenarios (see below).
results: This directory contains gathered results for given scenarios. For each evaluated scenario exists one sub-directory with the name of the scenario. The results are statistics determined from the raw data received through the measurement runs (see above). A scenario can contain results from measurement runs with different configurations (e.g. number of threads). In our case it contains the averages for readers (readers.result) writers (writers.result) separated and both together (avg.result). The content of the result files depends on the scenario and the evaluation goal. In most cases it contains the number of threads in the first column and the determined average in the following column. In other cases the first column contains the amount of memory accessed.
plots: This directory contains the gnuplot scripts to generate graphs from files given in the results directory for the different scenarios. Use
> gnuplot <script>.dem
to generate the graph.