The evaluation framework provides functionalities required to configure, run and evaluate measurement runs in an automated fashion. Most important parts are management of measurement data and the automated calculation of required measurement runs to fulfil a given accuracy constraint such as confidence intervals.
Note: The framework is still work in progress and contains some bad design decisions and hacks, but I felt someone could be interested in it, because the idea is actually not bad.
Evaluating software manually is a really annoying process. First you start writing a tiny shell script which automates execution of certain measurement runs with different parameters. Then you gather the results and analyse them using visualisation software such as gnuplot or some spreadsheet processor. During analysis you see that the accuracy is too low, so you increase the number of measurement runs and restart your script, generate the graphs again and now you see that there is some misbehaviour you need to fix in your software. You fix it and because all measurements are now obsolete, you have to rerun and analyse all of them. Changing parameters changes accuracy too, so you might have to do this once again and then you see another problem and so on.
Because it is so annoying, you end up writing a whole bunch of scripts which automate most of the tasks, such as cleaning up and generating graphs for analysis. You did not intend these scripts to grow to this size which results in ugly, chaotic code which is not reusable. So, the next time you have to evaluate another software you either force yourself through your ugly scripts of the last evaluation or you start over from the scratch, which definitely costs you a lot of time again!
I simply got angry about this nerve wrecking process and started writing a framework which allows to automate the following tasks:
Note: The source code is poorly documented! Best way to start is to look into
the package org.cakelab.eval.mlatency
which contains the implementation
for a specific evaluation using mlatency
I developed on top of this framework.
The following sections provide a very basic user manual, which covers build,
configuration and running of the measurements and analysis of the results to
evaluate the memory/cache access latency in highly concurrent scenarios using
the mlatency
benchmarks.
> make
Results in bin/ef.jar
which contains the entire framework with the
main class to evaluate memory latency scenarios.
The test scenarios are for the most part defined in the source code (see src/org/cakelab/eval/mlatency/Main.java).
Parameters that might change depending on the evaluated runtime environment or required accuracy can be configured in "ef.cfg" located in the current directory during execution (pwd).
"ef.cfg" contains a JSON object with the following members:
type | name | description |
int[] | numThreads | A JSON array with the amount of threads to be iterated through for each measurement scenario. Example: numThreads : [1,2,4,8], will iterate over runs with 1, 2, 4 and 8 threads. |
int | minRuns | The minimum runs (int) required for a scenario before the evaluation frameworks is allowed to stop repeating measurements for the scenario based on the achieved accuracy. |
float | accuracy | The requested accuracy of measurements depending on the applied constraint. Per default it is a confidence interval. |
string | executable | The path to the benchmarks executable (mlatency in our
case). |
string | outdir | The location of the output directory to receive all data from the evaluation. |
To run the evaluation framework controlling the scenarios for mlatency
type:
> java -jar ef.jar
The output directory contains all measurement results and generated plots.
data
: All raw data received from the benchmark in each measurement run. Data
contains sub-directories for each different configuration of the
benchmark. The name of each sub-directory consists of the set of all
command line arguments provided to the benchmark. Inside each sub-
directory is one sub-directory for each particular measurement run
with this configuration. Names of those sub-directories consist of a
timestamp of the single measurement run. Each of those sub-directories
contains the output of the benchmark received via stdout and stderr.
Raw data from a particular measurement run can be used to evaluate
different scenarios (see below).
results
: This directory contains gathered results for given scenarios. For
each evaluated scenario exists one sub-directory with the name of the
scenario. The results are statistics determined from the raw data received
through the measurement runs (see above). A scenario can contain
results from measurement runs with different configurations (e.g.
number of threads). In our case it contains the averages for readers
(readers.result) writers (writers.result) separated and both together
(avg.result). The content of the result files depends on the scenario
and the evaluation goal. In most cases it contains the number of
threads in the first column and the determined average in the
following column. In other cases the first column contains the amount of
memory accessed.
plots
: This directory contains the gnuplot scripts to generate graphs from
files given in the results directory for the different scenarios. Use
> gnuplot <script>.demto generate the graph.