scieee Science in your language
[en] (orig)
Using application benchmark call graphs
to quantify and improve the practical
relevance of microbenchmark suites
Martin Grambow
1
, Christoph Laaber
2
, Philipp Leitner
3
and
David Bermbach
1
1Mobile Cloud Computing Research Group, TU Berlin & Einstein Center Digital Future, Berlin,
Germany
2Department of Informatics, University of Zurich, Zurich, Switzerland
3Software Engineering Division, Chalmers | University of Gothenburg, Gothenburg, Sweden
ABSTRACT
Performance problems in applications should ideally be detected as soon as they
occur, i.e., directly when the causing code modication is added to the code
repository. To this end, complex and cost-intensive application benchmarks or
lightweight but less relevant microbenchmarks can be added to existing build
pipelines to ensure performance goals. In this paper, we show how the practical
relevance of microbenchmark suites can be improved and veried based on the
application ow during an application benchmark run. We propose an approach to
determine the overlap of common function calls between application and
microbenchmarks, describe a method which identies redundant microbenchmarks,
and present a recommendation algorithm which reveals relevant functions that are
not covered by microbenchmarks yet. A microbenchmark suite optimized in this way
can easily test all functions determined to be relevant by application benchmarks
after every code change, thus, signicantly reducing the risk of undetected
performance problems. Our evaluation using two time series databases shows that,
depending on the specic application scenario, application benchmarks cover
different functions of the system under test. Their respective microbenchmark suites
cover between 35.62% and 66.29% of the functions called during the application
benchmark, offering substantial room for improvement. Through two use cases
removing redundancies in the microbenchmark suite and recommendation of yet
uncovered functionswe decrease the total number of microbenchmarks and
increase the practical relevance of both suites. Removing redundancies can
signicantly reduce the number of microbenchmarks (and thus the execution time as
well) to ~10% and ~23% of the original microbenchmark suites, whereas
recommendation identies up to 26 and 14 newly, uncovered functions to
benchmark to improve the relevance.
By utilizing the differences and synergies of application benchmarks and
microbenchmarks, our approach potentially enables effective software performance
assurance with performance tests of multiple granularities.
Subjects Databases, Distributed and Parallel Computing, Software Engineering
Keywords Benchmarking, Time series databases, Microbenchmarking, Performance testing
How to cite this article Grambow M, Laaber C, Leitner P, Bermbach D. 2021. Using application benchmark call graphs to quantify and
improve the practical relevance of microbenchmark suites. PeerJ Comput. Sci. 7:e548 DOI 10.7717/peerj-cs.548
Submitted 14 October 2020
Accepted 25 April 2021
Published 28 May 2021
Corresponding author
Martin Grambow,
Academic editor
Muhammad Aleem
Additional Information and
Declarations can be found on
page 26
DOI 10.7717/peerj-cs.548
Copyright
2021 Grambow et al.
Distributed under
Creative Commons CC-BY 4.0
INTRODUCTION
With the continuously increasing complexity of software systems, the interest in reliable
and easy-to-use test and evaluation mechanisms has grown as well. While a variety of
techniques, such as unit and integration testing, already exists for the validation of
functional requirements of an application, mechanisms for ensuring non-functional
requirements, e.g., performance, are used more sparingly in practice (Ameller et al., 2012;
Caracciolo, Lungu & Nierstrasz, 2014). Besides live testing techniques such as canary
releases (Schermann, Cito & Leitner, 2018), developers and researchers usually resort to
benchmarking, i.e., the execution of an articially generated workload against the system
under test (SUT), to study and analyze non-functional requirements in articial
production(-near) conditions.
While application benchmarks are the gold standard and very powerful as they
benchmark complete systems, they are hardly suitable for regular use in continuous
integration pipelines due to their long execution time and high costs (Bermbach et al.,
2017;Bermbach & Tai, 2014). Alternatively, less complex and therefore less costly
microbenchmarks could be used, which are also easier to integrate into build pipelines due
to their simpler setup (Laaber & Leitner, 2018). However, a simple substitution can be
dangerous: on the one hand, it is not clear to what extent a microbenchmark suite covers
the functions used in production; on the other hand, often only a complex application
benchmark is suitable for evaluating complex aspects of a system. To link both benchmark
types, we introduce the term practical relevance which refers to the extent to which a
microbenchmark suite targets code segments that are also invoked by application
benchmarks.
In this paper, we aim to determine, quantify, and improve the practical relevance of a
microbenchmark suite by using application benchmarks as a baseline. In real setups,
developers often do not have access to a (representative) live system, e.g., generally-
available software such as database systems are used by many companies which install and
deploy their own instances and, consequently, the softwares developers usually do not
have access to the custom installations and their production traces and logs. In addition,
software is used differently by each customer which results in different load proles as well
as varying congurations. Thus, it is often reasonable to use one or more application
benchmarks as the next accurate proxy to simulate and evaluate a representative articial
production system. The execution of these benchmarks for each code change is very
expensive and time-consuming, but a light-weight microbenchmark suite that has proven
to be practically relevant could replace them to some degree.
To this end, we analyze the called functions of a reference run, which can be (an excerpt
from) a production system or an application benchmark, and compare them with the
functions invoked by microbenchmarks to determine and quantify a microbenchmark
suites practical relevance. If every called function of the reference run is also invoked by at
least one microbenchmark, we consider the respective microbenchmark suite as 100%
practically relevant as the suite covers all functions used in the baseline execution. Based on
this information, we devise two optimization strategies to improve the practical relevance
Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 2/32
of microbenchmark suites according to a reference run: (i) a heuristic for removing
redundancies in an existing microbenchmark suite and (ii) a recommendation algorithm
which identies uncovered but relevant functions.
In this regard, we formulate the following research questions:
RQ1 How to determine and quantify the practical relevance of microbenchmark suites?
Software source code in an object-oriented system is organized in classes and
functions. At runtime, executed functions call other classes and functions, which leads to a
program ow that can be depicted as a call graph. This graph represents which functions
call which other functions and adds additional meta information such as the duration
of the executed function. If these graphs are available for a reference run and the respective
microbenchmark suite, it is possible to compare the ow of both graphs and quantify to
which degree the current microbenchmark suite reects the use in the reference run, or
rather the real usage in production. Our evaluation with two well-known time series
databases shows that their microbenchmark suites cover about 40% of the functions called
during application benchmarks. The majority of the functionality used by an application
benchmark, our proxy for a production application, is therefore uncovered by the
microbenchmark suites of our study objects.
RQ2 How to reduce the execution runtime of microbenchmark suites without affecting
their practical relevance?
If there are many microbenchmarks in a suite, they are likely to have redundancies and
some functions will be benchmarked by multiple microbenchmarks. By creating a new
subset of the respective microbenchmark suite without these redundancies, it is possible
to achieve the same coverage level with fewer microbenchmarks, which signicantly
reduces the overall runtime of the microbenchmark suite. Applying this optimization as
part of our evaluation shows that this can reduce the number of microbenchmarks by 77%
to 90%, depending on the application and benchmark scenario.
RQ3 How to increase the practical relevance within cost efciency constraints?
If the microbenchmark suites coverage is not sufcient, the uncovered graph of the
application benchmark can be used to locate functions which are highly relevant for
practical usage. We present a recommendation algorithm which provides a fast and
automated way to identify these functions that should be covered by microbenchmarks.
Our evaluation shows that an increase in coverage from the original 40% to up to 90% with
only three additional microbenchmarks is theoretically possible. An optimized
microbenchmark suite could, e.g., serve as initial and fast performance smoke test in
continuous integration or deployment (CI/CD) pipelines or for developers who need a
quick performance feedback for their recent changes.
After applying both optimizations, it is possible to cover a maximum portion of an
application benchmark with a minimum suite of microbenchmarks which has several
advantages. First of all, this helps to identify important functions that are relevant in
practice and ensures that their performance is regularly evaluated via microbenchmarks.
Instead of a suite that checks rarely used functions, code sections that are relevant for
practical use are evaluated frequently. Second, microbenchmarks evaluating functions that
are already implicitly covered by other microbenchmarks are selectively removed,
Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 3/32
Advertisement
achieving the same practical relevance with as few microbenchmarks as possible while
reducing the runtime of the total suite. Furthermore, the effort for the creation of
microbenchmarks is minimized because the microbenchmarks of the proposed
functions will cover a large part of the application benchmark call graph and fewer
microbenchmarks are necessary. Developers will still have to design and implement
performance tests, but the identication of highly relevant functions for actual operation is
facilitated and functions that implicitly benchmark many further relevant functions are
pointed out, thus covering a broad call graph. Ultimately, the optimized microbenchmark
suite can be used in CI/CD pipelines more effectively: It is possible to establish a CI/CD
pipeline which, e.g., executes the comparatively simple and short but representative
microbenchmark suite after each change in the code. The complex and cost-intensive
application benchmark can then be executed more sparsely, e.g., for each major release. In
this sense, the application benchmark remains as the gold standard revealing all
performance problems, while the optimized microbenchmark suite is an easy-to-use and
fast heuristic which offers a quick insight into performance yet with obviously lower
accuracy.
It is our hope that this study contributes to the problem of performance testing as part
of CI/CD pipelines and enables a more frequent validation of performance metrics to
detect regressions sooner. Our approach can give targeted advice to developers to improve
the effectiveness and relevance of their microbenchmark suite. Throughout the rest of the
paper, we will always use an application benchmark as the reference run but our approach
can, of course, also use other sources as a baseline.
Contributions:
An approach to determine and quantify the practical relevance of a microbenchmark
suite.
An adaptation of the Greedy-based algorithm proposed by Chen & Lau (1998) to
remove redundancies in a microbenchmark suite.
A recommendation strategy inspired by Rothermel et al. (1999) for new
microbenchmarks which aims to cover large parts of the application benchmarks
function call graph.
An empirical evaluation which analyzes and applies the two optimizations to the
microbenchmark suites of two large open-source time series databases.
Paper Structure: After summarizing relevant background information in
Background, we present our approach to determine, quantify, and improve
microbenchmark suites in Approach. Next, we evaluate our approach by applying the
proposed algorithms to two open-source time series databases in Empirical Evaluation
and discuss its strength and limitations in Discussion. Finally, we outline related work
in Related Workand conclude in Conclusion.
BACKGROUND
This section introduces related background information, in particular this comprises
benchmarking and time series databases.
Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 4/32
Benchmarking
Benchmarking aims to determine quality of service (QoS) by stressing a system under test
(SUT) in a standardized way while observing its reactions, usually in a test or staging
environment (Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017). To provide relevant
results, benchmarks must meet certain requirements such as fairness, portability, or
repeatability (Huppler, 2009;Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017;
Folkerts et al., 2013). In this paper, we deal with two different kinds of benchmarks:
application benchmarks, which evaluate a complete application system, and
microbenchmarks, which evaluate individual functions or methods. Functional testing
as well as monitoring are not a focus of this work, but are of course closely related
(Bermbach, Wittern & Tai, 2017).
Application benchmark
In a so-called application benchmark, the SUT is rst set up and brought into a dened
initial state, e.g., using warmup requests or by inserting initial data. Next, an evaluation
workload is sent to the SUT and the relevant metrics are collected. This method is on
the one hand very powerful, because many relevant aspects and conditions can be
simulated in a dened testbed, but it is very expensive and time-consuming on
the other hand; not only in the preparation but also in the periodic execution.
The evaluation of an entire system involves several crucial tasks to nally come up
with a relevant comparison and conclusion, especially in dynamic cloud environments
(Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017;Grambow, Lehmann & Bermbach,
2019). During the design phase, it is necessary to think in detail about the specic
requirements of the benchmark and its objectives. While dening (and generating) the
workload, many aspects must be taken into account to ensure that the requirements of the
benchmark are not violated and to guarantee a relevant result later on (Huppler, 2009;
Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017;Folkerts et al., 2013). This is
especially difcult in dynamic cloud environments, because it is hard to reproduce results
due to performance variations inherent in cloud systems, random uctuations, and
other cloud-specic characteristics (Lenk et al., 2011;Difallah et al., 2013;Folkerts et al.,
2013;Rabl et al., 2010). To set up an SUT, all components have to be dened and initialized
rst. This can be done with the assistance of automation tools (e.g., Hasenburg et al., 2019;
Hasenburg, Grambow & Bermbach, 2020). However, automation tools still have to be
congured rst, which further complicates the setup of application benchmarks. During
the benchmark run, all components have to be monitored to ensure that there is no
bottleneck inside the benchmarking system, e.g., to avoid quantifying the resources of
the benchmarking clients machine instead of the maximum throughput of the SUT.
Finally, the collected data needs to be transformed into relevant insights, usually in a
subsequent ofine analysis (Bermbach, Wittern & Tai, 2017). Together, these factors
imply that a really continuous application benchmarking, e.g., applied to every code
change, will usually be prohibitively expensive in terms of time but also in monetary cost.
Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 5/32
Advertisement
Loading more pages...