Using application benchmark call graphs to quantify and improve the practical relevance of microbenchmark suites [original]

Using application benchmark call graphs

to quantify and improve the practical

relevance of microbenchmark suites

Martin Grambow

, Christoph Laaber

, Philipp Leitner

and

David Bermbach

1Mobile Cloud Computing Research Group, TU Berlin & Einstein Center Digital Future, Berlin,

Germany

2Department of Informatics, University of Zurich, Zurich, Switzerland

3Software Engineering Division, Chalmers | University of Gothenburg, Gothenburg, Sweden

ABSTRACT

Performance problems in applications should ideally be detected as soon as they

occur, i.e., directly when the causing code modiﬁcation is added to the code

repository. To this end, complex and cost-intensive application benchmarks or

lightweight but less relevant microbenchmarks can be added to existing build

pipelines to ensure performance goals. In this paper, we show how the practical

relevance of microbenchmark suites can be improved and veriﬁed based on the

application ﬂow during an application benchmark run. We propose an approach to

determine the overlap of common function calls between application and

microbenchmarks, describe a method which identiﬁes redundant microbenchmarks,

and present a recommendation algorithm which reveals relevant functions that are

not covered by microbenchmarks yet. A microbenchmark suite optimized in this way

can easily test all functions determined to be relevant by application benchmarks

after every code change, thus, signiﬁcantly reducing the risk of undetected

performance problems. Our evaluation using two time series databases shows that,

depending on the speciﬁc application scenario, application benchmarks cover

different functions of the system under test. Their respective microbenchmark suites

cover between 35.62% and 66.29% of the functions called during the application

benchmark, offering substantial room for improvement. Through two use cases—

removing redundancies in the microbenchmark suite and recommendation of yet

uncovered functions—we decrease the total number of microbenchmarks and

increase the practical relevance of both suites. Removing redundancies can

signiﬁcantly reduce the number of microbenchmarks (and thus the execution time as

well) to ~10% and ~23% of the original microbenchmark suites, whereas

recommendation identiﬁes up to 26 and 14 newly, uncovered functions to

benchmark to improve the relevance.

By utilizing the differences and synergies of application benchmarks and

microbenchmarks, our approach potentially enables effective software performance

assurance with performance tests of multiple granularities.

Subjects Databases, Distributed and Parallel Computing, Software Engineering

Keywords Benchmarking, Time series databases, Microbenchmarking, Performance testing

How to cite this article Grambow M, Laaber C, Leitner P, Bermbach D. 2021. Using application benchmark call graphs to quantify and

improve the practical relevance of microbenchmark suites. PeerJ Comput. Sci. 7:e548 DOI 10.7717/peerj-cs.548

Submitted 14 October 2020

Accepted 25 April 2021

Published 28 May 2021

Corresponding author

Martin Grambow,

[email protected]

Academic editor

Muhammad Aleem

Additional Information and

Declarations can be found on

page 26

DOI 10.7717/peerj-cs.548

2021 Grambow et al.

Distributed under

Creative Commons CC-BY 4.0

INTRODUCTION

With the continuously increasing complexity of software systems, the interest in reliable

and easy-to-use test and evaluation mechanisms has grown as well. While a variety of

techniques, such as unit and integration testing, already exists for the validation of

functional requirements of an application, mechanisms for ensuring non-functional

requirements, e.g., performance, are used more sparingly in practice (Ameller et al., 2012;

Caracciolo, Lungu & Nierstrasz, 2014). Besides live testing techniques such as canary

releases (Schermann, Cito & Leitner, 2018), developers and researchers usually resort to

benchmarking, i.e., the execution of an artiﬁcially generated workload against the system

under test (SUT), to study and analyze non-functional requirements in artiﬁcial

production(-near) conditions.

While application benchmarks are the gold standard and very powerful as they

benchmark complete systems, they are hardly suitable for regular use in continuous

integration pipelines due to their long execution time and high costs (Bermbach et al.,

2017;Bermbach & Tai, 2014). Alternatively, less complex and therefore less costly

microbenchmarks could be used, which are also easier to integrate into build pipelines due

to their simpler setup (Laaber & Leitner, 2018). However, a simple substitution can be

dangerous: on the one hand, it is not clear to what extent a microbenchmark suite covers

the functions used in production; on the other hand, often only a complex application

benchmark is suitable for evaluating complex aspects of a system. To link both benchmark

types, we introduce the term practical relevance which refers to the extent to which a

microbenchmark suite targets code segments that are also invoked by application

benchmarks.

In this paper, we aim to determine, quantify, and improve the practical relevance of a

microbenchmark suite by using application benchmarks as a baseline. In real setups,

developers often do not have access to a (representative) live system, e.g., generally-

available software such as database systems are used by many companies which install and

deploy their own instances and, consequently, the software’s developers usually do not

have access to the custom installations and their production traces and logs. In addition,

software is used differently by each customer which results in different load proﬁles as well

as varying conﬁgurations. Thus, it is often reasonable to use one or more application

benchmarks as the next accurate proxy to simulate and evaluate a representative artiﬁcial

production system. The execution of these benchmarks for each code change is very

expensive and time-consuming, but a light-weight microbenchmark suite that has proven

to be practically relevant could replace them to some degree.

To this end, we analyze the called functions of a reference run, which can be (an excerpt

from) a production system or an application benchmark, and compare them with the

functions invoked by microbenchmarks to determine and quantify a microbenchmark

suite’s practical relevance. If every called function of the reference run is also invoked by at

least one microbenchmark, we consider the respective microbenchmark suite as 100%

practically relevant as the suite covers all functions used in the baseline execution. Based on

this information, we devise two optimization strategies to improve the practical relevance

Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 2/32

of microbenchmark suites according to a reference run: (i) a heuristic for removing

redundancies in an existing microbenchmark suite and (ii) a recommendation algorithm

which identiﬁes uncovered but relevant functions.

In this regard, we formulate the following research questions:

RQ1 How to determine and quantify the practical relevance of microbenchmark suites?

Software source code in an object-oriented system is organized in classes and

functions. At runtime, executed functions call other classes and functions, which leads to a

program ﬂow that can be depicted as a call graph. This graph represents which functions

call which other functions and adds additional meta information such as the duration

of the executed function. If these graphs are available for a reference run and the respective

microbenchmark suite, it is possible to compare the ﬂow of both graphs and quantify to

which degree the current microbenchmark suite reﬂects the use in the reference run, or

rather the real usage in production. Our evaluation with two well-known time series

databases shows that their microbenchmark suites cover about 40% of the functions called

during application benchmarks. The majority of the functionality used by an application

benchmark, our proxy for a production application, is therefore uncovered by the

microbenchmark suites of our study objects.

RQ2 How to reduce the execution runtime of microbenchmark suites without affecting

their practical relevance?

If there are many microbenchmarks in a suite, they are likely to have redundancies and

some functions will be benchmarked by multiple microbenchmarks. By creating a new

subset of the respective microbenchmark suite without these redundancies, it is possible

to achieve the same coverage level with fewer microbenchmarks, which signiﬁcantly

reduces the overall runtime of the microbenchmark suite. Applying this optimization as

part of our evaluation shows that this can reduce the number of microbenchmarks by 77%

to 90%, depending on the application and benchmark scenario.

RQ3 How to increase the practical relevance within cost efﬁciency constraints?

If the microbenchmark suite’s coverage is not sufﬁcient, the uncovered graph of the

application benchmark can be used to locate functions which are highly relevant for

practical usage. We present a recommendation algorithm which provides a fast and

automated way to identify these functions that should be covered by microbenchmarks.

Our evaluation shows that an increase in coverage from the original 40% to up to 90% with

only three additional microbenchmarks is theoretically possible. An optimized

microbenchmark suite could, e.g., serve as initial and fast performance smoke test in

continuous integration or deployment (CI/CD) pipelines or for developers who need a

quick performance feedback for their recent changes.

After applying both optimizations, it is possible to cover a maximum portion of an

application benchmark with a minimum suite of microbenchmarks which has several

advantages. First of all, this helps to identify important functions that are relevant in

practice and ensures that their performance is regularly evaluated via microbenchmarks.

Instead of a suite that checks rarely used functions, code sections that are relevant for

practical use are evaluated frequently. Second, microbenchmarks evaluating functions that

are already implicitly covered by other microbenchmarks are selectively removed,

Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 3/32

achieving the same practical relevance with as few microbenchmarks as possible while

reducing the runtime of the total suite. Furthermore, the effort for the creation of

microbenchmarks is minimized because the microbenchmarks of the proposed

functions will cover a large part of the application benchmark call graph and fewer

microbenchmarks are necessary. Developers will still have to design and implement

performance tests, but the identiﬁcation of highly relevant functions for actual operation is

facilitated and functions that implicitly benchmark many further relevant functions are

pointed out, thus covering a broad call graph. Ultimately, the optimized microbenchmark

suite can be used in CI/CD pipelines more effectively: It is possible to establish a CI/CD

pipeline which, e.g., executes the comparatively simple and short but representative

microbenchmark suite after each change in the code. The complex and cost-intensive

application benchmark can then be executed more sparsely, e.g., for each major release. In

this sense, the application benchmark remains as the gold standard revealing all

performance problems, while the optimized microbenchmark suite is an easy-to-use and

fast heuristic which offers a quick insight into performance yet with obviously lower

accuracy.

It is our hope that this study contributes to the problem of performance testing as part

of CI/CD pipelines and enables a more frequent validation of performance metrics to

detect regressions sooner. Our approach can give targeted advice to developers to improve

the effectiveness and relevance of their microbenchmark suite. Throughout the rest of the

paper, we will always use an application benchmark as the reference run but our approach

can, of course, also use other sources as a baseline.

Contributions:

An approach to determine and quantify the practical relevance of a microbenchmark

suite.

An adaptation of the Greedy-based algorithm proposed by Chen & Lau (1998) to

remove redundancies in a microbenchmark suite.

A recommendation strategy inspired by Rothermel et al. (1999) for new

microbenchmarks which aims to cover large parts of the application benchmark’s

function call graph.

An empirical evaluation which analyzes and applies the two optimizations to the

microbenchmark suites of two large open-source time series databases.

Paper Structure: After summarizing relevant background information in

“Background”, we present our approach to determine, quantify, and improve

microbenchmark suites in “Approach”. Next, we evaluate our approach by applying the

proposed algorithms to two open-source time series databases in “Empirical Evaluation”

and discuss its strength and limitations in “Discussion”. Finally, we outline related work

in “Related Work”and conclude in “Conclusion”.

BACKGROUND

This section introduces related background information, in particular this comprises

benchmarking and time series databases.

Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 4/32

Benchmarking

Benchmarking aims to determine quality of service (QoS) by stressing a system under test

(SUT) in a standardized way while observing its reactions, usually in a test or staging

environment (Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017). To provide relevant

results, benchmarks must meet certain requirements such as fairness, portability, or

repeatability (Huppler, 2009;Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017;

Folkerts et al., 2013). In this paper, we deal with two different kinds of benchmarks:

application benchmarks, which evaluate a complete application system, and

microbenchmarks, which evaluate individual functions or methods. Functional testing

as well as monitoring are not a focus of this work, but are of course closely related

(Bermbach, Wittern & Tai, 2017).

Application benchmark

In a so-called application benchmark, the SUT is ﬁrst set up and brought into a deﬁned

initial state, e.g., using warmup requests or by inserting initial data. Next, an evaluation

workload is sent to the SUT and the relevant metrics are collected. This method is on

the one hand very powerful, because many relevant aspects and conditions can be

simulated in a deﬁned testbed, but it is very expensive and time-consuming on

the other hand; not only in the preparation but also in the periodic execution.

The evaluation of an entire system involves several crucial tasks to ﬁnally come up

with a relevant comparison and conclusion, especially in dynamic cloud environments

(Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017;Grambow, Lehmann & Bermbach,

2019). During the design phase, it is necessary to think in detail about the speciﬁc

requirements of the benchmark and its objectives. While deﬁning (and generating) the

workload, many aspects must be taken into account to ensure that the requirements of the

benchmark are not violated and to guarantee a relevant result later on (Huppler, 2009;

Bermbach et al., 2017;Bermbach, Wittern & Tai, 2017;Folkerts et al., 2013). This is

especially difﬁcult in dynamic cloud environments, because it is hard to reproduce results

due to performance variations inherent in cloud systems, random ﬂuctuations, and

other cloud-speciﬁc characteristics (Lenk et al., 2011;Difallah et al., 2013;Folkerts et al.,

2013;Rabl et al., 2010). To set up an SUT, all components have to be deﬁned and initialized

ﬁrst. This can be done with the assistance of automation tools (e.g., Hasenburg et al., 2019;

Hasenburg, Grambow & Bermbach, 2020). However, automation tools still have to be

conﬁgured ﬁrst, which further complicates the setup of application benchmarks. During

the benchmark run, all components have to be monitored to ensure that there is no

bottleneck inside the benchmarking system, e.g., to avoid quantifying the resources of

the benchmarking client’s machine instead of the maximum throughput of the SUT.

Finally, the collected data needs to be transformed into relevant insights, usually in a

subsequent ofﬂine analysis (Bermbach, Wittern & Tai, 2017). Together, these factors

imply that a really continuous application benchmarking, e.g., applied to every code

change, will usually be prohibitively expensive in terms of time but also in monetary cost.

Grambow et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.548 5/32

Loading more pages...