Document [original]

Citation: Eversberg, L.; Lambrecht, J.

Combining Synthetic Images and

Deep Active Learning: Data-Efficient

Training of an Industrial Object

Detection Model. J. Imaging 2024,10,

16. https://doi.org/10.3390/

jimaging10010016

Academic Editor: Guanghui Wang

Received: 28 November 2023

Revised: 29 December 2023

Accepted: 4 January 2024

Published: 6 January 2024

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

Journal of

Imaging

Article

Combining Synthetic Images and Deep Active Learning:

Data-Efficient Training of an Industrial Object Detection Model

Leon Eversberg * and Jens Lambrecht

Industry Grade Networks and Clouds, Faculty IV Electrical Engineering and Computer Science,

Technische Universität Berlin, Straße des 17. Juni 135, 10623 Berlin, Germany; [email protected]

*Correspondence: leon.eversber[email protected]

Abstract:

Generating synthetic data is a promising solution to the challenge of limited training data for

industrial deep learning applications. However, training on synthetic data and testing on real-world

data creates a sim-to-real domain gap. Research has shown that the combination of synthetic and

real images leads to better results than those that are generated using only one source of data. In this

work, the generation of synthetic training images via physics-based rendering is combined with deep

active learning for an industrial object detection task to iteratively improve model performance over

time. Our experimental results show that synthetic images improve model performance, especially at

the beginning of the model’s life cycle with limited training data. Furthermore, our implemented

hybrid query strategy selects diverse and informative new training images in each active learning

cycle, which outperforms random sampling. In conclusion, this work presents a workflow to train

and iteratively improve object detection models with a small number of real-world images, leading

to data-efficient and cost-effective computer vision models.

Keywords:

active learning; computer vision; data efficiency; deep active learning; deep learning;

image synthesis; industrial application; object detection; synthetic images; turbine blade

1. Introduction

Deep learning has become a key technology for solving real-world industrial problems

using artificial intelligence. However, deep learning models often require large-scale

datasets to achieve adequate performance. Limited data remains one of the major challenges

for industrial applications of deep learning [

]. As a solution for computer vision tasks,

synthetic images can be generated and used as training data. Generating synthetic images

has many advantages compared to collecting and manually annotating real-world images.

Synthetic images are fast and cheap to generate. They can be used to balance out real-

world dataset biases [

]. Furthermore, they can be used in situations where there are

privacy concerns surrounding the usage of real-world images [

]. Additionally, they have

pixel-perfect annotations without the potential for human error [4].

However, using synthetic images to train computer vision models and then testing

them on real-world images creates a domain gap that continues to be a challenge in this

field of research [

]. Research has shown that the combination of synthetic and real images

outperforms the use of a single data source [

–

]. But how can real-world training images

be efficiently selected for combination with the generated synthetic images? In this work,

we propose to solve this problem with strategies from the field of active learning (AL).

AL uses the current machine learning model to efficiently select data for the next iteration

of training.

This paper builds upon our previous work to generate training images via physics-

based rendering for industrial object detection (OD) tasks [

] and makes the following

new contributions:

J. Imaging 2024,10, 16. https://doi.org/10.3390/jimaging10010016 https://www.mdpi.com/journal/jimaging

J. Imaging 2024,10, 16 2 of 15

•

A workflow is presented to efficiently train industrial object detection models by auto-

matically generating synthetic training images based on 3D models and then using deep

active learning to iteratively improve the model with reduced annotation cost.

•

Different deep active learning query strategies are investigated on a collected industrial

dataset for a real-world object detection use case.

•

Multiple deep active learning cycles are compared to a single cycle with an equivalent

amount of manually labeled training images.

The remainder of this paper is structured as follows: Section 2provides a summary

of prior work on synthetic images and deep active learning for object detection tasks. In

Section 3, the methodology of this paper is presented. Our results for synthetic versus real

images and different deep active learning (DAL) query strategies are presented in Section 4.

Lastly, Section 5outlines the limitations of our study and summarizes our primary findings.

2. Related Works

2.1. Using Synthetic Images to Train Computer Vision Models

Generating synthetic training data is a promising solution to the data-hungry nature

of modern deep learning models. However, training models on a source domain of

synthetic images and testing them on a target domain of real images leads to a domain

gap, which remains one of the biggest challenges in this field [

]. In order to overcome

the domain gap, different approaches have been used. A simple strategy is to copy objects

from real images and then paste them onto random background images to create new

images [

]. For industrial applications, available 3D models can be used to train object

detection models [

]. Domain randomization is an approach where training images are

randomized to such an extent that the trained model is supposed to see real images as

just another variation of the synthetic training data [

]. The concept of photorealism

is another approach, where the goal is to create highly realistic images using physics-

based rendering [

]. Physics-based rendering uses the ray-tracing algorithm to

follow the path of light rays through the virtual scene as they bounce off objects in the

scene [

]. Domain adaptation is a third approach to bridging the domain gap. This

technique attempts to make the source domain and the target domain as similar as possible

through image transformations. Synthetic images can be transformed closer to the target

domain using generative adversarial networks [

–

]. Alternatively, image filters can be

used to transform both source and target images to an intermediate domain [24,25].

2.2. Deep Active Learning

AL is a subfield of machine learning that attempts to maximize the performance of

a machine learning model with the least amount of annotated data. The key idea behind

AL is that the model selects the data from which it learns [

]. In traditional AL, most

algorithms query only one sample at a time, which is inefficient for modern deep learning.

Therefore, DAL uses a batch-based query strategy to select the

most useful samples from

a large unlabeled pool of data

for annotation to reduce labeling cost while maintaining

performance [

]. To select optimal query samples, unlabeled data are fed into the model to

generate features. Given these features, a query strategy attempts to find an optimal batch

of samples. The selected

samples are annotated by the oracle, e.g., a human annotator,

and are then added to the labeled training set

. Given the updated labeled training set, a

new model can be trained. This DAL cycle is depicted in Figure 1. The first iteration of the

DAL cycle requires an initial model to be trained on the initial labeled training set L0.

J. Imaging 2024,10, 16 3 of 15

Deep learning model 𝑀

Labeled

training set 𝐿0

Ini�al training

Oracle

Unlabeled pool 𝑈

Select 𝑘samples according

to the query strategy

Annotate selected

samples

Labeled training set 𝐿

Train new modelUpdate dataset

Figure 1.

Deep active learning cycle. The large unlabeled pool

is used as input for the current deep

learning model. Based on the extracted features, a query strategy selects a batch of

optimal samples

for annotation, which can then be used in the next training iteration. Figure based on [27].

Query strategies can be classified into the following three categories: uncertainty-

based query strategies, diversity-based query strategies, and hybrid strategies that combine

uncertainty and diversity [

]. Uncertainty-based query strategies, such as least confidence,

margin sampling, and entropy, select samples that are difficult to predict by the current

model [

]. Diversity-based strategies select batches of unlabeled data samples that are

representative of the unlabeled pool. This includes clustering algorithms such as the well-

known KMeans algorithm [

] and selecting data samples from a small core set that tries

to represent the full dataset distribution [

]. Lastly, hybrid strategies attempt to select

samples that balance diversity and uncertainty. Example algorithms include BADGE [

Exploitation–Exploration [

], and DBAL [

]. Zhan et al. [

] implemented 17 different

query strategies for DAL and compared them across 7 datasets for image classification.

They found unsatisfactory results for diversity-based strategies compared to uncertainty-

based strategies and hybrid strategies. Based on their evaluation, they recommend trying

uncertainty-based query strategies first for new tasks.

2.3. Deep Active Learning for Object Detection

While AL is traditionally used for classification tasks, the DAL cycle can also be

used on OD tasks to reduce annotation costs. Because OD models can produce multiple

detections per image, an aggregation method has to be used in order to compute a single

score per image as input to the query strategy [

]. Brust et al. [

] trained a YOLO OD

model [

] on the PASCAL VOC 2012 dataset [

] with DAL using margin sampling as

an uncertainty-based query strategy. In their experimental evaluation, they compared the

aggregation methods sum, maximum, and average to aggregate the uncertainty scores

from multiple bounding box detections. They concluded that, overall, the sum was the best

aggregation method for their data. Haussmann et al. [

] also compared different query

strategies on a large-scale OD dataset including cars, pedestrians, bicycles, traffic signs,

and traffic lights. As a model, they used a one-stage object detector based on a UNet [

They found that uncertainty-based query strategies and diversity-based strategies both

performed better than random sampling. Furthermore, they found that letting the query

strategy choose from a combined dataset consisting of the unlabeled pool

and the labeled

set Loutperforms Ualone while reducing labeling costs.

As described in Section 2.2, before running the first DAL iteration, an initial model

has to be trained. Usually, the initial model is trained by randomly selecting a first batch

of samples as

[

]. However, randomly sampling a small training set can lead to

low initial model performance. Furthermore, randomly sampling a large initial training set

increases the annotation cost, which is contrary to the goal of DAL. Therefore, in this work,

J. Imaging 2024,10, 16 4 of 15

we propose to train the initial model using synthetically generated images that include

automatically generated annotations.

2.4. Combining Deep Active Learning with Synthetic Images

Peng et al. [

] combined synthetic images with DAL in surgical instrument segmenta-

tion. For each DAL cycle, they query the most informative training images according to the

uncertainty-based query strategy Bayesian active learning by disagreement (BALD) [

]

and then manually label them. Next, they generate additional synthetic images via copy-

and-paste based on the selected images. The authors conclude that combining synthetic

images with deep active learning for image segmentation results in improved performance,

especially with limited labeled data. Similarly, query strategies are used in [

] to select

a limited amount of relevant synthetic images to improve the available real training dataset.

Wang et al. [

] combined AL and synthetic images for weakly-supervised OD. They

generated synthetic training images via copy-and-paste from a few manually annotated

images to train an initial base model. The synthetic images are used in the initial iteration,

and weakly labeled images are used in subsequent iterations to train a teacher–student

OD model.

Our proposed method uses available industrial 3D models to automatically generate

training images via physics-based rendering for an initial OD model. During deployment,

large amounts of unlabeled images can be collected. Given an unlabeled pool of images,

DAL is used to efficiently fine-tune the next model iteration on a small number of manually

labeled images.

3. Materials and Methods

The overall methodology of our approach is summarized in Figure 2. First, a synthetic

training dataset

is automatically generated according to Section 3.1, based on a given 3D

model. With these synthetic images, an initial model

is trained which can then be used

for the first DAL cycle with a collected pool of unlabeled real images

(Section 3.2). The

model chooses

real training images according to the DAL query strategy from Section 3.4.

These images are labeled and added to the labeled training set

. Given the previous model

and the selected training images, a new model is fine-tuned according to Section 3.3 and

the DAL cycle can be repeated in the next iteration t.

3D model

Automa�cally generate a

synthe�c training dataset 𝐿0

𝑆

Collected unlabeled

pool 𝑈of real images

t = 0

based on features for all 𝒙in 𝑈,𝐿

Labeled real images 𝐿

Fine-tune model 𝑀𝑡on 𝐿

𝑡=𝑡+ 1

New object

detec�on task

DAL query strategy

Label selected 𝑘images

Ini�al pre-training with synthe�c images

DAL cycle with real images

Figure 2.

Proposed workflow to train and improve a data-efficient OD model throughout its life cycle.

3.1. Generating a Synthetic Training Dataset

The open-source 3D creation software Blender is a popular tool amongst many re-

searchers to generate synthetic training images for computer vision tasks, e.g., [

–

J. Imaging 2024,10, 16 5 of 15

Blender utilizes a path tracing rendering engine called Cycles for producing physically-

based renders and can be automated using its Python API.

As described in more detail in our previous work [

], Blender v2.93 is used to

automatically generate synthetic training images for a turbine blade detection task. In [

various strategies for generating images were compared, including different lighting,

background, object texture, additional foreground objects, and bounding box computation.

Based on these results, a virtual camera is created for each scene and one of the three

turbine blade models shown in Figure 3is added with a randomized position. For the

turbine blade models, a realistic-looking material texture is sampled from a pre-defined

set of texture images that are either gray or dark blue. Furthermore, up to three distractor

objects are added with a randomly selected material texture from a pool of texture images.

For each virtual scene, a high dynamic range image is randomly sampled for image-based

lighting. After rendering the scene, a random image from the COCO dataset [

] is added

to the image background. Thus, we generate an automatically annotated synthetic training

dataset consisting of 5000 different images for our generic turbine blade detection task. As

an example, a Blender scene and the resulting annotated image are shown in Figure 4. Our

code for generating synthetic training data based on 3D models is publicly available on

GitHub (https://github.com/ignc-research/blender-gen, accessed on 28 December 2023).

Figure 3.

Three different industrial turbine blade models were used to generate synthetic training

images. (a) Turbine blade 3D model 1. (b) Turbine blade 3D model 2. (c) Guide vane 3D model.

Figure 4.

Synthetic data generation example. (

) Blender scene with a turbine blade and an additional

distractor object. The box shows the camera view. (

) Generated image with bounding box annotation

in green.

3.2. Real Dataset of Our Industrial Object Detection Use Case

We collected 1300 images in 1080P quality from two Microsoft Azure Kinect cameras on

an industrial workbench from our previous work [

] over several days. The images were

collected from two different camera angles. Each image contains a minimum of

one and

a maximum of three turbine blades. Example images are depicted in Figure 5. Tools and

additional objects on the workbench create a moderate amount of clutter. We randomly

split the collected data into a pool of 1000 training images and 300 validation images.

J. Imaging 2024,10, 16 6 of 15

Figure 5.

Annotated example images from the collected dataset. (

) Top view with three turbine blades

on the table. (b) Side view with a clamped turbine blade. (c) Top view with a turbine blade in hand.

3.3. Object Detection Model Training Details

For our object detection model, we used the Faster R-CNN [

] implementation from

MMDetection [

], which uses a feature pyramid network [

] based on a ResNet-50

backbone [

] and is pre-trained on the Microsoft COCO dataset [

]. We trained all our

models with stochastic gradient descent with an input image size of 640

360, a batch size

of 4, a learning rate of 0.00001, a momentum factor of 0.9, and a

weight decay factor

of 0.0001 [

]. To increase data efficiency, we use data augmentation during training. We

used the library Albumentations [

] for online data augmentation, where we randomly

performed flipping, color jitter, Gaussian noise, Gaussian blur, shifting, and scaling on

training images. Augmenting training images is particularly useful when fine-tuning the

model with small query batches of real images.

We trained all our models on an Nvidia GeForce RTX 3090 GPU until the average

precision (AP) metric converged on the validation set. The AP metric is widely used to

evaluate the performance of an object detection model. It computes the area under the

precision-recall curve for a given threshold

and ranges from zero to one. Specifically,

we use COCO’s AP@[0.5:0.95], which uses 10 different thresholds

T= [0.5, 0.55, . . . , 0.95]

regarding the bounding box intersection over union and averages them into one single

metric. A mathematical definition of AP@[0.5:0.95] can be found in [56].

3.4. Deep Active Learning Pipeline

Based on the comparative survey of DAL query strategies from Zhan et al. [

], we

implemented an uncertainty-based query strategy and a hybrid query strategy. For our

experiments, a pre-trained model is needed to complete one DAL cycle. For experiments

with real images only, a publicly available Faster R-CNN base model

pre-trained on the

COCO dataset was used. For experiments with synthetic images as described in Section 3.1,

the COCO base model was fine-tuned on a labeled training set

5000 synthetic

images

for 85 epochs, resulting in an average precision of AP@[0.5:0.95] = 0.555 for the synthetic

base model MS

3.4.1. Uncertainty-Based Query Strategy

Considering the results from Brust et al. [

], we chose maximum margin sampling

with the sum aggregation method as our uncertainty-based query strategy. In maxi-

mum margin sampling, an informativeness score

smargin

for a detected object

is calcu-

lated according to Equation

(1)

, where

P(ˆ

y1|xd)

is the predicted probability of the class

with the highest confidence and

P(ˆ

y2|xd)

is the predicted probability of the second most

confident class.

smargin(xd) = 1−[P(ˆ

y1|xd)−P(ˆ

y2|xd)](1)

Because an image

can contain

detections, an aggregation method is required to

combine multiple detections into one score. The sum aggregation method

asum(x)

simply

computes the sum over all detections in an image according to Equation (2).

asum(x) = ∑

d∈D

smargin(xd)(2)

J. Imaging 2024,10, 16 7 of 15

If the OD model returns zero detections for an image, then

asum(x)

is set to zero.

Intuitively, the uncertainty-based query strategy described in Algorithm 1will select

samples xwith multiple uncertain detections per image.

Algorithm 1 Maximum margin sampling

Input

: Unlabeled pool of images

, empty labeled training set

, query batch size

, pre-

trained model MS

Output: Fine-tuned model M

1: t=1

2: loop

3: Obtain informativeness score asum(x)for every image x∈ {U,L}

4: if an image xhas no detections then

5: Set asum(x) = 0

6: end if

7: Select and label top kimages with the highest scores, add them to L

8: Fine-tune object detection model Mton labeled training set L

9: t=t+1

10: end loop

3.4.2. Hybrid Query Strategy

As a hybrid query strategy, we chose the diverse mini-batch active learning (DBAL)

algorithm from Zhdanov [

]. As described in Algorithm 2, DBAL first filters out training

images with a low informativeness score by using a pre-filter factor

. To this end, the top

βk

images are selected for further processing. In our experiments,

β=

2 was used. Then,

diverse samples are selected from the remaining

βk

images with weighted KMeans++

clustering [

], where the weights are represented by the maximum margin informativeness

scores. By selecting the image closest to each of the

clusters, the selected training images

are expected to be more diverse.

In order to perform clustering, feature vectors that represent the training images

are

required. We use the last feature map

of size

(

256, 90, 160

)

from the feature pyramid

network model

[

] and perform global average pooling to convert the feature map

to a one-dimensional feature vector of size 256. These feature vectors are then used for

weighted KMeans++ clustering.

Algorithm 2 DBAL

Input

: Unlabeled pool of images

, empty labeled training set

, query batch size

, pre-

filter factor β, pre-trained model MS

Output: Fine-tuned model M

1: t=1

2: loop

3: Obtain informativeness score asum(x)for every image x∈ {U,L}

4: if an image xhas no detections then

5: Set asum(x) = 0

6: end if

7: Pre-filter to top βkinformative images

8: Cluster βkimages to kclusters with weighted KMeans++

9: Select and label kimages closest to the cluster centers, add them to L

10: Fine-tune the object detection model Mton labeled training set L

11: t=t+1

12: end loop

4. Results

Using the described methodology from Section 3, we trained multiple OD models

by combining synthetic data and DAL. As training data, we used either only real training

J. Imaging 2024,10, 16 8 of 15

images (R) or we used the synthetically pre-trained model

and then fine-tuned it on real

images (S+R). For DAL query strategies, we implemented the two described algorithms

from Sections 3.4.1 and 3.4.2. Additionally, we implemented a random sampling strategy as

a baseline, which shuffles the unlabeled pool of images and then selects a batch of

training

images randomly. We ran each random strategy three times using different random seeds.

4.1. Combining Synthetic Images and Deep Active Learning for One DAL Cycle

First, we ran experiments for Algorithms 1and 2, and random sampling for one DAL

cycle with different query batch sizes

. Results for different DAL query strategies are

shown in Figure 6. All numerical results can be found in the Appendix Ain Table A1.

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Batch size of queried training images k

AP@[0.5:0.95]

R Random S+R Random S+R Maximum Margin S+R DBAL (β=2)

Figure 6.

Results for the first DAL cycle with different query strategies. R Random: Baseline model

using random sampling and only real images. S+R Random: Synthetic base model fine-tuned

on real images with random sampling. S+R Maximum Margin: Synthetic base model fine-tuned

on real images with Algorithm 1. S+R DBAL: Synthetic base model fine-tuned on real images

with Algorithm 2.

Using synthetic training images for model pre-training always outperformed using

only real images. In fact, the difference between using synthetic images and not using

synthetic images is much greater than the difference between the different query strategies.

The results show that the importance of synthetic images increases as the number of

labeled training images decreases. For

10, the model pre-trained on a synthetic

dataset (S+R Random) increased the AP@[0.5:0.95] by 30.5% compared to the baseline

model trained only on real images (R Random).

The hybrid query strategy DBAL has a higher

than the random query strategy for

all batch sizes

and shows overall the best performance. The chart shows that DAL query

strategies are most useful with a small number of training images selected from a bigger

pool of unlabeled data. The largest improvement over random sampling is at

25, where

S+R DBAL increased the AP by 4.5% in comparison to S+R Random. In other words, using

25 real training images with S+R DBAL yielded equivalent AP results to randomly selecting

about 50 training images. For large batch sizes with

k⩾

100, neither DAL query strategy

yielded a meaningful improvement in model performance over random sampling in the

first DAL cycle. As

approaches the total number of images in

, all query strategies must

converge eventually. As shown by the standard error, selecting training images randomly

J. Imaging 2024,10, 16 9 of 15

yields varying

values due to dependence on the random seed. Therefore, employing

DAL minimizes the chance of selecting an unfavorable random seed.

Figure 7shows the top five selected images from the unlabeled pool

by the initial

model

M0,S

according to the different query strategies in the first DAL cycle. As expected

from Equations

(1)

and

(2)

, maximum margin sampling and DBAL both select images from

the unlabeled pool Uwith many false positive detections with high uncertainty.

Figure 7.

Top five training images for the initial model

from the unlabeled pool

according to

the different query strategies. Bounding box predictions are displayed in blue, including the turbine

blade class confidence value. Best viewed with zoom. (

) Top five training images according to S+R

Random. (

) Top five training images according to S+R Maximum Margin. (

) Top five training

images according to S+R DBAL.

4.2. Multiple Deep Active Learning Cycles

Based on our findings in Section 4.1, we opted for DBAL as our query strategy with

a fixed batch size of

25. Starting with the synthetic base model

, the model was

iteratively fine-tuned for eight DAL cycles according to Algorithm 2. At each cycle, the

labeled training set

was extended by the 25 selected samples

x∈ {U

, based on the

feature vectors from the previously trained model. Results for DBAL with up to

8 DAL

cycles are compared to the previous charts in Figure 8for a single cycle. Numerical results

can be found in the Appendix Ain Table A1.

J. Imaging 2024,10, 16 10 of 15

20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

0.82

Number of used training images

AP@[0.5:0.95]

S+R Random: 1 Cycle S+R DBAL (β=2): 1 Cycle S+R DBAL (β=2): 8 Cycles

Figure 8.

Results for one DAL cycle with varying batch sizes

compared to eight DAL cycles with a

fixed batch size of k=25.

The results presented in Figure 8show that running DBAL for multiple DAL cycles

yields better OD performance compared to running only a single cycle with an equivalent

number of training images. For instance, a single cycle of DBAL with 150 labeled images

performed the same as running four cycles of DBAL with 25 new images each time, which

requires a maximum amount of 100 labeled images. Qualitative results on validation

images are depicted in Figure 9which shows the iterative learning of the model over the

course of multiple DBAL cycles. False positive detections are reduced and the confidence

values of turbine blade detections increase with each new cycle.

Figure 9.

Qualitative results from S+R DBAL on validation images throughout multiple active

learning cycles. Bounding box predictions are displayed in blue, including the turbine blade class

confidence value. Best viewed with zoom. (

) Results from the initial model trained only on synthetic

images. (

) Results after one cycle with real images. (

) Results after three cycles with real images.

(d) Results after five cycles with real images.

J. Imaging 2024,10, 16 11 of 15

5. Discussion and Conclusions

To summarize, this work combined the generation of synthetic training images with

DAL in order to train industrial OD models with minimal manual annotations. The base

model is initially trained on automatically generated synthetic images and subsequently

fine-tuned in each DAL cycle with real images. The synthetic base model enables early

deployment, while unlabeled real training images can be collected over time. To ensure

data efficiency, the DAL query strategy selects a limited batch of images for training from

a larger pool of unlabeled images. On our turbine blade detection dataset, we found that

using synthetic images for pre-training improved model performance, especially when

the number of real training images was small. Additionally, the hybrid query strategy

DBAL outperformed uncertainty-based maximum margin sampling and random sampling

for small batch sizes. Furthermore, running multiple DAL cycles with a small batch size

performed better than running only one cycle with an equivalent number of training images.

Utilizing DAL can either increase model performance with the same amount of data, or

provide the same performance with fewer data compared to randomly selecting training

images. Additionally, employing DAL minimizes the risk of selecting an unfavorable batch

of training images by chance.

Our findings are limited by our specific industrial use case of a turbine blade detection

model. However, the presented methodology is not restricted to turbine blades and can

be applied to any object. In future work, we plan to apply our approach to new industrial

applications and datasets. For both of our implemented DAL query strategies, we used

maximum margin as an informativeness score combined with the sum aggregation method.

Choosing an alternative informativeness score and aggregation method could lead to

different results. For our experiments with multiple DAL cycles in Section 4.2, we did not

change the unlabeled pool of images

. However, during real-world deployment of an OD

model, it is possible to collect new images over time. A steady increase in

will provide

the DAL query strategy with a larger selection of images to choose from.

As a next step, we would like to train and iteratively improve multiple OD models

using the developed workflow over a longer period of time on the shop floor. Future

work should incorporate best practices from the machine learning operations (MLOps)

paradigm [

] to automatically train and test new models and to ensure that each model

update performs better than the previous model. Automatic triggering of a new DAL cycle

could be initiated through continuous model monitoring. For instance, this could occur

when a specific amount of new data in

are collected, a certain time period has passed, a

dataset shift is detected [59], or model performance declines on key metrics.

Author Contributions:

Conceptualization, J.L. and L.E.; methodology, L.E.; software, L.E.; data cura-

tion, L.E.; writing—original draft preparation, L.E.; writing—review and editing, J.L.; visualization,

L.E.; supervision, J.L.; funding acquisition, J.L. All authors have read and agreed to the published

version of the manuscript.

Funding:

This work is part of the project MRO 2.0—Maintenance, Repair and Overhaul and was

supported in part by the European Regional Development Fund (ERDF) under grant number

ProFIT-10167454. We acknowledge support by the German Research Foundation and the Open

Access Publication Fund of TU Berlin.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement:

The industrial turbine blade data are not publicly available due to

protection of intellectual property.

Acknowledgments:

We would like to thank our MRO 2.0 project partners Siemens Energy, Gestalt

Robotics, and Fraunhofer Institute for Production Systems and Design Technology.

Conflicts of Interest: The authors declare no conflicts of interest.

J. Imaging 2024,10, 16 12 of 15

Abbreviations

The following abbreviations are used in this manuscript:

AL active learning

AP average precision

DAL deep active learning

DBAL diverse mini-batch active learning

MLOps machine learning operations

OD object detection

Appendix A

Numerical results from the experiments from Section 4are documented in Table A1.

Table A1.

Numerical results for all experiments. R: Only real training images were used; S+R: Syn-

thetic base model fine-tuned with real training images.

Strategy Number of Real

Training Images

AP@[0.5:0.95]

Random Seed 1

AP@[0.5:0.95]

Random Seed 2

AP@[0.5:0.95]

Random Seed 3

(Average)

AP@[0.5:0.95]

R Random *

10 0.479 0.507 0.507 0.498

25 0.578 0.582 0.601 0.587

50 0.686 0.665 0.645 0.665

100 0.733 0.732 0.720 0.728

150 0.754 0.761 0.751 0.755

200 0.765 0.771 0.763 0.766

S+R Random *

0 0.555

10 0.636 0.655 0.660 0.650

25 0.671 0.692 0.698 0.687

50 0.724 0.718 0.720 0.721

100 0.757 0.765 0.757 0.760

150 0.778 0.782 0.778 0.779

200 0.782 0.788 0.790 0.787

S+R Max. Margin

0 0.555

10 0.668

25 0.712

50 0.725

100 0.753

150 0.773

200 0.776

S+R DBAL

0 0.555

10 0.666

25 0.718

50 0.735

100 0.763

150 0.782

200 0.791

S+R DBAL (8 cycles)

0 0.555

25 0.718

50 0.747

75 0.768

100 0.778

125 0.791

150 0.796

175 0.798

200 0.800

* Random sampling strategies were repeated with three different random seeds.

J. Imaging 2024,10, 16 13 of 15

References

Gupta, C.; Farahat, A. Deep Learning for Industrial AI: Challenges, New Methods and Best Practices. In Proceedings of the

26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 6–10 July 2020.

[CrossRef]

Torralba, A.; Efros, A.A. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR 2011), Colorado Springs, CO, USA, 20–25 June 2011; IEEE: New York, NY, USA 2011. [CrossRef]

Coyner, A.S.; Chen, J.S.; Chang, K.; Singh, P.; Ostmo, S.; Chan, R.V.P.; Chiang, M.F.; Kalpathy-Cramer, J.; Campbell, J.P. Synthetic

Medical Images for Robust, Privacy-Preserving Training of Artificial Intelligence: Application to Retinopathy of Prematurity

Diagnosis. Ophthalmol. Sci. 2022,2, 100126. [CrossRef] [PubMed]

Northcutt, C.; Athalye, A.; Mueller, J. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. In Proceed-

ings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks

2021), Virtual, 6–14 December 2021; Vanschoren, J., Yeung, S., Eds.; Curran: Red Hook, NY, USA 2021; Volume 1.

Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks

from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and

Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA 2017. [CrossRef]

Lambrecht, J.; Kästner, L. Towards the Usage of Synthetic Data for Marker-Less Pose Estimation of Articulated Robots in RGB

Images. In Proceedings of the 2019 19th International Conference on Advanced Robotics (ICAR), Belo Horizonte, Brazil, 2–6

December 2019; IEEE: New York, NY, USA 2019. [CrossRef]

Nowruzi, F.E.; Kapoor, P.; Kolhatkar, D.; Hassanat, F.A.; Laganiere, R.; Rebut, J. How much real data do we actually need:

Analyzing object detection performance using synthetic and real data. arXiv 2019, arXiv:1907.07061. [CrossRef]

Movshovitz-Attias, Y.; Kanade, T.; Sheikh, Y. How Useful Is Photo-Realistic Rendering for Visual Learning? In Lecture Notes in

Computer Science; Springer International Publishing: Cham, Switzerland, 2016; pp. 202–217. [CrossRef]

de Melo, C.M.; Rothrock, B.; Gurram, P.; Ulutan, O.; Manjunath, B. Vision-Based Gesture Recognition in Human-Robot Teams

Using Synthetic Data. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),

Virtual, 24 October 2020–24 January 2021; pp. 10278–10284. [CrossRef]

10.

Yang, X.; Fan, X.; Wang, J.; Lee, K. Image Translation Based Synthetic Data Generation for Industrial Object Detection and Pose

Estimation. IEEE Robot. Autom. Lett. 2022,7, 7201–7208. [CrossRef]

11. Eversberg, L.; Lambrecht, J. Generating Images with Physics-Based Rendering for an Industrial Object Detection Task: Realism

versus Domain Randomization. Sensors 2021,21, 7901. [CrossRef] [PubMed]

12.

Schraml, D. Physically based synthetic image generation for machine learning: A review of pertinent literature. In Proceedings

of the Photonics and Education in Measurement Science 2019, Jena, Germany 17–19 September 2019; Proc. SPIE: Bellingham, WA

USA, 2019; Volume 11144. [CrossRef]

13.

Georgakis, G.; Mousavian, A.; Berg, A.; Kosecka, J. Synthesizing Training Data for Object Detection in Indoor Scenes. In

Proceedings of the Robotics: Science and Systems XIII. Robotics: Science and Systems Foundation, Cambridge, MA, USA, 12–16

July 2017. [CrossRef]

14.

Dwibedi, D.; Misra, I.; Hebert, M. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In Proceedings of the

2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA,

2017. [CrossRef]

15.

Gorschlüter, F.; Rojtberg, P.; Pöllabauer, T. A Survey of 6D Object Detection Based on 3D Models for Industrial Applications.

J. Imaging 2022,8, 53. [CrossRef] [PubMed]

16.

Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training

Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. In Proceedings of the 2018 IEEE/CVF

Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; IEEE:

New York, NY, USA, 2018. [CrossRef]

17.

Prakash, A.; Boochoon, S.; Brophy, M.; Acuna, D.; Cameracci, E.; State, G.; Shapira, O.; Birchfield, S. Structured Domain

Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data. In Proceedings of the 2019 International Conference

on Robotics and Automation (ICRA), Montreal, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019. [CrossRef]

18.

Hodan, T.; Vineet, V.; Gal, R.; Shalev, E.; Hanzelka, J.; Connell, T.; Urbina, P.; Sinha, S.N.; Guenter, B. Photorealistic Image

Synthesis for Object Instance Detection. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP),

Taipei, Taiwan, 22–25 September 2019; IEEE: New York, NY, USA 2019. [CrossRef]

19.

Jabbar, A.; Farrawell, L.; Fountain, J.; Chalup, S.K. Training Deep Neural Networks for Detecting Drinking Glasses Using

Synthetic Images. In Neural Information Processing; Springer International Publishing: Cham, Switzerland, 2017;

pp. 354–363.

[CrossRef]

20.

Pharr, M.; Jakob, W.; Humphreys, G. Physically Based Rendering: From Theory to Implementation, 3rd ed.; Morgan Kaufmann:

Burlington, Massachusetts, USA, 2016.

21.

Shrivastava, A.; Pfister, T.; Tuzel, O.; Susskind, J.; Wang, W.; Webb, R. Learning From Simulated and Unsupervised Images

Through Adversarial Training. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

Honolulu, HI, USA, 21–26 July 2017; pp. 2242–2251. [CrossRef]

J. Imaging 2024,10, 16 14 of 15

22.

Sankaranarayanan, S.; Balaji, Y.; Jain, A.; Lim, S.N.; Chellappa, R. Learning From Synthetic Data: Addressing Domain Shift for

Semantic Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt

Lake City, UT, USA, 18–23 June 2018; pp. 3752–3761. [CrossRef]

23.

Peng, X.; Saenko, K. Synthetic to Real Adaptation with Generative Correlation Alignment Networks. In Proceedings of the 2018

IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: New York,

NY, USA, 2018. [CrossRef]

24.

Rojtberg, P.; Pollabauer, T.; Kuijper, A. Style-transfer GANs for bridging the domain gap in synthetic pose estimator training.

In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Virtual, 14–18

December 2020; IEEE: New York, NY, USA, 2020. [CrossRef]

25.

Su, Y.; Rambach, J.; Pagani, A.; Stricker, D. SynPo-Net—Accurate and Fast CNN-Based 6DoF Object Pose Estimation Using

Synthetic Training. Sensors 2021,21, 300. [CrossRef] [PubMed]

26.

Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin: Madison, WI,

USA, 2009.

27.

Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Gupta, B.B.; Chen, X.; Wang, X. A Survey of Deep Active Learning.

ACM Comput. Surv. 2021,54, 1–40. [CrossRef]

28.

Zhan, X.; Wang, Q.; hao Huang, K.; Xiong, H.; Dou, D.; Chan, A.B. A Comparative Survey of Deep Active Learning. arXiv

2022

arXiv:2203.13450. [CrossRef]

29.

Wang, D.; Shang, Y. A new active labeling method for deep learning. In Proceedings of the 2014 International Joint Conference

on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; IEEE: New York, NY, USA, 2014. [CrossRef]

30.

Sener, O.; Savarese, S. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In Proceedings of the 2018

International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018.

31.

Ash, J.T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; Agarwal, A. Deep Batch Active Learning by Diverse, Uncertain Gradient

Lower Bounds. In Proceedings of the 2020 International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia,

26–30 April 2020.

32.

Yin, C.; Qian, B.; Cao, S.; Li, X.; Wei, J.; Zheng, Q.; Davidson, I. Deep Similarity-Based Batch Mode Active Learning with

Exploration-Exploitation. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA,

USA, 18–21 November 2017; IEEE: New York, NY, USA, 2017. [CrossRef]

33. Zhdanov, F. Diverse mini-batch Active Learning. arXiv 2019, arXiv:1901.05954. [CrossRef]

34. Li, Y.; Fan, B.; Zhang, W.; Ding, W.; Yin, J. Deep active learning for object detection. Inf. Sci. 2021,579, 418–433. [CrossRef]

35.

Brust, C.A.; Käding, C.; Denzler, J. Active Learning for Deep Object Detection. In Proceedings of the 14th International Joint

Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), Prague, Czech

Republic, 25–27 February 2019; SciTePress: Setúbal, Portugal, 2019; pp. 181–190. [CrossRef]

36.

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.

37.

Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge.

Int. J. Comput. Vis. 2009,88, 303–338. [CrossRef]

38.

Haussmann, E.; Fenzi, M.; Chitta, K.; Ivanecky, J.; Xu, H.; Roy, D.; Mittel, A.; Koumchatzky, N.; Farabet, C.; Alvarez, J.M. Scalable

Active Learning for Object Detection. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA,

19 October–13 November 2020; IEEE: New York, NY, USA, 2020. [CrossRef]

39.

Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Lecture Notes in

Computer Science; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [CrossRef]

40.

Peng, H.; Lin, S.; King, D.; Su, Y.H.; Bly, R.A.; Moe, K.S.; Hannaford, B. Reducing Annotating Load: Active Learning with

Synthetic Images in Surgical Instrument Segmentation. arXiv 2021, arXiv:2108.03534. [CrossRef]

41.

Houlsby, N.; Huszár, F.; Ghahramani, Z.; Lengyel, M. Bayesian Active Learning for Classification and Preference Learning. arXiv

2011, arXiv:1112.5745. [CrossRef]

42.

He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of

the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong

Kong, China, 1–8 June 2008; pp. 1322–1328. [CrossRef]

43.

Niemeijer, J.; Mittal, S.; Brox, T. Synthetic Dataset Acquisition for a Specific Target Domain. In Proceedings of the IEEE/CVF

International Conference on Computer Vision (ICCV) Workshops, Paris, France, 2–6 October 2023; pp. 4055–4064.

44.

Wang, Y.; Ilic, V.; Li, J.; Kisaˇcanin, B.; Pavlovic, V. ALWOD: Active Learning for Weakly-Supervised Object Detection. In Proceed-

ings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 6459–6469.

45.

Denninger, M.; Sundermeyer, M.; Winkelbauer, D.; Olefir, D.; Hodan, T.; Zidan, Y.; Elbadrawy, M.; Knauer, M.; Katam, H.; Lodhi,

A. BlenderProc: Reducing the Reality Gap with Photorealistic Rendering. In Proceedings of the Robotics: Science and Systems

(RSS), Virtual, 12–16 July 2020.

46.

Dirr, J.; Gebauer, D.; Yao, J.; Daub, R. Automatic Image Generation Pipeline for Instance Segmentation of Deformable Linear

Objects. Sensors 2023,23, 3013. [CrossRef] [PubMed]

J. Imaging 2024,10, 16 15 of 15

47.

Druskinis, V.; Araya-Martinez, J.M.; Lambrecht, J.; Bøgh, S.; de Figueiredo, R.P. A Hybrid Approach for Accurate 6D Pose

Estimation of Textureless Objects From Monocular Images. In Proceedings of the 2023 IEEE 28th International Conference on

Emerging Technologies and Factory Automation (ETFA), Sinaia, Romania, 12–15 September 2023; pp. 1–8. [CrossRef]

48.

Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in

Context. In Computer Vision—ECCV 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [CrossRef]

49.

Eversberg, L.; Lambrecht, J. Evaluating digital work instructions with augmented reality versus paper-based documents for

manual, object-specific repair tasks in a case study with experienced workers. Int. J. Adv. Manuf. Technol.

2023

,127, 1859–1871.

[CrossRef]

50.

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In

Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28,

pp. 91–99.

51.

Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection

Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [CrossRef]

52.

Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In

Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July

2017; pp. 936–944. [CrossRef]

53.

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [CrossRef]

54. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.

55.

Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image

Augmentations. Information 2020,11, 125. [CrossRef]

56.

Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a

Companion Open-Source Toolkit. Electronics 2021,10, 279. [CrossRef]

57.

Arthur, D.; Vassilvitskii, S. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth annual ACM-SIAM

symposium on Discrete algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035.

58.

Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine Learning Operations (MLOps): Overview, Definition, and Architecture.

IEEE Access

2023,11, 31866–31879. [CrossRef]

59. Moreno-Torres, J.G.; Raeder, T.; Alaiz-Rodríguez, R.; Chawla, N.V.; Herrera, F. A unifying view on dataset shift in classification.

Pattern Recognit. 2012,45, 521–530. [CrossRef]

Disclaimer/Publisher’s Note:

The statements, opinions and data contained in all publications are solely those of the individual

author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to

people or property resulting from any ideas, methods, instructions or products referred to in the content.