S-MSRRS5000: A Simulated Dataset Highlighting the Challenges of Data Obtained From Multiple Spatially Resolved Reflection Spectroscopy [original]

S-MSRRS5000: A Simulated Dataset Highlighting

the Challenges of Data Obtained From Multiple

Spatially Resolved Reflection Spectroscopy

Birk Martin Magnussen ∗, Maik Jessulat†, Claudius Stern‡and Bernhard Sick†

∗biozoom services GmbH

34121 Kassel, Germany

†Intelligent Embedded Systems

Universit¨

at Kassel

34121 Kassel, Germany

‡FOM Hochschule f¨

ur Oekonomie & Management

Hochschulzentrum Kassel

34117 Kassel, Germany

e-mail: [email protected], {mjessulat, bsick}@uni-kassel.de, [email protected]

Abstract—Optical sensors based on spectroscopy are occasion-

ally used in consumer healthcare and wellness applications. This

includes applications such as measuring the concentration of

cutaneous carotenoids using sensors based on multiple spatially

resolved reflection spectroscopy (MSRRS). Processing the data

yielded from MSRRS-based sensors poses unique challenges.

When using machine learning for data processing, specialized

models such as continuous feature networks are required to

achieve good results. However, due to privacy issues of medical

data, data availability is low, hindering model development. In

this article, a simulated dataset is introduced, highlighting the

challenges of data from MSRRS-based sensors. Furthermore, the

underlying principles used for the simulation will be discussed.

Finally, several model architectures including continuous feature

networks are trained on the dataset, demonstrating the various

challenges.

Index Terms—dataset, simulated data, reflection spectroscopy

I. INTRODUCTION

Multiple spatially resolved reflection spectroscopy

(MSRRS) is a technology for analyzing biological tissue. One

common use-case for sensors based on MSRRS is measuring

the concentration of cutaneous carotenoids in humans [1].

MSRRS-based sensors consist of multiple light emitters

and light detectors. These emitters and detectors are then

arranged in a matrix. When a measuring sample is placed

on this detector matrix, the light emitted from an emitter

will pass through the sample before being detected by the

light detector. The measured brightness at the light detector

then depends on the wavelength of the light, the distance

between emitter and detector, as well as the optical properties

of the sample. By analyzing the brightness observed for all

Supported by the state Hessen through funding as part of the Distr@l

research grant 22 0041 2B.

emitter-detector pairs, it is possible to make predictions about

the measurement sample [2], such as the concentration of

cutaneous carotenoids.

This analysis of the observed brightness values can be per-

formed using machine learning. For this purpose, specialized

neural network architectures have been developed, such as

continuous feature networks [3], [4]. These continuous feature

networks are designed to tackle the difficulties of data yielded

by MSRRS-based sensors.

At the core of the difficulty of processing MSRRS-based

data is the irregularity of the data. The light emitters used

in MSRRS-based sensors have discrete wavelengths. As a

result, the observed spectrum is not sampled continuously, but

rather at discrete sampling points. Furthermore, these points

are not distributed equally but are at irregular intervals. A

Emitter wavelength (nm)

Emitter-Detector distance (mm)

450 520 590 660 730 800 870

Fig. 1. An example of the structure of data from MSRRS-based sensors [4].

similar situation is observed with the distance of emitter-

detector pairs. An example of this type of irregular data can

be seen in Figure 1.

In addition to the irregularity, any MSRRS-based sensor can

be afflicted by manufacturing inaccuracies. This can result in

the wavelengths of the light emitters shifting slightly between

individual sensors. Any machine learning model operating

on data from MSRRS-based sensors needs to be able to

compensate for these fluctuations.

When trying to predict properties or substances in human

tissue, the required training data is composed of medical

data. Due to privacy concerns, such data cannot be made

publicly available. Therefore, a simulated dataset called S-

MSRRS5000 is presented.

II. THE DATASET AND ITS CHALLENGES

For the creation of S-MSRRS5000, a virtual MSRRS-

based sensor with 8 detectors and 32 emitters is assumed.

S-MSRRS5000 is made up of four sets of 5000 simulated

measurements each. Each measurement consists of a series

of 256 data points. One data point always corresponds to

one emitter-detector pair of the virtual sensor. For each data

point, the observed brightness at the detector is listed. In

addition, the wavelength of the emitter and the emitter-detector

distance are listed. Together, the 256 data points make up one

measurement. For each such measurement, three ground truth

labels are available.

1) The concentration of cutaneous beta-carotene in the

simulated sample in milligrams per liter.

2) The concentration of hemoglobin in the blood of the

sample in grams per liter.

3) The oxygenation level of the blood in the sample as a

factor from 0 to 1.

The ground truth labels for each set are stored in a file

called dataset_meta.json. This JSON file contains a

single top-level array. Within are multiple JSON objects, each

representing one simulated measurement, with a key for each

label and a key called name containing the filename of the

measurement data itself. The file containing the measurement

data is a CSV file with a list of data points. Each data point

is listed with the wavelength of the emitted light, the emitter-

detector distance, and the measured intensity.

The four sets available in the dataset represent different

challenges of MSRRS-based data in ascending difficulty. The

challenges of the different sets are as follows:

1) 1_clean: Clean Data Only The first set represents

the ideal case. In this set, the virtual MSRRS-based

sensor is perfectly accurate. Similarly, the measurement

sample is simulated to have no noise or other factors

negatively impacting the data. The only factors with a

variable influence on the simulated brightness values

are substances with labels available as ground truth.

This set is thus useful to understand the structure of

MSRRS-based data and to establish a baseline for model

performance.

2) 2_noise: Noisy Samples The second set increases

the prediction difficulty by introducing measurement

noise. In addition to the substances with labels available

as ground truth, further randomized noise contributes to

the light absorption within the simulated sample. Fur-

thermore, the carotenoid concentration is simulated with

localized fluctuations and is no longer homogeneous

throughout the entirety of the sample.

3) 3_wavelength_shift: Inconsistent Emitter

Wavelengths The third set introduces production

inaccuracies of the sensor in addition to the noise of

the second set. In this set, the emitter wavelengths of

each measurement are no longer identical. Instead, they

are simulated to fluctuate in a small region around the

nominal target wavelength that is used in the previous

sets.

4) 4_missing_data: Missing Data From Detectors

The fourth set expands on the difficulty of the third

set. In addition to the sample noise and inconsistent

emitter wavelength, the fourth set allows for inoperable

detectors. As a result, it is no longer guaranteed that

all data points are available in a measurement. Instead,

some data points may be replaced by NaN-values. This

is to simulate detectors that might have to be disabled

due to them exceeding tolerances or being otherwise

defective.

III. SIMULATION BACKGROUND

The simulation of the brightness detected by the light

detectors is a highly simplified simulation. It does not aim to

accurately recreate the measurement values as when measuring

real human tissue but rather aims at recreating some of the core

challenges of real MSRRS-based data.

At its core, the simulation traces the light paths from every

light emitter to every light detector as semicircles. These

semicircles are split into equidistant segments. Figure 2 shows

an example of these light paths for two emitters and one

detector.

Starting from the light emitter, the light attenuation over

the light path segments is calculated. In this simulation,

attenuation is comprised of two components. First, since the

light emitter is not emitting a focused beam of light, the light

attenuates based on the distance to the emitter with the inverse-

square law. Given the distance to the light emitter dand the

length of the light path segment ls, the ratio between the

light intensity I0before the light path segment and the light

Fig. 2. Traced light paths split into equidistant segments of two emitters to

one light detector.

intensity Iafter the light path segment can be described as

follows:

=d2

(d+ls)2(1)

The second component is the absorption of light by the

sample, calculated using the Beer-Lambert law. The Beer-

Lambert law describes the light attenuation due to the ab-

sorbance of light of the chemicals. Given the absorptivity a

of a chemical, optical beam length b, and the concentration

of the absorbing chemical c, the Beer-Lambert law allows to

calculate the logarithmic ratio between the light intensity I0

before the light path segment and the light intensity Iafter

the light path segment can be described as follows [5]:

−log (︃I

I0)︃=a·b·c(2)

The absorptivity ais a wavelength-dependent material con-

stant. The optical path length bcan be calculated as the product

of the light path segment length and the refractive index of

human skin (approximated as 1.37 [6]).

This dataset simulates the to-be-measured tissue as three

primary components. First, the tissue is assumed to consist of

a base of water. Additionally, the tissue contains hemoglobin

as found in blood, both oxygenated and not oxygenated.

Finally, a given concentration of carotenoids contributes to

the absorption of light.

For the base of water, the absorptivity used by the simulation

is based on the measurements of Kou, Labrie, and Chylek [7].

For the hemoglobin contained in the blood within the

simulated tissue, measurements compiled by Prahl [8] were

used as the absorptivity values in the simulation. The amount

of hemoglobin present for absorption is calculated as a com-

bination of the amount of hemoglobin per volume of blood,

and the ratio of volume of blood to volume of total tissue.

The volume of blood present in human limbs averages 5.6%

of the tissue volume [9]. For the concentration of hemoglobin

in human blood, values are sampled from a distribution based

on measurements from Kim et al. [10] and stored as labels in

the ground truth data. Similarly, for the ratio of oxygenated to

non-oxygenated blood, a value is sampled from a distribution

based on measurements from Epstein and Haghenbeck [11]

and similarly stored as a label in the ground truth data.

For the carotenoids to be predicted by a given machine

learning algorithm, a concentration is sampled from a distri-

bution of the beta-carotene concentrations in humans based

on measurements from Matsumoto et al. [12] and stored as a

label in the ground truth data. Data compiled by Prahl [13]

on the absorptivity spectrum of beta-carotene was used as the

basis for the simulation.

For the noise in set two and beyond, OpenSimplex-based

noise [14] is used. A three-dimensional noise map is used to

vary the localized concentration of carotenoids by up to ±15%.

A second, four-dimensional is used as well. The first three

dimensions represent the position in the simulated space while

the fourth dimension represents the current wavelength of

light. This noise map is used to add localized and wavelength-

dependent background absorptivity noise, representing other

substances present in the tissue.

The light emitter wavelength shift in the third and fourth

sets is sampled from a Gaussian distribution around the target

wavelength for the light emitter with a standard deviation of

2.5 nm.

In the fourth set, the chance that any single detector is

disabled for the simulation is fixed at 10%. This ratio is higher

than expected for real MSRRS-based sensors but serves as a

suitable worst-case scenario challenge for a prediction model.

IV. EXAMPLE USAGE AND RESULTS

For testing, three different machine learning models were

trained on each of the four sets within the S-MSRRS5000

dataset. The three models include a simple linear regres-

sion model, a multi-layer feed-forward neural network, and

a continuous feature network as proposed in Magnussen,

Stern, and Sick [3], [4]. For this experiment, each set of the

dataset was split into 3500 training measurements, and 1500

validation measurements. For both methods based on neural

networks, the 1500 validation measurements were further split

into 500 testing measurements and only 1000 final validation

measurements. The testing measurements were used to select

the best-performing model during training, while the validation

measurements are the basis for the final score of the model.

The training was repeated ten times for each model and each

set, with randomized training, testing, and validation data splits

for each repetition.

The linear regression model and the multi-layer feed-

forward network are not able to deal with missing input

data for the fourth set in S-MSRRS5000. Instead, the miss-

ing values were imputed by linearly interpolating values of

comparable wavelength and emitter-detector distances. The

continuous feature network is capable of handling missing

input data without imputation and was thus given the input

data of the fourth set unmodified.

The results shown in Table I include the root mean square

error (RMSE) and the coefficient of determination (r2) be-

tween the predicted results and the ground truth labels of

the dataset. The data is averaged over the ten runs, with the

respective standard deviation given for each quality metric.

It is to be noted that only training instances that yielded

a positive coefficient of determination contributed to the re-

spective quality metrics. Instead, the % fail metric indicates

the percentage of runs that yielded a negative coefficient of

determination, indicating models with an unusable training

result.

From the data in Table I, it can be observed that calculating

the concentration of carotenoids in a simplified simulation

such as used for S-MSRRS5000 is a simple task if no noise

is present. All models trained are able to achieve a very high

coefficient of determination. However, especially the introduc-

tion of fluctuations in the wavelength of the light emitters

has a significant impact on the linear regression model and

TABLE I

THE TRAINING ACCURACY OF DIFFERENT MODELS ON THE VARIOUS CHALLENGES OF THE S-MSRRS5000 DATASET.

linear regression multi-layer feed-forward network continuous feature network

RMSE r2% fail RMSE r2% fail RMSE r2% fail

clean data 0.0±0.01.0±0.0 0% 0.02±0.01.0±0.0 0% 0.03±0.0 0.99±0.0 0%

noisy data 0.03±0.00.99±0.0 0% 0.04±0.0 0.98±0.0 0% 0.05±0.01 0.97±0.01 0%

wavelength shift 0.26±0.0 0.29±0.01 40% - - 100% 0.11±0.01 0.85±0.03 0%

missing data 0.27±0.01 0.15±0.01 60% - - 100% 0.16±0.02 0.60±0.08 0%

the multi-layer feed-forward network. The multi-layer feed-

forward network was not able to achieve a positive coefficient

of determination for a single training repetition for the third

and fourth sets. The continuous feature network however is

capable of taking the exact wavelength and emitter-detector

distance for each input brightness into account. As a result, the

continuous feature network is able to keep a high coefficient of

determination and a low root mean square error for the third

set as well. A similar effect can be observed for the fourth

set, where the missing data significantly reduces the accuracy

of the linear regression model, whereas the continuous feature

network is able to deal with the missing data and still keeps

a comparatively high coefficient of determination.

These results, especially including the ability of the con-

tinuous feature network to achieve a high performance due to

being able to compensate for inconsistent wavelengths and for

missing data is consistent with recent findings on real data [3],

[4].

V. CONCLUSION

The S-MSRRS5000 dataset is a simulated dataset, op-

timized to show the structure and challenges of MSRRS-

based data. This article discusses the contents of the S-

MSRRS5000 dataset and the four challenge sets making up

the dataset. Furthermore, the relation of each challenge set to

the challenges of real data is highlighted.

This article also goes into detail on how the dataset is

simulated. This includes the underlying physical principles,

as well as the sources for the spectral data used.

Finally, the dataset is used to train different example models

on each of the challenge sets. The results from these experi-

ments are consistent with results on real data.

REFERENCES

[1] M. E. Darvin, B. Magnussen, J. Lademann, and W. K¨

ocher, “Multiple

spatially resolved reflection spectroscopy for in vivo determination of

carotenoids in human skin and blood,” Laser Physics Letters, vol. 13,

no. 9, p. 095601, Aug. 2016.

[2] B. Magnussen, C. Stern, and W. K¨

ocher, “Vorrichtung und verfahren

zur bestimmung einer konzentration in einer probe,” European Patent

EP 3 013 217 B1, 2016.

[3] B. M. Magnussen, C. Stern, and B. Sick, “Utilizing continuous kernels

for processing irregularly and inconsistently sampled data with position-

dependent features,” in Proceedings of The Nineteenth International

Conference on Autonomic and Autonomous Systems, C. Behn, Ed.,

IARIA. ThinkMind, Mar. 2023, pp. 49–53.

[4] B. M. Magnussen, C. Stern, and B. Sick, “Continuous feature networks:

A novel method to process irregularly and inconsistently sampled data

with position-dependent features,” International Journal On Advances

in Intelligent Systems, vol. 16, no. 3&4, pp. 43–50, 2023.

[5] D. F. Swinehart, “The beer-lambert law,” Journal of Chemical Education,

vol. 39, no. 7, p. 333, Jul. 1962.

[6] H. Ding, J. Q. Lu, W. A. Wooden, P. J. Kragel, and X.-H. Hu,

“Refractive indices of human skin tissues at eight wavelengths and

estimated dispersion relations between 300 and 1600 nm,” Physics in

Medicine & Biology, vol. 51, no. 6, p. 1479, Mar. 2006.

[7] L. Kou, D. Labrie, and P. Chylek, “Refractive indices of water and ice

in the 0.65- to 2.5-µm spectral range,” Applied Optics, vol. 32, no. 19,

pp. 3531–3540, Jul. 1993.

[8] S. Prahl. (1999) Optical absorption of hemoglobin. [Online]. Available:

https://omlc.org/spectra/hemoglobin/index.html

[9] L. M. Karpeles and R. L. Huff, “Blood volume of representative portions

of the musculoskeletal system in man,” Circulation Research, vol. 3,

no. 5, pp. 483–489, 1955.

[10] S. K. Kim, H. S. Kang, C. S. Kim, and Y. T. Kim, “The prevalence

of anemia and iron depletion in the population aged 10 years or older,”

The Korean Journal of Hematology, vol. 46, no. 3, pp. 196–199, 2011.

[11] C. D. Epstein and K. T. Haghenbeck, “Bedside assessment of tissue

oxygen saturation monitoring in critically ill adults: An integrative

review of the literature,” Critical Care Research and Practice, vol. 2014,

May 2014.

[12] M. Matsumoto, H. Suganuma, S. Shimizu, H. Hayashi, K. Sawada,

I. Tokuda, K. Ihara, and S. Nakaji, “Skin carotenoid level as an

alternative marker of serum total carotenoid concentration and vegetable

intake correlates with biomarkers of circulatory diseases and metabolic

syndrome,” Nutrients, vol. 12, no. 6, 2020.

[13] S. Prahl. (2017) Beta-carotene. [Online]. Available: https://omlc.org/

spectra/PhotochemCAD/html/041.html

[14] KdotJPG. (2019) Opensimplex 2. [Online]. Available: https://github.

com/KdotJPG/OpenSimplex2