Diving Deep into the Data: A Review of Deep Learning Approaches and Potential Applications in Foodomics [original]

foods

Review

Diving Deep into the Data: A Review of Deep Learning

Approaches and Potential Applications in Foodomics

Lisa-Carina Class 1,2,†, Gesine Kuhnen 1,3,† , Sascha Rohn 2,3 and Jürgen Kuballa 1,*





Citation: Class, L.-C.; Kuhnen, G.;

Rohn, S.; Kuballa, J. Diving Deep into

the Data: A Review of Deep Learning

Approaches and Potential

Applications in Foodomics. Foods

2021,10, 1803. https://doi.org/

10.3390/foods10081803

Academic Editors:

Domenico Montesano,

Gabriele Rocchetti,

Alessandra Bordoni and

Francesco Capozzi

Received: 29 June 2021

Accepted: 2 August 2021

Published: 4 August 2021

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional affil-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

1GALAB Laboratories GmbH, Am Schleusengraben 7, 21029 Hamburg, Germany;

[email protected] (L.-C.C.); [email protected] (G.K.)

2Hamburg School of Food Science, Institute of Food Chemistry, University of Hamburg, Grindelallee 117,

20146 Hamburg, Germany

Department of Food Chemistry and Analysis, Institute of Food Technology and Food Chemistry, Technische

Universität Berlin, Gustav-Meyer-Allee 25, 13355 Berlin, Germany; r[email protected]

*Correspondence: juer[email protected]

† Both authors contributed equally to this work.

Abstract:

Deep learning is a trending field in bioinformatics; so far, mostly known for image pro-

cessing and speech recognition, but it also shows promising possibilities for data processing in food

analysis, especially, foodomics. Thus, more and more deep learning approaches are used. This review

presents an introduction into deep learning in the context of metabolomics and proteomics, focusing

on the prediction of shelf-life, food authenticity, and food quality. Apart from the direct food-related

applications, this review summarizes deep learning for peptide sequencing and its context to food

analysis. The review’s focus further lays on MS (mass spectrometry)-based approaches. As a result of

the constant development and improvement of analytical devices, as well as more complex holistic

research questions, especially with the diverse and complex matrix food, there is a need for more ef-

fective methods for data processing. Deep learning might offer meeting this need and gives prospect

to deal with the vast amount and complexity of data.

Keywords:

deep learning; machine learning; metabolomics; food authenticity; food fraud; shelf-life;

peptide sequencing; mass spectrometry

1. Introduction

New challenges and the development or improvement of analytical methods in the

last years come with the need for new approaches enabling a holistic way to evaluate food

products [

]. Foodomics is not really a well-defined term being used to unite the analytical

technologies and disciplines of the omics-cascade with research questions in food and

nutrition [

]. Genomics, proteomics, and metabolomics are part of the so-called omics-

cascade, comprising disciplines, technologies, and methodologies that are commonly used

to describe the whole profile of food compounds [

]. Figure 1illustrates schematically this

cascade. These applications are often also called high-throughput technologies, producing

a substantial amount of data [

–

]. Every discipline provides different information about

the composition of the target. As the analytical date collected is steadily increasing, a new

approach is the integration of a mathematical/bioinformatic point of view into foodomics.

The first discipline of the omics-cascade is genomics and focuses on the investigation of the

entire genome, meaning all genetic material of an organism based on the four bases DNA

code with its 64 codons, allowing a countless number of sequences. Because of the low

adaption to exogenous influence factors, the genome is predominantly stable and offers a

valuable tool for the differentiation and identification of species.

Foods 2021,10, 1803. https://doi.org/10.3390/foods10081803 https://www.mdpi.com/journal/foods

Foods 2021,10, 1803 2 of 18

Figure 1. Scheme of the omics-cascade.

Transcriptomics is the next discipline in the course of the omics-cascade and describes

the analysis of the transcriptome, which includes mRNA, non-coding RNA, as well as small

RNA in an organism [

]. The transcriptome is, unlike the genome, not stable and therefore,

almost unsuitable for analytical applications regarding food investigations, because of the

difficult dynamics [

]. The trancriptome leads to proteins. Proteomics describes the study

of all proteoformes in a defined biological system [

]. Nowadays, it seems to be almost

clear that the estimated number of genes (~100,000) is much higher than the anticipated

number identified: (~20,300) [

]. Moreover, this means that at the protein level, the

variations is much higher because of the encoded proteins, but also all kinds of follow-up

modifications [

]. Similar to other biological matrices, investigating the proteome of a food

product can be done with the following two strategies: On the one hand, the so called top-

down approach focuses on the characterization of intact proteins [

–

]. In contrast, the

bottom-up approach usually focuses on the peptides resulting from a proteolytic digestion

of the proteins [

]. Both targets—proteins and peptides—are analyzed afterwards

with mass spectrometry [

]. Metabolomics is the next discipline in the omics-cascade

and focus on the identification and quantification of the whole metabolome [

]. This

includes substrates and products of metabolic pathways and is directly associated to the so

called phenotype [

]. So, it is obvious that the metabolome is heavily influenced by all

kinds of exogenous factors. However, the metabolome enables an even more pronounced

fingerprint of a system, as many compounds can be taken into account for an evaluation

and a certain status of product at a certain timepoint can be estimated [4–7,14,16].

In the following, this review focuses primarily on proteomics and metabolomics, be-

cause those approaches are mainly mass spectrometry (MS)-based and the most prominent

disciplines for differentiating food products with regard to identification (‘authentication’),

but also characterizing the influence factors altering the phenotype (‘status’). The MS-based

methodologies offer a basis for many chemometric applications, especially with regard to

food analysis.

MS is the primary analytical technique to perform proteomics and metabolomics. It

is a technique for additionally separating molecules (besides chromatography), but with

regard to the mass-to-charge ratio (m/z) of an ion. However, in nearly all cases, MS-based

investigations of food are coupled with liquid chromatography (LC) or gas chromatography

(GC), where the analytes are additionally separated, with a certain clean-up, before entering

the mass spectrometer [

]. There are two combination methods for those applications:

on the one hand low-resolution mass spectrometry (LRMS) often provided as a triple-

quadrupole as an analyzer and on the other hand high-resolution mass spectrometry

(HRMS) with time-of-flight or similar detectors [12,18,19].

To describe the composition of food, HRMS technologies are applied preferably, as

these instruments provide the most efficient results, facing different challenges related

to the structure and quantity of molecules in natural products as well as processed food

products [17,18].

Foods 2021,10, 1803 3 of 18

HRMS makes it possible to also separate isobaric molecules [

]. Another impres-

sive aspect of this technology is that HRMS enables screening and quantifying without

a reference standard. In this context, a reference standard is the exact substance, which

is the target of an investigation. LRMS needs those references to identify the analytes.

With HRMS suspect screening and non-targeted applications can be performed, while

not needing references, because of a precise mass-to-charge-ratio and the generating of

other values like the collisions cross section (CCS) [

]. These reasons often led to

the conclusion that the superior choice to perform non-targeted approaches of complex

matrices is

HRMS [12,18,19,21,22].

However, these different omics provide lots of data.

Consequently, chemometric tools need to be developed and applied. The goal is to deal

effectively with the large amount of data, to achieve different assignments like the predic-

tion of shelf-life, identifying food fraud, proving food authenticity, and evaluating food

quality, in general [

]. The novel deep learning approaches might provide an adequate

tool for answering holistic food-related questions in the future and also grant even more

possibilities than the traditional applications.

2. Chemometrics, Artificial Intelligence, and Machine Learning

When dealing with this emerging topic, primarily some terms and concepts need to

be introduced to enable strategies and protocols for chemometrics. When starting with the

term Chemometrics itself, this has been defined by the International Union of Pure and

Applied Chemistry (IUPAC) as “The science of relating measurements made on a chemical

system or process to the state of the system via application of mathematical or statistical

methods.” [

]. This definition includes the combination of the disciplines chemistry,

mathematics, and computer science with the focus on the generated data.

Artificial Intelligence (AI) is an expression many stumble across regularly, but it

remains unclear what it really means or comprises. IUPAC defined AI as “The capability

of a machine to perform human-like intelligence functions, such as learning, adapting,

reasoning and self-correction.” [

]. Machine Learning (ML) is considered as a subclass

of AI, covering the methods of detecting and learning. Patterns need to be determined

and learning leads to optimization, and using these enables decision-making or predicting

future outcomes [

]. Deep Learning (DL) in turn is a certain form of ML, which is

presented in more detail in the next section. The relationship between AI, ML, and DL

is shown in Figure 2. A further term that is often mentioned in relation to the others is

Data Mining (DM). DM is a step of knowledge discovery in databases (KDD). The tools

used for DM are often ML tools. DM or KDD general focus on the knowledge discovery

including storage, access of data, as well as visualization; all in all, providing a workflow

for an evaluation process [29].

Figure 2. Relationship of artificial intelligence, machine learning, and deep learning [30].

Foods 2021,10, 1803 4 of 18

ML methods are categorized into supervised, unsupervised, and reinforcement learn-

ing. Supervised learning is based on labeling the training sets with the desired output. It is

the most common used technique. Common tasks in supervised learning are regression

and classification tasks. While classification tasks provide a categorical output, regres-

sion provides real-valued outputs. Unsupervised learning can be used for the discovery

of patterns in datasets (‘fingerprints’), where no labelling of data is given in advance.

Reinforcement learning is based on reward or punishment signals [28].

There are some issues that should be considered when ML methods are applied.

The training of supervised models is based on labelled data. These are processed and

fitted to improve performance. Accordingly, the efficiency of the model depends on the

used dataset. Limitations are inadequate small datasets, non-representative data, and an

insufficient quality of the datasets. While the model is trained, overfitting is a problem as

well. Overfitting means that the model performs well with the training data but difficulties

are occurring in generalizing or adapting the patterns observed to new data. Overfitting

can be caused by different reasons. One might be that the model has been trained using

the same data with too many repetitions. Another reason for overfitting is models which

are too complex for the given task. Moreover, it can also be caused by datasets that are

insufficient in representing the generality. Underfitting on the other hand results from

models that are too simple for a rather complex topic [31,32].

Some commonly used chemometric and ML models in foodomics are principle compo-

nent analysis (PCA) [

], partial least square-discriminant analysis (PLS-DA) [

], support

vector machine(s) (SVM) [

], random forest (RF) [

], decision tree (DT) [

], and k-nearest

neighbors (kNN) [37,38].

3. Deep Learning

Conventional machine-learning-systems need manual feature extraction, whereas DL

systems learn these features from the trainings data [

]. DL is learning of representation

of data through layers of neurons [

]. These neurons are structured in form of Neural

Networks (NN). As the name indicates, NN consist of artificial neurons. The first NN are

based on computational concepts that are structurally similar to biological neurons and on

how they might work together as a network [

]. Artificial neurons receive one or more

inputs, which can activate a neuron to give an output. Figure 3shows the structure of a

simple NN, also called multilayer perceptron (MLP) [30].

Figure 3.

Architecture of a feedforward neural network with four input neurons, and three output

neurons, and (here exemplarily three) hidden layers [40].

It is a feedforward neural network (FNN) that consists of three main components: the

neurons (also called nodes), the connection between these neurons, and the layers. The

layers and thereby the neurons are divided into input, hidden, and output [

]. The input

Foods 2021,10, 1803 5 of 18

neurons receive the raw data. Each of the varying number of hidden layer of neurons takes

the sum of the outputs of the preceding layer as an input. This input goes through an

activation function, generating output according to the value of the input [

]. DL is not

equal to NN, although it is often used in a similar way.

The differentiation between deep and shallow NN are based on the number of hidden

layers, but with no defined number of them [

]. NN are categorized in feedforward NN

(FNN) and recurrent NN (RNN). FNN are acyclic, as the data stream goes straight through

each layer from input to output [

]. In the previous section and Figure 3, a simple type of

FNN was described; a special kind of FNN that should be pointed out is the convolutional

neural network (ConvNet, CNN). It consists of one or more convolutional layers followed

by a pooling layer. While the convolutional layer detects features in the input matrix, the

pooling layer reduces the dimension. After convolutional and pooling layers are applied,

usually fully connected layers are used as well [

]. Recurrent neural networks (RNN) on

the other hand, are distinguished by feedback connections. Due to the feedback connection,

the output of the neuron is not only influenced by the current input of the neurons in

the preceding layer, but also by inputs from previous timepoints. The so-called hidden

state of the neurons provides a kind of memory from the previous layers. Therefore, RNN

are strong for processing sequential data [

]. The long short-term memory (LSTM)

is a variant of RNN, coping with the challenge of the loss of long-term information in

conventional RNN. The LSTM features so-called memory cells with three gates: input,

output, and forget gate. These enable the cells to extract and save important input in

long-term aspects [43].

The most used learning technique in DL is supervised [

]. Supervised, unsupervised,

and reinforcement learning have already been described above. Supervised learning uses

backpropagation. As already mentioned, labelled datasets are needed. The data is split into

training and testing data. The bigger part is used for the training of the model. Initially, the

output of a small dataset is calculated, then the error in comparison with the desired output

is considered, and subsequently the weights and biases are adjusted. This is performed

multiple times until the functions reaches a local minimum [39,44].

For the application and programming of DL algorithms, the most popular program-

ming languages are Python [

], R [

], and MATLAB [

].Frameworks like Tensor-

Flow [

] and PyTorch [

] can be used, making the application more accessible, due

to the simplified integration of models and various available tutorials, even with a lim-

ited background knowledge in informatics. Last but not the least, the improvement of

graphic processing units contributed to the success of DL in recent years. These enabled

the acceleration of the NN training [39,41].

4. Food Fraud and Food Authenticity

Food Fraud, Food Crime, Food Adulteration, and Food Terrorism are just a few terms

to describe different food safety issues that have been especially associated with authen-

tic food, which refers to a certain (production) technology, origin, or other specificities.

The European Commission describes food fraud as “any suspected intentional action

by business or individuals for the purpose of deceiving purchasers and gaining undue

advantages therefrom, in violation of rules referred to in Article 1 (2) of Regulation (EU)

2017/625 (the agri-food chain legislation)”. Food fraud includes, referring to the Food

and Drug Administration, economically motivated adulteration or the concept of food

counterfeiting [

]. This means that once the intentional violation against the food law

is committed, in most cases for achieving an economic or financial benefit, it is a matter of

food fraud, and the consumers are at risk of being cheated or even consumer’s health is in

danger [

]. There are different types to perpetrate food fraud as shown in Figure 4. One

subtopic is the non-approved enhancement of some food products. An example for that

is the so called “melamine scandal 2008” in China, where melamine was added to infant

formula and other food materials to enhance the protein value by mimicking proteins with

melamine as intense nitrogen-containing compound [

–

]. Another subtopic of food

Loading more pages...