Studying the Potential of Multi-Target Classification to Characterize Combinations of Classes with Skewed Distribution [original]

Studying the Potential of Multi-Target Classification

to Characterize Combinations of Classes with

Skewed Distribution

Arne Schneck∗Sven Kalle∗† R¨

udiger Pryss‡Winfried Schlee†Thomas Probst‡§ Berthold Langguth†

Michael Landgrebe†Manfred Reichert‡

Myra Spiliopoulou∗

∗Otto-von-Guericke Univ. Magdeburg, †University Hospital Regensburg, ‡Univ. Ulm, §Georg-August Univ. G¨

ottingen

Abstract—The identification of subpopulations with particu-

lar characteristics with respect to a disease is important for

personalized diagnostics and therapy design. For some diseases,

the outcome is described by more than one target variable. An

example is tinnitus: the perceived loudness of the phantom signal

and the level of distress caused by it are both relevant targets

for diagnosis and therapy. In this work, we study the potential of

multi-target classification for the identification of those screening

variables, which separate best among the different subpopula-

tions of patients, paying particular attention to subpopulations

with discordant value combinations of loudness and distress.

We analyse the screening data of 1344 tinnitus patients from

the University Hospital Regensburg, including questions from 7

questionnaires, and report on the performance of our workflow

in target separation and in ranking the questionnaires’ variables

on their discriminative power.

Index Terms—multi-target classification on skewed data; tin-

nitus handicap; tinnitus loudness; medical mining

I. INTRODUCTION

During patient screening, physicians use agreed-upon ques-

tionnaires and medical tests to capture symptoms and assess-

ments that associate with the outcome. The results of this

screening process are then used for diagnostics and person-

alized therapy design. There are diseases that pose particular

challenges to this process, especially those with comorbidities

or with unclear physiopathological mechansisms: extensive

assessments are needed for diagnostics, and a complex out-

come, consisting of more than one target variables, must be

considered for therapy design. In this study, we propose a

mining workflow for multi-target classification and for the

characterization of assessments with respect to their predictive

power towards an outcome consisting of multiple targets. We

report on our results on screening data of tinnitus patients,

studying the combination of tinnitus loudness and handicap

as multi-targeted outcome.

Tinnitus is a medical condition characterized by the phan-

tom perception of sound in one or both ears. In [1], Baguley

et al. report a prevalence of 10-15%. The recent review of

Elgoyen et al. highlights the large patient heterogeneity as

one of the major reasons for inconsistent results in studies

on tinnitus [2]. An example of heterogeneity concerns the

interplay between loudness of the tinnitus signal and handicap

caused by tinnitus: in their study [3], Hiller and Goebel

show that ”loudness and annoyance are discrepant” in some

cases, since there are patients whose everyday life is obscured

although loudness is low, while other patients do not feel

disturbed although their tinnitus signal is loud. Understanding

what characterizes patients with such a discordant combination

of loudness and handicap is important for personalized therapy

design, but also to gain insights in the pathophysiological

mechanisms of tinnitus.

Tinnitus loudness refers to the the subjective loudness of

the tinnitus perception, as rated by the patient. The Tinnitus

Sample Case History Questionnaire (TSCHQ) [4] contains an

explicit question on the scale of tinnitus loudness, as well as

further questions on the nature of the perceived signal. The

assessment of tinnitus handicap is the subject of the Tinnitus

Impairment Questionnaire (TBF12) [5] (12 questions), but

questionnaires associated with mental well-being are also

of relevance. They include the Major Depression Inventory

(MDI) [6] and the World Health Organization Quality of

Life (WHOQOL) questionnaire [7]. In this study, we use the

tinnitus loudness question (Q11) of TSCHQ [4] as one target

variable, ”TLoudness”, and the aggregate value of the 25

questions in the Tinnitus Handicap Inventory (THI) [8] as

second target variable ”THandicap”.

Our approach is a mining workflow with several steps. To

learn models for the two targets of tinnitus loudness and

handicap we use multi-target classification, preceded by a

target discretization task and an oversampling task to account

for infrequent combinations of loudness and handicap values.

To assess the importance of specific questions/assessments for

target separation, our approach encompasses a task of variable

ranking, whereby we generate several models and count the

occurrences of a variable in a model.

The paper is organized as follows. In section II we give a

short overview of related research. In section III we describe

the data used in our analysis. In section IV we present our

approach and report on the experimental results in section V.

We close the paper with a summary and some open questions

in section VI.

II. STATE OF THE ART

The screening protocols for the diagnosis of a disease

encompass questionnaires and clinical examinations. There are

several questionnaires for tinnitus diagnostics, including the

Tinnitus Sample Case History Questionnaire (TSCHQ) [4],

the Tinnitus Handicap Inventory (THI) [8] and the Tinnitus

Questionnaire (TQ) [9]. Questionnaires differ in purpose: for

example, THI focusses on assessing the handicap caused by

tinnitus, while TSCHQ records anamnesis, medication and

loudness as well. There is overlap among the questionnaires,

but also inside a questionnaire. For example, more than one

of the TQ questions addresses the effects of tinnitus on sleep.

Agreement or disagreement among answers to strongly corre-

lated questions can shed insights on how a patient experiences

the disease. Hence, we do not skip questions before learning,

but identify questions that contribute to class separation in a

learned model.

Class separation with respect to more than one classifi-

cation variable is studied by ”multi-target” (or multi-output)

classification algorithms. Early algorithms include [10]–[12].

Random forests [13] have been shown to be promising for

multi-target classification, to the effect that more elaborate

algorithms emerged over the years. A recent overview can

be found in [14].

The use of ensembles implies that the contribution of each

variable to class separation becomes less clear. This is further

exacerbated in random forests, since each tree is learned on

a different subset of the original feature space. In [13], [15],

Leo Breiman already investigates this problem and proposes

measures that quantify the importance of a variable for class

separation. In [16], Louppe et al provide an elaborate quantifi-

cation of variable importance. They concentrate on Breiman’s

”Mean Decrease Impurity” (MDI) but their estimations gener-

alize for further impurity measures. The main emphasis of [16]

is in providing a reliable estimation of variable importance,

i.e. as the number of randomized trees goes towards infinity.

Moreover, they provide estimates both for fully grown trees,

i.e. after the end of the learning phase, and for trees grown to

depth q≤p, where pis a tree’s full depth. In our work, we

use a much simpler computation of variable importance within

a finite set of random trees, without generalization guarantees.

To perform this simplification, we restrict model induction so

that all randomized trees are learned on the complete feature

space instead of learning each tree in a subspace.

Advances on variable ranking also include methods that

assess the relevance of variables before model learning, aiming

to prune non-predictive variables and to identify correlated

ones. A recent example can be found in [17]. The a priori

exclusion of screening assessments that are known to be

overlapping or correlated is not desirable in our example, since

we want to identify those among the correlated questions that

contribute mostly to separation. Hence, our workflow encom-

passes the task of variable ranking after learning, whereby

we perform ranking on the variables learned by a number of

multi-target classifiers.

III. MATERIALS

We use a sample of 1344 tinnitus patients from the Uni-

versity Hospital Regensburg and consider exclusively the

assessments of the first screening. The screening encom-

passes answers to several questionnaires, including the Tin-

nitus Handicap Inventory (THI) [8], which is a 25-items-

questionnaire and is the most widely used instrument for

measuring the tinnitus-associated handicap in the daily life

of the patient. The Tinnitus Questionnaire (TQ) [9] is another

questionnaire containing 52 items; it is frequently used for

tinnitus research in Germany. The Tinnitus Sample Case His-

tory Questionnaire (TSCHQ) [4] is an assessment instrument

with 35 questionnaire items to record demographic variables

and clinical characteristics. The Tinnitus Impairment Ques-

tionnaire (TBF12) [5] is a short questionnaire to measure

the tinnitus-related distress; it contains 12 items. The small

Tinnitus Severity (TS) questionnaire consists of 6 items, used

to measure different aspects of the tinnitus-related distress on a

numeric rating scale. The Major Depression Inventory (MDI)

[6] is a standard instrument to assess depressive symptoms; it

contains 12 items. The World Health Organization Quality of

Life (WHOQOL) is an internationally validated questionnaire

to measure the quality of life; it encompasses 26 items [7]. In

addition to the questionnaires, an Audiological Examination

was also performed to assess the hearing ability of the patients

with an audiogram.

The outcome is described by two variables. We derive them

by discretizing the TSCHQ variable loudnessdescription-

text screen and the THI variable THI totalscore screen.

The former variable is a measure for the subjective loud-

ness of the tinnitus perception, as rated by the patient. It

ranges between 1 and 100 in steps of 5 units. The variable

THI totalscore screen is the aggregate value of the 25

questions in the Tinnitus Handicap Inventory (THI) question-

naire and ranges between 0 and 100. Values closer to 100

indicate higher loudness, resp. handicap. We split each value

range into two bins, the bin ”LOW” containing the values

in [0,50), the bin ”HIGH” containing the higher values in

[50,100]. The resulting discrete variables TLoudness and

THandicap are binary. Of the four combinations total, two

are discordant, namely low loudness with high handicap (also

denoted as L H+hereafter) and high loudness with low

handicap (denoted as L+H ).

From this dataset we removed all patients, for whom one or

both of the target variables had no value. As next filtering step,

we projected away following variables: variables with evident

logical errors, variables with undiscretized dates, variables

with missing values for more than 5% of the patients. Patients

with missing values in one of the retained variables were also

removed, as final filtering step. The remaining dataset consists

of 629 patients described by 97 variables. The distribution of

the targets is depicted on Table I.

IV. OUR APPROACH

Our approach for multi-target classification builds upon

following model of the learning problem.

TABLE I

DISTRIBUTION OF THE TARGET VARIABLES IN THE PATIENTS SAMPLE

TLoudness : LOW TLoudness : HIGH

THandicap : LOW 69 239

THandicap : HIGH 20 301

Let T={T1, . . . , Tm}be the set of targets and let

LTi={Ci,1, . . . , Ci,li}be the set of class labels for the

target Ti. Let Pmbe the set of all combinations of labels

from the mtargets. Further, let s={s1, . . . , sm}∈Pmbe a

combination of labels from the targets, i.e. si∈LTifor each

i= 1 . . . m. We define as learning focus (or simply focus)

the set S⊆ Pmof label combinations that are of particular

interest for the application. For the tinnitus application, S

consists of the two discordant combinations of TLoudness

and THandicap, namely L H+and L+H . On Table I we

see that L H+is infrequent (20 patients), while L+H is

frequent (239 patients).

Our first objective is to build a set of models that separate

well both with respect to Pmand with respect to the focus S.

Our second objective is to derive from these models a set of

variables with high contribution to the classification of those

instances, whose labels are in the focus S.

A. Outline of our mining workflow

Our approach towards the two objectives of classification

and identification of predictive variables encompasses follow-

ing tasks:

1) Bin construction for the target combinations and over-

sampling

2) Multi-target classification

3) Assessment of model quality

4) Assessment of a variable’s importance

5) Construction of ”good” models and variable ranking

over those models

The first task encompasses partitioning of the training sample

into bins, where each bin covers one combination of values

of the target variables. Since some bins may be substantially

smaller than others (cf. data distribution among the four

combinations in Table I), we perform oversampling to derive

equisized bins.

In the following, we describe the subsequent tasks of our

workflow.

B. Multi-target classification core

For class separation we use random forests (RF), as pro-

posed in [13]. For a training set Dover a feature space F, this

algorithm induces multiple CART-based decision trees [18],

whereby each tree is learned on |D|instances, randomly drawn

with replacement from D. RF considers a random choice of

variables from Fwhen inducing each tree. In our approach,

however, we force RF to consider the whole of Fduring tree

induction, so that all variables in Fare considered with equal

prior probability during variable ranking.

We consider two RF-based algorithms for multi-target clas-

sification. The first one is a scikit-learn implementation [19] of

a multi-target classification algorithm on the basis of random

trees1, proposed in [14]. This algorithm, denoted as MTRF

hereafter, builds a single classification model for the mtarget

variables, namely an ensemble of random trees. The second

algorithm is an RF-based variant on the ”Label Powerset”

algorithm proposed in [20], denoted as LPRF hereafter. This

algorithm learns one target variable, the values of which are

the combinations of values of the mtarget variables in Pm.

C. Quantification of model quality

To ensure that our mining workflow produces models that

separate well across all targets, we distinguish between global

quality and focus quality of a model. The global quality

of model Mis an m-dimensional vector qglobal(M), the

ith element of which is the accuracy value achieved by M

for the ith target. The focus quality is the m-dimensional

vector qfocus(M)encompassing the recall values for the

combinations in the focus S, i.e. the number of hits for the

focus combinations to the number of instances in the focus; the

ith element of this vector represents the recall value achieved

by Mfor the ith target. Although we define the quality

vectors on the basis of accuracy, resp. recall, any other quality

function, e.g. the F-measure, could be used instead.

Since the instances belonging to classes in Smay make

only a small portion of the population, we consider two user-

defined thresholds, τglobal and τfocus. A model is ”good” if

each element of its global quality vector exceeds τglobal and

each element of its focus quality vector is higher than τfocus.

D. Assessing a variable’s importance

To assess the importance of a variable for the separation

among the classes in Pmand in the focus S, our mining work-

flow generates a series of models G. Informally, a variable is

deemed to be important, if it is used by many models in G.

Since a model is a consists of trees, a variable that is used to

split the root node or another node close to the root has more

influence on class separation than a variable that is used in a

split close to the leaf nodes. Hence, our scoring function for

a variable’s importance takes into account the position of a

variable in each tree that used this variable for splitting.

Let s∈Sbe a combination of target variable values from

the focus S, let M∈Gbe a model and T∈Mbe a tree

induced as part of M. We denote as f(s, T )the set of those

nodes in T, which contain a split that involves s, i.e. a split that

separates the instances belonging to sfrom those belonging

to other target combinations. We identify the variables used in

the splits of the nodes in f(s, T )and compute for each of them

vits importance for sover all T∈M. To do so, we combine

two scoring functions, avF (v, s, G)and avH(v, s, G), defined

as follows.

1scikit-learn 0.17 RF implementation http://scikit-learn.org/stable/modules/

ensemble.html#random-forests, accessed on Feb. 10, 2017.

For a variable v, we define its average frequency with

respect to sin the set of models Gas:

avF (v, s, G) = PM∈GPT∈MPx∈f(s,T )split(x, v)

PM∈GPT∈M1(1)

where split(x, v)acquires the value 1 if node xis split on v

and zero otherwise. Larger values are better.

For a variable v, we define the average height (tree layer)

in which it appears as:

avH(v, s, G) = PM∈GPT∈MPx∈f(s,T )split(x, v)·l(x, T )

PM∈GPT∈MPx∈f(s,T )split(x, v)

(2)

where l(x, T )refers to the position/layer of the tree, where x

is located, divided by the total height of the tree T. The root

of the tree is at the layer 1, a leaf at a layer equal to the tree

height. The closer the node xis to the root of T, the more

important is the variable on which xis split. Hence, smaller

values of avH() are better.

On the basis of those two functions, we define the impor-

tance of a variable vin a set of models Gtowards a set of

focus combinations sas:

importance(v, s, G) = avF (v, s, G)–w∗avH(v, s, G)(3)

where the contribution of frequently used variables is penal-

ized if the location of these variables is close to the leaves

of the trees in the models of G. The weight wregulates the

influence of avH(). In our work, we have set w= 0.5.

This function allows us to either extract all variables with

higher importance scores than a threshold, or to select the

variables with the top-N scores. In section V we choose the

second option and return the top N= 10 variables for the

focus combinations L H+und L+H .

E. Variable ranking for a choice of models

To identify the variables that have the highest contribution

to class separation, we stepwise generate a number of models.

However, instead of considering models of arbitrary quality,

we discard models, the quality of which is below threshold,

and continue the model generation until a user-defined number

nof good models is reached. They constitute the set of models

Ginput to the importance function.

To create this Gfor each of the two classification algorithms

MTRF and LPRF, we perform a sequence of runs. In each run, we

place the instances into two bins, whereby we oversample the

minority classes, according to the first step of our workflow.

We first use the one bin for learning and the other one for

evaluation, and then switch the bins. Hence, each run outputs

two models, the quality of which is evaluated against the two

quality thresholds. To increase diversity among the runs, we

shuffle the instances, i.e. we assign the instances to two bins

randomly without replacement. We continue generating pairs

of models until the user-defined number of ”good” models n

is reached.

V. EXPERIMENTS

A. Experimental design

We evaluate our approach on the sample described in

section III, i.e. for two target variables. We set the fo-

cus on the combinations of LOW TLoudness and HIGH

THandicap, denoted as L H+and of HIGH TLoudness

and LOW THandicap, denoted as L+H .

For LPRF, we set the threshold for global quality τgq to 0.8

and the quality threshold for the (two) focus combinations of

targets τfq also to 0.8. For MTRF, we set the corresponding

values to 0.88, since it turned out, that models created by

MTRF predict better and we only want to create the very best

possible models. We set the number of models with quality

higher than the thresholds to n= 20.

B. Results

The performance values over the n= 20 models are

depicted on Table II. The second column shows the number

of models induced, until 20 models with quality above the

thresholds were created.

TABLE II

GLOBAL QUALITY AND FOCUS QUALITY,AVERAGED OVER THE SELECTED

20 MODELS FOR EACH ALGORITHM

Number Global quality Focus Quality

of models avg (variance) avg (variance)

MTRF 20 of 142 0.9060 (0.0005) 0.9323 (0.0005)

LPRF 20 of 112 0.8190 (0.0003) 0.8524 (0.0021)

For each of the combinations L H+and L+H , we sorted

the variables used by LPRF on importance and similarly for

MTRF . To select and compare these sets, we have set the

performance threshold for MTRF higher than for LPRF. On

Table III, we depict the top-10 variables for L H+, which

is the least frequent combination in our data (20 patients,

cf. Table I). On Table IV, we similarly show the top-10

variables for L+H , which is rather frequent in our data (239

patients). Variables considered important by both algorithms

are represented in row Both, while variables found important

by only one algorithm are represented in separate rows MTRF

and LPRF.Qirepresents the ith question of the respective

questionnaire.

TABLE III

THE TOP-10 IMPORTANT VARIABLES FOR THE COMBINATION L H+

Top-10 important variables for L H+

Both 8 THI:{Q10, Q12, Q13, Q16, Q17}, TQ:{Q7, Q10, Q15}

MTRF 2 THI:{Q1, Q23}

LPRF 2 THI:Q21, TQ:Q39

C. Discussion

Our workflow shows very good accuracies for MTRF and

LPRF. LPRF induced less models than MTRF in order to build the

n= 20 good models, but this may be attributed to the higher

thresholds we used for MTRF. The higher quality of MTRF is

TABLE IV

THE TOP-10 IMPORTANT VARIABLES FOR THE COMBINATION L+H

Top-10 important variables for L+H

Both 7 THI:{Q10, Q12, Q13, Q16, Q17}, TQ:{Q7, Q15}

MTRF 3 THI:{Q1, Q15, Q25}

LPRF 3 THI:{Q7, Q14, Q21}

not completely unexpected, as it learns all targets separately,

while LPRF learns combinations. This should be evaluated in

more detail though, by usage of e.g. confidence intervals on

the achieved accuracies. The similarity of both approaches is

underlined by the agreement about the variables characterizing

each of the focus combinations.

The top-10 important variables for the two focus combina-

tions L H+(infrequent) and L+H (frequent) come mostly

from the Tinnitus Handicap Inventory (THI). This is not

surprising, since THandicap is derived from the aggregate

score of THI. It is more of interest to check which questions

are among the top-10: they refer to frustration (Q10), pleasures

and responsibilities (Q11, resp. Q12), stress in social relations

(Q17), rather than to difficulties in hearing people (THI:Q2),

anger (Q3) or confusion (Q4).

Despite the correlation of the target THandicap with

THI, there are four highly discriminative questions from the

Tinnitus Questionnaire (TQ) among the top-10 in L H+(cf.

Table III), though not in L+H . TQ contains 52 questions,

which are formulated as statements. For example, Q15 states

that the tinnitus signal is loud most of the times; the patients

answer with ”Agree”, ”Disagree” and ”Partially”. The ques-

tions overlap: Q7 states that the tinnitus signal is rather faint.

Q10 states that the tinnitus sound is unpleasant, while Q39

is on feeling depressed. The occurrence of these questions

in Table III indicates that the answers of the patients are very

discriminative for L H+, while the other patients answer these

four questions in a way that does not allow to distinguish

between L+H and the remaining two classes.

As in THI, the TQ questions present in Table III are on

feeling annoyed or distressed. Questions on feeling angry,

having difficulties to hear others etc, are not adequately

discriminative to reach the top-10 positions. This indicates that

the patients experience handicap in very different forms, no

form being highly prevalent.

There are some constraints in these results. MTRF and LPRF

are conceptually similar algorithms, so the variability of their

findings is not large. Moreover, there has been no correction

for oversampling: this affects the reliability of the global/focus

quality computations, and thus the choice of the nmodels for

subpopulation characterization. Further, the algorithms learned

only over 50% of the data, thus the overall model quality

may have been lower than possible. Finally, the ranking of

the variables has not been tested statistically. Nonetheless,

the lists of the top-10 variables are in agreement with expert

insight on which questionnaire questions are informative for

the combination of tinnitus loudness and handicap.

VI. CONCLUSION

We presented a mining workflow for multi-target classi-

fication and identification of discriminative variables during

patient screening, and we have reported our preliminary results

on the classification of screening records of tinnitus patients.

Our results for two target variables indicate that the approach

can build good models and identify discriminative variables

that agree with expert insight. Since the screening involves a

very large number of questions from semantically overlapping

questionnaires, the identification of discriminative questions

can help the physicians focus on specific answers for diagnosis

and therapy design.

Our first steps of future work are on the alleviation of

some of the identified shortcomings, namely correction for

oversampling, induction of random trees on subsets of the

feature space (subspaces) and usage of the variable ranking

estimates of [16], enhancement of the variable ranking mech-

anism for global quality vs focus quality with appropriate

statistical testing, and experiments with more than two target

variables.

REFERENCES

[1] D. Baguley, D. McFerran, and D. Hall, “Tinnitus,” The Lancet, vol. 382,

no. 9904, pp. 1600–1607, 2013.

[2] A. Elgoyhen, B. Langguth, D. De Ridder, and S. Vanneste, “Tinnitus:

perspectives from human neuroimaging,” Nature Rev Neurosci, vol. 16,

pp. 632–642, Sept. 2015.

[3] W. Hiller and G. Goebel, “When tinnitus loudness and annoyance

are discrepant: audiological characteristics and psychological profile,”

Audiology and Neurotology, vol. 12, no. 6, pp. 391–400, 2007.

[4] B. Langguth, R. Goodey, A. Azevedo, A. Bjorne, A. Cacace, A. Crocetti,

L. Del Bo, D. De Ridder, I. Diges, T. Elbert et al., “Consensus

for tinnitus patient assessment and treatment outcome measurement:

Tinnitus research initiative meeting, regensburg, july 2006,” Progress

in brain research, vol. 166, pp. 525–536, 2007.

[5] K. V. Greimel, M. Leibetseder, J. Unterrainer, and K. Albegger, “Can

tinnitus be measured? methods for assessment of tinnitus-specific dis-

ability and presentation of the tinnitus disability questionnaire,” Hno,

vol. 47, no. 3, p. 196, 1999.

[6] P. Bech, N.-A. Rasmussen, L. R. Olsen, V. Noerholm, and

W. Abildgaard, “The sensitivity and specificity of the major depression

inventory, using the present state examination as the index of diagnostic

validity,” Journal of affective disorders, vol. 66, no. 2, pp. 159–164,

2001.

[7] S. M. Skevington, M. Lotfy, and K. A. O’Connell, “The world health

organization’s whoqol-bref quality of life assessment: psychometric

properties and results of the international field trial. a report from the

whoqol group,” Quality of life Research, vol. 13, no. 2, pp. 299–310,

2004.

[8] C. Newman, G. Jacobson, and J. Spitzer, “Development of the tinnitus

handicap inventory,” Archives of Otolaryngology–Head & Neck Surgery,

vol. 122, no. 2, pp. 143–148, 1996.

[9] R. S. Hallam, “TQ – manual of the tinnitus questionnaire – revised and

updated, 2008,” http://www.richardhallam.co.uk, 2009.

[10] D. Demˇ

sar, S. Dˇ

zeroski, T. Larsen, J. Struyf, J. Axelsen, M. B.

Pedersen, and P. H. Krogh, “Using multi-objective classification to

model communities of soil microarthropods,” Ecological Modelling, vol.

191, no. 1, pp. 131–143, 2006.

[11] J. Struyf and S. Dˇ

zeroski, “Constraint based induction of multi-objective

regression trees,” in International Workshop on Knowledge Discovery in

Inductive Databases. Springer, 2005, pp. 222–233.

[12] D. Kocev, C. Vens, J. Struyf, and S. Dˇ

zeroski, “Ensembles of multi-

objective decision trees,” in European Conference on Machine Learning.

Springer, 2007, pp. 624–631.

[13] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.

5–32, 2001.

[14] G. Louppe, “Understanding random forests: From theory to prac-

tice,” Ph.D. dissertation, University of Liege, Belgium, 10 2014,

arXiv:1407.7502.

[15] L. Breiman, “Manual on setting up, using, and understanding random

forests v3. 1,” 2002.

[16] G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts, “Understanding

variable importance in forests of randomized trees,” in Advances in

Neural Information Processing Systems, 2013, pp. 431–439.

[17] Q. Zou, J. Zeng, L. Cao, and R. Ji, “A novel features ranking metric with

application to scalable visual and bioinformatics data classification,”

Neurocomputing, vol. 173, pp. 346–354, 2016.

[18] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification

and regression trees. CRC press, 1984.

[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,

O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-

plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-

esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine

Learning Research, vol. 12, pp. 2825–2830, 2011.

[20] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,”

International Journal of Data Warehousing and Mining, vol. 3, no. 3,

2006, the label powerset algorithm is called PT3.