Entity-level stream classification: exploiting entity similarity to label the future observations referring to an entity [original]

International Journal of Data Science and Analytics

https://doi.org/10.1007/s41060-019-00177-1

REGULAR PAPER

Entity-level stream classification: exploiting entity similarity to label

the future observations referring to an entity

Vishnu Unnikrishnan1·Christian Beyer1·Pawel Matuszyk1·Uli Niemann1·Rüdiger Pryss2·Winfried Schlee3·

Eirini Ntoutsi4·Myra Spiliopoulou1

Received: 2 January 2019 / Accepted: 7 February 2019

Abstract

Stream classification algorithms traditionally treat arriving instances as independent. However, in many applications, the

arriving examples may depend on the “entity” that generated them, e.g. in product reviews or in the interactions of users

with an application server. In this study, we investigate the potential of this dependency by partitioning the original stream

of instances/“observations” into entity-centric substreams and by incorporating entity-specific information into the learning

model. We propose a k-nearest-neighbour-inspired stream classification approach, in which the label of an arriving obser-

vation is predicted by exploiting knowledge on the observations belonging to this entity and to entities similar to it. For the

computation of entity similarity, we consider knowledge about the observations and knowledge about the entity, potentially

from a domain/feature space different from that in which predictions are made. To distinguish between cases where this

knowledge transfer is beneficial for stream classification and cases where the knowledge on the entities does not contribute to

classifying the observations, we also propose a heuristic approach based on random sampling of substreams using kRandom

Entities (kRE). Our learning scenario is not fully supervised: after acquiring labels for the initial mobservations of each entity,

we assume that no additional labels arrive and attempt to predict the labels of near-future and far-future observations from

that initial seed. We report on our findings from three datasets.

Keywords Stream classification ·kNN ·Entity similarity

We use the terms observations and instances interchangeably, since

we speak of a time series of property values that exist at all times but

are observed at different points in time.

BVishnu Unnikrishnan

[email protected]

Christian Beyer

christian.beyer@ovgu.de

Pawel Matuszyk

[email protected]

Uli Niemann

[email protected]

Rüdiger Pryss

ruediger[email protected]

Winfried Schlee

[email protected]urg.de

Eirini Ntoutsi

[email protected]er.de

Myra Spiliopoulou

[email protected]

1 Introduction

Most data sources analysed for decision making are of a

dynamic nature. Purchases, reviews and ratings are subject

to changes in customers’ attitudes and in products’ popu-

larities. User interaction with mobile apps and social media

reflects the people’s changes in interests, preferences and

mood. Stream classification algorithms treat streaming data

as data/observations that arrive independently. This seems

oversimplifying, since the sales of a specific product or the

recordings of a patient interacting with an mHealth app

certainly depend on the associated entity—the product or

patient. So, it is reasonable to expect that information on the

1Otto-von-Guericke Univeristy Magdeburg, Magdeburg,

Germany

2University Ulm, Ulm, Germany

3University Hospital Regensburg, Regensburg, Germany

4Leibniz University Hannover, Hannover, Germany

123

International Journal of Data Science and Analytics

specific data-generating entity and on other entities (prod-

ucts, patients) that the query entity is similar to can improve

model quality. For instance, predicting future blood glucose

levels for a diabetic patient may benefit from considering

other, similar patients with diabetes. This approach is also

potentially useful given that most real-world applications

have to deal with data sparsity issues, and inferring infor-

mation from the neighbourhood may improve the quality of

predictions. Towards this end, we propose a neighbourhood-

based approach that trains entity-specific stream classifiers

and exploits similarity among entities for learning, rather

than similarity among the individual instances comprising

them, i.e. the entity is assumed to have both static and

dynamic information associated with it, and the static infor-

mation is used to compute the similarity neighbourhood,

while information from similar entities is exploited to make

predictions in the dynamic space.

There are abundant advances in stream classification, as

reflected in recent surveys like [6,13], where major chal-

lenges are identified, including the non-stationarity of the

data-generating process and the need to learn and adapt the

model efficiently. Although ensembles are seen as a very

competitive learning paradigm for stream classification [13],

the idea of learning entity-specific models has received little

attention in stream learning [3], in contrast to the obvious

entity-centric learning in time series analysis (where each

time series, usually generated by an entity, is treated sep-

arately from those generated by others). In our work, we

attempt to bring the two fields closer by exploiting the his-

tory of an entity and that of entities similar to it to predict the

labels of the arriving observations.

Similarityamongentitiesistraditionallyexploitedintime-

series analysis, and many extensions have been proposed

since the introduction of the first principles (cf. [28] for one

of the early papers). The core idea is to compare the series of

observations appertaining to different entities, thereby using

dynamic time warping and its variants to align time series of

different lengths [11]. However, this idea does not transfer

to the context of entity-centric learning on a stream for two

reasons. First, there is no reason to suppose that the obser-

vations on two independent entities will arrive at the same

speed. For example, if during a period of 24 months a patient

(entity) has been inspected by a physician twice (two obser-

vations), while another patient has been inspected 24 times

(e.g. once per month), it seems inappropriate to align these

two time series. Second, time series are finite, so each obser-

vation in them can be labelled, while streams are infinite.

Hence, the assumption of a constantly available oracle that

supplies fresh labels on demand must be questioned. Solu-

tions should therefore be developed that take into account the

fact that different entities are tracked at different rates, and

that labels may not always be available.

In this work, we propose an entity-centric stream classifi-

cation method that predicts the labels of the observations

belonging to an entity without requiring that the streams

of the individual entities have similar speeds, and without

demanding that fresh labels are made available at any time

by an oracle. To achieve this, we transfer similarity informa-

tion among entities from a static domain, and we learn on an

initial, small set of labelled observations, assuming that no

labels arrive thereafter. We investigate how far in the future

of each entity we can predict, given this limited amount of

labelled data.

Our approach encompasses the following components:

an entity-level classifier for numerical data (i.e. a regres-

sor), a method for computing k-nearest neighbours (kNN),

a baseline method for computing predictions using kRan-

dom Entities (kRE) and comparing the learners in order to

reject irrelevant domains (from the transfer), and a method

for building a predictive subspace for a domain. We investi-

gate our approach on three datasets: patients interacting with

an mHealth app, product reviews, and sensor data on air qual-

ity. Since the predicted labels in this study are numeric, kNN

predictions are evaluated on RMSE and compared against

that from the kRE model.

It is to be noted that our proposed method does not

use kNN as a classifier, and more as a tool mitigating

label sparsity for an entity by expanding to its neighbour-

hood. Our approach also differs from kNN-based regression,

because the feature space in which kNN is computed is not

the feature space seen by the classifier making the predic-

tions.

The paper is organised as follows. The next Sect. 2dis-

cusses related work, and the following Sect. 3describes the

proposed workflow. In Sect. 4we present our experiments

and discuss our findings, before closing in Sect. 5with a

summary and outlook.

2 Related work

In our recent work [3], we propose an entity-centric learn-

ingapproachforopinionateddocumentstreams.Wecompare

theperformance of aclassifiertrainedon all opinionatedtexts

to that of classifiers that see only the data associated to the

entity itself, and show that in some cases even basic entity-

centric models achieve competitive performance. However,

this approach ignores all variables except the document label

(prediction using only timestamp, and ‘without reading con-

tent’ of the arriving instances). Here, we build upon this

conceptualmodel butexploitfurther timestampeddata,while

wedonotassumelabelavailabilitybeyondthefirstfewobser-

vations per entity.

123

International Journal of Data Science and Analytics

2.1 kNN-based predictions in time series data

The task of prediction over a sequence of observations for

an entity is addressed in the context of time series analysis,

where kNN is used to identify an entity’s nearest neigh-

bours and exploit them for learning. In [2], Ban et al. use

a kNN-based regression method to predict future values

of stock prices. The fundamental assumption is based on

economic intuition from [5] that within-industry returns cor-

relate highly compared to cross-industry returns. A two-tier,

multivariate regression method for financial time series fore-

casting is used to exploit this fact. The first step computes

the nearest neighbour for each entity (companies belong-

ing to S&P500, in this case) based on degree of correlation

over historical data, and the next step builds a multivariate

regressor to predict future values using each of the neigh-

bours identified.

This two-step approach of first identifying neighbours and

then learning on them is an idea that is very close to what

we propose. However, the chief difference in our approach

is that unlike in [2], we do not assume that all time series

have equal lengths. In addition, the dataset in [2] subsets the

S&P500 to select stocks from four sectors, while we do not

subset the original set of time series.

Several papers ([18,19,26], and [17]) assess the efficacy

of kNN-based systems in forecasting electricity prices. The

worksofboth[1] and [19] adapt a method very similar to

that in [2] for prediction of electricity prices (in the UK and

Spain, respectively). For this prediction task, each day’s elec-

tricity price information is stored as a 24-value time series

(one observation per hour). The neighbours are discovered

with kNN using Euclidean distance. Since all time series

are aligned and of equal length, more sophisticated distance

functionsarenotnecessary.Themethodologyin[1] improves

on the shortcomings of [19], where the time series are pre-

processed to exclude weekends and holidays. This is done

by using multiple regression to forecast prices while accom-

modating additional variables like temperature, cloudcover,

wind chill, etc., and dummy variables to flag holidays and

weekends.

2.2 Time series prediction with deep learning

algorithms

Time series prediction with help of deep learning algorithms

is enjoying increasing attention. This invites the question

of whether simplistic kNN-based approaches are still an

appropriate first choice for predicting the label of the next

observation of an entity. In their “Review of Unsupervised

Feature Learning and Deep Learning for Time Series Model-

ing”[15],Längkvistetal.state that “Timeseriesdataconsists

of sampled data points taken from a continuous, real-valued

process over time.” This rather obvious statement is reflected

in modelling and analysis of time series of observations gen-

erated by a set of entities, e.g. flow of crowds from different

residual units inside a city [29], or flow of vehicles [21].

In our research context, the statement of [15]isnot

guaranteed to hold: it would be rather surprising if the opin-

ions on all products belonging to the same category (e.g.

watches) would adhere to the same data-generating process.

Similarly, patients with the same disease but with differ-

ent static characteristics like birthdate, sex, and time since

disease onset, and with different comorbidities, should not

be assumed to generate observations according to the same

overarching process. Hence, an exploitation of the static

information on the entities seems essential next to the times-

tamped data. Studies on the use of clinical recordings, e.g.

for predicting the outcome of an intervention [25], take also

patient static data into account, including phenotypes and lab

tests.

Given the availability of both static and timestamped data,

it seems most appropriate to use advanced machine learning

algorithms that can exploit both. However, the streams of the

individual entities may vary substantially on the number of

observations perentity.Thiscorresponds totimeseries of dif-

ferent lengths per time frame, since streams are endless time

series. The authors of [4] use recurrent neural networks for

predicting time series with missing values within each time

frame. This approach seems promising for entities that have a

comparable total number of observations. However, we con-

centrate on predictions after acquiring a minimal number of

observations per entity, without making assumptions on the

time frame in which they arrive.

2.3 Label prediction under infinite verification

latency

Conventional stream classification algorithms assume that an

oracle (e.g. a human expert) provides a label for each arriving

observation. The temporary unavailability of the oracle, i.e.

the verification latency leads to semi-supervised and active

learning solutions for stream classification [14]. If there are

no new labels at all, i.e. in the case of infinite verifica-

tion latency, semi-supervised stream classification methods

are employed, see, e.g. [7,27]. These methods have been

designed for a single stream of observations though, not for

entity-centric streams.

In the domain of time series prediction, semi-supervised

algorithms are less widespread. Short-term prediction in the

stock exchange market with help of semi-supervised algo-

rithms is proposed in [12]. In our work, we do not transfer

derived labels, as in semi-supervised learning, but similarity

information from another domain, and consider this infor-

mation for short-term and long-term predictions.

123

International Journal of Data Science and Analytics

2.4 Short- and long-range forecasting

Recent work has also called into attention the need for

short- and long-term forecasts for time series under various

constraints like missing values, etc. [8] proposes forecast-

ing methods for cross-sectional observations of multiple

multivariate time series and makes the distinction between

‘Continuous Approach’es and ‘Distance Approach’es. Con-

tinuous methods use already forecasted values to make

forecasts farther in the future (like ARIMA), and distance

methods do not. The latter is handled by varying the ‘target

cross section’ (the time at which the cross-sectional observa-

tion of all time series are made) while holding the rest of the

data constant. This is close to what we propose in this work,

but is subtly different from our approach, since our work

focuses on using all variables except the class label from the

multivariate time series when making a short- or long-range

forecast.

3 Neighbourhood-assisted predictions

Thisworkpostulatesthatwhile making predictionsforobser-

vations of an entity, observations of entities that fall in its

neighbourhoodcan be leveraged toimprove predictions.This

involves three steps: finding the neighbours of an entity,

choosing how to learn models on data of the entity and its

neighbours, and combining each model’s prediction to cre-

ate a final prediction. As already described in Sect. 1,we

generalise the notion of an entity here to something that has

both static properties (like the age, gender of a patient) and

dynamic properties (like daily blood sugar levels). This has

the consequence that neighbour discovery can work on a set

of attributes that are outside the dynamic attribute space. In

the following parts of this section, Sect. 3.1 provides a more

formal definition of the problem we aim to solve. Section

3.2 describes how the similarity between two entities can

be defined in a feature space other than that where the pre-

dictions are made. Section 3.3 describes the various options

available for dealing with timestamps while combining data

from individual entities. Subsequently, Sects. 3.4 and 3.5

describe how predictions are made from entity data aug-

mentedwithneighbourhoods,andhowtheseneighbourhoods

can be pruned to exclude false neighbours.

3.1 Formalisation of the prediction problem

We study a stream of instances linked to predefined entities,

and we investigate the problem of predicting the labels of

arriving observations, given static information on the enti-

ties, next to the observations themselves. We formalise this

problem as follows.

Assume Ebeasetof entities,the staticpropertiesofwhich

are expressed in attribute space Do, while the observations

on each entity e∈Econstitute an “entity-centric” stream Te

over the domain-specific attribute space Dp. An observation

arrives at a time point tj, so that the total number of obser-

vations at tjis j. For an entity e, the number of observations

seen until tjconstitutes the “entity length” (or “length”) of

eat tj,length(e,tj)and is upper bounded by jand lower

bounded by zero.

The supervised learning goal over this stream would be to

label the observation arriving at tj, given the past observa-

tionsand afixedsetof labels L.Rather,assume thattheoracle

which provides the labels delivers an initial set of labels,

namely for the first mobservations for each entity. After this

seed of labelled observations, no more labels arrive, i.e. the

oracle is no longer available. Then, the classification prob-

lem is as follows: for each entity eand observation oe,jon

e, with j>m, predict the label of oe,j.

In this study, we concentrate on numerical labels only.

Our approach does not propagate labels, as in typical semi-

supervised stream learning, but rather focuses on predicting

the labels for j>mgiven the first mlabels per entity and

the static information on the entities from Do. In our exper-

iments, we investigate how prediction performance changes

as jmoves from m+1 towards values much larger than m.

For our prediction task, we consider following steps: (a)

computing an entity’s neighbourhood in domain Do—with

kNN; (b) using this neighbourhood to inform the model of an

entity e, by finding a suitable way to augment this data; (c)

using the model for prediction—in the near future and in the

far future,requiringaspecification of what isnearandwhat is

far in the future. The following parts of this section detail the

various components that the various subtasks detailed above.

3.2 Borrowing an entity’s neighbours from another

domain

For a given domain Dowith feature space F, we specify

the notion of similarity between entities, compute an entity’s

nearest neighbours and verify whether these neighbours lead

to better performance than randomly selected entities. This

means that for each entity e, the nearest-neighbour compu-

tation is performed on attribute space Do, but supplemented

with aggregated information from the time series attribute

space Dp. For example, if predictions are being made for

blood sugar values (a variable in Dpspace), then the average

of the training-set blood sugar value for the entity in question

can be added as one of the features in Do. Ideally, this would

mitigate the problem that kNN can compute neighbourhoods

in Dowith no regard to the values of prediction interest in

Dp. These steps to include information from Doin the neigh-

bourhood computation can be seen as borrowing the notion

of similarity from domain Doto improve the predictions in

123

International Journal of Data Science and Analytics

domain Dp. The use of aggregated information helps miti-

gate the problems created by computing similarities for time

series of very different lengths, and/or those with large gaps.

We explore the simplest case by including the aggregated

value only for the variable of prediction interest from Dp(i.e.

onlyone of the variablesfrom Dp,theone for which wemake

predictions is included along with Do). The nearest neigh-

bours of an entity are computed using the kNN algorithm.

We only use Euclidean distance in this work, and the distance

between two entities is defined as the total distance between

the values for each of their relevant attributes weighting all

attributes equally. As already discussed, the attributes used

may be a subset of the total attributes available. The optimal

value of kis discovered experimentally. However, Euclidean

distance can be replaced with other distance functions.

3.2.1 Selecting a predictive subspace over Do

An optional step before applying kNN would be to select

from the list of available features in Do, the ones most suited

for the prediction task for the variable of interest in Dp.The

feature selection step is guided either by human intuition,

or is discovered in a way that maximises performance. It is

clear that manual feature selection is only feasible in cases

that Dodoes not have a large number of variables, and the

dependencies between variables in Doand Dpare intuitive

(e.g. the inclusion of stress as a variable that affects tinnitus).

We propose a solution that simultaneously optimises the

selected features and evaluates the quality of the selected fea-

ture space, using an evolutionary algorithm. Each individual

in the evolutionary algorithm is a Boolean vector Vi, with

i∈1...|Do|. The fitness of each individual is the perfor-

mance of the model built on the neighbourhood defined by

the features that had Vi=True,i∈1...|Do|.

3.2.2 Testing the relevance of Doby building a model from

random entities

The main assumption of this work is that for a set of entities

E, each of which consists of static, unchanging attributes

in an attribute space Do, and a sequence of timestamped

observations oi1...oit of arbitrary length in attribute space

Dp, the ‘neighbourhood’ of an entity in Docan inform on

the future observations of the entity (in space Dp). Most

existing work optimises for the neighbourhood size k,but

does not verify whether the specific neighbourhood chosen

isresponsibleforthelowerrors.Wethereforepropose anovel

method for evaluating the contribution of the neighbourhood

towards prediction—for a given kand particular entity ep,

we pick kRandom Entities er1...erk, and build a regression

model on each, as detailed in Sect. 3.4. It is expected that this

will show the degree to which the neighbourhood improves

prediction.

3.3 Time alignment in Dp

The kNN step provides a list of similar entities for each entity

e. However, a model can be learned on the entities in the

neighbourhood of eeither by time-aligning each entity’s first

observation to time 0, or to preserve the global time in the

dataset as is. This can help understand if the global time in the

dataset can inform on the tendencies of an entity, or whether

the model parameters depend only on an entity’s own past

(relative time w.r.t. entity), and not the exact time at which

the observations were made.

To investigate the effect of the global, dataset-level

‘clock’, we investigate two variants of treating the obser-

vations oz∈ep∈E. The first variant is a ‘global clock’,

which uses this timestamp of each observation oias-is, with-

out any modification. The second variant gives each entity a

‘local clock’, which begins with the first observation arriv-

ing for that entity, i.e. we pre-process the observation time

stamps so aligned_timestamp(oz)=timestamp(oz)−

min_timestamp(ep). The details of how a model can be

built on data from either variant is detailed in the following

Sect. 3.4.

3.4 Creating augmented regressors

Once the kNN step is complete, the prediction for an entity

epcan be made using the entities ei,i∈1...k, where

ei∈kNN(ep)and i= por in case of the random base-

line ei∈kRE(ep)and i= p. For each ei(with an arbitrary

number of observations t), we have a sequence of observa-

tions oi1...oit in the feature space Dp, over which a model

can be learned. In this early exploratory work, we use the first

mobservations oifrom entity eito train a linear regression

model. We also investigate two ways to combine the data

from the entities in the neighbourhood.

Model augmentation Thefirst method trains a linear regres-

sion model mion each of the entities ei∈kNN(ep), and

averages the parameters me,slope and me,intercept for each of

the models to create a final model. The RMSE is then com-

puted on predictions made for test observations for e.This

methodis analogoustolearning the tendenciesforeach entity

in the neighbourhood (including the entity itself), and aver-

aging all the tendencies to make a prediction for the entity in

question. This method is referred to as model augmentation.

Data augmentation for regression Since our time series can

differ inlength andhavelargegaps,wealsoproposeamethod

toretain model qualityinthecase that therearetoo fewobser-

vations to learn a reliable model on entity epand/or entities

ei∈kNN(ep). We propose pooling all observations into a

123

International Journal of Data Science and Analytics

set oall such that oall =oi∀o∈ei. Since we train one

common model on the pooled observations from the entire

neighbourhood, the problems with gaps and unequal lengths

can be mitigated to some extent. This method is called data

augmentation. In case of the random baseline, everything

stays the same but we use ei∈kRE(ep)instead.

Model Retraining In this early work, we always build our

linear regression models on the first mobservations belong-

ing to ep. The regression models are not updated after a new

observation belonging to eparrives,butwedoplantoinves-

tigate in this direction in the future.

3.5 Neighbourhood-pruning

The idea that the kNN algorithm might not always find the

optimal neighbourhood, thereby leading to suboptimal pre-

dictions is described in [19]. Though it is possible to examine

data in Dpto exclude two entities eiand ejthat are in close

proximity in the Dospace, in this work we attempt to use

domain knowledge to guide the kNN algorithm by excluding

certain entities from each other’s neighbourhoods. Experts

can exploit information in Doto inform the kNN algorithm

that entities eiand ejshould be excluded from each others’

neighbourhoods even though their data in Doare very simi-

lar. For example, the medical expert can provide the intuition

that people with type 1 and type 2 diabetes are fundamen-

tallydifferent,eventhoughtheymaybothhavesimilarlyhigh

values for blood sugar.

In this work, we use neighbourhood pruning on the

mHealth dataset, where we exploit the discrepancy between

tinnitus loudness and tinnitus annoyance. Tinnitus loudness

is the severity of the symptom, while annoyance or distress

is the psychological effect it has on the patient. It has been

noted by in [9] that there is a subset of patients where the

loudness of tinnitus and the psychological annoyance that the

patientsuffers showarediscordant.Thedegreeofpsycholog-

icalannoyancedue to tinnitus is measuredusing the Mini-TQ

questionnaire, which consists of a 12-question questionnaire

with answers in the scale of 0–2 (higher values are worse). To

measure the loudness level of the tinnitus symptom, Hiller

andGoebel use the GlockhoffandLindblom’sloudness grad-

ing system. They have noted that of the tinnitus patients,

about a third have low annoyance scores in spite of high

loudness levels. In addition, they have found a ‘specific psy-

chologicalprofile’that characteriseshigh-annoyancetinnitus

sufferers, for example, not feeling low/depressed, not feeling

like a victim of their noises, etc., were all predictors of low

annoyance even in patients of high loudness [9].

This work investigates if the ‘psychological profiles’

found by Hiller and Goebels may be found in the mHealth

dataset. To this effect, we investigate whether the neighbour-

hoods in the kNN algorithm may be restricted to subgroups

that may be similar in their perception of annoyance given a

particular loudness.

We do not investigate this case for the AQI and Amazon

datasetsbecausetheconceptofwhatconstitutesafalse neigh-

bour is less clearly defined. However, in the abstract sense,

it is possible to exlude certain products from each others’

neighbourhoods. For example, a cheap knock-off of a luxury

watch should not be allowed to be in the neighbourhood of

the original product, or two air quality sensors should not be

in each others’ neighbourhoods if separated by a relatively

small distance but fall on different sides of a mountain range

trapping ambient pollution on one side.

4 Experiments

Goal of our experiments is to study the performance of

our method for the prediction of the labels of observations.

We concentrate on studying the predictive power achieved

through the exploitation of entity similarity, and hence focus

on prediction rather than adaption. We partition each dataset

into an early part used for learning (first 60% of each entity’s

time series) and the subsequent part used for testing.

As part of the evaluation, we perform near and far predic-

tions, predicting the labels of the first N%, and respectively,

the last N% of the test data. In a real scenario, the first N%

of the observations would correspond to the first time win-

dow, after which the classifier would need to be adapted.

The far predictions are used for gauging model quality, since

we expect that a model would perform better in the near

future than in the far future. All predictions are measured by

root-mean-squared error (RMSE) against the true values, and

performance improvement from kNN is compared against

the kRandom Entities (kRE) introduced in Sect. 3.2.2.The

RMSE reported in the experiments is computed over all the

predictions made for each observation in each entity’s test

trajectory. The RMSE may be computed only for the first

part of this test trajectory, as in the case of the first N%, or

only for the last part of the test trajectory, as in the case of

the last N%.

4.1 Datasets

We consider three datasets: one from the domain of environ-

ment monitoring, in which the entities are sensors; one from

the domain of mHealth, where the entities are patients inter-

acting with a mobile app; one from e-commerce, where the

entities are products on which opinions have been submitted.

For each of the datasets, we remove entities that have too

few observations to learn, and then build a train and test set

123

International Journal of Data Science and Analytics

from each of the trajectories (time series) of the remaining

entities. For each entity eremaining in the dataset after the

filtering step, we use the first 60% of the data for training the

augmented regressor, and the remaining data for testing.

AQI—yearly EPA carbon monoxide datasets: These

are public domain dedicated datasets collected by the US

Environmental Protection Agency. The two datasets used

in this study are the daily and the yearly carbon monox-

ide summaries, containing the daily and yearly observations,

respectively. The daily carbon monoxide (CO) summaries

contain daily observed information for mean CO levels at

different sites located all across theUS.Inadditionto CO lev-

els, various other metrics are also available (PM2.5, PM10,

etc.), though this work focuses on predictions for CO lev-

els only. However, the ‘Air Quality Index’ (AQI) and the

daily Max_Observed_CO_Value information are also used

at the daily level. The selection of attributes was done in

a way that balances relevance to the prediction variable

as well as data quality. The dataset consists of 200 enti-

ties, with 577482 observations for the period from 1990

to 2017. The prediction variable is the mean CO values

observed in ppm as measured daily, and has the following

values: average =0.742, min =0.004, max =8.461 and

standard_deviation =0.519.

Attribute Space Dofor kNN computation: The yearly CO

summary dataset gives the average CO levels observed at

these sites over the whole year. In addition to the yearly aver-

age observed CO, we also use the latitude and longitude of

the measuring site, the standard deviation of the measure-

ment, the first and second max values encountered, as well

as 90th and 50th percentile values.

Feature selection: For both the daily and the yearly

datasets, we removed several variables to focus only on those

associated with CO levels. Whenever a particular measuring

site had more than one device measuring CO levels, only

measurementsfromthe primary devicewereused. It wasalso

seen that several observations were possible within the day,

withaverageCOlevelscomputedovervaryingtimeintervals.

In this case, we kept only the average CO levels measured

over the longest time. The attribute space Dptherefore

has 3 variables (the timestamp, AQI, and max_CO_Value).

The daily CO level dataset contains data between 1990 and

2017, while for the yearly summary we only keep data from

1989.

Figure 1shows the distribution of entity lengths in the

AQI dataset. Note that while there is indeed a large range of

lengths, the distribution of #entities for varying entity lengths

is not as skewed as in the other datasets.

mHealth dataset: The mHealth dataset is not public

domain. It contains data on patients interacting with an

mHealth app called TrackYourTinnitus, to which they enter

data several times a day. The feature space of the entity tra-

jectories contains seven variables, one of which captures the

Fig. 1 AQI: #Entities (Y-axis) for various entity lengths (X-axis)

Fig. 2 mHealth: #Entities (Y-axis) for various entity lengths (X-axis)

patient’s overall state of health at the moment of the observa-

tion. This variable is numeric with a value between [0,1], and

we use it as target variable. For patient similarity, we use an

additional dataset that contains 31 variables on the patients

themselves (including sociodemographics and symptoms of

the disease). We use this second dataset as Do. The data

from this application has already been used to create mean-

ingful insights, like in [22], which has studied the difference

between prospective and retrospective ratings on the Track-

YourTinnitus mHealth platform.

Attribute Space Dofor kNN computation: The attribute

space for the kNN computation is quite large, and the dis-

covery of the best subspace over which the kNN could be

built was done using a genetic algorithm, as discussed in

3.2.1.

Feature Selection: From the list of features available in

Dp, all features were used to train the regression model.

All entities with less than five observations were removed,

leaving a total of 516 entities in the dataset. A histogram of

entity lengths is given in Fig. 2.

Amazon—the tools & home improvement dataset: This

is a subset of the Amazon product reviews dataset introduced

123

International Journal of Data Science and Analytics

in [20], including only reviews for products belonging to

the category ‘Tools & Home Improvement’. Only the review

star rating, review timestamp, review text, and productID

information is exploited from this dataset. The productID

and review text combine to form Dowith the timestamp and

star rating forming the Dp. The star rating is the prediction

variable, and takes values 1, 2, 3, 4 or 5.

The minimum cutoff for the Amazon dataset was set at

least 2 reviews. As can be seen in Fig. 3, the skew in entity

lengths is rather extreme in the Amazon dataset, with the

25th percentile, 50th percentile, and the 100th percentile

being 2, 4 and 4770, respectively. There are 139508 enti-

ties in the dataset after filtering, with a mean entity length of

12.9. In addition, [3] has also noted that the most reviews in

the dataset cluster towards the last timestamps, along with an

increased bias towards high ratings.

Attribute Space Dofor kNN computation: Since the Ama-

zon dataset provides no clear set of variables in Doregarding

the static properties of the product, we use an aggregated

version of the information in the domain Dp. Toward this

end, we build a paragraph vector model as described in

[16] on the textual reviews for each product. The paragraph

vector model creates an embedding space not only for the

words in the review, but also for the documents that those

words are a part of. This comes with the great advantage

that the documents (products generating the reviews) and

the words are described in the same latent space. For this

work, we train a doc2vec model using the gensim library

[23].

Feature selection in Dpis unnecessary because only the

timestamp is used to predict the arriving ratings. While this

approach (of predicting an arriving rating based only on

the timestamp) might seem unrealistic from the point of

view of the dataset, the reader is encouraged to remember

that the goal of this work is to assess the degree to which

knowing the entity influences the predictability of its future

observations.

Fig. 3 Amazon: #Entities (Y-axis) for various entity lengths (X-axis)

4.2 Experimental runs and results

As introduced in Sect. 3, the main point is to evaluate the per-

formance of a regressor augmented with entities belonging

to the kneighbourhood of an entity. However, as discussed in

Sect.3.2.2, weneedto firstverifyhow theneighbourhoodcan

augment the computed kNN, and whether this augmentation

should be done with absolute or relative timestamps.

4.2.1 Data v/s model augmentation

Figures 4,5and 6show the RMSE values achieved by the

entity-centric models using the two approaches to augment

kNN data to improve predictions. It can be seen that the

data augmentation performs much better than the model aug-

mentation for the Amazon and AQI datasets, while for the

mHealth dataset the two methods are fairly similar.

FromFig. 4,wesee that theRMSEdrops rapidlyaswe add

neighbours,butincreasesagainas more and more neighbours

are added. The curve for the model-augmented predictions is

Fig. 4 mHealth: RMSE for data augmentation (blue) v/s model aug-

mentation (green) (color figure online)

Fig. 5 Amazon: RMSE for data augmentation (blue) v/s model aug-

mentation (green) (color figure online)

123

International Journal of Data Science and Analytics

Fig. 6 AQI: RMSE for data augmentation (blue) v/s model augmenta-

tion (green) (color figure online)

only slightly worse than the curve for data-augmented pre-

dictions. This indicates that there might be no dataset-level

tendencies in any direction, i.e. all entities have roughly the

same slope. It is also possible that the kNN successfully

identifies neighbouring entities of similar tendency. But this

requires further investigation for the concrete dataset.

For the Amazon and AQI datasets, however, the data-

augmented method shows almost the opposite tendency

(though for AQI the quality improves as we add more and

more neighbours). It is also to be noted that the graphs show

results for up to 50 neighbours, which is a much larger pro-

portion of the total number of entities in the AQI dataset

(which has 200 entities), as compared to the Amazon dataset.

It is therefore possible that the Amazon dataset would have

shown the same initially upward, but subsequently decreas-

ing tendency for large neighbourhoods if the neighbourhood

size were expanded to thousands of entities. This test with

such a large neighbourhood size was not performed due to

performance considerations. It is also to be noted that the

results shown for the mHealth and the AQI datasets are on

the Z-score normalised value of the prediction variable as

in the training dataset. This is because unlike in the Ama-

zon five-star rating system, a fixed-magnitude deviation can

have different implications for the entity. For example, in the

mHealth dataset, a patient feeling worse by 0.2 points may

or may not be twice as bad as a patient feeling worse by 0.1

points.

4.2.2 Global v/s local time

Foreachoftheentitiesek∈kNN(ep), k∈{1...k},wehave

a set of timestamped observations. However, the timestamps

can be defined in two ways, one that considers each entity to

have a separate clock (local time), beginning at the time when

the first observation on that entity is recorded. The other case

Fig. 7 mHealth: RMSE for Global time (green) v/s Local-time (blue)

(color figure online)

Fig. 8 Amazon: RMSE for Global time (green) v/s Local-time (blue)

(color figure online)

is the default case of using the timestamps as they are (global

time), without modifying them in any way.

We perform this experiment only for the mHealth and the

Amazon dataset. We skip the AQI dataset, because no align-

ment is needed: almost all AQI entities have observations

at the beginning of time in the dataset (1990). The differ-

ent lengths are a consequence of the fact that not all entities

have recordings for as long into the future (from 1990) as

others.

Figures 7and 8show that time aligning each entity to 0

makes predictions worse, though the effect is much larger

in the Amazon dataset, where the difference between the

youngest and oldest entities is much larger. Another possible

reason for this result is that most entities are very short (see

Fig. 2and 3). When short entities are time aligned before

learningaregressor on them, this causes the slopes to become

steeper than if entities with the same amount of variation (but

far apart in absolute time) use the global clock.

123

International Journal of Data Science and Analytics

4.3 Comparison with kRE baseline

Since the past two experiments suggest that non-aligned

(global time), data-augmented regressors outperform the

other configurations, these two methods were compared

against the kRE baseline introduced in Sect. 3.2.2. The base-

line method numbers reported for the AQI and mHealth

datasets are averages computed over 30 runs. The size of the

Amazon dataset implied very slow computations, so we per-

formed only 5 runs. However, this does not appear to be a big

drawback, since the variation exhibited for each k was very

small (with minimum, mean and maximum std. deviations

for RMSE of 0.00022, 0.00046 and 0.00082. respectively).

Theresultsof running these experimentson the three datasets

are shown in Figs. 9,10 and 11.

The results on the Amazon dataset (Fig. 9) are partic-

ularly remarkable. This is the only dataset for which the

RMSE begins at low values and continuously increases, as

we increase the neighbourhood size. It is quite likely that

the very large number of very short entities (of length = 2,

for example—see Fig. 3) are best predicted without consid-

ering any neighbours, and the addition of each neighbour

distracts the local model with noise. Also, range of values k

over which the current experiments have been conducted are

a tiny fraction of the total number of entities in the dataset

(100,000+).However, itcan also be seen thatthough the kNN

model does have a high RMSE at the beginning, the RMSE

drops steadily as we add more neighbours. The drop in the

RMSE (compared to the rise in the case of random entities)

seems to suggest that the neighbours that are being added are

informative with respect to the true tendency.

The behaviour of RMSE might be associated to the use

of word embeddings. Since the study of the role of word

embeddings is beyond the scope of our work, we did not

compare different word embeddings. Rather, we selected the

doc2vec model [16] by checking whether the kNN of a doc-

Fig. 9 Amazon: Data-Augmented Regressor (blue) v/s kRE Baseline

(red) (color figure online)

Fig. 10 mHealth: Data-Augmented Regressor (blue) v/s kRE (red)

Baseline (color figure online)

Fig. 11 AQI: data-augmented regressor (blue) v/s kRE (red) baseline

(color figure online)

ument were mostly from the same category of products. So,

the high RMSE values might indicate that doc2vec was a

suboptimal choice for this analysis.

In the mHealth dataset, the kNN and the kRE baseline

performances are very similar, suggesting either that the

neighbourhood computation task has failed to identify fea-

turesin Dothatimprovepredictability in Dp, orthatthestatic

attributes of the respondents are not predictive of disease

evolution. However, the deviation of the kNN away from the

random beyond neighbourhood sizes of 15 suggests that the

attributes on which neighbourhoods are computed do have

an impact on predictions.

In the AQI dataset, it can be seen that the kNN does indeed

help to predict future values of an entity. It also appears that

the increase in neighbourhood size does not affect prediction

quality to a very large extent. Given that it has been observed

in the dataset that almost all entities have coinciding obser-

vations in the early parts of the time series (almost all entities

have observations starting from 1990, the beginning of the

dataset), this could suggest that all entities in the dataset have

123

International Journal of Data Science and Analytics

similarslopes,andaddingmoreneighboursdoesnotimprove

the models as much, at least for the simple models that we

have considered in this work.

It is also important to note two interesting observations

that hold true across all three datasets: It can be seen that for

all three datasets, predictions based on using entity informa-

tion alone (neighbourhood size of 0) was always worse than

those based on a neighbourhood. This means that neighbour-

hoods are useful in improving predictions. Unexpectedly,

RMSE decreases even for kRandom Entities. A possible

explanation is that there are entities which ‘run with the

crowd’, and this tendency helps even to improve predictions

on an entity eeven by using krandomly selected entities,

since they can inform on the global tendency. The Amazon

dataset is the only one we tested where we see an upward

slope for kRandom Entities from the very beginning.

4.4 Predictive performance in the near and the far

future

To lookdeeperat thekNNresults, webrokedown theRMSEs

by predictions for observations near in the future and pre-

dictions far in the future (first x% and last x%ofthetest

trajectory). Various bounds (10%, 20% and 50%) were tried

for near and far on the test data (which are the last 40%

of the observations belonging to an entity). On Figs. 12,13

we show the results only for near = first 10%, and far =

last 10%. The RMSE curves for the first 20% observations

showed a similar trend. The RMSE curves for the 50% of the

observations had also the same trend but were closer to each

other.

In the mHealth dataset, the near-RMSE is lower than the

far-RMSE (Fig. 12). This is quite expected, since the health

condition of a patient for some time period is more likely to

be similar to the patient’s condition just before that period.

Fig. 12 mHealth: RMSE (Y-axis) versus neighbourhood size (X-axis)

for near (green) v/s far predictions (red) (mean RMSE in blue) (color

figure online)

Fig. 13 AQI: RMSE (Y-axis) versus neighbourhood size (X-axis) for

near (green) v/s far predictions (red) (mean RMSE in blue) (color figure

online)

Fig. 14 Amazon: RMSE (Y-axis) versus neighbourhood size (X-axis)

for near (green) v/s far predictions (red) (mean RMSE in blue) (color

figure online)

Figure 12 shows that the near- and far- RMSEs have the same

trends with increasing k.

The mHealth dataset is the only one where the near-

RMSE is lower than the far-RMSE. The difference is most

remarkable for the AQI dataset: Fig. 13 shows that as the

neighbourhood size increases, the near-RMSE curve quickly

deteriorates compared to the far-RMSE. The mean of the

RMSE follows the same trend as the far-RMSE. A possible

explanation is that the near-RMSE refers to a time period

in which the recordings were more noisy or there were less

sensors in total (and more far apart).

In the AQI dataset, the curves of near-RMSE and far-

RMSEcrossclosetozero,i.e.forverylowvaluesofk.Forthe

Amazon dataset (Fig. 14), the two curves do not even cross

for small neighbourhoods: the near-RMSE drops, but the

far-RMSE begins at much lower values and then increases,

though very slowly. The two curves seem to approach each

other asymptotically. In [3], it has been reported that the rat-

ings in this dataset exhibit a long-term tendency, which might

123

International Journal of Data Science and Analytics

indicate that the far-RMSE is the result of many observations

with similar labels.

4.5 Neighbourhood pruning: using medical

intuition to exclude‘false neighbours’

The idea that some of the neighbours computed by the k-

nearest neighbours algorithm might not contribute positively

topredictionismentionedin[19].Inordertokeeponlyneigh-

bours that help improve the predictions, the authors suggest

a method that excludes ‘false neighbours’. However, the cri-

teria on which a neighbour may be declared ‘false’ are not

clearly defined in [19]. For our problem specification, one

way to decide if a neighbour is indeed negatively affecting

model performance is to check whether the neighbour of

entity xin static space Dois also a neighbour of the time

series of xin Dp. To do so, we need to observe the time

series for a while before deciding whether the time series of a

neighbour in Dodiverges from the given entity. As explained

already in the earlier sections, the time series for individual

entities may be too short, making it necessary to wait too

long before such a test becomes reliable.

In light of these difficulties, this work focuses on exploit-

ing expert knowledge regarding entities in the Dodomain

to identify ‘false neighbours’, which can then be excluded

from the kNN output for a particular entity. More precisely,

we build upon the findings of [9] on the association between

tinnitus loudness and distress, not simply to investigate how

distress can be predicted among similar patients, as done

in Sect. 3.4, but also to refine patient neighbourhoods and

to exclude patients that exhibit dissimilar distress despite

similar loudness levels. As already explained in Sect. 3.5,

restricting neighbourhoods computed by kNN to groups

of participants that have the same combination of loud-

ness/annoyance scores may be beneficial in the discovery

of subgroups.

The approach used by Hiller & Goebels in [9] can-

not be applied directly to our problem, since they measure

‘loudness’ using variables that are not part of the mHealth

dataset. However, the mHealth dataset does collect a vari-

able regarding registration-time tinnitus loudness. We cluster

the response values for this variable into three groups using

kMeans. The clusters are not well separated, but the three

clusters that are thus discovered amongst the respondents

form groups of participants with high, medium and low tin-

nitus loudness as shown in Fig. 15. Unlike for loudness, the

variable used by Hiller and Goebels to measure distress does

exist in the mHealth dataset—as in [9], we use the total score

computed by Mini-TQ, the ‘tfsum’ variable, for our analysis:

we varied the number of clusters but fixed them at 2. Figure

16 shows the two groups of patients discovered by kMeans

on the ‘tfsum’ variable. The minimum and maximum pos-

sible values are 0 and 24, respectively. As with loudness,

Fig. 15 Clusters for loudness measurements: black (low loudness),

green (moderate loudness) and red (high loudness) (color figure online)

Fig. 16 Clusters for distress measured as total ’tfsum’: green (low dis-

tress) and red (high distress) (color figure online)

Table 1 Number of participants and their average loudness and distress

levels per group

N Avg. Distress Avg. Loudness

Group1 97 18.3 82.2

Group2 168 17.0 54.4

Group3 52 15.7 26.8

Group4 35 9.2 77.1

Group5 83 8.1 53.0

Group6 81 7.4 28.1

the clusters are not well separated, so subgroups of low- and

high-distress participants correspond to cuts in continuous

strata.

On this basis, we create subgroups of participants that are

more similar in their loudness and distress ratings using the

following steps:

– Create six groups, and assign each participant into one

group, depending on his/her loudness and distress clus-

ter. The six possible groups are: High distress + High

Loudness, High Distress + Moderate Loudness, High

Distress + Low Loudness, Low Distress + High Loud-

ness, Low Distress + Moderate Loudness, Low Distress

+ Low Loudness

– Perform the classification task with kNN restricted

to finding neighbours from within each participant’s

group. Compare the performance of the neighbourhood-

augmented models within the groups to the global model

without restricting kNN to the group.

Table 1shows the number of participants and the mean

loudness and distress levels for each of the groups. It is

noted that our workflow has identified ‘discrepant’ partic-

ipants with both high loudness and low distress as well as

123

International Journal of Data Science and Analytics

Fig. 17 RMSE’s for pruned neighbourhood-augmented predictions

for six participant groups in the mHealth Dataset: Neighbourhoods

restricted to groups of participants of similar (Distress =<L/H>,

Loudness =<L/M/H>)

low loudness and high distress values in agreement with [9].

The low distress, high loudness group is the smaller of the

two, with only 35 participants. It is also seen that while Hiller

& Goebel found about a third of tinnitus patients to be dis-

cordant, only about 16% of the participants in the mHealth

dataset showed this peculiarity. While this could be a short-

coming of the workflow design, it is also possible that the

mHealth application is more accessible to tinnitus sufferers

who did not go to the hospital ([9] refers to hospital data)

Figure 17 shows the predictions in Dpachieved after

segmenting the patients on loudness+distress in Do.Two

conclusions, in particular, can be drawn from the results. It

can be seen that in four of the six groups, expert knowledge

guided intuition contributed to improving prediction qual-

ity. In addition, among the groups of patients described in

Hiller & Goebel [9] as discrepant in their loudness-distress

measurements, one of the corresponding discrepant partici-

pant groups (high distress, low loudness) is improved to the

greatest extent by the neighbourhood pruning process in the

mHealth dataset. In contrast, the predictability in Group4

(low distress, high loudness) is the lowest. This suggests

that of the two possible ways in which a patient could be

‘discrepant’, the ones that are distressed to a high degree

even with a relatively lower loudness levels are somehow

more similar to each other than those of the opposite group

(low distress in spite of high loudness). In other words, it

could be that participants who find ways do deal with their

symptom find their own different ways to deal with it, while

those that are easily distressed are more psychologically

alike. The RMSE of the global model on Figure 17 indi-

catesthat theimprovementsonsomegroupsarecancelledout

by the low predictability on Group 4. Hence the refinement

into segments is necessary in order to improve performance.

Indeed, the idea of eliminating false neighbours does have a

small positive impact on predictive performance, with four

of the six groups showing small improvements in predic-

tive performance compared to the global model that does not

exclude neighbours. Since the neighbourhood computations

are already (non-randomly) restricted to be within group, we

do not compare these results against rNN.

5 Summary and future work

The goal of the study was to investigate whether a k-nearest

neighbours constructed on a feature space outside the pre-

diction space of an entity can guide a labelling process for

future observations of an entity. Towards this end, we pro-

posed a k-nearest-neighbour method that can work on a static

attribute space Do, and transfer the knowledge about entity

similarities to label future instances of the entity in dynamic

domain Dp. The efficacy of the proposed method is evalu-

ated against a baseline which challenges the assumption that

Doyields ‘transferable information’ (by choosing kRandom

Entities).

Two methods of combining information from the related

entities were observed, and it was seen that a data-augmented

method had the best performance. However, more work

is needed on a wider range of datasets, before a general

conclusion can be reached regarding the selection of the aug-

mentation type with respect to dataset characteristics. It is

also to be noted that this early exploratory work only consid-

ers the case of linear regression models, and the performance

of more complex models such as polynomial regression or

ARIMA may be affected to different extents by data augmen-

tation(for example,thefactthatdataaugmentationforcefully

aligns all time series can affect ARIMA models in the case

of two out-of-sync time series with repeating patterns). Sur-

prisingly, it was found that depending on the dataset and

its average tendencies, even choosing random entities can

improve the prediction quality. This means that even a ran-

dom entity transfers some level of knowledge about the

dataset-level tendency. In other words, it can be argued that a

random entity is able to mitigate the sparsity in the available

labels. More work needs to be done to verify whether this is

generallytrueacrossalldatasets,andifyes,whetheraheuris-

tically discoverable kexists that can be used for entities (new,

privacy-protected, etc.) that do not have easily computable

neighbourhoods.

In this first study on predicting observation labels in an

entity-centric way, we did not consider model adaption.

Albeit our results show that it is possible to make even far-

future predictions by using solely an initial seed of labelled

observations per entity. In addition, we also see that it is pos-

sible to incorporate domain-specific expert information to

improve predictions by restricting the kNN computation to

123

International Journal of Data Science and Analytics

groups of entities in a way that excludes ‘false neighbours’

as suggested in [19]. The adaption of the model with means

of semi-supervised or active learning is our next planned

task. For this, we intend to build upon our recent works on

semi-supervised sentiment classification [10] and on active

learning for sentiment classification with an irregularlyavail-

able oracle [24], both of which are designed for conventional

streamsthough.Ourcurrent approachwasnot optimisedwith

respect to training time and runtime. We want to investi-

gate the impact of dataset characteristics on training time,

and to optimise the cost of retraining and adaption if labels

can be acquired through an irregularly available oracle or

derived with self-learning. The effects of neighbourhood size

onmodelqualityandexecutiontimeneedalsofurtherinvesti-

gation. In our current experiments, the number of neighbours

did not exceed 50, whereupon the error values seemed rather

stable. Since the number of entities is much smaller than the

number of observations, it is worth investigating whether it

would be beneficial to set the neighbourhood size relative to

the number of entities in the dataset. Several avenues also

remain for possible improvements to the core method. The

poor gains in predictive performance from the kNN in the

Amazon dataset is also possibly due to a poorly trained word

embedding model. The evaluation process for the paragraph

embedding can possibly be improved by considering a Jac-

card coefficient-like measure for the number of neighbours

a product has from the same product subcategory.

Acknowledgements Work of Authors 1 and 2 was partially supported

by the German Research Foundation (DFG) within the DFG-project

OSCAR Opinion Stream Classification with Ensembles and Active

Learners. The last two authors are the project’s principal investigators.

References

1. Al-qahtani, F.H.: Multivariate k-Nearest Neighbour Regression for

Time Series data—a novel Algorithm for Forecasting UK Electric-

ity Demand Multivariate KNN Regression for Time Series. Neural

Networks (IJCNN), The 2013 International Joint Conference on

pp 228–235 (2013)

2. Ban, T., Zhang, R., Pang, S., Sarrafzadeh, A., Inoue, D.: Referen-

tial kNN regression for financial time series forecasting. In: Lee,

M., Hirose, A., Hou, Z.G., Kil, R.M. (eds.) Neural Information

Processing, pp. 601–608. Springer, Heidelberg (2013)

3. Beyer, C., Niemann, U., Unnikrishnan, V., Ntoutsi, E.,

Spiliopoulou, M.: Predicting document polarities on a stream with-

out reading their contents. In: Proceedings of the Symposium on

Applied Computing (SAC) (2018)

4. Che, Z., Purushotham, S., Cho, K., Sontag, D., Liu, Y.: Recurrent

neural networks for multivariate time series with missing values.

Sci. Reports 8(1), 6085 (2018)

5. Clarke, R.N.: SICs as Delineators of Economic Markets. J.

Bus. 62(1), 17–31, (1989) https://ideas.repec.org/a/ucp/jnlbus/

v62y1989i1p17-31.html. Accessed 1 Feb 2018

6. Ditzler, G., Roveri, M., Alippi, C., Polikar, R.: Learning in nonsta-

tionary environments: a survey. IEEE Comput. Intell. Mag. 10(4),

12–25 (2015)

7. Dyer, K.B., Capo, R., Polikar, R.: Compose: A semisupervised

learning framework for initially labeled nonstationary streaming

data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 12–26 (2014)

8. Hartmann, C., Ressel, F., Hahmann, M., Habich, D., Lehner, W.:

Csar: the cross-sectional autoregression model for short and long-

range forecasting. Int. J. Data Sci. Anal. (2019). https://doi.org/10.

1007/s41060-018-00169-7

9. Hiller, W., Goebel, G.: When tinnitus loudness and annoyance are

discrepant: audiological characteristics and psychological profile.

Audiol. Neurotol. 12(6), 391–400 (2007)

10. Iosifidis,V.,Ntoutsi,E.:Largescalesentimentlearningwithlimited

labels. In: Proceedings of the 23rd ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, ACM, pp

1823–1832 (2017)

11. Keogh, E.J., Pazzani, M.J.: Scaling up dynamic time warping

for datamining applications. In: Proceedings of the Sixth ACM

SIGKDD International Conference on Knowledge Discovery and

Data Mining, ACM, pp 285–289 (2000)

12. Kia, A.N., Haratizadeh, S., Shouraki, S.B.: A hybrid supervised

semi-supervised graph-based model to predict one-day ahead

movement of global stock markets and commodity prices. Expert

Syst. Appl. 105, 159–173 (2018)

13. Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Wo´zniak,

M.: Ensemble learning for data stream analysis: a survey. Inf.

Fusion 37, 132–156 (2017)

14. Krempl, G., Žliobaite, I., Brzezi´nski, D., Hüllermeier, E., Last,

M., Lemaire, V., Noack, T., Shaker, A., Sievi, S., Spiliopoulou,

M., et al.: Open challenges for data stream mining research. ACM

SIGKDD Explorations Newslett. 16(1), 1–10 (2014)

15. Längkvist, M., Karlsson, L., Loutfi, A.: A review of unsupervised

featurelearninganddeeplearning fortime-seriesmodeling.Pattern

Recogn. Lett. 42, 11–24 (2014)

16. Le, Q., Mikolov, T.: Distributed representations of sentences and

documents. In: International Conference on Machine Learning, pp

1188–1196 (2014)

17. Lora, A.T., Santos, J.R., Santos, J.R., Ramos, J.L.M., Expósito,

A.G.: Electricity market price forecasting: Neural networks versus

weighted-distance k nearest neighbours. In: International Confer-

ence on Database and Expert Systems Applications, Springer, pp

321–330 (2002)

18. Lora,A.T.,Santos, J.M.R., Riquelme, J.C.,Expósito, A.G., Ramos,

J.L.M.: Time-series prediction: Application to the short-term elec-

tric energy demand. Current Topics in Artificial Intelligence pp

577–586 (2004)

19. Lora, A.T., Santos, J.M.R., Exposito, A.G., Ramos, J.L.M., San-

tos, J.C.R.: Electricity market price forecasting based on weighted

nearest neighbors techniques. IEEE Trans. Power Syst. 22(3),

1294–1301 (2007)

20. McAuley, J., Yang, A.: Addressing complex and subjective

product-related queries with customer reviews. In: Proceedings

of the 25th International Conference on World Wide Web, Inter-

national World Wide Web Conferences Steering Committee, pp

625–635 (2016)

21. Polson, N.G., Sokolov, V.O.: Deep learning for short-term traffic

flow prediction. Transportation Research Part C: Emerging Tech-

nologies 79, (2017)

22. Pryss, R., Probst, T., Schlee, W., Schobel, J., Langguth, B., Neff,

P., Spiliopoulou, M., Reichert, M.: Prospective crowdsensing ver-

sus retrospective ratings of tinnitus variability and tinnitus-stress

associations based on the trackyourtinnitus mobile platform. Int. J.

DataSci.Anal. (2018). https://doi.org/10.1007/s41060-018-0111-

23. ˇ

Reh˚uˇrek, R., Sojka, P.: Software Framework for Topic Modelling

with Large Corpora. In: Proceedings of the LREC 2010 Workshop

on New Challenges for NLP Frameworks, ELRA, Valletta, Malta,

pp 45–50 (2010)

123

International Journal of Data Science and Analytics

24. Serrao, E., Spiliopoulou, M.: Active stream learning with an oracle

ofunknownavailability forsentiment prediction. In:2nd Int. Work-

shop on Interactive Adaptive Learning (IAL2018) at ECML PKDD

2018, Dublin, Ireland, accepted in July 2018, to appear (2018)

25. Suresh, H., Hunt, N., Johnson, A., Celi, L.A., Szolovits, P., Ghas-

semi, M.: Clinical intervention prediction and understanding with

deep neural networks. In: Machine Learning for Healthcare Con-

ference, pp 322–337 (2017)

26. Troncoso Lora, A., Riquelme, J.C., Martínez Ramos, J.L.,

Riquelme Santos, J.M., Gómez Expósito, A.: Influence of kNN-

based load forecasting errors on optimal energy production. In:

Pires, F.M., Abreu, S. (eds.) Progress in Artificial Intelligence, pp.

189–203. Springer, Heidelberg (2003)

27. Wagner, T., Guha, S., Kasiviswanathan, S.P., Mishra, N.: Semi-

supervisedlearning ondatastreams viatemporal label propagation.

In: International Conference on Machine Learning, pp 5082–5091

(2018)

28. Yakowitz, S.: Nearest-neighbour methods for time series analysis.

J. time Series Anal. 8(2), 235–247 (1987)

29. Zhang, J., Zheng, Y., Qi, D.: Deep spatio-temporal residual

networks for citywide crowd flows prediction. In: AAAI, pp 1655–

1661 (2017)

Publisher’s Note Springer Nature remains neutral with regard to juris-

dictional claims in published maps and institutional affiliations.

123