Generic concept for integrating voice assistance into
smart therapeutic interventions
Jens Scheible1, Fabian Hofmann1, Manfred Reichert1, R¨
udiger Pryss2, Marc Schickler1
1Institute of Databases and Information Systems, Ulm University, Germany
2Institute of Clinical Epidemiology and Biometry, University of W¨
urzburg, Germany
1{jens.scheible, fabian-1.hofmann, marc.schickler, manfred.reichert}@uni-ulm.de
2ruediger.pryss@uni-wuerzburg.de
Abstract—Therapeutic Interventions (TIs) play an important
role in modern medical and psychological treatments, but their
integration into the digital world still shows deficits, e.g., in
the integration of the auditory interface. Initiatives to integrate
this interface into existing Internet- and Mobile-Based Inter-
ventions (IMIs) are largely focused on a small group of Voice
Assistants (VAs) and their specific capabilities. To mitigate these
drawbacks, the presented concept seamlessly integrates arbitrary
VAs into the treatment process of TIs. To this end, an architecture
- including a discussion of relevant requirements - is presented
that, on the one hand, uses VAs as the only point of contact with
patients and, on the other hand, provides a comprehensive web-
based backend for Healthcare Providers (HCPs). Based on the
architecture, a proof-of-concept implementation using Amazon
Alexa is presented. Finally, it is discussed that the scenario
addressed and the solution presented have great potential, but
still need a lot of work and technical considerations.
Index Terms—conversational agents, smart assistance, thera-
peutic interventions, voice interface
I. INTRODUCTION
NOWADAYS, Therapeutic Interventions (TIs) play an es-
sential role in medical and psychological treatment. Their
scope of application ranges from medication administration to
complex therapeutic homework [1]. The demand for digital
solutions has therefore risen sharply in recent years and has led
to numerous, often multimodal concepts and developments [2],
[3]. However, these developments have mainly focused on
visual user interfaces provided via smartphones, tablets or
wearables. With the rapid development of Voice Assistants
(VAs) capabilities, their importance has now increased, cre-
ating a completely new way of interaction with patients.
In general, it does not seem trivial to reconcile the two
worlds of VAs and TIs, especially considering the complexity
of both fields [4]. To be more precise, TIs consist of a wide
range of different methods and tasks that are tailored to the
individual needs of individual patients. VAs, on the other
hand, are offered by different companies such as Amazon
or Google; in addition, many research approaches exist, for
which the available solutions are characterized by a large
heterogeneity. Consequently, the difficulty in bringing them
together lies not in digitizing TIs, but in creating a uniform
and comparable interface for any VA for TIs. In order to make
auditory interfaces available through VAs in the context of
TIs, we believe that further research is needed. Above all, one
aspect is particularly important: a developed concept for the
integration of VAs into TIs must not lose its genericity with
respect to a provider of VAs. Furthermore, possible concepts
should not only focus on patients, but also integrate therapists,
domain specialists and scientists in a suitable way.
To enable this, from a technical point of view, the digital
delivery of interventions is considered as the starting point for
our approach. Therefore, VAs are considered as agents that
only access treatment data. In addition, typical methods of
language dialogue systems, such as Natural Language Under-
standing (NLU) and Natural Language Generation (NLG) [5],
are used to leverage their generated information to better
enable more personalized user interactions. In this process,
agents are orchestrated by a server. The contribution of this
work is to present a (1) generic concept for the integration of
VAs in therapeutic contexts. Within this concept, a correspond-
ing architecture is presented. Furthermore, the (2) challenges
and limitations of such an architecture are discussed and a
(3) prototypical implementation based on this architecture is
presented.
The remainder of this paper is organized as follows: Sec-
tion II discusses related work. Key aspects of TIs are presented
in Section III, while requirements for the proposed approach
are presented in Section IV. The development of the approach
and its implementation are discussed in Sections V and VI. A
discussion of the results, benefits, and limitations is presented
in Section VII. Finally, Section VIII concludes the paper with
a summary and outlook.
II. RELATED WORK
Previous approaches integrating a voice interface into Ther-
apeutic Interventions (TIs), mostly concentrate on single
Voice Assistants (VAs). According to an up-to-date (2020)
market share1,Amazon Alexa [6] (34%) and Google Assis-
tant [7] (43%), still occupy a prominent position in the world
of VAs. Due to the ability to bundle a number of voice com-
mands into so-called skills that have, for example, a common
dialog context, as well as the ability to easily integrate other
systems easily, Alexa is very popular in the present context. Of
note, the developer API of Alexa Skills Kit3was launched in
2015, while for Google Assistant,Google Actions, in 20172. Of
further note, existing platforms for Internet- and Mobile-Based
Interventions (IMIs) [8]–[10], which are important technical
foundations for TIs, support patients as well as Health Care
Providers (HCPs) during the overall intervention progress.
The resulting setting must therefore be carefully considered
in the present context if VAs are to be used for TIs. The
following sections first introduce a number of platforms that
implement therapeutic interventions in the eHealth context
and briefly highlight their capabilities. Afterwards, identified
approaches that integrate the auditory interface into the digital
interventions will be presented.
A. Therapeutic Intervention Platforms
The eSano platform [8], as an example of IMIs, offers
treatment for mental disorders and chronic somatic diseases
independent of time and place. The platform consists of a
central REST API that connects a database with web-based
systems to flexibly create interventions. Patients can perform
their treatments via various cross-platform applications. The
applications support diaries, questionnaires, and Ecological
Momentary Assessments (EMAs), among others. The authors
of [9], in turn, provide a messaging system that patients
can use to communicate with their therapists. For example,
patients can book appointments or provide feedback on on-
going treatments. In addition to this asynchronous type of
communication, real-time chat functionality has been added to
enable live communication via text, voice or video. Because
of the features available to patients and HCPs, both platforms
provide a solid starting point for integrating the auditory
interface. However, it needs to be clarified to what extent the
platform APIs are sufficient as interfaces that VAs are able to
interact with.
Electronic Health Records (EHRs), stored in a centralized
medical cloud, are a key feature of [10]. The authors describe
how the collection of physiological data can be integrated into
their system, e.g., by collecting electrocardiological data via
wearables. A connected smartphone provides synchronization
and allows healthcare professionals to draw conclusions about,
for example, daily activity via a web interface, which supports
clinical decision making.
B. Enabling Interventions for VAs
Yun et al. [2] evaluated the integration of voice interfaces
in Caring for Caregivers Online (COCO)3. COCO is a plat-
form focusing on private caregivers of family members with
chronic health conditions. The majority (89%) of participants
in the survey conducted are familiar with VAs, and the fact
that a voice interface provides more flexibility in retrieving
information or instructions for practices during activity favors
VA use. The authors also cite high levels of user satisfaction,
such as not being distracted by a smartphone while interacting
with the VA.
1https://www.statista.com/statistics/789633/worldwide-digital-assistant-
market-share/ (Accessed: 2022-04-06)
2https://www.theverge.com/2017/5/17/15648538/google-actions-android-
ios-phones-third-party-app-assistant-io-2017 (Accessed: 2022-04-06)
3https://press.aboutamazon.com/news-releases/news-release-
details/amazon-introduces-alexa-skills-kit-free-sdk-developers/ (Accessed:
2022-04-07)
[2] evaluate the degree of User Experience (UX) with
the help of user testing methods and could reveal challenges
in the following three categories: (1) Accessibility: Since
users prefer learning a new system in combination with a
Graphical User Interfaces (GUIs), keeping track of the system
progress and controlling is rather hard only using spoken
language interfaces. (2) Efficiency: Voice as a volatile form
of information should be accompanied by a reduced cognitive
load. Presented information should not be too detailed, nor
should it lose too much information content. (3) Satisfaction:
Users prioritize a high level of correctness over speed of
response in this context.
Similar to [2], the authors of [11] integrated a voice
interface into an existing IMI platform: Nurse AMIE [12].
The system supports women with metastatic breast cancer
through a self-administered intervention solution that includes
psychoeducation and mindfulness meditations. It technically
relies on Alexa and the corresponding Amazon infrastructure
(i.e., AWS Lambda Service, AWS DynamoDB). Previous
approaches provide proven interventions using conventional
web- or mobile based interfaces. Underlying infrastructures
start from those of single manufacturers, potentially limiting
the use of different VAs [11], to completely decoupled dis-
tributed systems [8]–[10]. As the auditory interface is on the
rise4and gives new possibilities in a more natural human com-
puter interaction, we think this field of research is promising.
Different approaches already try to tackle challenges like e.g.,
integrating distributed systems into multiple VAs [10].
However, the aforementioned approaches lack their generic
character, as they focus on specific VAs or user groups. The
concept presented here, on the other hand, attempts to include
all stakeholders in order to develop a more general approach
to integrating VAs into the TI domain. VAs are therefore
considered as interchangeable voice interfaces.
III. THERAPEUTIC INTERVENTIONS
The starting point of any therapeutic intervention is the
occurrence of a psychological or physical complaint, which
leads to consultation with a corresponding specialist. In the
further course, a diagnosis can be made using various methods.
This diagnosis can in turn serve as the basis for planning any
necessary treatments [1]. Traditional therapeutic interventions
are based on one or more therapy sessions [1]. These serve
to refine the initial diagnosis and to test possible therapeutic
problem-solving approaches together with the patient. The
treatment itself consists mostly of both inpatient and outpatient
measures [13]. During a supervised session, for example, the
patient performs exercises under the guidance of a therapist.
However, to achieve optimal therapy results, exercises must be
continued outside of these sessions. Exercises that the patient
performs independently outside of the sessions are referred to
as therapeutic homework [13] and aim to further deepen his
or her knowledge and understanding of the therapy.
3https://coco.health (Accessed: 2022-04-06)
4https://www.thinkwithgoogle.com/consumer-insights/consumer-
trends/voice-technology-purchasing/ (Accessed: 2022-04-07)
Consult
Expert
At home During the therapy session Outside the therapy session
Patient Patient Therapist Patient
Mental
complaints
Diagnose
Planning
Doing
Homework
Perform
Exercises
Go
home
Defining new
homework
Iterative sessions
Physical
complaints
Depression
Listlessness
Fear
Back pain
Migraine
Sprain Medication
Report
Progress
Fig. 1: Traditional therapeutic intervention
However, the success of a therapeutic measure depends
not only on the methodology used. Numerous studies show
that patient adherence (also compliance) is an essential factor
for the success of a therapy. This is because, despite correct
indications, problems can occur when performing therapeutic
exercises [1]. For example, various challenges may arise in
the application of therapeutic homework, such as:
•Misunderstandings in oral communication
•Inappropriate difficulty levels of the tasks
•Lack of motivation
The situation is further complicated by the fact that home-
work carried out in analog form is difficult to monitor.
Digital support of therapeutic processes therefore offers clearly
identifiable advantages on both the therapist and patient sides.
Therapists could make ad-hoc adjustments in the course of
therapy and patients could benefit from even more individual-
ized care.
IV. REQUIREMENTS
After an initial introduction to Therapeutic Interventions
(TI), the following section focuses on the requirements that
arise in such a context. When integrating Voice Assistants
(VAs) into the TI domain, therapists and patients can be
considered as the key stakeholders. In addition, to meet the
requirements of a generic approach, other stakeholders such
as domain specialists or scientists are also considered. Since
patients and therapists directly influence therapy outcomes,
whereas scientists or domain experts usually influence
therapy more indirectly, both perspectives are considered
separately. However, the focus of our approach is on the
former perspective. Finally, Table I summarizes identified
requirements, grouped by the main implementation steps.
Providing the best possible care to a patient is undoubtedly
the main goal of any treatment. Ensuring this goal during a
therapy session depends on the therapist’s skills [3]. Other
parts, however, such as homework, take place outside a session
and are therefore within the responsibility of the patient [13].
The out-of-session scenario is particularly well suited to the
use of a VA, since the VA can serve as a source of information
here 1. However, since speech is a volatile form of informa-
tion [2], it should be supplemented by a visual representation.
This, in turn, can be used to balance the level of detail and rel-
evance of individual pieces of information. For example, a VA
could provide a visual overview of current progress or answer
simple questions auditorily. To increase a patient’s adherence,
it is also beneficial to gain the user’s trust by making the
conversations as natural as possible 4. This could be achieved
by enriching conversations with historical patient data, current
contextual information, or possible predictions 5. Thus, the
context awareness of a VA agent depends on the available
contextual data and the orchestration of the server.
Another important perspective is that of the therapist, who
is involved in the planning, implementation and evaluation
of a particular treatment. The involvement of a VA at this
point therefore means supporting an entire process chain. The
coordination of session appointments by a VA is therefore only
one essential area of application. In this scenario, an assistant
could also monitor the course of treatment, inform the therapist
of any kind of deterioration, or simply allow therapists to
access current patient data 1. However, it should be kept
in mind that just because something can be augmented by
a VA does not mean that this is the appropriate modality.
For example, consider the visualization of the vast amounts
of data generated in the course of therapy. Here, it must
be doubted whether a VA is the right modality for the data
volumes. Also, the visual support of ad-hoc adjustments of
the patient’s therapeutic homework may usually be clearer
on a large desktop application window. Regardless of the
implementation chosen, however, consistency and timeliness
of treatment data should also be a key requirement. In addition,
different user roles with corresponding authorization levels are
already emerging in this scenario 3. This is an aspect that
must also be taken into account when integrating a VA into
the TI domain.
The last perspective considered is that of researchers and
professionals. While therapists and patients are actively in-
volved in a particular treatment, researchers and other pro-
fessionals tend to take a passive role. Thus, their main
interest is initially focused on the data collected [3]. For
example, scientists could access the raw data to gain new
insights or discover new therapies. This could be supported
by making data available in standardized, well-known formats
to increase interoperability with external analysis tools 7.
Professionals could process this data for any type of busi-
ness transaction. This, in turn, will lead to new, previously
unknown requirements in the future. Therefore, the underlying
system architecture should be as modular as possible 6,
and also highly extensible. Eventually, new data-processing
or -evaluating components could be extended, VAs could be
replaced, or new ones could be added.
V. ARCHITECTURE
After evaluating some key requirements, the following chap-
ter presents our architectural approach for flexible integration
of Voice Assistants (VAs) into the therapeutic context.
As mentioned before, the VAs can be considered as agents
controlled by a central server. Therefore, the proposed archi-
No. Title Description
A.) Interaction Model
1Data access VA gives access to most recent information
about treatment.
2Presentation According to hardware-limitations, the VA
presents answers in an audio-visual manner.
3User roles User roles allow a limitation of access for
single features.
B.) Conversational Flow
4Flow control Contextual information throughout a dialog in-
fluences its outcome.
5Personalization VA is aware of historic data and current con-
textual information.
C.) Data and Request Processing
6Modularity The modular and extendable system structure
benefits further adaptions.
7Interoperability Internal formats are based on state-of-the-art
standards to provide the highest degree of in-
teroperability.
TABLE I: Summary of identified requirements, grouped by
A.) Interaction Model, B.) Conversational Flow and C.) Data
and Request Processing of Voice Assistants (VAs).
tecture design is based on a well-known client-server model.
The main business logic is initially centralized on a single
server. Accordingly, the main focus of this approach is on the
internal server architecture.
First, the various existing clients must be evaluated in order
to know what a server must be capable of to interact with
these clients. For example, the server must first and foremost
be able to cope with a heterogeneous landscape of different
VAs. But, as described earlier, it must also be considered
that for some applications the voice interface may not be
the appropriate implementation. Therefore, desktop or web
applications should also be considered as possible clients.
To manage this large number of endpoints, we propose to
divide them into voice assistants and other client devices.
However, the focus is clearly on the former. The server must
therefore provide two separate interfaces to meet the different
requirements of the individual device groups.
Due to the variety of VAs currently available, it is not
possible to consider all voice assistance implementations.
Therefore, according to the latest statistics in [14], only a
subset is considered, covering most of the market. Since some
of them focus only on the Asian market, such as Baidu, Xiaomi
or Alibaba, they are not further considered. In addition, some
manufacturers, such as Apple, only offer limited access to the
assistant’s features, so they are not considered either. Ulti-
mately, the architectural design focuses on Google’s Assistant
and Amazon’s Alexa as reference VA implementations. These
two assistants are also particularly suitable due to their large
market share in [14], extensive documentation [7] [15], and
large developer community.
The first interface to be defined is the VA facing side of
the architecture. After it is known what kind of assistants the
server has to support, the VA interface can be designed. Our
two references VA Google Assistant and Amazon Alexa are
Google
Assistant
API
Amazon
Alexa
API
Microsoft
Cortana
API
...
Voice Assistant Generalizer
Business Logic
Machine Learning
Artificial Intelligence
Therapist
Scientist
End-User Voice Assistant
Therapist
Patient
Backend Professionals
Data Analysis/
Monitoring
amazon alexa
Google Assistant
Hey Cortana
Permission
Management
...
Fig. 2: Schematic description of the architecture
based on the definition of a so-called interaction model [15],
[16]. These models describe the main interactions between
a user and the assistant. This includes, for example, what a
user or patient can say to express certain intentions, which in
turn can be analyzed by the conversational agent via Natural
Language Understanding (NLU). In the case of reference
VAs, this process is provided via an encapsulated, vendor-
specific cloud service. Thus, only the result of the NLU, the
identified intent of the user, can be accessed. Therefore, the
main task of the server is to understand these intentions and
provide personalized responses. Generalizing a user’s intention
represented by VA requests is not a trivial task. Especially
when considering the previously mentioned heterogeneous
feature set of the individual assistants. Therefore, to ensure
that the server is able to handle all incoming requests from
the VAs, it is necessary to define possible user interactions in
advance. This could be done by modeling a unified interaction
model as a basis for structuring the server interface.
To keep the proprietary parts of the architecture as lean as
possible, the incoming VA requests need to be translated into
an internal, vendor-independent format. This could be done
by using an intermediate layer that acts as an intermediary
between the internal logic and the incoming VA requests.
Therefore, each assistant gets its own communication inter-
face that implements the basic vendor-specific communication
logic. The request and its fulfillment is delegated to the me-
diator. Subsequently, the mediator invokes the actual business
logic and generates the corresponding response.
To keep the mediator logic as simple as possible, we propose
to use established and well-known data formats that can be
interpreted by the assistant itself and therefore do not need to
be translated. Any kind of proprietary standard, such as Alexa
Presentation Language (APL), would increase complexity and
decrease interoperability. In addition, this scenario involves
two types of modalities: the voice interface and some type of
visual interface such as touchscreens. Therefore, separate data
formats should be used for the different interfaces. So for the
generation of speech responses, Speech Synthesis Markup Lan-
guage (SSML) seems to be suitable. To be more precise, this
XML-based markup language will be supported by the Google
Assistant [17], Alexa [18] and other assistants1. This format
can be used to describe, how a VA should generate an auditory
1https://dueros.baidu.com/dbp (Accessed: 2022-04-11)
response, with regard to emphasis, pauses and pronunciation2.
The visual responses, on the other hand, are based on typical
web technologies such as HTML, JavaScript and CSS. Thus,
if the VA’s hardware has a graphical interface, the responses
can be displayed as web applications. Moreover, the use of
these technologies greatly improves the interoperability of the
system. This is because the visual response can be interpreted
using any type of web browser, regardless of which end device
is used.
Once it has been determined how the data is to be formatted,
the internal architecture structure must be designed. With
regard to the generic claim of this approach, the actual business
logic should be implemented as modular as possible in order
to be easily extensible and changeable. We therefore propose
to use a so-called microservice architecture pattern. Here, the
server is structured as a combination of several encapsulated
software components that interact via a clearly defined com-
munication paradigm [19]. This allows certain components to
be developed, maintained and extended independently, which
in turn increases the maintainability and modularity of the
system [19]. However, it should also be kept in mind that the
use of this design pattern can also increase the complexity
of the system. Especially if some components are distributed
across multiple network nodes. Finally, the decision may also
be influenced by the experience of the designer. Therefore,
for the above reasons, we propose to use the microservice
architecture pattern to maximize the modularity of the system.
Another important component of this architectural approach
is the second server interface, through which all other requests
to access patient data are handled. Therefore, it is mainly in-
tended for professionals such as Healthcare Providers (HCPs)
or scientists. It allows these users to access raw, unprocessed
data. This data can then be further processed for specific ques-
tions or analyzed to gain new insights into certain diseases.
However, it must always be remembered that this personalized
health data should be treated with the utmost sensitivity.
Finally, we also recommend implementing appropriate rights
management that regulates all data access. This includes, for
example, multiple user roles with different authorizations. To
protect patient privacy, data should be anonymized prior to
access. Ultimately, then, the main focus of this interface must
be on providing easy and responsible access to well-protected
data.
VI. PROOF OF CONCEPT
After the elicitation of essential aspects of the underlying
architecture, the following chapter deals with the prototypical
implementation of our concept. In order to get a better
understanding of the users’ needs, the support of patients with
therapeutic homework was chosen as a reference scenario.
Furthermore, a selected Voice Assistant (VA), Amazon Alexa,
was utilized for the initial implementation. However, since this
is a generic approach, Alexa-specific aspects and limitations
are highlighted.
2https://www.w3.org/TR/speech-synthesis11/ (Accessed: 2022-04-11)
Since the server is considered the centerpiece of our ar-
chitecture, this component was implemented first. It consists
of our two previously designed interfaces, one handling voice
interactions and the other allowing scientists to access patient
data. Both interfaces, as well as the core of the architecture,
are based on the JavaScript server framework NodeJS3. On the
one hand, NodeJS is particularly suitable for dealing with the
previously mentioned web technologies for visually supported
response generation. On the other hand, both Google and
Amazon already offer SDKs for their assistants, which are
also available for NodeJS. Thus, both interfaces could easily be
implemented using the same language, which in turn increases
maintainability. Thanks to the microservice architecture pat-
tern, further developments are flexibly possible. This repre-
sents a major advantage, especially when considering that data
processing and evaluation could be done in a more suitable
language such as Python. To this end, all components have to
fulfill only one requirement to ensure communication: They all
have to implement the same communication paradigm, which
is based on the well-known REST principles.
MY THERAPY
Welcome, John.
Let’s start with today’s exercise.
Try, "Alexa, start my homework.“
Fig. 3: Exemplary personalized visual VA frontend
After implementing the core functionalities on the server,
the next important component of our presented architecture
is the audio-visual VA front-end. As mentioned earlier, our
proposed approach uses only established, standardized for-
mats. Therefore, Speech Synthesis Markup Language (SSML)
was used to implement the auditory part, while the visual
components are based on HTML, CSS and JavaScript (see
Figure 3). However, during the implementation it turned out
that Google Assistant only supports a subset of the SSML
specification [17]. Consequently, only a selection of SSML
tags supported by both reference assistants could be used.
Otherwise, the mediator would have to distinguish between
the different vendors when generating an assistant’s auditory
response. This in turn would increase the complexity of our
intermediate layer. The biggest challenge was therefore the
definition of this SSML tag subset.
The last component to be presented here is the dashboard
for professionals. This is a kind of reference implementa-
tion of a web front-end based on the server interface for
the professionals mentioned previously. Since it is only a
prototype, the security mechanisms described earlier have not
3https://nodejs.org/en/ (Accessed: 2022-04-11)
been implemented. What has been implemented, however, is a
visual dashboard interface for therapists and other Healthcare
Providers (HCPs). This web application could be used, for
example, to make ad-hoc adjustments to a patient’s treatment
plan, receive feedback that a user might give to their assistant,
or monitor the progress of a therapy. The underlying server
interface was also built on REST principles to meet the
requirements of contemporary technologies. In addition, the
web application was implemented using the JavaScript web
framework ReactJS.
VII. DISCUSSION
In the following section, we discuss how our proposed
architectural design meets the requirements from Table I.
The POC has shown that the use of widely used data formats
such as Speech Synthesis Markup Language (SSML) increases
interoperability with different clients. However, due to the
varying degrees of support for these formats, the off-the-shelf
interoperability is not as far as we had anticipated. Therefore,
this can be seen as another challenge for generalizing speech
agent interactions.
Another important aspect of the architecture is the natu-
ral conversation flow described earlier. By using contextual
information and historical data, we tried to make the user
interactions as natural as possible. However, the use of inter-
action models made the conversations less flexible. Since this
is a measure to reduce server complexity, this is considered a
tradeoff. However, to enable a truly natural and personalized
conversation, further development is needed.
The last important aspect of this discussion is the lack of
transparency of the Natural Language Understanding (NLU)
cloud services mentioned. This means that it is difficult to
understand how a user’s intention is detected. This raises
questions about data protection and possible dependencies in
particular.
VIII. SUMMARY AND OUTLOOK
As seen in section VII, there are some non-negligible
challenges that can arise in the generic integration of speech
assistance into the therapeutic context. This paper therefore
presented an approach for the corresponding technical im-
plementation (see section V). It illustrates the complexity
of designing, planning and implementing such a system.
Furthermore, it is shown that generalizing speech interactions
to understand the user’s intentions and provide personalized
healthcare support is one of the main challenges in such
scenarios. Especially when considering the differences within
the functional areas of the assistants. Finally, it was mentioned
that the lack of transparency of current Natural Language
Processing (NLP) cloud services increasingly raises questions
regarding the protection of personal health data. However, as
has been shown recently, vendors such as Amazon are already
aware of this [20]. In summary, there is great potential for
developments in the topic area covered.
REFERENCES
[1] M. Schickler, R. Pryss, M. Stach, J. Schobel, W. Schlee, T. Probst,
B. Langguth, and M. Reichert, “An it platform enabling remote
therapeutic interventions,” in 2017 IEEE 30th International Symposium
on Computer-Based Medical Systems (CBMS). IEEE, 2017, pp.
111–116. [Online]. Available: https://doi.org/10.1109/CBMS.2017.78
[2] Y. Liu, L. Wang, W. R. Kearns, L. Wagner, J. Raiti, Y. Wang, and
W. Yuwen, Integrating a Voice User Interface into a Virtual Therapy
Platform. New York, NY, USA: Association for Computing Machinery,
2021. [Online]. Available: https://doi.org/10.1145/3411763.3451595
[3] M. Schickler, R. Pryss, J. Schobel, and M. Reichert, “Supporting
remote therapeutic interventions with mobile processes,” in 2017 IEEE
International Conference on AI Mobile Services (AIMS), 2017, pp.
30–37. [Online]. Available: https://doi.org/10.1109/AIMS.2017.13
[4] A. Ermolina and V. Tiberius, “Voice-controlled intelligent personal
assistants in health care: International delphi study,” J Med Internet
Res, vol. 23, no. 4, p. e25312, Apr 2021. [Online]. Available:
https://doi.org/10.2196/25312
[5] H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems:
Recent advances and new frontiers,” Acm Sigkdd Explorations Newslet-
ter, vol. 19, no. 2, pp. 25–35, 2017.
[6] (2022) Amazon alexa voice ai — alexa developer official site. [Online].
Available: https://developer.amazon.com/en-US/alexa
[7] (2022) Google assistant — google developers. [Online]. Available:
https://developers.google.com/assistant
[8] R. Kraft, A. R. Idrees, L. Stenzel, T. Nguyen, M. Reichert, R. Pryss,
and H. Baumeister, “esano–an ehealth platform for internet-and mobile-
based interventions,” in 2021 43rd Annual International Conference of
the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE,
2021, pp. 1997–2002.
[9] G. Vlaescu, A. Alasj¨
o, A. Miloff, P. Carlbring, and G. Andersson,
“Features and functionality of the iterapi platform for internet-based
psychological treatment,” Internet Interventions, vol. 6, pp. 107–114,
2016.
[10] D. Dojchinovski, A. Ilievski, and M. Gusev, “Interactive home healthcare
system with integrated voice assistant,” in 2019 42nd International
Convention on Information and Communication Technology, Electronics
and Microelectronics (MIPRO). IEEE, 2019, pp. 284–288.
[11] L. Qiu, B. Kanski, S. Doerksen, R. Winkels, K. H. Schmitz, and
S. Abdullah, “Nurse amie: Using smart speakers to provide supportive
care intervention for women with metastatic breast cancer,” in Extended
Abstracts of the 2021 CHI Conference on Human Factors in Computing
Systems, 2021, pp. 1–7.
[12] K. H. Schmitz, X. Zhang, R. Winkels, E. Schleicher, K. Mathis,
S. Doerksen, L. Cream, J. Rosenberg, R. Kass, M. Farnan, P. Halpin-
Murphy, R. Suess, D. Zucker, and M. Hayes, “Developing “Nurse
AMIE”: A tablet-based supportive care intervention for women with
metastatic breast cancer,” Psycho-Oncology, vol. 29, no. 1, pp. 232–236,
Jan. 2020. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.
1002/pon.5301
[13] M. S. Broder, “Making optimal use of homework to enhance your
therapeutic effectiveness,” Journal of rational-emotive and cognitive-
behavior therapy, vol. 18, no. 1, pp. 3–18, 2000. [Online]. Available:
https://doi.org/10.1023/A:1007778719729
[14] F. Laricchia, “Global smart speaker market share 2021,” Mar
2022. [Online]. Available: https://www.statista.com/statistics/792604/
worldwide-smart-speaker-market-share/
[15] (2022) Create the interaction model for your skill.
[Online]. Available: https://developer.amazon.com/en-US/docs/alexa/
custom-skills/create-the-interaction-model-for-your-skill.html
[16] (2021, 08) Build conversation models. [Online]. Available: https:
//developers.google.com/assistant/conversational/build/conversation
[17] (2022, 03) Ssml (dialogflow). [Online]. Available: https://developers.
google.com/assistant/df-asdk/ssml
[18] (2022) Speech synthesis markup language (ssml) reference.
[Online]. Available: https://developer.amazon.com/en-US/docs/alexa/
custom-skills/speech-synthesis-markup-language-ssml-reference.html
[19] I. Nadareishvili, R. Mitra, M. McLarty, and M. Amundsen, Microservice
architecture: aligning principles, practices, and culture. O’Reilly
Media, Inc., 2016.
[20] (2022) Voice for health and wellness. [Online].
Available: https://developer.amazon.com/en-US/alexa/alexa-skills-kit/
get-deeper/custom-skills/healthcare-skills