SOFTWARE Open Access
Au oma ically exposing OpenLi eDa a ia SADI
seman ic Web Se ices
Alejand o Rod íguez González
1
, Alison Callahan
2
, José C uz-Toledo
3
, Ad ian Ga cia
1
, Mikel Egaña A angu en
4
,
Michel Dumon ie
2
and Ma k D Wilkinson
1*
Abs ac
Backg ound: Two dis inc ends a e eme ging wi h espec o how da a is sha ed, collec ed, and analyzed wi hin
he bioin o ma ics communi y. Fi s , Linked Da a, exposed as SPARQL endpoin s, p omises o make da a easie o
collec and in eg a e by mo ing owa ds he ha moniza ion o da a syn ax, desc ip i e ocabula ies, and iden i ie s,
as well as p o iding a s anda dized mechanism o da a access. Second, Web Se ices, o en linked oge he in o
wo k lows, no malize da a access and c ea e anspa en , ep oducible scien i ic me hodologies ha can, in
p inciple, be e-used and cus omized o sui new scien i ic ques ions. Cons uc ing que ies ha a e se
seman ically- ich Linked Da a equi es subs an ial expe ise, ye adi ional REST ul o SOAP Web Se ices canno
adequa ely desc ibe he con en o a SPARQL endpoin . We p opose ha con en -d i en Seman ic Web Se ices
can enable acile disco e y o Linked Da a, independen o hei loca ion.
Resul s: We use a well-cu a ed Linked Da ase - OpenLi eDa a - and u ilize i s desc ip i e me ada a o au oma ically
con igu e a se ies o mo e han 22,000 Seman ic Web Se ices ha expose all o i s con en ia he SADI se o
design p inciples. The OpenLi eDa a SADI se ices a e disco e able ia que ies o he SHARE egis y and easy o in eg a e
in o new o exis ing bioin o ma ics wo k lows and analy ical pipelines. We demons a e he u ili y o his sys em h ough
compa ison o Web Se ice-media ed da a access wi h adi ional SPARQL, and no e ha his app oach no only simpli ies
da a e ie al, bu simul aneously p o ides p o ec ion agains esou ce-in ensi e que ies.
Conclusions: We show, h ough a a ie y o di e en clien s and examples o a ying complexi y, ha da a om he
my iad OpenLi eDa a can be eco e ed wi hou any need o p io -knowledge o he con en o s uc u e o he
SPARQL endpoin s. We also demons a e ha , ia clien s such as SHARE, he complexi y o ede a ed SPARQL que ies
is d ama ically educed.
Keywo ds: OpenLi eDa a, Bio2RDF, SADI, Seman ic web se ices, SPARQL, SHARE, Sen ien knowledge explo e , Galaxy
Backg ound
Da a in eg a ion is an ongoing challenge o biological
in o ma icians, and is o en a s udy un o i sel , wi h
nume ous esea ch g oups wo ldwide app oaching he
p oblem om a a ie y o pe spec i es [1]. In eg a ion is
di icul o a a ie y o easons, gene ally b oken in o
he h ee co e issues o syn ax, s uc u e, and seman ics
[2]. In addi ion, assigning and using unique iden i ie s
o da a i ems and concep s is an essen ial equi emen in
biology and elsewhe e, and o ms an equally dis up i e
ba ie o success ul in eg a ion [3]. Syn ac ic ba ie s
include issues such as bina y o ex ual o ma , and
ee- ex o s uc u ed ex ; s uc u al ba ie s in ol e such
hings as la - ile o ma s, and XML Schema; seman ic
ba ie s include inconsis en naming, naming con lic s
(mul iple hings wi h he same name, o mul iple names o
hesame hing)o insu icien lyde inednames;and
inally iden i ica ion issues in ol e non-unique iden i ie s,
iden i ie s ha can only be in e p e ed wi hin a pa icula
scope (e.g. in he con ex o a gi en da abase), non-opaque
iden i ie s, and uns able o unp edic able iden i ie s.
The Seman ic Web Ini ia i e [4] has ecen ly eme ged
wi h echnologies and amewo ks aimed a sol ing a
leas some o hese p oblems. In pa icula , he Resou ce
Desc ip ion F amewo k (RDF [5]) is an en i y- ela ionship
da a model ha is, in p inciple, machine- eadable and
* Co espondence: [email p o ec ed]
1
Cen o de Bio ecnología y Genómica de Plan as, Uni e sidad Poli écnica de
Mad id, Mad id, Spain
Full lis o au ho in o ma ion is a ailable a he end o he a icle
JOURNAL OF
BIOMEDICAL SEMANTICS
© 2014 González e al.; licensee BioMed Cen al L d. This is an Open Access a icle dis ibu ed unde he e ms o he C ea i e
Commons A ibu ion License (h p://c ea i ecommons.o g/licenses/by/4.0), which pe mi s un es ic ed use, dis ibu ion, and
ep oduc ion in any medium, p o ided he o iginal wo k is p ope ly c edi ed. The C ea i e Commons Public Domain
Dedica ion wai e (h p://c ea i ecommons.o g/publicdomain/ze o/1.0/) applies o he da a made a ailable in his a icle,
unless o he wise s a ed.
González e al. Jou nal o Biomedical Seman ics 2014, 5:46
h p://www.jbiomedsem.com/con en /5/1/46
capable o ep esen ing any concep o da a en i y. RDF
also p oposes se e al app o ed syn axes, aimed a
maximizing machine- eadabili y. Impo an ly, a que y
language has been de eloped o RDF - he SPARQL
P o ocol and RDF Que y Language (SPARQL [6]) - and a
p o ocol o explo ing and e ie ing he RDF s o ed in
“SPARQL endpoin s”on he Web is now well-es ablished
and, in ou expe ience, highly consis en om implemen-
a ion o implemen a ion.
Wi h RDF as i s co e, he Linked Da a ini ia i e [7]
p oposes se e al bes p ac ices ha d ama ically imp o e
he disco e abili y and in eg a ion o da a on he Web.
Fi s , all da a en i ies and ela ionships mus be iden i ied
by a Uni o m Resou ce Iden i ie (URI), which gua an ees
uniqueness on a global scale. Second, URIs should
esol e o da a and me ada a using he mos common
Webp o ocol,HTTP.Thi d,URIsshould esol e o
use ul/in o ma i e in o ma ion, and esou ce p o ide s
should o e his in o ma ion in a a ie y o syn axes
ha can be selec ed by HTTP con en -nego ia ion; in
pa icula , Linked Da a esou ces should p o ide a means
o e ie ing he da a in he o m o RDF. Finally, his RDF
should con ain labeled links be ween ha piece o da a,
and o he pieces o da a also iden i ied by esol able URIs,
whe e he links indica e he ela ionship be ween he wo
da a elemen s and a e, hemsel es, esol able URIs. Wi h
he abili y o e ie e, sha e, and e-use hese ela ionship
de ini ions, we begin o mo e owa ds he “seman ic”
aspec o he Seman ic Web.
A emp s o uni y seman ics ha e long been a ocus o
biomedicine. The medical wo ld has engaged o cen u ies
in he de elopmen o nosologies o naming and
classi ying diseases. Wi hin he bioin o ma ics communi y,
on ologies ha e become widely adop ed in he pas
decade, wi h he mos p ominen o hese being he Gene
On ology [8]. While such on ologies gene ally ocus on
consis en and sensible human- eadable names, hey ha e
dedica ed less a en ion o he unique iden i ica ion o he
concep - names in on ologies a e no gua an eed o be
globally unique, no a e concep s gua an eed o be
uniquely named, and can appea in mul iple on ologies.
Howe e , as hese on ologies became encoded using he
ules o Linked Da a, aspec s o hese p oblems we e also
sol ed. Concep s became globally and uniquely iden i ied
by esol able URIs, and sha ed concep s could be e e ed
o by URI om one on ology o ano he . Mo eo e , mod-
e n on ologies’use o he Web On ology Language
(OWL) [9] desc ip ion logic o de ine he meaning o
he “links”in Linked Da a - e ec i ely, he p ecise
na u e o he ela ionship be ween one da a en i y and
ano he - enabled machines o au oma ically a e se hese
linkages in a meaning ul way.
Two key issues emained p oblema ic, howe e , e en wi h
Linked Da a. Fi s he e was no widely-used mechanism o
ensu e he s abili y and p edic abili y o URIs ep esen ing
da a and concep s - o example he e was no way o
p edic he URI o he P o ein Da a Bank eco d o
he A abidopsis UFO p o ein, and e en i his URI
we e de e mined, i migh no be he same om one
day o he nex . As a esul , indi idual Linked Da a
esou ces could no eliably link ou o o he Linked
Da a esou ces, because he URIs we e unp edic able
and uns able. Da a ended o emain “siloed”e en in
he Linked Da a wo ld because links gene ally poin ed
inwa d, a he han ou wa d, as a esul o his ins abili y
and unp edic abili y. Second, he s uc u e o wha was
e u ned when he URI o a piece o da a was esol ed was
also no su icien ly p edic able, and no consis en om
si e o si e, e en o he same ype o da a. While Linked
Da a is a signi ican imp o emen o e XML Schema wi h
espec o he p edic abili y o i s da a s uc u es, he e
we e s ill no guidelines o how o a ange he ela ionships
be ween pieces o da a, o e en wha hose ela ionships
could/should be. I was hese emaining p oblems
ha became he ocus o he Bio2RDF p ojec .
Bio2RDF is an open sou ce p ojec ha uses Seman ic
Web echnologies o c ea e a sus ainable in as uc u e
o publishing biological da a in a manne ha eases he
ask o da a in eg a ion [10-12]. Bio2RDF sc ip s con e
he e ogeneously o ma ed da a (e.g. la - iles, ab-delimi ed
iles, da ase speci ic o ma s, SQL, XML e c.) in o RDF.
Bio2RDF ollows a se o basic con en ions o gene a e
and p o ide Linked Da a which a e guided by Tim
Be ne s-Lee’s design p inciples and a se o communi y-
es ablished guidelines and p ac ices. Speci ically, en i ies,
hei a ibu es and ela ionships a e named using a
simple con en ion o p oduce In e na ionalized Resou ce
Iden i ie s (IRIs) ha a e highly p edic able in hei
s uc u e, while s a emen s a e a icula ed using he
ligh weigh seman ics o RDF Schema (RDFS) and Dublin
Co e. Bio2RDF, howe e , did no eliably implemen all o
he equi emen s o “well beha ed”linked da a, such as
HTTP con en -nego ia ion, and had somewha limi ed
exp essi i y in i s ela ionships as a esul o using he
seman ics o RDF Schema. OpenLi eDa a p o ides
cus omized se ices o e Bio2RDF SPARQL endpoin s. I s
goal is o p o ide al e na i e use in e aces and applica ion
p og amming in e aces o Linked Open Da a beyond wha
Bio2RDF cu en ly does. OpenLi eDa a en iches Bio2RDF’s
RDFS seman ics o OWL exp essi i y, implemen s ich
HTTP con en -nego ia ion, and u ilizes que y- ew i ing o
esol e OpenLi eDa a IRIs and SPARQL que ies agains
he Bio2RDF SPARQL endpoin s.
OpenLi eDa a da a is accessed by use s ei he ia he Web,
h ough esolu ion o a URI o an HTML- ep esen a ion o
i s da a con en in hei b owse , o by he submission o a
SPARQLque y ooneo heOpenLi eDa aendpoin s.
While he da a- ypes and ela ionships wi hin each endpoin
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 2 o 12
h p://www.jbiomedsem.com/con en /5/1/46
can be de e mined by manual explo a ion o he endpoin ,
SPARQL que ies mus ne e heless be cons uc ed
manually, and hen posed agains he app op ia e endpoin
(s). Ex ac ing OpenLi eDa a Linked Da a, he e o e, emains
a non- i ial ask o e en expe ienced bioin o ma icians.
The 2014 elease o OpenLi eDa a (based on Release 3 o
Bio2RDF) de eloped a scheme o p o ide a p e-compu ed
summa y, o index, o he con en s o each OpenLi eDa a
SPARQL endpoin in o de o educe he compu a ional
load equi ed o explo a o y que ies and enable new
applica ions. Summa y me ics we e p e-compu ed,
including numbe o iples, numbe o unique subjec s,
numbe o unique p edica es, numbe o unique objec s,
lis and equency o unique ypes, lis and equency o
unique p edica es, lis and equency o unique subjec ,
p edica e-unique objec uples, lis and equency o
ins ances o subjec ype, p edica e, and ins ances o
unique objec ype, and inally numbe o links o o he
da ase s. These indexes make i easie o de e mine he
s uc u e and con en o each OpenLi eDa a endpoin ,
and mo eo e , he s uc u es a e highly consis en om
endpoin o endpoin .
Seman ic Au oma ed Disco e y and In eg a ion (SADI)
[13] is a se o design p inciples o exposing Web Se ices
in a manne ha simpli ies hei in eg a ion wi h o he
Seman ic Web esou ces. Desc ibed simply, SADI Se ices
a e Web-based ools ha consume a pa icula ype o
da a, and e u n ano he ype o da a ha is explici ly
ela ed o ha inpu . Fo example, you could send DNA
sequences o a SADI blas x se ice, and i would
gi e you back P o ein sequences ha a e connec ed o he
o iginalDNAsequenceby he“hasP o einHomologyTo”
ela ionship. Exp essed mo e conc e ely, SADI se ices
consume and p oduce RDF da a, whe e ins ances o an
inpu OWL class, ep esen ed in RDF, a e submi ed o he
se icebyHTTPPOST,andRDFins anceso anou pu
OWL class a e e u ned in esponse. The cons ain SADI
places on hese da a is ha he ou pu class mus be a
specializa ion o he inpu class such ha he inpu
ins ances a e ela ed o he new se ice-gene a ed
da a nodes h ough on ologically-de ined ela ions.
The esul o chaining SADI se ices oge he , he e o e, is
an unb oken ne wo k o well- o med and on ologically-
g ounded Linked Da a, which can be explo ed and
a e sed using s anda d ools such as SPARQL.
SHARE (Seman ic Heal h And Resea ch En i onmen )
[14,15] is a SADI clien ha combines: a egis y o he
inpu and ou pu OWL classes o all known SADI
se ices, a se ice disco e y and in oca ion API, an
au oma ed wo k low design and enac men engine, and a
logical easone . While o he componen s o SHARE a e
discussed in de ail in he p e iously- e e enced pape s, i
is ele an o his manusc ip ha se ice disco e y is
achie ed by indexing all known SADI se ices in he
SHARE egis y, such ha inpu ypes, ou pu ypes, and
he p ope ies ha link hem, a e all apidly sea chable.
This egis y is made publicly a ailable as a SPARQL
endpoin , whe e he da a model o he egis y ollows ha
o he myG id se iceDesc ip ion [16] on ological class.
The simila i y be ween he inpu - ype, p ope y,
ou pu - ype “signa u e”o a SADI Web Se ice, and he
subjec - ype, p edica e, objec - ype indexes o he
OpenLi eDa a endpoin s p o ides a na u al mechanism
h ough which hese wo ini ia i es could be combined,
such ha OpenLi eDa a becomes disco e able and access-
ible ia SADI. A he NBDC/DBCLS BioHacka hon 2013
we p oposed ha i should be possible o au oma ically
gene a e (a) o malized de ini ions o SADI se ices, (b)
SPARQL que ies o e ie e he se ice-app op ia e
da a om he OpenLi eDa a endpoin s, and (c) he
SADI se ice code o se e ha da a, all by simply pa sing
he OpenLi eDa a indexes. This manusc ip desc ibes
he ealiza ion o ha ision. Fo he emainde o he
manusc ip will use he sho name ‘OpenLi eDa a2SADI’
as a con enien way o e e ing o he p ojec as a whole.
Implemen a ion
OpenLi eDa a’s con en summa ies a e p o ided as RDF
[17]. We u ilize he Jena [18] Ja a lib a ies o pa se he
[Subjec ype - p edica e - Objec ype] (SPO) iple
pa e ns in hese indexes, and addi ional indexes c ea ed
speci ically o his p ojec , o gene a e se s o h ee
con igu a ion iles used by OpenLi eDa a2SADI o
se e each da a- ype wi hin OpenLi eDa a. The i s
ile con ains wo OWL on ological classes, desc ibing he
inpu and ou pu da a o he se ice. These on ologies
a e published on he Web such ha he inpu and ou pu
class URIs a e esol able h ough HTTP GET. The second
ile is a summa y con aining he URIs o he inpu and
ou pu OWL Classes o ha se ice, he human- eadable
class names, and he URI and name o he RDF p edica e
ha links he wo classes. Finally, a hi d ile is gene a ed
ha con ains a SPARQL que y empla e ha , when
illed-in wi h da a and execu ed agains he app op ia e
OpenLi eDa a endpoin , e ie es he ou pu da a
app op ia e o ha se ice. We now desc ibe in addi ional
de ailhoweacho heses epsisunde aken.
Pa sing he indexes
Each OpenLi eDa a da ase is se ed om i s own
SPARQL endpoin , and con ains da a wi hin a speci ic
namespace (e.g. ‘sgd’ o Sacha omyces Genome Da abase,
o ‘ncbigene’ o he NCBI Gene). The con en o each
endpoin has been p e-indexed, using VoID (Vocabula y o
In e linked Da ase s), whe e he index cap u es all unique
da a- ype/p edica e/da a- ype iples o ha endpoin . Fo
example, one o he index iples o he HGNC endpoin
is “Gene Symbol/x-omim/Gene”. The Ja a collec o i s
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 3 o 12
h p://www.jbiomedsem.com/con en /5/1/46
pa ses he in o ma ion p o ided in he OpenLi eDa a in-
dexes o ob ain wo pa ame e s: SPARQL Endpoin URL
and Namespace - e ec i ely, he loca ion o each da ase ,
and he domain/scope o ha da ase . In p inciple, each
endpoin could be in e oga ed o e ie e all SPO pa e ns
by execu ing he ollowing SPARQL que y:
This would be su icien o ga he all in o ma ion
necessa y o c ea e SADI se ices ha ou pu esou ce
nodes (URIs); howe e , a his ime, OpenLi eDa a does
no index he la ge componen o da a ha exis s as
li e al alues (numbe s and s ings). As such, o be ully
comp ehensi e, we execu e an i e a i e se o que ies
o e each endpoin which ga he s all subjec - ypes, hen
he p edica es associa ed wi h each subjec - ype, and
inally he objec ype ha is connec ed by each
p edica e, including he cases whe e he objec is a
li e al alue. To u he en ich he seman ics, we hen
do ede a ed que ies o e mul iple end-poin s in an
a emp o de e mine mo e speci ic de ails abou he
objec ypes. Fo example, he omim da ase includes links
o en i ies in he hgnc da ase , bu conside s all o hese o
be “Resou ces”- a gene ic e m o some hing ha exis s
in ano he da ase . Th ough ou ede a ed que ies, we can
de e mine ha hese hgnc “Resou ces” ep esen , o
example, Genes, o SNPs, and he eby we a e able o
cons uc seman ically iche desc ip ions o wha he
SADI se ices will consume/p oduce.
The que ies we execu e (in empla e o m) a e as
ollows:
Ge Subjec - ypes
Ge P edica e- ypes
Ge Objec - ypes
Ge Da a- ypes
Fede a ed Que y o objec ypes
Con igu a ion ile c ea ion
A e e ie ing all SPO pa e ns o each endpoin ,
OpenLi eDa a2SADI hen builds he iles needed o
au oma ically con igu e he SADI Se ice; each SPO
iple pa e n becomes i s own Se ice, whe e he se ice
consumes da a o he ‘Subjec ’ ype, and e u ns all
iples om ha endpoin ma ching he SPO pa e n
o ha Subjec . Fo each Se ice, h ee con igu a ion
iles mus be c ea ed:
Inpu and ou pu on ology classes
Using he Ja a OWL API [19] we c ea e on ology classes
based on he pa e n o each SPO in each endpoin ;
hese classes desc ibe he OWL p ope ies equi ed
o /p o ided by he Inpu and Ou pu o he se ice
espec i ely.
In OpenLi eDa a URIs he class/p edica e iden i ie , and
namespace a e sepa a ed ei he by he hash (#) o colon (:)
cha ac e s. Since we in end ha OpenLi eDa a2SADI
se ices should “make sense” o bo h machines and
humans, an a emp is made o cons uc a human- eadable
name o each class and p ope y. The code i s a emp s
o esol e he URI o e ie e i s d :label, and his label, i
a ailable, is used as he human eadable class/p ope y
name in he inal con igu a ion ile o ha se ice. I no
label can be e ie ed, he hash o colon sepa a o is used o
spli aname om he es o heURIand hisisusedas he
human- eadable name. While no en i ely success ul, his is
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 4 o 12
h p://www.jbiomedsem.com/con en /5/1/46
ou bes a emp a au oma ically building se ices ha ha e
accu a e human- eadable desc ip ions.
The inpu class (gene ically called ‘Subjec _Class’in
his discussion) is de ined in OWL, simply, as he d :
ype o he Subjec o he SPO iple, as de ined by
OpenLi eDa a. I con ains no o he axioms o es ic ions.
The class ep esen ing he ou pu o he se ice is hen
de ined as he Subjec _Class wi h an addi ional p ope y
de ined by he P edica e o he SPO iple, whe e he
ange o ha p edica e is de ined by he Objec da a- ype
componen o he SPO iple. This is ep esen ed in
Manches e Syn ax as ollows:
Logically, he e o e, he Se ice ou pu is a subclass o
he Se ice inpu (Subjec _Class), as is ypical o all SADI
se ices. A simila app oach is aken o OpenLi eDa a
p edica es wi h Li e al alue anges. The esul ing on ology
is hen sa ed o he local iles o e wi h he naming
con en ion ./<namespace>/<subjec _p edica e_objec > .owl
and his is published on he Web such ha he URIs in ha
on ology esol e co ec ly.
In a second phase, he p ocess abo e is duplica ed, bu
in his second i e a ion, he owl:In e se o he P edica e
is used, and Subjec and Objec a e e e sed. This allows
us o au oma ically c ea e SADI se ices ha a e se
he OpenLi eDa a in ei he o ien a ion, and hus beha e
in a manne akin o con en ional SPARQL, whe e ei he
Subjec o Objec may be bound in a cons ain clause
o he que y.
Con igu a ion ile
This ile con ains pa ame e s equi ed o p ope ly con-
igu e he SADI se ice such ha i (a) se es he
app op ia eda ausing heapp op ia edesc ip o s,
and (b) p o ides i s own me ada a in a o m ha is
comp ehensible o humans. The Ja a code ha c ea es
hese con igu a ion iles equi es a single a gumen - he
oo URL o he inal loca ion o he on ologies (c ea ed
abo e) on he Web. The con igu a ion ile con ains he
ollowing pa ame e s:
INPUTCLASS_NAME: The name o he inpu
class a e emo ing he namespace. In cases
whe e he class name is opaque o nume ical,
an a emp is made o esol e he class URI o
i s ull OWL-RDF de ini ion, and e ie e he
“label”p ope y, such ha he class name is
human- eadable.
INPUTCLASS_URI: The URI o he inpu class
OUTPUTCLASS_NAME: The name o he ou pu
class. This is he same o all se ices, bu con lic s
a e a oided since each ou pu class name exis s in a
unique namespace (on ology); e e y ou pu class is
named “#Se iceOu pu ”.
OUTPUTCLASS_URI: The Web- esol able URI o
he ou pu class. This URI is gene a ed by he con-
ca ena ion o he oo URL, he namespace o he
OpenLi eDa a da ase , he pa h and name o he
on ology ile, and he gene ic class-name
“#Se iceOu pu ”.
PREDICATE_NAME: The name o he p edica e. As
wi h he inpu class name, an a emp is made o
e ie e he human- eadable label o he p edica e i
i appea s ha he p edica e is somehow opaque o
nume ical.
PREDICATE_URI: The URI o he p edica e.
ORIGINAL_ENDPOINT: The URL o he o iginal
endpoin indexed by OpenLi eDa a.
GENERIC_ENDPOINT: The endpoin ha should
be que ied by he SADI se ice using SPARQL.
OpenLi eDa a is duplica ed in se e al loca ions; he
p e e ed loca ion o que y would be he alue o
his ield.
OUTPUT_CLASS: The d : ype o he da a ha will
be added du ing he se ice execu ion.
The esul ing ile is w i en o he local iles o e in
he same olde as he on ology ile, wi h he naming
con en ion./<namespace>/<subjec _p edica e_objec > .c g.
SPARQL que y ile
The hi d ile gene a ed by OpenLi eDa a2SADI con ains
he SPARQL que y ha should be execu ed wi hin he
business logic o he SADI Web se ice. The con en o
his que y is se ice speci ic, bu ollows he pa e n:
whe e < PREDICATE_NAMESPACE > is eplaced wi h
he namespace o he p edica e p o ided by he SPO,
and < p edica e > elemen is eplaced wi h he local
name o he p edica e ( he componen a e he ‘#’o ‘:’
cha ac e ). %VAR is le in he que y empla e, and will be
subs i u ed by he SADI se ice a un- ime, based on he
inpu da a.
In he case o he SPARQL que y, he e is no di e ence
be ween he ‘ o wa d’P edica e and he in e se p edica e.
In e se p edica es do no exis in he OpenLi eDa a
SPARQL endpoin s, bu a he a e simply de ined in he
OWL logic ha de ines he en i ies and ela ionships
in hose endpoin s. As such, we ely on logical ea-
soning o de e mine ha an in e se in oca ion can be
sol ed equally well by a ‘ o wa d’que y; hus he
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 5 o 12
h p://www.jbiomedsem.com/con en /5/1/46
que y ha se es bo h o wa d and in e se se ices is
iden ical.
SADI se ice implemen a ion
To se e he OpenLi eDa a da a, a single Pe l sc ip
using he s anda d SADI::Simple code lib a ies ac as he
SADI Se ice Daemon o all se ices. The sc ip lis ens
o HTTP calls o URLs o he o m:
In his URL, SADI is he name o he OpenLi eDa a2SADI
Se ice sc ip , while he addi ional pa h in o ma ion
(namespace and se ice name) a e used as keys o access
he con igu a ion ile and SPARQL que y ile app op ia e
o ha se ice, as desc ibed abo e. The SADI Pe l sc ip
pa ses hese iles, and con igu es i sel o be capable o :
HTTP GET:
Re u ning he comple e se ice in e ace de ini ion,
ep esen ed as an owl:Indi idual o he myg id
on ology Se iceDesc ip ion Class, as pe he SADI
design pa e ns.
HTTP POST:
Pa sing he inpu da a, which a i es in RDF syn ax
as owl:Indi iduals o ha se ice’s Inpu OWL Class.
Execu ing he SPARQL que y, ex ac ed om he
con igu a ion iles, agains he co ec OpenLi eDa a
endpoin o ha se ice, using each o he
incoming owl:Indi iduals o ill he que y a iables
o ha pa icula in oca ion.
Cons uc ing owl:Indi iduals complian wi h he
class de ini ion o ha se ice’s Ou pu OWL Class,
and passing his da a back o he calle .
This is all accomplished using he no mal SADI se ice
empla e [13]. The key di e ence is ha he Se ice’s
in e ace empla e e ie es i s alues om a dynamic
look-up o da a om he con igu a ion iles, a he han
being ha d-coded in o he se ice.
Se ice egis a ion
Two sc ip s we e w i en o au oma e he egis a ion and
de egis a ion o he ull sui e o OpenLi eDa a2SADI.
The egis a ion code and de egis a ion code a e a ailable
in he Pe l olde o he Gi Hub p ojec (see “A ailabili y
and Requi emen s”sec ion). They ope a e by que ying all
o he con igu a ion iles ( o egis a ion) o all o he
exis ing SHARE egis y en ies ( o de egis a ion) and
igge ing he egis y o call GET on each se ice end-
poin . The egis y unc ions by c ea ing a se ice i i inds
a alid se ice desc ip ion documen a ha endpoin , o
de egis e ing a se ice i i does no . The e o e, in he case
o egis a ion, he SADI sc ip should be ins alled on he
designa ed se ice endpoin i s , in o de o espond o
he egis y calls. In he case o de egis a ion, he SADI
Se ice code should be emo ed p io o unning he
de egis a ion sc ip .
Wo k lows o Bio2RDF se ices
The es ablishmen o he OpenLi eDa a2SADI sui e o
se ices made i possible o mo e easily explo e he
in e connec ions be ween OpenLi eDa a endpoin s. In
o de o gene a e an exhaus i e lis o hese connec ions,
o assis hi d-pa ies in building no el explo a ion ools,
he ollowing que y was issued which c ea es a lis o all
alid se ice-ou pu o se ice-inpu pai s wi hin he se o
OpenLi eDa a se ices (no e ha he PREFIX di ec i es in
his example a e sha ed o all que ies in his manusc ip ,
and will no be epea ed in la e examples):
Since his, in p inciple, ep esen s he comple e se o
po en ial wo k low connec ions ha could be cons uc ed
wi hin hese se ices, we chose o o mally ep esen he ou -
pu o his que y as an abs ac wo k low empla e, using he
Open P o enance Model o Wo k lows (OPMW) Abs ac
Templa e on ology [20,21]. Those in e es ed in gene a ing a
copy o his abs ac empla e o hei own explo a ion can
simply execu e he OpenLi eDa a2SADI2OPMW.pl sc ip in
heGi Hubp ojec ,whichwillgene a eacopybasedon he
con en s o he public SHARE egis y. A copy gene a ed
a he ime o w i ing is also a ailable in he p ojec ’sGi
eposi o y (see “A ailabili y”sec ion).
P o enance
P o enance o da a is becoming inc easingly impo -
an as da ase s ge la ge , mo e dispe sed o e he
Web, and as da a ga he ing and analyses become mo e
au oma ed. The OpenLi eDa a2SADI p ojec has selec ed
he NanoPublica ion [22] con en ions and model o
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 6 o 12
h p://www.jbiomedsem.com/con en /5/1/46
passing p o enance in o ma ion o he clien , along
wi h he esul s o hei se ice in oca ion. As wi h
all SADI se ices, his is achie ed h ough no mal
HTTP con en nego ia ion. I he clien passes an “Accep :
applica ion/n-quads”HTTP heade , he OpenLi eDa a2-
SADI se ice will espond by e u ning h ee named
g aphs, cons uc ed acco ding o he NanoPublica ion
speci ica ions. One g aph con ains he se ice ou pu ,
he second con ains he me ada a desc ibing he se ice
and, o example, i s name, desc ip ion, and URL, and he
hi d desc ibing he da e and ime he NanoPublica ion
was gene a ed.
Resul s and discussion
A his ime he e a e mo e han 22,000 OpenLi eDa a2-
SADI se ices om 26 independen endpoin s, and mo e
will be gene a ed as OpenLi eDa a expands in o new da a-
ypes. These se ices a e disco e able h ough simple
que ies agains he SHARE egis y, o h ough a a ie y
o clien applica ions. We now demons a e he u ili y
o he OpenLi eDa a2SADI applica ion by a se ies o
walk h oughs, whe e he p ocess o disco e y, execu-
ion, and chaining- oge he o SADI-w apped OpenLi-
eDa a se ices is desc ibed in mo e de ail and
compa ed o he in e oga ion o OpenLi eDa a di -
ec ly ia SPARQL.
We will s a wi h a small agmen o RDF da a ep e-
sen ing a Human Gene Naming Commi ee (HGNC) Gene
Symbol:
Disco e y o OpenLi eDa a2SADI se ices
Disco e y o se ices is gene ally accomplished by
execu ing a SPARQL que y agains he SHARE egis y
[23]. Disco e y o he OpenLi eDa a2SADI se ices
can be accomplished by a wide a ie y o que y s uc u es,
bu in his example we will que y o se ices ha
consume OpenLi eDa a HGNC Gene Symbols and ha e
“app o ed-name”somewhe e in he se ice’sdesc ip i e
ex . The que y is:
This e u ns a single esul , which is he URL o se -
ice “hgnc_ ocabula y_Gene-Symbol_hgnc_ ocabula -
y_app o ed-name_s ing”.
In oca ion o OpenLi eDa a2SADI se ices
In oca ion o a disco e ed OpenLi eDa a da a e ie al
se ice simply consis s o sending he da a o he se ice
endpoin using HTTP POST. This can be accomplished
wi h widely a ailable ools such as Unix ‘cu l’. Below, he
sample HGNC Gene Symbol eco d desc ibed ea lie , is in
he ile sampleda a_hgnc. d . Cu l is hen used o in oke
he se ice, as ollows:
he esul o his se ice in oca ion is he ou pu da a,
con aining he app o ed name om he OpenLi eDa a
HGNC endpoin :
Clien applica ions
We do no expec ha ou use s will ypically disco e
o access OpenLi eDa a2SADI se ices ia SPARQL que ies
o he command-line. Mo e commonly, he same disco e y
and in oca ion in e ac ions p esen ed in hei aw o m
abo e a e p esen ed o he use g aphically ia one o he
SADI plug-ins o clien applica ions; ne e heless, disco e y
and in oca ion happens he same way as desc ibed abo e,
ega dless o he clien . We belie e ha his simple
s anda diza ion p o ides a e y low ba ie - o-adop ion o
new use s and ool-de elope s who wish o gain access o
he my iad OpenLi eDa a esou ces.
The e a e a wide ange o g aphical clien s capable o
execu ing SHARE egis y que ies in esponse o he use ’s
con ex ual needs, o in some cases, ully au oma ically.
We will now p esen se e al o hese applica ions,
showing how OpenLi eDa a se ices can be accessed and
chained- oge he wi hin hese di e se clien s.
The lis o se ices we will use o his demons a ion
a e:
(1) Gene-Symbol_app o ed-name
(2) Gene-Symbol_x-omim
(3) Gene_gene- unc ion
(4) Gene_a icle
(5) Gene_x-mgi
(6) Gene_x-unip o
Se ices (1) and (2) link an HGNC esou ce o i s
app o ed name and a linked OMIM en y, se ices
(3), (4), and (5) link an OMIM esou ce o a gene
unc ion desc ip ion, i s associa ed PubMed en ies,
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 7 o 12
h p://www.jbiomedsem.com/con en /5/1/46
and i s associa ed Mouse Genome In o ma ics (MGI)
Gene, while se ice (6) links an MGI gene iden i ie o i s
associa ed UniP o iden i ie . The empla e wo k low
connec ing hese se ices in a biologically-meaning ul way
is shown in Figu e 1.
IO in o ma ics knowledge explo e
The SADI plug-in o he IO In o ma ics Knowledge
Explo e [24,25] (KE) p o ides menu-d i en access o
he SHARE egis y h ough a con ex menu ha
appea s when igh -clicking a piece o biological da a on
he KE can as. In Figu es 2A and B we show he same
sample da a om he examples abo e, loaded in o he
Knowledge Explo e . A igh click e eals he “Find
SADI Se ices”menu op ion, which hen ini ia es a
sea ch based on he da a- ype ha was selec ed. He e we
ha e chosen he “app o ed-name”se ice om he esul ing
se ices menu by clicking he selec ion box. In Figu e 2C
he app o ed name o HGNC:7 has been added as new
in o ma ion o he can as. Figu e 2D shows he inal esul
a e a se ies o OpenLi eDa a2SADI se ices ha e been
execu ed, ollowing he wo k low pa h in Figu e 1.
SHARE
The SHARE clien [14] is one o se e al SADI clien s
capable o chaining mul iple se ices oge he . We will
u ilize his clien o emphasize he ac ha , as a esul
o exposing OpenLi eDa a da a as SADI se ices, i is no
longe necessa y o know which da a exis s in which o
he 26+ OpenLi eDa a SPARQL endpoin s. In his use
case we imagine ha a esea che has s udied some
human condi ion, has na owed-down o a speci ic gene
lis o in e es , and now wan s o know mo e abou hose
genes, hei unc ions, and whe he o no he p o eins
migh be sui able d ug a ge s based on known p o ein
in o ma ion om hei espec i e Mouse homologues.
Diag amma ically, he wo k low is as shown in Figu e 2
(using he se ice numbe s and s a ing-da a om abo e).
The SHARE in e ace is a h p://de .bio d .o g/
ca dioSHARE. SHARE exposes SADI Web Se ices as i
hey we e combined in o a single, global, SPARQL
endpoin . The SHARE SPARQL que y ha will in oke
hewo k low omFigu e2is:
No e ha i was no necessa y o know which endpoin
con ained which da a elemen s, no o use “se ice”
que ies o ede a e o e hese endpoin s. This is impo an
when conside ing he complex s uc u e o ede a ed
SPARQL que ies, whe e i is necessa y o know he loca ion
o he endpoin , and in some cases, he named-g aph
ha mus be que ied. Fo example, he equi alen
SPARQL que y o e he OpenLi eDa a endpoin s, would
be as ollows:
As such, we belie e ha OpenLi eDa a2SADI makes
he explo a ion ac oss he mo e han 20 OpenLi eDa a
da a endpoin s conside ably mo e s aigh o wa d.
Galaxy
The Galaxy [26] wo k low en i onmen is e y popula
among li e scien is s, ye o da e, we know o no Galaxy
wo k low ha accesses OpenLi eDa a o Bio2RDF da a.
This is likely due o he lack o li e science ools and
se ices ha deal wi h RDF- o ma ed da a a all, and
he lack o a s aigh o wa d empla e o mapping da a
be ween a wo k low and a SPARQL que y (and back
again). The SADI Galaxy plugin [27,28] p o ides SADI
Figu e 1 A wo k low o OpenLi eDa a2SADI se ices, numbe ed
as in he lis o se ices abo e, and he ou pu da a ha
will esul .
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 8 o 12
h p://www.jbiomedsem.com/con en /5/1/46
se ices as no mal Galaxy ools [29], hus making i
s aigh o wa d o chain OpenLi eDa a se ices oge he
in he Galaxy en i onmen . Figu e 3 shows he same
wo k low as abo e, c ea ed wi hin he Galaxy wo kbench.
In o de o ep oduce he wo k low, i is necessa y o
c ea e, o you sel , a use on ou Galaxy se e [30] and
impo he his o y and wo k low [31,32]. The i s i em o
he his o y can be used as he inpu o he wo k low o
ep oduce he esul s epo ed he e.
Limi a ions and scalabili y
The use o SADI o expose da a in SPARQL endpoin s
clea ly adds a ce ain amoun o o e head wi h espec o
bo h execu ion- ime and compu a ional load; howe e , i
is di icul o di ec ly compa e he wo scena ios because
(a) speed and load depend on he clien , and web se ice
clien s a e signi ican ly di e en om one ano he , and
om SPARQL clien s; (b) he ime (and knowledge)
equi ed o manually cons uc each desi ed SPARQL
que y, compa ed o he au oma ed dynamic disco e y
o app op ia e SADI se ices, is no conside ed in a
head- o-head compa ison o hei espec i e execu ion
imes, and (c) I is conside ably easie o op imize he
execu ion plan o a SPARQL que y, e sus a se ice
wo k low. Ne e heless, a di ec compa ison o he
“ ede a ed”que y en e ed in o he SHARE clien (abo e)
e sus he equi alen ede a ed SPARQL que y en e ed
in o he Vi uoso web-based que y in e ace, showed
execu ion imes o 34-39 seconds o SHARE compa ed
o 2-3 seconds o Vi uoso. Thus, while he o e head o
Figu e 2 Disco e y and in oca ion o OpenLi eDa a2SADI se ices using he SADI plugin o he Sen ien Knowledge Explo e . A. Da a
nodes espond o a igh -click wi h a con ex menu i em “Find SADI Se ices”.B. a se o se ices capable o consuming nodes o ha ype a e
disco e ed and p esen ed in a menu-like manne . C. he esul o selec ing he “app o ed-name”se ice om he menu. D. he ou pu a e
i e a i ely in oking all 6 o he se ices om he example se ice lis (e ec i ely, manually execu ing he wo k low in Figu e 1).
González e al. Jou nal o Biomedical Seman ics 2014, 5:46 Page 9 o 12
h p://www.jbiomedsem.com/con en /5/1/46