scieee Science in your language
[en] (orig)

Interactive Cloud-Based Platform for Parallelized Machine Learning of Astronomical Big Data

Author: Koza, Jakub
Publisher: Zenodo
DOI: 10.5281/zenodo.17537158
Source: https://zenodo.org/records/17537158/files/F8-DP-2017-Koza-Jakub-thesis.pdf
Ing. Michal Valen a, Ph.D.
Head o Depa men p o . Ing. Pa el T dík, CSc.
Dean
P ague Janua y 28, 2017
CZECH TECHNICAL UNIVERSITY INPRAGUE
FACULTY OF INFORMATION TECHNOLOGY
ASSIGNMENT OF MASTER’S THESIS
Ti le: In e ac i e Cloud-Based Pla o m o Pa allelized Machine Lea ning o As onomical Big
Da a
S uden : Bc. Jakub Koza
Supe iso : RND . Pe Škoda, CSc.
S udy P og amme: In o ma ics
S udy B anch: Web and So wa e Enginee ing
Depa men : Depa men o So wa e Enginee ing
Validi y: Un il he end o summe semes e 2017/18
Ins uc ions
The goal is he in eg a ion o cu en ly independen componen s, such as VO-Cloud, Jupy e , Spa k, HDFS,
and machine lea ning lib a ies in a lexible en i onmen . The esul ing sys em will allow he web-con olled
da a acquisi ion, job scheduling and in e ac i e isualisa ion o esul s o a numbe o independen use s
conduc ing machine lea ning expe imen s on a pa allel compu ing in as uc u e.
1) Make a su ey o easible in eg a ion echniques.
2) Analyse and design he wo k low o que y and acqui e da a, o p e iew and clean hem, o apply p e-
p ocessing and op ional dimensionali y educ ion, o send hem o cloud o machine lea ning, and o e u n
he esul s o isualisa ion in Jupy e in a seamless and sc ip able way.
3) Design he equi ed missing modules and con ol logic.
4) Realise he pla o m and y o make i easily po able using i ual en i onmen as Docke .
5) Discuss use expe ience and pe o mance o you solu ion and sugges u u e imp o emen s.
Re e ences
Will be p o ided by he supe iso .
Czech Technical Uni e si y in P ague
Facul y o In o ma ion Technology
Depa men o So wa e Enginee ing
Mas e ’s hesis
In e ac i e Cloud-Based Pla o m o
Pa allelized Machine Lea ning o
As onomical Big Da a
Bc. Jakub Koza
Supe iso : RND . Pe ˇ
Skoda, CSc.
9 h May 2017
Acknowledgemen s
I would like o hank my supe iso , RND . Pe ˇ
Skoda, CSc., o his help and
o gi ing me his oppo uni y, and Jana Doleˇzalo ´a o long-las ing suppo .
This esea ch was suppo ed by he g an COST LD-15113 o he Minis y o
Educa ion You h and Spo s o he Czech Republic.

Decla a ion
I he eby decla e ha he p esen ed hesis is my own wo k and ha I ha e
ci ed all sou ces o in o ma ion in acco dance wi h he Guideline o adhe ing
o e hical p inciples when elabo a ing an academic inal hesis.
I acknowledge ha my hesis is subjec o he igh s and obliga ions s ip-
ula ed by he Ac No. 121/2000 Coll., he Copy igh Ac , as amended. In
acco dance wi h A icle 46(6) o he Ac , I he eby g an a nonexclusi e au-
ho iza ion (license) o u ilize his hesis, including any and all compu e p o-
g ams inco po a ed he ein o a ached he e o and all co esponding docu-
men a ion (he eina e collec i ely e e ed o as he “Wo k”), o any and all
pe sons ha wish o u ilize he Wo k. Such pe sons a e en i led o use he
Wo k in any way (including o -p o i pu poses) ha does no de ac om i s
alue. This au ho iza ion is no limi ed in e ms o ime, loca ion and quan-
i y. Howe e , all pe sons ha makes use o he abo e license shall be obliged
o g an a license a leas in he same scope as de ined abo e wi h espec o
each and e e y wo k ha is c ea ed (wholly o in pa ) based on he Wo k, by
modi ying he Wo k, by combining he Wo k wi h ano he wo k, by including
he Wo k in a collec ion o wo ks o by adap ing he Wo k (including ans-
la ion), and a he same ime make a ailable he sou ce code o such wo k a
leas in a way and scope ha a e compa able o he way and scope in which
he sou ce code o he Wo k is made a ailable.
In P ague on 9 h May 2017 . . . . . . . . . . . . . . . . . . . . .
Czech Technical Uni e si y in P ague
Facul y o In o ma ion Technology
c
2017 Jakub Koza. All igh s ese ed.
This hesis is school wo k as de ined by Copy igh Ac o he Czech Republic.
I has been submi ed a Czech Technical Uni e si y in P ague, Facul y o
In o ma ion Technology. The hesis is p o ec ed by he Copy igh Ac and i s
usage wi hou au ho ’s pe mission is p ohibi ed (wi h excep ions de ined by he
Copy igh Ac ).
Ci a ion o his hesis
Koza, Jakub. In e ac i e Cloud-Based Pla o m o Pa allelized Machine Lea n-
ing o As onomical Big Da a. Mas e ’s hesis. Czech Technical Uni e si y
in P ague, Facul y o In o ma ion Technology, 2017.
Abs ak
VO-CLOUD je dis ibuo an´y sys ´em, k e ´y posky uje uˇzi a el˚um p os o a
´ykon p o y ´aˇ en´ı ´ypoˇce nˇe n´a oˇcn´ych as onomick´ych expe imen ˚u sk ze
ozh an´ı webo ´eho p os ˇ ed´ı. C´ılem ´e o diplomo ´e p ´ace je na hnou a
implemen o a no ´e komponen y a in eg o a y o komponen y do sys ´emu
VO-CLOUD za ´uˇcelem pˇ id´an´ı moˇznos ´ı izualizace soubo ˚u as onomic´ych
spek e , yuˇzi ´ı echnologie Jupy e No ebook, k e ´a posky uje uˇzi a el˚um
p os ˇ ed´ı k in e ak i n´ımu expe imen o ´an´ı, a yuˇz´ı ´ypoˇce n´ı klas Hadoop
spoleˇcnˇe s echnologi´ı Apache Spa k.
Kl´
ıˇ
co ´
a slo a VO-CLOUD, Vi u´aln´ı Obse a oˇ , Hadoop, Spa k, Jupy e ,
Docke , Ja a EE, UWS, as oin o ma ika
Abs ac
The VO-CLOUD is a dis ibu ed sys em capable o p o iding use s wi h a s o -
age and compu abili y o conduc as onomical expe imen s in a web based
en i onmen . The aim o his Mas e ’s hesis is o design and implemen addi-
ional componen s and o in eg a e hem o he VO-CLOUD sys em in o de o
add capabili ies o isualise as onomical spec a iles, o p o ide use s wi h
ix

In oduc ion
The esea ch o he nigh sky o nowadays is no only ocused on da a ac-
quisi ion using big as onomical elescopes p oducing egula ly big amoun o
da a. The c ucial pa o he esea ch is o ac ually unea h signi ican in o -
ma ion inside hose da a. VO-CLOUD is a dis ibu ed sys em ha has been
de eloped o help as onome s wi h exac ly his pa . I allows as onome s o
acqui e da a om big as onomical a chi es, execu e p ep ocessing and da a
mining jobs on dis ibu ed wo ke s and isualize he inal esul s o he spe-
ci ic da a mining me hod o he use . Howe e , he p oblem is ha cu en ly
he e is no way o isualise o explo e da a ha a e al eady s o ed on he VO-
CLOUD se e . The isualiza ion is c i ical because he as onome s should
be able o e iew he s a e o spec a in he e e y s age o spec a p ocessing.
The aim o his hesis is o analyse a p esen wo k low and deploymen o
he VO-CLOUD se e and o design a solu ion ha would allow a use o
isualize as onomical spec a inside a web b owse applica ion and explo e
hem easily using an in eg a ed Jupy e web applica ion. Fu he , p esen ed
hesis examines ways in which he VO-CLOUD se e can be ex ended o al-
low a use o enqueue jobs ha use Apache Spa k amewo k o a la ge-scale
da a p ocessing and Hadoop Dis ibu ed File Sys em (HDFS) as a s o age o
his kind o jobs. Las ly, he possibili y o in ol emen o he Docke con-
aine pla o m echnology is examined in o de o acili a e he deploymen
o ce ain pa s o he VO-CLOUD sys em.
1
Chap e 1
Technology o e iew
VO-CLOUD is complica ed sys em ha adop s many concep s and echnolo-
gies ha eade should unde s and be o e eading his Mas e ’s hesis. Also,
he ask o he wo k is o in eg a e addi ional echnologies o al eady c ea ed
sys em. This chap e is dedica ed o he explana ion o hese concep s and
echnologies ha will be used la e on in he ex .
1.1 Vi ual Obse a o y
The Vi ual Obse a o y (VO) concep is nowadays e y popula among as-
onomy communi y. Whe eas in he pas as onome s had o wai e en a
couple mon hs o access he elescope, oday hey can p ac ically ins an ly
access da a hey wan using he concep o VO. Vi ual Obse a o y add esses
challenges such as da a managemen , analysis, dis ibu ion and in e ope abil-
i y [1].
”The VO is a sys em in which he as as onomical a chi es and
da abases a ound he wo ld, oge he wi h analysis ools and com-
pu a ional se ices, a e linked oge he in o an in eg a ed acil-
i y.” [1]
The VO concep and addi ional associa ed echnologies and ecommenda-
ions ha e been de eloped by he In e na ional Vi ual Obse a o y Alliance
(IVOA). The IVOA is an o ganisa ion wi h a mission o ” acili a e he in-
e na ional coo dina ion and collabo a ion necessa y o he de elopmen and
deploymen o he ools, sys ems and o ganiza ional s uc u es necessa y o
enable he in e na ional u iliza ion o as onomical a chi es as an in eg a ed
and in e ope a ing i ual obse a o y.” [2]
The VO-CLOUD sys em is igh ly connec ed wi h he concep o Vi ual
Obse a o y. I allows use o ob ain da a om he emo e se ices imple-
men ing he VO p inciples using special VO p o ocols, s o e he da a in he
3
1. Technology o e iew
p o ided s o age, p ep ocess hem o an app op ia e o ma and apply a spe-
ci ic da a mining me hod. All o his can be ope a ed easily wi hin a use ’s
web b owse .
1.2 Ja a EE
The whole cu en solu ion o he VO-CLOUD sys em is buil upon Ja a EE
P og amming Language Pla o m (En e p ise Edi ion). The Ja a EE pla o m
is an ex ension o he Ja a SE pla o m (S anda d Edi ion) which p o ides
he co e unc ionali y o he Ja a p og amming language. The Ja a EE en-
iches he Ja a SE pla o m wi h addi ional concep s and echnologies ha a e
mos ly used in se e mul i- ie ed en i onmen s and makes he de elopmen
o Ja a se e applica ions much easie .
”The aim o he Ja a EE pla o m is o p o ide de elope s wi h
a powe ul se o APIs while sho ening de elopmen ime, e-
ducing applica ion complexi y, and imp o ing applica ion pe o -
mance.” [3]
Unlike he Ja a SE pla o m whe e e e y buil applica ion can be execu ed
di ec ly on he Ja a Vi ual Machine (JVM) – he en i onmen whe e e e y
Ja a applica ion is unning, he Ja a EE applica ions a e usually deployed in o
an en i onmen ha suppo s all Ja a EE echnologies ha he applica ion
in en s o u ilize. This en i onmen is called Ja a EE se e . The Ja a EE
se es is an applica ion ha implemen s APIs om he Ja a EE pla o m and
p o ides he s anda d Ja a EE se ices [3]. The e a e many implemen a ions
o he Ja a EE se e . The e e ence implemen a ion o iginally s a ed by
Sun Mic osys ems, nowadays de eloped by O acle Co po a ion is an open-
sou ce se e called GlassFish1. The e a e many mo e implemen a ions o he
Ja a EE se e , some o hem a e open-sou ce o he a e comme cial. The one
ha is necessa y o men ion he e is an open-sou ce se e WildFly2o iginally
de eloped by JBoss, now con inuously de eloped by Red Ha . The WildFly
is he se e whe e he VO-CLOUD sys em is cu en ly unning on.
The Ja a EE speci ica ion con ains many echnologies ha should simpli y
de elopmen o he se e sided applica ions. Following sec ions a e dedica ed
o he explana ion o Ja a EE echnologies ha a e ela ed o he VO-CLOUD
sys em.
1.2.1 Ja a Pe sis ence API
The Ja a Pe sis ence API (JPA) is a echnology ha conside ably simpli ies
usabili y o ela ional da abases inside Ja a EE applica ions using a p inciple
1h ps://glass ish.ja a.ne /
2h p://wild ly.o g/
4
1.2. Ja a EE
@En i y
public class Use Accoun implemen s Se ializable {
p i a e s a ic i n a l long se ialVe sionUID = 1L;
@Id @Gene a edValue ( s a egy = Gene a ionType .AUTO)
p i a e Long id ;
@Column( unique = ue , n u l l a b l e = a lse )
@Pa e n ( egexp = ” [ a−zA−Z]+([ a−zA−Z0−9]” +
” [ . −]?)+[a−zA−Z0−9]” )
p i a e S ing use name ;
@No Null
p i a e S ing passwo dHash ;
public Use Accoun () {}
// g e e s , s e e s , equals , hashCode me hods
}
Figu e 1.1: F agmen o simple use accoun En i y JPA class
called Objec -Rela ional Mapping (ORM).
”The Ja a Pe sis ence API (JPA) is a Ja a s anda ds-based solu-
ion o pe sis ence. Pe sis ence uses an objec ela ional mapping
app oach o b idge he gap be ween an objec -o ien ed model and
a ela ional da abase.” [3]
A p og amme implemen ing an applica ion using ela ional da abase does
no ha e o ha e any knowledge o S anda d Que y Language (SQL) – a lan-
guage ha is used o que ying and manipula ing da a inside ela ional da a-
bases. The JPA amewo k does e e y hing o him. A p og amme simply
implemen s a Ja a class (in e ms o JPA called En i y class) and anno a es
i and i s a ibu es wi h special Ja a anno a ions. The amewo k c ea es
mapping be ween hese objec s and ables inside a ela ional da abase. Fig-
u e 1.1 demons a es an agmen o a simple En i y class ep esen ing a use
accoun . The class will be mapped o he able use accoun inside ela ional
da abase con aining exac ly 3 columns: id,use name,passwo dhash. The
JPA amewo ks p o ides special class En i yManage ha p o ides API o
communica ion wi h he ela ional da abase using abo e men ioned En i y
classes. I a p og amme equi es mo e complica ed da abase que ies he can
also use he Ja a Pe sis ence Que y Language (JPQL) – a simple s ing-based
language simila o SQL used o que y en i ies and hei ela ionships [3].
One g ea ad an age o using he JPA amewo k is he ac ha an ap-
plica ion does no ha e o know any in o ma ion abou da abase i sel . The
applica ion only speci ies he name o so-called Pe sis ence Uni . The Pe sis-
5

1. Technology o e iew
ence Uni is con igu ed on he Ja a EE se e and he con igu a ion consis s
o i ems such as da abase connec ion URL, login c eden ials, que y imeou s,
connec ion d i e s and many o he s. The p inciple o pulling con igu a ion
om applica ion o se e ensu es po abili y o he applica ion. In ac , ap-
plica ion on one se e can use o example Pos g eSQL3 ela ional da abase
and i can also be edeployed wi hou ecompila ion o he se e u ilizing
MySQL4 ela ional da abase.
1.2.2 Ja a Se le Technology
The Ja a Se le Technology is e y impo an in he Ja a EE speci ica ion
because many o he echnologies a e buil upon i .
”A se le is a Ja a p og amming language class used o ex end
he capabili ies o se e s ha hos applica ions accessed by means
o a eques - esponse p og amming model. Al hough se le s can
espond o any ype o eques , hey a e commonly used o ex end
he applica ions hos ed by web se e s. Fo such applica ions, Ja a
Se le echnology de ines HTTP-speci ic se le classes.” [3]
Se le s ha e e y simple li ecycle. The li ecycle is con olled by he web
con aine o he Ja a EE se e whe e he se le has been deployed. When
a eques is mapped o a se le , he con aine pe o ms ollowing s eps on
o de o se e a esponse. [3]
1. I con aine does no con ain an ins ance o he se le , he con aine :
a) loads he se le class i i has no been done al eady,
b) c ea es an ins ance o he se le class,
c) calls se le ’s me hod ini o pe o m se le ini ializa ion.
2. Calls se ice me hod o he se le ins ance wi h wo me hod pa am-
e e s ep esen ing se le eques objec and se le esponse objec .
Con aine can also decide ha a se le ins ance is no longe necessa y and
emo e i om he con aine . Be o e i does so i inalizes se le by calling
me hod des oy.
F om he implemen a ion poin o iew he Ja a Se le Technology is
implemen ed in packages ja ax.se le and ja ax.se le .h p. The i s
package con ains in e ace Se le ha e e y se le class mus implemen .
The mos impo an me hods o his in e ace a e a o emen ioned me hods
ini ,se ice and des oy [4]. The i s package also con ains one o he
3h ps://www.pos g esql.o g/
4h ps://www.mysql.com/
6
1.2. Ja a EE
@WebSe le ( ”/ hello −wo ld ” )
public class HelloSe le ex ends H pSe le {
p i a e s a ic i n a l long se ialVe sionUID = 1L;
public oid doGe ( H pSe le Reques eq ,
H pSe le Response es)
h ows Se le Excep ion , IOExcep ion {
es . se Con en Type ( ” ex / plain ” ) ;
P in W i e ou = es . ge W i e ( ) ;
ou . p i n l n ( ” Hello , wo ld ! ” ) ;
ou . c l o s e ( ) ;
}
}
Figu e 1.2: Se le code agmen example
class implemen ing Se le in e ace – Gene icSe le . This class can be
used o implemen a gene ic se ice – p o ocol independen se le .
The mos impo an subclass o Gene icSe le is H pSe le om
package ja ax.se le .h p ha p o ides an abs ac implemen a ion o
he HTTP p o ocol [4]. Me hod se ice in implemen a ion o his class
delega es eques s o one o he me hod doXXX whe e XXX is one o he me hods
o HTTP p o ocol (GET,HEAD,OPTIONS,POST,PUT,TRACE o DELETE). Figu e
1.2 demons a es an example o a simple ”Hello, Wo ld!” se le applica ion.
The applica ion e u ns s ing ”Hello, wo ld!” whene e HTTP GET me hod
is called on he se le ’s endpoin ( o example when web b owse connec o
he se le ’s endpoin URL).
I is impo an o unde s and wha he se le ’s endpoin ac ually is and
how o speci y i . As i is possible o see in he example 1.2, he class
HelloSe le is anno a ed wi h WebSe le anno a ion. The alue o his
anno a ion speci ies a pa h ela i e o he pa h o deployed Ja a EE appli-
ca ion. Fo ins ance, i he se le applica ion is deployed on he URL ad-
d ess h p://example.o g/app, he se le endpoin om he example 1.2
is h p://example.o g/app/hello-wo ld. The WebSe le anno a ion can
be also eplaced by u ilizing a con igu a ion ile web.xml, whe e deploymen
con igu a ions a e speci ied in XML o ma . An example o such a con igu a-
ion o HelloSe le om he example 1.2 can be seen in he igu e 1.3.
1.2.3 Ja aSe e Faces
Ja aSe e Faces (JSF) is an impo an echnology o he Ja a EE pla o m
ha ocuses on simpli ica ion o web use in e ace de elopmen . I is buil
upon Ja a Se le echnology. In con as o Ja a Se le echnology whe e
7
1. Technology o e iew
<?xml e sion =”1.0” encoding=”UTF−8”?>
<web−app e sion=” 3.0 ” . . . >
<se le >
<se le −name>Hello S e l e </se le −name>
<se le −class>s e l e . He llo Se l e </se le −class>
</se le >
<se le −mapping>
<se le −name>Hello S e l e </se le −name>
<u l−pa e n>/hello−wo ld</u l−pa e n>
</se le −mapping>
</web−app>
Figu e 1.3: Se le mapping con igu a ion
se le class con ains implemen a ion o bo h p esen a ion and beha io al pa
o use in e ace, Ja aSe e Faces amewo k spli s hese pa s o di e en
uni s.
”One o he g ea es ad an ages o Ja aSe e Faces echnology is
ha i o e s a clean sepa a ion be ween beha iou and p esen a-
ion o web applica ions.” [3]
The implemen a ion o web use in e ace o he Ja a EE applica ion using
Ja aSe e Faces amewo k consis s o wo di e en ypes o iles: XHTML
iles and so-called Managed Beans. XHTML iles ep esen a p esen a ion pa
o he use in e ace – isual side o one page in s anda dized XML o ma [3].
The e a e wo di e en ypes o XML ags ha can be used inside XHTML
o ma . S anda d HTML ags and special JSF ags. Whe eas HTML ags
ha e no special meaning o he JSF amewo k and hey a e mos ly passed
di ec ly o he clien ’s web b owse , he JSF ags add addi ional unc ionali y
beyond he s a ic HTML pages. They allow o bind da a changes, ac ions and
e en s o he page o Ja a me hods speci ied in Managed Beans using a special
syn ax called Exp ession Language [5]. JSF ags can ep esen any web iew
componen om a single ex ield o complica ed da a able o e ing so ing
and il e ing unc ionali y. Fo example ollowing agmen o XHTML code
ep esen s a simple inpu ex ield o he HTML inpu o m:
<h:inpu Tex alue=”#{ oo . use name}” />
The #{...} syn ax is in ac he Exp ession Language ha binds alue o
his inpu ield o he use name ield o he Managed Bean named oo. The
binding has wo unc ionali ies:
•When a page is being ende ed o a clien he alue o he inpu ield
is se o he alue o use name ield in oo Managed Bean.
8
1.2. Ja a EE
•When he inpu o m is illed and submi ed back o a se e by clien
he new alue o his inpu ield is s o ed in he Managed Bean.
The se o JSF ags is easily ex endable by using addi ional XML name-
spaces in he oo ag o he XHTML documen [6]. Using his p inciple, one
can use addi ional ex ended componen s ha a e no a ailable in pu e JSF
amewo k. VO-CLOUD sys em uses one o such popula JSF ex ensions –
he open sou ce amewo k named P imeFaces5.
Managed Bean by de ini ion [3] is a simple Ja a class ha mus ul il
ollowing ules:
•I mus ha e non-pa ame ic cons uc o in o de o be able o ins an-
ia e i any ime wi hou pa ame e s.
•I mus ha e de ined name ha is a ailable o Exp ession Language.
Managed beans ha e usually de aul name in e ed om he name o
Ja a class, howe e i can be enamed using XML desc ip o iles o
Ja a anno a ions.
•I mus de ine a scope.
Choosing he igh scope o e e y Managed Bean is an impo an pa
o applica ion design. ”Scope de ines how applica ion da a pe sis s and is
sha ed.”[3] The wo mos impo an scopes a e eques and session scopes.
Da a inside Managed Beans anno a ed wi h eques scopes su i e only o
a single clien ’s HTTP eques , whe eas da a om session scoped beans a e
sa ed o a session s o age o he speci ic clien and hey su i e mul iple HTTP
eques s. The d awback o he session scoped app oach is he ac ha session
s o age consumes memo y o a se e and i complica es scalabili y o he
applica ion because i equi es eplica ion o he session s o age o addi ional
se e ins ances.
1.2.4 En e p ise Ja aBeans
En e p ise Ja aBeans (EJB) is a powe ul echnology ha is also pa o he
Ja a EE speci ica ion. En e p ise bean is a componen ha uns inside EJB
con aine , a un ime en i onmen in he Ja a EE se e [3]. I is impo an
o no e ha no e e y Ja a EE se e has implemen ed he EJB con aine
and so applica ion w i en wi h he EJB unc ionali y canno be deployed o
such se e s. Fo example Apache Tomca 6se e suppo s many Ja a EE
echnologies, howe e i is only se le -based se e and so he e is no suppo
o EJB componen s. Pa o he VO-CLOUD sys em uses EJB hus i is
necessa y o deploy hem o EJB enabled se e such as ea lie men ioned
GlassFish o WildFly se e .
5h ps://www.p ime aces.o g
6h p:// omca .apache.o g
9
2. Analysis o he cu en solu ion
2.1.1 Mas e se e
Mas e se e is a web applica ion comple ely w i en in he Ja a EE pla o m.
Responsibili ies o he Mas e se e a e ollowing:
1. P o ide web GUI o communica ing wi h he expe imen ing use ’s web
b owse applica ion.
2. P o ide s o age whe e use s can sa e hei da a and use hem o u he
expe imen s and p o ide web in e ace o manage he s o age.
3. Allow use s o upload new iles o he Mas e se e ’s s o age di ec ly
om he use ’s de ice.
4. Allow use s o download new iles o he Mas e se e ’s s o age om
he passed HTTP o FTP esou ce URL.
5. Allow use s o download new iles om spec a a chi es using special
as onomical p o ocols SSAP and Da aLink.
6. On he use ’s eques enqueue new compu a ional job o he Uni e sal
wo ke , awai job’s comple ion, download esul s back and p esen hem
o he use .
7. Allow use o abo cu en ly unning job.
Whole web use in e ace is w i en using Ja aSe e Faces (JSF) ech-
nology. Mas e se e also equi es da abase o pe sis ing mul iple pieces
o in o ma ion, o example use accoun s and o execu ional jobs ha ha e
been c ea ed by indi idual use s. VO-CLOUD uses Ja a Pe sis ence API
(JPA) amewo k o u ilizing he pe sis ence s o age. Mo eo e , o simpli-
ica ion o ansac ion and secu i y managemen VO-CLOUD Mas e se e
u ilizes En e p ise Ja aBeans echnology. This means ha he Mas e se e
has o be deployed on a Ja a EE se e con aining EJB con aine and hus
suppo ing EJB echnology.
Use s communica e only wi h he use in e ace o he Mas e se e . E -
e y use ha wan s o wo k wi h VO-CLOUD sys em mus be au hen ica ed
and hus one o he Mas e se e ’s addi ional esponsibili ies is o o e a eg-
is a ion o m o newcome use s. When use logs in he sys em, he can do se
o ope a ions depending on his au ho iza ion le el – use ole. VO-CLOUD
dis inguishes be ween h ee ollowing use oles:
•USER – Use wi h his ole has ead-only access o he VO-CLOUD’s
s o age and can c ea e new jobs om he se o non- es ic ed job ypes.
•MANAGER – Use wi h his ole has in addi ion o USER ole w i e
pe missions o he sys em’s s o age and he can also c ea e jobs o a
es ic ed ype.
16

2.1. A chi ec u e
•ADMIN – Use wi h his ole has in addi ion o MANAGER pe missions
o iew jobs o all use s in he sys em, o change use s’ se ings (e.g.
se new passwo d o change use ole) and o change con igu a ion o
a ailable Uni e sal wo ke s.
The s o age o VO-CLOUD sys em is di ec ly mapped o he ilesys em
whe e VO-CLOUD has been deployed – i has ee s uc u e o iles and olde s.
The e a e in o al i e ways o ge da a o he sys em’s s o age:
1. A use can di ec ly upload iles om his local de ice h ough VO-
CLOUD’s web use in e ace.
2. A use can command he se e o download ile/ iles om emo e lo-
ca ions using FTP o HTTP p o ocol. By using his me hod se e can
download mul iple iles i passed loca ion poin s o olde in FTP se e
o o di ec o y lis ing o HTTP p o ocol.
3. A use can command he se e o download as onomical spec a om
VO da abases using p o ocols SSAP and Da aLink. These wo p o o-
cols ha e been de eloped as IVOA ecommenda ions. SSAP is basically
a p o ocol ha allows o que y as onomical spec a ul illing speci ied
il e condi ions and i e u ns a lis o spec a oge he wi h me a-
da a [10]. These spec a could be ei he di ec ly downloaded, o , i VO
se ice suppo s i , he Da aLink p o ocol can be used o apply addi-
ional spec a ans o ma ions on he se ice p o ide ’s side be o e a
download [11].
4. A use can command he se e o s o e an ou pu o any compu a ional
job o he sys em’s s o age.
5. VO-CLOUD’s s o age can be also modi ied by di ec ly modi ying a olde
s uc u e on he side o se e whe e he Mas e se e has been deployed
(e.g. by connec ing o he se e di ec ly h ough SSH p o ocol and
modi ying da a using e minal commands).
Mas e se e ’s use in e ace also o e s ope a ions o download selec ed iles
om se e ’s s o age o use ’s de ice, o dele e selec ed iles, o c ea e new
olde s, o ename iles, e c. No e ha ope a ions changing he s o age’s s a e
can be issued only by use s ha ha e use ole ADMIN o MANAGER.
Mas e se e p o ides unc ionali y o use o c ea e a new compu a ional
job. E e y job is ep esen ed by a job ype and a con igu a ion in JSON da a
o ma . Job ype is in ac he choice o a speci ic p ep ocessing o da a mining
applica ion. Job ypes a e di ided o wo ca ego ies:
•Non- es ic ed jobs – Jobs ha can be c ea ed and execu ed by any
logged in use s.
17
2. Analysis o he cu en solu ion
•Res ic ed jobs – Jobs ha can be c ea ed and execu ed only by use s
wi h highe pe missions because o ully u ilize hei po en ial i is nec-
essa y o ha e w i e pe missions o he sys em’s s o age.
The e a e cu en ly h ee ypes o jobs ha can be execu ed on he VO-
CLOUD sys em. P ep ocessing and Random Decision Fo es (RDF) me hod
ha has been implemen ed by And ej Paliˇcka in his Bachelo ’s hesis [12]
and Sel -O ginizing Maps (SOM) me hod ha has been de eloped by Luk´aˇs
Lopa o sk´y in his Bachelo ’s hesis [13]. P ep ocessing job ype akes spec a
s o ed in he sys em’s s o age and i p ep ocesses hem o he o ma ha is
an inpu o RDF and SOM job ypes. Due o necessi y o s o e da a om
p ep ocessing job ype back o he VO-CLOUD’s s o age he p ep ocessing
job ype is se as a es ic ed job. RDF and SOM job ypes a e non- es ic ed.
In ac , mul iple wo ke s suppo ing a single speci ic job ype can be con-
igu ed. Mas e se e selec s om he lis one wo ke ha is he leas loaded
and delega es he compu a ional job on i h ough UWS wo ke ’s in e ace.
A e he execu ion o a job has s a ed, he Mas e se e pe iodically checks
wo ke ’s job phase h ough UWS API and when he execu ion s ops, he Mas-
e se e downloads esul s om he wo ke and i commands wo ke o dele e
esul s on i s side. Use can iew he execu ion phase o e e y c ea ed job.
Jobs’ phases a e di ec ly mapped o phases o UWS pa e n (see igu e 1.5).
2.1.2 Uni e sal wo ke
Compu a ional jobs a e no execu ed by he Mas e se e i sel bu hey a e
delega ed o compu a ional componen s o he dis ibu ed sys em – gene ic
wo ke s. Wo ke has ollowing esponsibili ies in he VO-CLOUD sys em:
1. I p o ides UWS se ice ha he Mas e se e communica es wi h.
2. I pa ses JSON con igu a ion o e e y new job and downloads all nec-
essa y iles om he Mas e se e ha a e lis ed in he con igu a ion
and a e he e o e necessa y o compu a ion.
3. I passes JSON con igu a ion oge he wi h downloaded iles o a speci ic
p ep ocessing o da a mining applica ion.
4. I s o es compu ed esul s un il he Mas e se e downloads hem.
The wo ke is a ela i ely simple web applica ion w i en in he Ja a EE
pla o m. As i has been al eady s a ed in i s esponsibili ies i mus expose
a UWS in e ace ha he Mas e se e communica es wi h. A Ja aSe le
Techology is in ol ed in he UWS se ice implemen a ion. No addi ional Ja a
EE echnologies (especially EJB) a e used in he wo ke ’s implemen a ion
and hus i can be deployed on ligh weigh Ja a EE se e ha has no EJB
con aine (e.g. Tomca ).
18
2.1. A chi ec u e
In he o iginal implemen a ion o VO-CLOUD sys em (o iginally named
VO-KOREL [14]) he e had o be an implemen ed applica ion o each indi-
idual ype o wo ke s. E e y new p ep ocessing o da a mining me hod e-
qui ed also a new wo ke applica ion implemen a ion. Mo eo e , e e y se e
whe e such wo ke s ha e been deployed can ha e di e en se ings, i.e., he
pa h o he p ep ocessing o da a mining applica ion could be di e en . I
was necessa y o build an indi idual package o each se e and wo ke ype.
Sou ce codes o hese applica ions we e almos iden ical wi h he excep ion o
a ew lines o codes and con igu a ion s ings. This app oach was de imen al
o main enance as e e y mino change in he sou ce code equi ed mul iple
ecompila ions and deploymen s.
This app oach was changed as he esul o Jakub Koza’s Bachelo ’s hesis
ha b ings a new concep called Uni e sal wo ke .
”A uni e sal wo ke is a new ype o he se le based applica ion
ha is used ins ead o all o he wo ke applica ion ypes. The
idea is o deploy only one ins ance o uni e sal wo ke applica-
ion on one compu e wo ke node whe e mul iple compu a ional
execu able applica ions a e suppo ed.”[5]
Uni e sal wo ke is con igu ed using a XML con igu a ion ile ha ma ches
XSD schema specially c ea ed o a Uni e sal wo ke concep [5]. The schema
can be seen in appendix C. Uni e sal wo ke uses mul iple job lis collec ions
ins ead o only one – one job lis pe one wo ke XML ag con igu ed in he
XML con igu a ion ile. The agmen o a such wo ke ’s con igu a ion can be
seen in igu e 2.1.
As can be seen in he example 2.1 he mos impo an pa o wo ke ’s
con igu a ion is ac ually a speci ica ion o a p ocess call. Whene e some
job should be s a ed on he Uni e sal wo ke , he wo ke ac ually c ea es a
new wo king di ec o y. All iles ha a e necessa y o a job execu ion a e
downloaded o his di ec o y. Also he JSON con igu a ion ha was passed
as a job pa ame e is sa ed in o his di ec o y as a ile. Finally, he p ocess
speci ied in he XML con igu a ion is execu ed in his di ec o y and a pa h
o he con igu a ion JSON ile is passed as a pa ame e o his p ocess ( his
is caused by he las command ag in XML con igu a ion con aining special
subs i u ion sequence ${con ig- ile}.
Wo ke s ha e been speci ically designed o cons i u e he dis ibu ed pa
o he VO-CLOUD sys em. The e can be mul iple wo ke s on mul iple ma-
chines. I he e a e mo e wo ke s o a single job ype, VO-CLOUD sys em
au oma ically chooses he one ha is he leas loaded. In he ma e o de-
ploymen i is essen ial ha VO-CLOUD’s Mas e se e has ne wo k isibil-
i y o indi idual wo ke s in o de o communica e wi h hei UWS in e ace.
Howe e , wo ke s do no ha e o be exposed o use s’ de ices a all. I i is
expec ed ha wo ke s should be able o download da a om VO-CLOUD’s
Mas e se e he isibili y mus be bidi ec ional.
19
2. Analysis o he cu en solu ion
. . .
<ns :wo ke >
<ns :iden i ie >p ep ocessing</ns :iden i ie >
<ns :desc ip ion>P ep ocessing</ns :desc ip ion>
<ns : es ic ed> ue</ns : es ic ed>
<ns :bina ies−l o c a i o n>/ us / l o c a l /
wo ke s / p e p oc e ss in g</ns :bina ies−l o c a i o n>
<ns :exec−command>
<ns :command>py hon3</ns :command>
<ns :command>${bina ies−l o c a i o n }/
un p ep ocessing .py</ns :command>
<ns :command>${con ig− ile}</ns :command>
</ns :exec−command>
</ns :wo ke >
. . .
Figu e 2.1: Uni e sal wo ke con igu a ion agmen
2.1.3 Speci ic p ep ocessing o da a mining applica ion
VO-CLOUD’s Uni e sal wo ke componen would be useless wi hou an appli-
ca ion ha is capable o p ep ocessing passed da a o unea hing new ele an
in o ma ion om hem. As i has been al eady s a ed he e a e cu en ly h ee
o hese applica ions – P ep ocessing, SOM and RDF. All o hem a e w i -
en in Py hon p og amming language and hei beha iou can be al e ed by
changing an inpu JSON con igu a ion. These applica ions a e simply called
as a new p ocess om he Uni e sal wo ke componen . Despi e Py hon being
used as a echnology o all h ee job ypes, he e is no limi a ion on echnol-
ogy used, i.e., any p ocess ha can be execu ed on wo ke ’s hos ed sys em
can be used o pu poses o he Uni e sal wo ke componen .
Uni e sal wo ke edi ec s s anda d ou pu s eam and s anda d e o
s eam o applica ion’s p ocess o i s own empo al iles ha a e a e wa ds
passed o he Mas e se e oge he wi h esul s. Also he exi s a us code
o he p ocess is passed back o he Mas e se e . I he s a us code is equal
o ze o, he p ocess is conside ed o be success ully ended and he job’s phase
is se o COMPLETED s a e. O he wise, job’s phase is se o ERROR s a e. A
use can go h ough s anda d ou pu and e o iles on he Mas e se e o
unco e a eason o he p ocess ailu e.
Applica ions execu ed by he Uni e sal wo ke can also c ea e an isual-
iza ion ou pu ha he Mas e se e can p esen o he use . I was designed
his way because e e y job ype can equi e di e en ype o isualiza ion. Vi-
sualiza ion is op ional and he e a e wo ypes o isualiza ion ha he Mas e
se e can u ilize:
20
2.2. Deploymen
•S a ic isualiza ion – P ocess p oduces s a ic image/images ha a e
placed di ec ly o he wo king di ec o y. These images a e di ec ly p e-
sen ed o a use in a web in e ace. Mas e se e suppo s ollowing
image o ma s: PNG,JPEG,GIF.
•Dynamic isualiza ion – P ocess p oduces a simple web applica ion. In
o de o do so he p ocess mus p oduce index.h ml ile as a s a ing
poin o he web applica ion ha mus be placed di ec ly in he wo k-
ing di ec o y. Mas e se e ende s con en o his ile inside a special
HTML ag IFRAME ha basically allows o un ano he web page in-
side a web page. P ocess can also p oduce addi ional HTML iles and
link hose iles ia s anda d hype ex ela i e links. I can also con-
ain a Ja aSc ip code o addi ional sc ip ing capabili ies. By using
his app oach he compu a ional applica ion can o example ende a
complica ed clickable isualiza ion wi h spec a ende ing capabili ies.
2.2 Deploymen
In o de o be able o con inue wi h his Mas e ’s hesis i is impo an o
explain he deploymen o he cu en solu ion as his is he s a e ha is
going o be ex ended. VO-CLOUD sys em is cu en ly deployed on wo
se e s a S ella Depa men o he As onomical Ins i u e o he Czech
Academy o Sciences in Ondˇ ejo . These se e s a e named ocloud-de
and be elgeuse. Whe eas ocloud-de is only a i ual se e wi h ela-
i ely small amoun o compu a ional esou ces, be elgeuse is a powe ul
physical se e wi h 12 CPU co es suppo ing Hype -Th eading echnology
(24 i ual CPU co es) and 128 GiB RAM memo y. Howe e , unlike he
ocloud-de se e , he be elgeuse se e is especially o secu i y easons
no a ailable publicly. The e o e, he e is a e e se p oxy se e Nginx7de-
ployed on he ocloud-de se e . The e e se p oxy simply o wa ds all
incoming HTTP/HTTPS eques s s a ing wi h URI / ocloud-be elgeuse
o he Ja a EE se e hos ed on he be elgeuse se e . I also edi ec URI
/ o he p e ious one, he e o e he VO-CLOUD sys em is a ailable on URL
add ess h ps:// ocloud-de .asu.cas.cz.
Whole VO-CLOUD sys em is cu en ly deployed on he be elgeuse se e
on he Ja a EE se e called WildFly8. The e is a single Mas e se e in-
s ance and a single Uni e sal wo ke ins ance deployed on he WildFly se e .
The Uni e sal Wo ke is con igu ed so ha i p o ides a job execu ion se ice
o P ep ocessing, RDF and SOM job ypes. Mas e se e equi es a ela-
ional da abase o i s unc ionali y. The e is a Pos g eSQL9da abase ha is
7h p://nginx.o g
8h p://wild ly.o g
9h ps://www.pos g esql.o g
21

2. Analysis o he cu en solu ion
«de ice»
be elgeuse se e
RDF
bina ies
SOM
bina ies
P ep ocessing
bina ies
Uni e sal Wo ke
Mas e Se e
«execu ion en i onmen »
WildFly Ja a EE se e
«schema»
ocloud-schema
«execu ion en i onmen »
Pos g eSQL DB
«execu ion en i onmen »
Docke con aine
«de ice»
ocloud-de se e
«execu ion en i onmen »
Nginx e e se
p oxy se e
«de ice»
Use clien sys em
«execu ion en i onmen »
Web B owse
HTTP/HTTPS
HTTP/HTTPS
SQL
execu es
UWS
Figu e 2.2: Deploymen diag am
deployed also on he be elgeuse se e inside a Docke con aine . Docke 10 is
e y powe ul echnology ha conside ably simpli ies deploymen o applica-
ion componen s on di e en machines. This echnology is explained in de ail
in he ollowing chap e because i is a c ucial echnology o o he pu poses
o his wo k.
Deploymen diag am o he cu en solu ion can be seen in he igu e 2.2.
2.3 Wo k low example
Fo he sake o comple eness, le ’s desc ibe a scena io o use ’s communica-
ion wi h he VO-CLOUD sys em. A use in his scena io needs o down-
load da a om a VO a chi e, o apply p ep ocessing on hem and o apply
Sel -O ganazing Maps (SOM) me hod o ind simila i ies be ween passed as-
onomical spec a [13]. As he use in en s o use he es ic ed job ype –
P ep ocessing – he mus ha e ei he a MANAGER use ole o ADMIN use ole.
1. Use logs in o he VO-CLOUD sys em using his use name and passwo d.
2. Use na iga es h ough VO-CLOUD’s s o age ee s uc u e and selec -
s/c ea es a di ec o y whe e spec a om VO a chi e should be down-
loaded.
10h ps://www.docke .com
22
2.3. Wo k low example
3. Use clicks Append new iles using SSAP bu on, he ills in all neces-
sa y inpu pa ame e s and he con igu es a Da aLink p o ocol se ings
(i Da aLink p o ocol is suppo ed by he VO a chi e and i use wan s
o use i ).
4. Use commi s he download eques . P og ess o his eques and possi-
ble e o s can be seen on a dedica ed page.
5. When he download is comple ed, use con inues on page o a P ep o-
cessing job c ea ion. He ei he c ea es a JSON con igu a ion ile om
sc a ch o he selec s one o he p e-c ea ed con igu a ions. Files ha
should be p ep ocessed and p ep ocessing pa ame e s a e speci ied in
he JSON con igu a ion.
6. Use s a s a p ep ocessing job. P og ess can be seen on a dedica ed
page.
7. A e he p ep ocessing job is comple ed, use opens de ails o he job
and checks ha he e is an expec ed ou pu .
8. Use hen s o es an ou pu o he p ep ocessing job o sys em’s s o age
in o de o ha e i as a sou ce o addi ional expe imen s.
9. In a simila way o he p ep ocessing job, use c ea es a new SOM job
and as an inpu he selec s an ou pu o he p ep ocessing job s o ed in
he sys em’s s o age.
10. When he SOM execu ion job is comple ed, use opens de ails o he job.
SOM job ype p o ides an in e ac i e isualiza ion ha use can deeply
explo e and i can help him o unea h new in e es ing in o ma ion.
23
Chap e 3
Requi emen s analysis
This chap e is dedica ed o he desc ip ion o all unc ional and non- unc ional
equi emen s ha a e demanded by he new e sion o he VO-CLOUD sys-
em. The ul ilmen o hese equi emen s is he aim o his Mas e ’s hesis.
Be o e di ing in o hese equi emen s i is impo an o explain all new ech-
nologies ha a e going o be in ol ed in he new e sion o he sys em.
3.1 New echnologies
The basis o his wo k is in ac in eg a ion o many new echnologies o an
exis ing solu ion o he VO-CLOUD sys em in o de o simpli y scien i ic wo k
wi h he sys em and o ex end i s capabili ies. The comp ehension o hese
echnologies is c ucial o he p ac ical pa o his hesis, hus his sec ion
should in oduce hese echnologies o he eade .
3.1.1 Docke
One o he mos impo an echnologies ha was used in he p ac ical pa
is Docke . Docke echnology has been al eady men ioned in he sec ion ex-
plaining VO-CLOUD’s deploymen 2.2 because he da abase se e o he
cu en ly deployed solu ion uses a Docke con aine . Docke is a so wa e
con aine pla o m whe e piece o so wa e is packaged in o isola ed con ain-
e s [15]. The con aine s a e unc ionally e y simila o he i ual machines,
howe e , con aine s in Docke pla o m do no bundle a ull ope a ing sys em,
bu only so wa e lib a ies and se ings equi ed o make he so wa e wo k as
needed [15].
”Docke con aine s unning on a single machine sha e ha ma-
chine’s ope a ing sys em ke nel; hey s a ins an ly and use less
compu e and RAM. Images a e cons uc ed om ilesys em lay-
25
3. Requi emen s analysis
FR 14 The Jupy e No ebook en i onmen mus be a ailable only o use s
wi h a use ole ei he MANAGER o ADMIN.
FR 15 The Jupy e No ebook en i onmen mus p o ide a w i eable di ec-
o y whe e use s can c ea e hei own iles and Jupy e No ebooks.
FR 16 E e y use ’s Jupy e No ebook w i able di ec o y mus be isola ed
om all o he use s.
FR 17 Use s should no au hen ica e again o accessing he Jupy e No e-
book en i onmen . Au hen ica ion should be s aigh o wa d when
use is al eady au hen ica ed in he VO-CLOUD’s Mas e se e .
FR 18 The e mus be a possibili y o c ea e new wo ke ypes u ilizing he
Apache Spa k echnology – Spa k wo ke s.
FR 19 Spa k wo ke s mus be able o download iles om he Mas e se e ’s
s o age di ec ly o he speci ied pa h in HDFS.
FR 20 Spa k wo ke s mus ha e a de ined se o de aul pa ame e s ha a e
passed o he Spa k job.
FR 21 The se o Spa k job pa ame e s should be op ionally con igu ed by
a use in he JSON con igu a ion du ing he Spa k job c ea ion.
FR 22 The ou pu o Spa k job should be op ionally downloadable om he
HDFS back o he VO-CLOUD sys em.
FR 23 Mas e se e mus p o ide a possibili y o use o b owse he HDFS.
Use should be able o modi y he HDFS in he same way as he
Mas e se e ’s s o age.
FR 24 Plo e mus be able o plo spec a iles s o ed in he HDFS.
FR 25 Files s o ed inside HDFS mus be isible and a ailable o expe i-
men ing om he in eg a ed Jupy e No ebook en i onmen in he
same way as he iles om he Mas e se e ’s s o age.
3.3 Non- unc ional equi emen s
NFR 1 Fo secu i y easons, whole VO-CLOUD sys em mus secu e i s
communica ion o e HTTPS p o ocol. Incoming connec ions o e
HTTP p o ocols mus be edi ec ed o HTTPS connec ions.
NFR 2 Newly implemen ed modules should use Docke echnology in o de
o make deploymen mo e s aigh o wa d.
32

3.3. Non- unc ional equi emen s
NFR 3 Sou ce codes o he VO-CLOUD sys em mus be published unde
he Open Sou ce license and hey mus be publicly a ailable on a
public eposi o y.
NFR 4 The new Spa k wo ke ype mus be able o un on he same ap-
plica ion se e as he Mas e se e as well as on an applica ion
se e on a di e en machine.
33
Chap e 4
Realisa ion
The e a e h ee main goals in he ex ension o he cu en solu ion o he
VO-CLOUD sys em:
•Implemen a ion o spec a plo e
•In eg a ion o Jupy e No ebook en i onmen
•In eg a ion o HDFS and Apache Spa k
This chap e is dedica ed o he explana ion o all o hese goals in de ail in
he ollowing sec ions.
4.1 As onomical spec a plo ing capabili y
The cu en e sion o he VO-CLOUD sys em has been o iginally designed
o no di e en ia e be ween ile ypes sa ed in he Mas e se e ’s s o age. I
was use ’s esponsibili y o know wha is an ac ual ep esen a ion o espec i e
iles. Wo ke s o he VO-CLOUD sys em ha e been designed in he exac ly
same way. They basically ake a JSON con igu a ion con aining he lis o
iles ha hey should download om he Mas e se e . Then hey pass he
downloaded iles and he same JSON con igu a ion ile o some compu a ional
p ocess. The p ocess is ac ually he elemen ha should know wha iles i is
wo king wi h. I i is desi able, he p ocess can c ea e a isualiza ion ou pu
ha Mas e se e can p esen o he use .
Fo ins ance, he P ep ocessing job ype cu en ly deployed on he VO-
CLOUD sys em p ocesses passed as onomical spec a iles and p oduces an
cs ile con aining a p ep ocessing ou pu . I also p oduces a simple web ap-
plica ion – in e ac i e plo e o ou pu spec a ha u ilizes dyg aphs12 –
an open sou ce Ja aSc ip cha ing lib a y. The VO-CLOUD sys em simply
12h p://dyg aphs.com
35
4. Realisa ion
6300 6400 6500 6600 6700
wa eleng h [Ångs öm]
1.0
1.5
2.0
2.5
3.0
3.5
4.0
lux
31 Pegasi
Figu e 4.1: Example o as onomical spec um plo o s a 31 Pegasi
akes he P ep ocessing bina ies as a black box ha equi es a JSON con igu-
a ion and some inpu iles and hen i collec s all he P ep ocessing p oduces
and i p esen s i back o he use as a se o iles and an in-b owse appli-
ca ion. VO-CLOUD does no ha e o know any hing abou ile ypes a any
ime cu en ly.
Despi e he ac ha he VO-CLOUD sys em has been designed gene ally
o any kind o da a p ocessing, nowadays i is especially used o p ocessing
o as onomical spec a iles.
As onomical spec um ile – A ile con aining a eco d o an as onom-
ical spec um oge he wi h addi ional me ada a desc ibing when he
spec um was eco ded, unde wha condi ions, how i was p ocessed,
e c. . . As onomical spec um is a e y impo an concep o he s ella
as onomy, as i is a eco d o he elec omagne ic adia ion adia ing
om he obse ed objec . As onomical spec um can be easily isu-
alized as a unc ion o i s wa eleng h and a adia i e ene gy so-called
lux. [25] An example o plo ed spec um can be seen in igu e 4.1.
Whe eas he VO-CLOUD sys em’s s o age mos ly con ains only as onom-
ical spec a iles, he e is cu en ly no way o isualize hem o he han by
sending hem o a speci ic job ha is able o plo hem o by downloading
hem o he use ’s de ice and isualize hem in some o he applica ion (e.g.
36
4.1. As onomical spec a plo ing capabili y
Spec al Analysis Tool (SPLAT)13). These app oaches a e no sui able as ex-
pe imen ing as onome s o en equi e o isualize he selec ed se o spec a
o check ha spec a wa eleng hs a e co ec ly cu a e p ep ocessing phase,
ha i s alues a e co ec ly no malized o [0.0,1.0] in e al and so on. The e-
o e new unc ional equi emen s on he VO-CLOUD sys em eme ged – o
plo as much as onomical spec a ile ypes s o ed in he sys em’s s o age as
possible.
4.1.1 Co e isualisa ion p oblems s a emen
Whe eas i seems ha he isualisa ion o as onomical spec a iles is eally
s aigh o wa d, he opposi e is ue. The ac ha he VO-CLOUD is a
web applica ion b ings some p oblems ha complica e he isualisa ion. This
sec ion is dedica ed o desc ip ion o hese p oblems.
Mul iple spec a ile o ma s
As onomical spec a can be s o ed in mul iple specialized o ma s and e en
inside hese o ma s he e can be mo e ways in which spec a can be s o ed.
The desi ed spec a isualise should na u ally suppo as much o hese op-
ions as possible. Based on unc ional equi emen s, he isualise should
suppo ollowing spec a ile o ma s:
•Flexible Image T anspo Sys em (FITS) – I is a o ma o as onom-
ical spec a iles wi h ex ension . i o . i s and i was designed in
o de o acili a e he in e change o as onomical image da a be ween
obse a o ies. ”A FITS ile consis s o a sequence o one o mo e heade
and da a uni s (HDUs) op ionally ollowed by special eco d. The s uc-
u e o a FITS ile is based on blocks wi h a leng h o 2880 8-bi by es
(23040 bi s)”[26] The e a e mo e ways o s o e an as onomical spec um
inside a FITS ile. The mos s aigh o wa d way is o s o e wo ec o s
wi h he same leng h – a ec o o xaxis alues de ining poin s in a
wa eleng h and a ec o o yaxis alues de ining he ac ual lux alue o
he espec i e wa eleng h poin (” espec i e” means ha he poin has
he same index in ec o ). Ano he way is o w i e only a ec o o lux
alues and o desc ibe a wa eleng h axis in FITS me ada a, e.g., de ine
he alue wa eleng h o he i s poin and he s ep dis ance o he nex
wa eleng h poin ei he in linea o loga i hmic scaling. Howe e , he
p oblem is ha he names o me ada a keys a e no usually s anda dized
and mo eo e some FITS iles do no e en ully mee he FITS speci i-
ca ion, hus i is almos impossible o implemen a isualize suppo ing
e e y as onomical spec um FITS ile.
13h p://s a -www.du .ac.uk/˜pd ape /spla /
37

4. Realisa ion
•VOTable – The VOTable o ma is a ecommenda ion de eloped by
he IVOA o ganisa ion and i is newe han FITS speci ica ion. ”The
VOTable o ma is an XML s anda d o he in e change o da a ep-
esen ed as a se o ables.”[27] The able in his con ex con ains an
uno de ed se o ows and each ow con ains a sequence o cells. The
VOTable o ma can be u ilized in many ways (e.g. i is used in al eady
men ioned p o ocol SSAP as a ca ie o he SSAP que y esul s), how-
e e , i is mainly u ilized as a o ma o as onomical spec a iles as i
can ca y spec a da a as well as me ada a. The e a e wo main ways
o s o e spec a inside VOTable o ma iles:
–TABLEDATA – Two ec o s o an as onomical spec um a e mapped
as ows wi h wo columns – wa eleng h and lux pai s – using able
XML elemen s (TR and TD elemen s – he same as in HTML).
–BINARY – Wa eleng h and lux ec o s a e se ialized in he bina y
o ma ha is in ended o be easy o ead by pa se s. I is basically
a sequence o cells se ialized as a sequence o by es. [27] E e y cell
belongs o some ow and some column and he size in by es o he
cell is de ined by he column’s da a ype. Column da a ypes and
addi ional me ada a a e speci ied a he beginning o he VOTable
ile.
•CSV – Whe eas FITS and VOTable iles ep esen a single as onom-
ical spec um, some imes i is desi able o ha e mul iple as onomical
spec a s o ed inside a single ile. One o he mos use ul o ma s o
his pu pose is a simple Comma-Sepa a ed Values (CSV) o ma . The
CSV ile con ains each spec um on a single ow and o each spec um
i con ains mul iple alues sepa a ed by a comma cha ac e (some imes
o he cha ac e s such as a space o a semicolon can be used, howe e ,
a comma is used he mos o en). The i s alue o a spec um ow
con ains a spec um iden i ie (e.g. he name o o iginal ile whe e spec-
um has been aken om) and he es o ow alues ep esen s he
lux spec um ec o . The CSV spec a ile does no con ain any heade
ow. The g ea ad an age o his app oach is ha he CSV spec a
ile can be easily spli o mul iple smalle CSV iles. I is impo an o
no e ha he CSV spec a iles con ain no in o ma ion abou wa eleng h
ec o . The CSV spec a ile is usually an ou pu o some p ep ocess-
ing me hod ha akes mul iple spec a as an inpu and p ep ocesses
hem o a single CSV spec a ile. Inpu spec a a e in e pola ed o he
same wa eleng h alues and he esul ing wa eleng h ec o is expo ed
as a me a.xml ile – he ile in VOTable o ma con aining a ow wi h
he wa eleng h ec o . Da a mining me hods do no usually need he
me a.xml ile as hey mos ly wo k only wi h spec a lux ec o s. The
38
4.1. As onomical spec a plo ing capabili y
me a.xml se es especially o pu poses o isualisa ion and addi ional
p ep ocessing me hods.
Spec a isualize mus suppo plo ing o mul iple as onomical spec a
oge he . Use can selec ei he one ile o mul iple iles o plo . These selec ed
iles may ha e di e en o ma . The CSV ile mus be plo able by i sel (in
his case i should be assumed ha he wa eleng h ec o con ain alues o
(x) = x unc ion whe e xis an index o a lux ec o alue) o wi h co ec
wa eleng h alues i me a.xml ile is speci ied.
Da a olume p oblem
The e is a big di e ence be ween isualising a single as onomical spec um
ile wi h size o app oxima ely 50 kB and mul iple spec a (s o ed as mul iple
iles o a single big CSV ile) wi h o al size in megaby es. I spec a o
isualising we e eally small he bes solu ion would be o ans e all necessa y
da a o clien ’s web b owse and isualise hem using a Ja aSc ip code. The
p oblem is ha da a a e usually oo big o web b owse ’s Ja aSc ip in e p e
o handle. Mo eo e , use s wi h limi ed ne wo k connec i i y would ha e o
wai a e y long ime because web b owse s usually wai un il all da a ha e
been downloaded be o e hey pass hem o he Ja aSc ip code – he e is no
way o isualise spec a con inually as new da a a e being downloaded.
The be e way o sol e he big da a p oblem isualisa ion is o gene a e
an image o a plo on he se e ’s side and hen send i o use ’s web b owse
ha simply shows he image. The e a e a ew ad an ages o his solu ion:
•Clien code ha e no esponsibili y o di e en ia ing be ween mul iple
spec a ile o ma s.
•The e a e no equi emen s on compu a ional capabili ies o use ’s de ice.
•The e is no need o ans e o big amoun o da a om se e o use ’s
de ice – only one image o a desi ed quali y.
This app oach seems o be be e han he i s one, howe e , he e is also a
g ea disad an age. An abili y o zoom in he esul ing image is condi ioned
by he image quali y. Images would ei he ha e o be unnecessa ily la ge o
he quali y a e image zooming would be unaccep able. I is necessa y o
ind he comp omise be ween he i s and he second app oach men ioned
in his sec ion in o de o implemen a su icien solu ion o he esul ing
as onomical spec a isualise .
Technology in eg a ion p oblem
As i ha e been al eady explained in he p e ious sec ion 4.1.1, he spec a
isualising mus a leas pa ially ake place on he se e side. In o de o do
so i is impo an o implemen mainly wo ollowing componen s:
39
4. Realisa ion
•A pa sing module o all possible as onomical spec a ile ypes ha
akes a se o iles as an inpu and e u ns a se o wa eleng h ec o
and lux ec o pai s – one pai o each as onomical spec um.
•A plo ing module ha akes ou pu o he p e ious module and i plo s
all spec a in o a plo image.
The p oblem is ha he e a e no lib a ies implemen ed in he Ja a language
ha a e able o pa se FITS o VOTable spec a ile o ma s. Whe eas i
would be a ela i ely easy ask o implemen VOTable pa se in Ja a as i
is basically a XML documen and Ja a o e s echnologies o s aigh o wa d
XML documen pa sing, he FITS o ma pa sing would ha e o be imple-
men ed whole om sc a ch. Also, Ja a o e s almos no ools wi h plo ing
capabili ies. Plo ing lib a ies w i en in Ja a language a e mos ly a ge ed a
desk op applica ions and hey do no i wi h pu poses o a web applica ion.
On he o he hand, Py hon p og amming language seems like a igh way
o go. Pa sing o bo h FITS and VOTable spec a ile o ma s can be im-
plemen ed easily using he As opy Py hon p ojec – a communi y e o o
de elop a co e package o as onomy using he Py hon p og amming language
and imp o e usabili y, in e ope abili y, and collabo a ion be ween as onomy
Py hon packages [28]. Py hon also o e s an excellen plo ing lib a y named
Ma plo lib.
”Ma plo lib is a 2D g aphics package used o Py hon o appli-
ca ion de elopmen , in e ac i e sc ip ing, and publica ion-quali y
image gene a ion ac oss use in e aces and ope a ing sys ems.”[29]
The solu ion implemen ed in he Py hon p og amming language would
be a s aigh o wa d o do, howe e , he VO-CLOUD sys em is implemen ed
in he Ja a p og amming language. I is impo an o decide whe he i is
be e o implemen he spec a isualise in Py hon language and o make an
in eg a ion wi h he cu en solu ion mo e di icul , o whe he o implemen
i in Ja a language o ha e an in eg a ion i ial bu o implemen pa sing
and plo ing modules all om sc a ch.
4.1.2 Solu ion
A e conside ing all abo e s a ed p oblems I ha e e en ually decided o im-
plemen he whole as onomical spec a isualise in he Py hon language as
a new web applica ion and hen in eg a e his applica ion o he cu en solu-
ion o he VO-CLOUD sys em. By using he Py hon p og amming language
many hing ha e been simpli ied as i is possible o delega e many applica ion
esponsibili ies o lib a ies ha his applica ion u ilizes.
The applica ion was named spec a iewe and i was implemen ed as a
web se e applica ion by u ilizing a Py hon package named To nado – a web
40
4.1. As onomical spec a plo ing capabili y
amewo k and asynch onous ne wo k lib a y [30]. The da a olume p oblem
4.1.1 has been sol ed smoo hly by u ilizing he WebSocke p o ocol. The Web-
Socke is a p o ocol ha uses a anspo laye o HTTP p o ocol in o de o
c ea e bidi ec ional communica ion be ween a se e and a clien [31]. A e
assembling a WebSocke connec ion be ween a clien and a se e , he clien
can send a message o he se e as well as he se e can send a message o he
clien . Due o he ac ha To nado se e is implemen ed on asynch onous
p inciples, i is e y easy o implemen he WebSocke p o ocol using he To -
nado Py hon package. E e y WebSocke p o ocol e en (connec ion opened,
connec ion closed, message ecei ed) on he se e side igge s a me hod call
o a espec i e o nado.websocke .WebSocke Handle class ins ance. New
ins ance o his class is c ea ed o e e y new incoming WebSocke connec ion.
The applica ion wo ks in he ollowing way:
1. Clien sends he lis o spec a he would like o isualise.
2. Applica ion pa ses he lis ed spec a and i sa es he plo igu e inside
i s empo al key- alue s o age – he key is a andomly gene a ed unique
iden i ie and he alue is he igu e i sel .
3. Applica ion esponds o clien wi h he HTML empla e con aining
Ja aSc ip clien code and he s o age iden i ie .
4. Clien ende s he HTML empla e and he c ea es a new WebSocke
connec ion o he speci ic se e endpoin passing he s o age iden i ie
as an a gumen .
5. Se e links newly ecei ed WebSocke connec ion wi h he igu e s o ed
inside he s o age.
6. Se e sends a message o he WebSocke connec ion con aining an image
o he igu e.
7. Clien shows he ecei ed image in he page.
8. Clien can use panning o zooming ools on he image. When he does
so, he pa ame e s o expec ed ans o ma ion a e sen o he se e
h ough he WebSocke connec ion.
9. Se e applies desi ed ans o ma ion on he linked igu e and i sends
back a new image h ough he WebSocke connec ion.
10. Clien can epea s eps 8 and 9.
11. When clien closes he page, he WebSocke connec ion is closed and he
se e emo es he ele an igu e om he empo al key- alue s o age.
41
4. Realisa ion
1. The Jupy e No ebook se e ins ance o wa ds he enc yp ed cookie o
he Hub o au ho iza ion.
2. I he cookie is alid, he Hub esponds wi h he use ’s use name.
3. I he use is he owne o he Jupy e No ebook se e ins ance, access
is allowed.
4. I he use name is w ong o he cookie is in alid, he use is edi ec ed
o /hub/login.
4.2.2.2 Jupy e Hub Spawne
Each Jupy e No ebook se e ins ance is s a ed by he Hub subsys em by an
objec called Spawne . The Spawne objec has ollowing esponsibili ies[32]:
•S a he Jupy e No ebook se e p ocess.
•Poll whe he he p ocess is s ill unning.
•S op he p ocess when necessa y.
The e a e many implemen a ions o he Jupy e Hub spawne objec . The
implici one is called LocalP ocessSpawne . This spawne implemen a ion
wo ks only on UNIX sys ems as i spawns new se e ins ances as a p ocess
unde he UNIX sys em use wi h name ma ching he au hen ica ed one in
he Jupy e Hub au hen ica ion p ocess. The e a e cases whe e his solu ion
could be su icien , howe e , in his case he e is no mapping be ween UNIX
sys em use s and use s inside he VO-CLOUD sys em.
The c ucial implemen a ion o he Jupy e Hub Spawne ha is u ilized
in he VO-CLOUD–Jupy e Hub in eg a ion is named Docke Spawne . This
spawne implemen a ion s a s o each au hen ica ed use a Docke con aine
ha packages he whole Jupy e No ebook se e . The u iliza ion o Docke
con aine s has also a g ea ad an age. The en i onmen inside he unning
Docke con aine is isola ed om he hos ing sys em, he e o e, his app oach
e y e ec i ely deals wi h he sys em isola ion p oblem 4.2.1. I he Jupy e
No ebook se e would ha e badly con igu ed access pe missions, in he wo s
case scena io a use could b eak only he se e on he con aine i sel . O he
Docke con aine s o he hos ing sys em a e inaccessible om he inside o he
Docke con aine .
4.2.2.3 Jupy e Hub Au hen ica o
The Jupy e Hub Au hen ica o is ano he impo an objec in he Hub subsys-
em. I s esponsibili y is o p o ide au hen ica ion capabili ies o he Jupy e -
Hub se e . P ac ically, he Au hen ica o is any Py hon class ha inhe -
i s om he class jupy e hub.au h.Au hen ica o . I consis s o a single
48

4.2. Jupy e No ebook en i onmen in eg a ion
me hod au hen ica e ha basically akes a use name and a passwo d o a
use ha is ying o au hen ica e. I he use ’s c eden ials a e co ec he
me hod mus e u n he use ’s use name. O he wise, he me hod mus e u n
he special Py hon alue None.
In o de o in eg a e he Jupy e Hub o he VO-CLOUD sys em, i was
necessa y o design and implemen he way o au hen ica ion. The wo k low
o he au hen ica ion is done in he ollowing way:
1. Use connec s o he VO-CLOUD’s Mas e se e and logs in wi h his
c eden ials.
2. In o de o ansi ion o he Jupy e No ebook en i onmen , use clicks
he Jupy e bu on in he Mas e se e ’s use in e ace.
3. Mas e se e gene a es a new andomly gene a ed oken, links he oken
wi h he use ’s use name and sa es i empo a ily in he in-memo y
s o age.
4. Mas e se e sends he oken o he use ’s web b owse .
5. Use ’s web b owse does a HTTP POST eques o he Jupy e Hub’s
login endpoin /hub/login. The POST eques con ains wo pa ame e s
– he use name iden ical o he use ’s use name on he Mas e se e and
he oken.
6. Jupy e Hub delega es he au hen ica ion ask o he au hen ica o ’s im-
plemen a ion – he VocloudAu hen ica o .
7. The VocloudAu hen ica o does a HTTP POST eques o he Mas-
e se e ’s oken checking endpoin . I passes he oken as a POST
pa ame e .
8. I oken is alid and no expi ed, he Mas e se e e u ns he use name
o he use accoun linked wi h his oken and i in alida es he oken.
9. The VocloudAu hen ica o checks ha he use name ecei ed om he
Mas e se e ma ches he one ecei ed om he use ’s web b owse .
10. I use names ma ch i e u ns he use name o he Jupy e Hub.
11. Use is now au hen ica ed o he Jupy e Hub.
The e is no o he way o au hen ica e o he Jupy e Hub han o ansi-
ion om he VO-CLOUD’s Mas e se e using he p o ided oken. E e y
gene a ed oken is alid only o a limi ed amoun o ime and i is in alida ed
as soon as i is used. This solu ion is signi ican ly be e om a secu i y poin
o iew han he solu ion explained in he sec ion 4.2.1, as he e is no way o
u ilize po en ially caugh oken, since i is alid only o a e y sho pe iod o
49
4. Realisa ion
use 3
Docke olume
use 2
Docke olume
use 1
Docke olume
Jupy e Hub
Docke olume
Mas e se e
«execu ion en i onmen »
WildFly Ja a EE se e
Hub
Use
use 3 Jupy e No ebook
Docke con aine
use 2 Jupy e No ebook
Docke con aine
use 1 Jupy e No ebook
Docke con aine
VocloudAu hen ica o
Docke Spawne
P oxy
Jupy e Hub Docke con aine
«execu ion en i onmen »
Docke
«de ice»
be elgeuse se e
HTTP
HTTP
spawns
HTTPS
HTTP
HTTP
HTTP
Figu e 4.3: Jupy e Hub solu ion Docke deploymen
ime. Secu i y cookies a e also di icul o exploi , as hey con ain enc yp ed
in o ma ion iden i ying he use ’s web b owse and de ice.
The implemen a ion o he VO-CLOUD’s Mas e se e has been ex ended
o suppo he new unc ionali y o he au hen ica ion oken endpoin . I was
implemen ed as a e y simple REST ul se ice. The oken in-memo y s o age
has been implemen ed as a Single on EJB bean.
4.2.2.4 Deploymen
The Docke echnology has been used bo h o Jupy e No ebook se e in-
s ances and o he whole Jupy e Hub i sel . This solu ion is eally in e es ing
as he e is he Jupy e Hub Docke con aine ha equi es o spawn addi-
ional Docke con aine s wi h he Jupy e No ebook se e ins ances on he
same se e whe e he Jupy e Hub con aine is deployed i sel , bu no in-
side he Jupy e Hub’s con aine . Mo eo e , in o de o ha e access o he
Mas e se e ’s s o age and jobs di ec o y i is necessa y o moun hese di-
ec o ies o e e y one o he indi idual Jupy e No ebook Docke con aine s.
E e y Jupy e No ebook Docke con aine has i s own wo king di ec o y ha
is backed in he ilesys em o he hos ing sys em.
In o de o be able o s a a new Docke con aine om he inside o
ano he Docke con aine bu no inside he Docke con aine i sel , i is nec-
essa y o moun he Docke socke ile / a / un/docke .sock om he hos -
ing se e o he spawning Docke con aine . The Docke socke is basically
a clien o communica ing wi h he Docke daemon p ocess – he Docke
con aine wi h he moun ed Docke socke will gain an abili y o con ol he
Docke daemon in he same way as i can be done di ec ly om he hos ing
se e .
50
4.2. Jupy e No ebook en i onmen in eg a ion
The deploymen o he Jupy e Hub solu ion can be seen in he igu e 4.3.
The whole deploymen solu ion has been implemen ed as a p ojec named
ocloud-jupy e hub and he deploymen consis s o adjus men o .en con-
igu a ion ile and o in oca ion o wo commands:
make
docke −compose −d
The i s command c ea es all necessa y Docke ne wo ks and olumes and
hen i builds he Jupy e No ebook se e Docke image and he Jupy e Hub
image. The second command s a s he Jupy e Hub con aine in a de ached
mode and exposes i s web in e ace on he TCP po speci ied in he .en
con igu a ion ile.
Since Jupy e Hub is unning on he be elgeuse se e , i was necessa y
o expose he se e also in he e e se p oxy on he ocloud-de se e .
The small imp o emen was also done – when a use accesses he /hub/login
URI o he Jupy e Hub se e using a HTTP GET me hod, he e e se p oxy
sends a edi ec back o he use poin ing o he VO-CLOUD’s Mas e se e
login page. Now when a use logs ou om he Jupy e Hub se e o when he
andomly accesses some Jupy e Hub’s esou ce wi hou au hen ica ion, he is
au oma ically edi ec ed o he VO-CLOUD’s login page.
4.2.3 Summa y
E e y use now has access o his own ins ance o he Jupy e No ebook se e
ha is s a ed on demand as a Docke con aine . E e y use has his own
wo king di ec o y isola ed om all o he use s whe e he can c ea e new iles
and No ebook documen s. Also, in his wo king di ec o y he e a e wo ead-
only di ec o ies moun ed om he hos ing sys em – he Mas e se e ’s s o age
and jobs di ec o y. Use s ha e di ec ead access o all iles sa ed in hese
di ec o ies wi hou a necessi y o copy iles om hem o some o he loca ion.
Use s ha e access o he e minal window ea u e o he Jupy e No ebook
en i onmen , howe e , his e minal has access only o he speci ic Docke
con aine as i is isola ed om he hos ing sys em. Use s can use he e minal
window o ins all addi ional Py hon packages, ha a e no implici ly p o ided,
howe e , he Jupy e No ebook Docke image should be al eady p o ided wi h
all necessa y Py hon packages such as Ma plo lib, As opy, NumPy, pandas
and many o he s.
Use can ansi ion o his Jupy e No ebook en i onmen om he VO-
CLOUD sys em by clicking only a single bu on – whole au hen ica ion p ocess
is done au oma ically in he backg ound. When use logs ou om he Jupy e
No ebook en i onmen , he is au oma ically edi ec ed back o he VO-CLOUD
sys em.
51
4. Realisa ion
4.3 Apache Spa k and HDFS in eg a ion
The inal goal i his Mas e ’s hesis is o ind a way o in eg a e he VO-
CLOUD sys em wi h he Hadoop in as uc u e in o de o be able o u ilize
he dis ibu ed ile sys em HDFS and o s a Apache Spa k jobs using he
Hadoop YARN schedule . The cu en solu ion o he VO-CLOUD compu a-
ional wo ke s is usable, howe e , he usabili y o wo ke s is limi ed by wo
ac o s:
•The se o inpu da a mus be always downloaded again om he Mas-
e se e ’s s o age o he compu a ional wo ke o e e y indi idual
job. This app oach enables he deploymen o wo ke s on addi ional
sepa a ed de ices.
•The wo ke ’s compu a ional ask is always execu ed on a single de ice
o he speci ic wo ke . The compu a ional capabili y is limi ed by CPU
and memo y esou ces o he single de ice.
While he cu en solu ion o he VO-CLOUD sys em can be easily applied
on a p ocessing o a limi ed amoun o da a se , he need has eme ged o be
able o p ocess he whole as onomical spec a a chi e LAMOST-DR1. The
La ge Sky A ea Mul i-Objec Fib e Spec oscopic Telescope (LAMOST) is
a me idian ac i e e lec ing Schmid elescope loca ed in Xinglong S a ion
o na ional As onomical Obse a o y in China [33]. Da a Release 1 (DR1)
o his elescope’s obse a ions comp ises o 2,202,000 as onomical spec a
iles encoded in FITS o ma . E e y as onomical spec um ile akes up
app oxima ely 90 kB o a disk space and he whole uncomp essed a chi e
in o al akes up 189 GiB o a disk space. I is un ealis ic o use cu en
compu a ional wo ke s o pu poses o p ocessing he whole spec a a chi e,
as he whole a chi e would ha e o be s o ed in he VO-CLOUD’s s o age
and also i would ha e o be downloaded o a wo ke o e e y compu a ional
job. I is necessa y o design a be e solu ion – u ilize he capabili ies o he
Apache Spa k and he Apache Hadoop in as uc u e.
4.3.1 Hadoop deploymen
Fi s ly i was necessa y o deploy he Hadoop in as uc u e o he se e s in
S ella Depa men o he As onomical Ins i u e o he Czech Academy o
Sciences in Ondˇ ejo whe e he VO-CLOUD sys em is also unning. I was
decided ha he Hadoop compu a ional clus e would consis o wo se e s:
•be elgeuse – The se e whe e he VO-CLOUD sys em is cu en ly de-
ployed. I has 12 CPU co es suppo ing he Hype -Th eading echnology
(24 i ual CPU co es) and 128 GiB RAM memo y.
•an a es – The se e wi h 8 CPU co es and 24 GiB o RAM memo y.
52
4.3. Apache Spa k and HDFS in eg a ion
NodeManage p ocess
NodeManage p ocess
Resou ceManage p ocess
Hadoop YARN
HDFS
Da aNode p ocess
Da aNode p ocess
NameNode p ocess
Hadoop clus e
«de ice»
an a es se e
«de ice»
be elgeuse se e
«con ols»
«con ols»
Figu e 4.4: Hadoop clus e deploymen on Ondˇ ejo se e s
As i has been al eady explained in he sec ion 3.1.3, he Hadoop Dis-
ibu ed File Sys em (HDFS) is comp ised o NameNode and Da aNo e p o-
cesses. The Da aNode p ocesses is basically he componen ha sa es he da a
blocks in he de ice’s ilesys em. The NameNode is he con olling componen
ha has in o ma ion abou all iles s o ed in he HDFS, hei da a blocks and
whe e a e hese blocks sa ed. In his case, he Da aNode p ocess is unning on
bo h be elgeuse and an a es se e and he NameNode con olling p ocess
is unning only on he be elgeuse se e .
The Hadoop YARN has been deployed e y simila ly. I consis s o wo
p ocesses:
•Resou ceManage – The YARN schedule and esou ce managing p o-
cess ha has in o ma ion abou all a ailable NodeManage p ocesses.
•NodeManage – The p ocess ha can ecei e a compu a ional wo k om
he Resou ceManage .
The be elgeuse se e uns bo h p ocesses, whe eas he an a es se e uns
only he NodeManage p ocess.
The diag am o he whole Hadoop in as uc u e deploymen can be seen
in he igu e 4.4.
4.3.2 Apache Spa k
The ins alla ion o he Apache Spa k was e y simple – i was only necessa y
o download he Apache Spa k bina ies and o p ope ly de ine en i onmen
53

4. Realisa ion
a iables o co ec ly poin o he pa h o he Apache Hadoop ins alla ion
di ec o y.
All jobs ha a e expec ed o be execu ed inside he Apache Spa k en i-
onmen a e submi ed using he spa k-submi sc ip ha is bundled wi h
he Apache Spa k ins alla ion package. Mul iple pa ame e s can be passed o
he spa k-submi sc ip . The mos impo an a e:
•--mas e – De ines whe e he Spa k job should be unning. In o de o
ha e he job managed by he Hadoop YARN schedule , i is necessa y
o pass ya n as a alue o his pa ame e .
•--deploy-mode – De ines whe e he execu ion d i e should un. The
d i e is he applica ion ha o ches a es he job execu ion o indi idual
execu o s. The e a e wo op ions ha can be passed o his pa ame e :
–clien – The d i e should un on he side o he de ice whe e
he spa k-submi sc ip has been execu ed. This op ion is picked
when i is necessa y o ins an ly see he p og ess o job’s execu ion.
–clus e – The d i e should un on any de ice in he clus e . The
clus e ’s esou ce manage simply picks he bes sui able clus e ’s
node o his ask.
•--num-execu o s – The coun o compu a ional execu o s. This op ion
is only used oge he wi h he --mas e ya n. In ac , he Hadoop
YARN alloca es i s esou ces o compu a ional con aine s named ex-
ecu o s. E e y execu o can un only on a single clus e node, howe e ,
mul iple execu o s can un on he single node. Each execu o equi es o
ha e an alloca ed speci ic amoun o CPU co es and a speci ic amoun
o RAM memo y. The Spa k compu a ion on YARN can s a when
he desi ed amoun o execu o s ha e been s a ed wi h all equi ed e-
sou ces.
•--execu o -co es – A numbe o CPU co es alloca ed pe each execu-
o .
•--execu o -memo y – A RAM memo y amoun alloca ed pe each ex-
ecu o .
4.3.3 Small iles p oblem
As i has been al eady explained in he sec ion 3.1.3, he HDFS is ine ec i e
in s o ing a big amoun o small iles, because i uses la ge da a blocks (e.g.
128 MB). I is signi ican ly be e when he e is smalle amoun o big iles.
In o de o be able o execu e Spa k compu a ional jobs o e he LAMOST-
DR1 spec a a chi e, a i s i is necessa y o copy he whole a chi e o he
HDFS o make he da a a ailable on all clus e nodes. The p oblem is ha
54
4.3. Apache Spa k and HDFS in eg a ion
he LAMOST-DR1 a chi e is comp ised o millions o small spec a iles ha
he HDFS canno handle.
In o de o sol e he p oblem i is necessa y o ind a way o me ge mul i-
ple small iles oge he o make a big ile. In some p oblem ins ances his ask
is e y i ial. Fo ins ance, i is easy o me ge mul iple CSV iles oge he by
appending hem one a e ano he . Howe e , some ile o ma s a e no me ge-
able, as hey ha e a complex s uc u e. The FITS o ma is un o una ely
one o hese o ma s, he e o e i is necessa y o ind a be e way o me ge
as onomical spec a FITS iles oge he .
4.3.3.1 SequenceFile
SequenceFile seems o be a good solu ion o he small iles p oblem. I is a
la ile consis ing o bina y key/ alue pai s and me hods o i s eading and
w i ing a e pa o he Hadoop API [34]. In his con ex , mul iple as onomical
spec a iles would be me ged in o a single SequenceFile, whe e he key would
be he name o he o iginal spec um ile and he alue would be he con en o
he spec um ile i sel . SequenceFile s o es he key/ alue pai s se ialized in
a bina y o ma one a e ano he . I does no o e an abili y o quickly ind
a desi ed key (i.e. ile name) as i has no indexes o he keys s o ed inside he
SequenceFile, howe e , his unc ionali y is no e en equi ed in his case,
as all spec a need o be p ocessed.
Un o una ely, he e is a se ious p oblem ha makes he deploymen o
he SequenceFile o ma almos impossible. Apache Spa k jobs can be im-
plemen ed in h ee p og amming languages – Ja a, Scala and Py hon. The
SequenceFile API me hods ha e been p og ammed in a Ja a language and
hese me hods u ilize a Ja a Se ializa ion mechanism. This API can be used
na u ally in Ja a and also in Scala as i uns on he Ja a Vi ual Machine and i
can call any Ja a API. The p oblem is he Py hon p og amming language, o
i does no ha e he Se ializa ion mechanism om he Ja a language and hus
i na u ally has no implemen a ion o SequenceFile o ma . I is expec ed ha
he Py hon p og amming language could be used o p og amming a Spa k
job, he e o e i is necessa y o ind a be e way o spec a iles me ging.
4.3.3.2 Apache A o
Apache A o is a da a se ializa ion sys em ha can be u ilized o sol e he
small iles p oblem. The A o elies on schemas ha a e w i en in a JSON
o ma . E e y ile ha was w i en in he A o o ma con ains, apa om
he da a, he JSON schema i sel . Da a in he A o o ma a e sa ed in he
compac bina y o ma de i ed om he JSON o ma and hey a e s o ed as a
sequence o ows. E e y ow ep esen s one eco d wi h he o ma ma ching
he de ined A o schema. The A o has APIs w i en in many languages
including Ja a, Scala and Py hon. [35]
55
4. Realisa ion
1{
2" ype":" eco d",
3"name":"Fi sFiles ",
4"aliases":["Fi s"],
5" ields":[
6{"name":"name"," ype":"s ing","doc":"Fi s
ile name"},
7{"name":"con en "," ype":"by es ","doc":"
Bina y con en o he i s ile "}
8]
9}
Figu e 4.5: Apache A o schema JSON
1. . .
2w i e =Da aFileW i e (open( ” ou pu . a o ” , ”wb” ) ,
Da umW i e () , schema ,codec=” d e l a e ” )
3 o in ilenames :
4wi h open( , ” b ” ) as d :
5con en = d . ead ()
6w i e .append({”name” : , ” con en ” : con en })
7w i e .c l o s e ( )
8. . .
Figu e 4.6: F agmen o code se ializing spec a iles o he A o o ma
Le ’s illus a e he A o se ializa ion o ma on he cu en p oblem. I is
basically necessa y o achie e he same unc ionali y as in he SequenceFile o -
ma . The designed schema can be seen in he igu e 4.5. The schema con ains
de ini ion o wo ields (i.e. wo columns) – he i s (name) speci ies he name
o he o iginal spec um ile and he second (con en ) speci ies he bina y con-
en o he ile i sel . The only ac ion ha emains is o use his schema o se i-
alize spec a om he LAMOST-DR1 a chi e o A o o ma and o push he
A o iles o he HDFS. The simple ool named spec a-a o-se ialize
has been implemen ed o his pu pose in he Py hon language. The mos
impo an agmen o code o his ool ha akes he schema and spec a
iles and se ializes hem o a single A o ile can be seen in he igu e 4.6.
The whole a chi e has been p ocessed by he spec a-a o-se ialize
ool. Ins ead o 2,202,000 small spec a iles, he a chi e now consis s o
only 1,169 iles in se ialized A o o ma . Mo eo e , as can be seen in he
example 4.6, he de la e comp ession codec has been u ilized o make he
a chi e e en smalle . Ins ead o 189 GiB i akes up only 85.7 GiB o s o age
56
4.3. Apache Spa k and HDFS in eg a ion
space. All newly c ea ed A o iles ha e been mo ed o he HDFS and can be
now used as a da a sou ce o any Apache Spa k job.
4.3.4 VO-CLOUD in eg a ion
One o he goals o his wo k is o ind a way o in eg a e he Spa k job
submi ing ea u e u ilizing he HDFS o he exis ing solu ion o he VO-
CLOUD sys em. Cu en ly he e is only one implemen ed ool ha u ilizes
he Apache Spa k – he p ep ocessing ool named ocloud spa k impo .
Howe e , he e is expec ed o be mo e ools in he u u e ha would u ilize
an ou pu o he p ep ocessing ool and p oduce signi ican esul s. The e o e,
i is necessa y o design he in eg a ion solu ion in a gene al way o allow an
easy adop ion o new Spa k job ypes.
E e y Spa k job me hod is expec ed o wo k in he ollowing way:
•The bina ies o he Spa k job (he eina e Applica ion) a e p epa ed in
some speci ic di ec o y on he se e .
•The Applica ion is w i en in ei he Ja a o Scala o Py hon p og am-
ming language.
•The Applica ion expec s o be execu ed wi h exac ly one pa ame e –
he pa h o he JSON con igu a ion ile.
•The JSON con igu a ion con ains pa ame e s ha de ine a beha iou o
he Applica ion. I is p o ided by a use .
•E e y indi idual Applica ion could equi e a di e en se o he pa am-
e e s passed o he spa k-submi .
•A use can amend he spa k-submi pa ame e s.
•The Applica ion akes as an inpu da a s o ed inside he HDFS.
•The Applica ion p oduces ou pu o he HDFS.
In he cu en s a e o deploymen he VO-CLOUD sys em is deployed
on he se e ha is also a pa o he Hadoop clus e . This gene ally does
no ha e o be ue as he clus e could heo e ically un on a di e en se o
se e s. The in eg a ion solu ion mus be designed gene ally o also mee his
equi emen .
4.3.4.1 Spa k Wo ke
The solu ion has been designed as a new ype o he VO-CLOUD Wo ke
named Spa k Wo ke . Whe eas he Uni e sal Wo ke ocuses on a gene al
execu ion o p ocesses, he Spa k Wo ke is deeply ocused on an execu ion
57

Bibliog aphy
[1] Hanisch, R.; Quinn, P. In e na ional Vi ual Obse a o y Alliance [on-
line]. The IVOA, [ci . 2017-04-27]. A ailable om: h p://www.i oa.ne /
abou /TheIVOA.pd
[2] IVOA. Wha is he IVOA [online]. [ci . 2017-04-27]. A ailable om: h p:
//i oa.ne /abou /wha -is-i oa.h ml
[3] O acle. Ja a Pla o m, En e p ise Edi ion; The Ja a EE Tu o ial; Re-
lease 7 [online]. Sep embe 2014, [ci . 2014-05-05]. A ailable om: h ps:
//docs.o acle.com/ja aee/7/JEETT.pd
[4] O acle. Ja a(TM) EE 7 Speci ica ion APIs [online]. [ci . 2017-04-28].
A ailable om: h p://docs.o acle.com/ja aee/7/api/
[5] Koza, J. Design and implemen a ion o a dis ibu ed pla o m o
da a mining o big as onomical spec a a chi es. Bachelo ’s hesis,
Czech Technical Uni e si y in P ague,Facul y o In o ma ion Technology,
P ague, 2015, doi:10.5281/zenodo.44641.
[6] WWW Conso cium. Ex ensible Ma kup Language (XML) 1.0 (Fi h Edi-
ion) [online]. No embe 2008, [ci . 2017-04-29]. A ailable om: h p:
//www.w3.o g/TR/REC-xml/REC-xml-20081126- e iew.h ml
[7] Fielding, R. T.; Taylo , R. N. P incipled Design o he Mode n Web
A chi ec u e. ACM T ans. In e ne Technol., May 2002: pp. 115–150,
ISSN 1533-5399, doi:10.1145/514183.514185. A ailable om: h p://
doi.acm.o g/10.1145/514183.514185
[8] Ha ison, P.; Rixon, G. IVOA Recommenda ion: Uni e sal Wo ke Se -
ice Pa e n Ve sion 1.0. A Xi e-p in s, Oc obe 2011, 1110.0510. A ail-
able om: h p://adsabs.ha a d.edu/abs/2011a Xi 1110.0510H
65
Bibliog aphy
[9] Coulou is, G.; Dollimo e, J.; e al. Dis ibu ed Sys ems: Concep s and
Design (5 h Edi ion). Pea son, 2011, ISBN 0132143011.
[10] Tody, D.; Dolensky, M.; e al. IVOA Recommenda ion: Sim-
ple Spec al Access P o ocol Ve sion 1.1. A Xi e-p in s, Ma ch
2012, 1203.5725. A ailable om: h p://adsabs.ha a d.edu/abs/
2012a Xi 1203.5725T
[11] Lau en , M.; Bonna el, F.; e al. IVOA Recommenda ion: Da aLink
P o ocol Ve sion 1.0 [online]. The IVOA, May 2013, [ci . 2017-04-30].
A ailable om: h p://www.i oa.ne /documen s/No es/Da aLink/
20130502/NOTE-Da aLinkP oposal-1.0-20130502.pd
[12] Paliˇcka, A. Applica ion o Random Decision Fo es s in As oin o ma -
ics. Bachelo ’s hesis, Czech Technical Uni e si y in P ague, Facul y o
In o ma ion Technology, P ague, 2014.
[13] Lopa o sk´y, L. Applica ion o Sel -O ganizing Maps in As oin o ma -
ics. Bachelo ’s hesis, Czech Technical Uni e si y in P ague, Facul y o
In o ma ion Technology, P ague, 2014.
[14] M k a, L. VO-KOREL, se e o as onomical cloud compu ing. Bache-
lo ’s hesis, Czech Technical Uni e si y in P ague, Facul y o In o ma ion
Technology, P ague, 2012.
[15] Docke Inc. Wha is Docke [online]. [ci . 2017-05-02]. A ailable om:
h ps://www.docke .com/wha -docke
[16] Docke Inc. Wha is a Con aine [online]. [ci . 2017-05-02]. A ailable om:
h ps://www.docke .com/wha -con aine
[17] The Apache So wa e Founda ion. Wha is Apache Hadoop? [online].
[ci . 2017-05-02]. A ailable om: h p://hadoop.apache.o g/
[18] The Apache So wa e Founda ion. HDFS A chi ec u e Guide [online]. [ci .
2017-05-02]. A ailable om: h ps://hadoop.apache.o g/docs/ 1.2.1/
hd s_design.h ml
[19] Whi e, T. The Small Files P oblem [online]. Feb ua y 2009, [ci . 2017-
05-02]. A ailable om: h p://blog.cloude a.com/blog/2009/02/ he-
small- iles-p oblem/
[20] The Apache So wa e Founda ion. Spa k O e iew [online]. [ci . 2017-05-
04]. A ailable om: h p://spa k.apache.o g/docs/la es /
[21] The Apache So wa e Founda ion. Machine Lea ning Lib a y (ML-
lib) Guide [online]. [ci . 2017-05-04]. A ailable om: h p://
spa k.apache.o g/docs/1.6.3/mllib-guide.h ml
66
Bibliog aphy
[22] Penchikala, S. Big Da a P ocessing wi h Apache Spa k [online].
[ci . 2017-05-04]. A ailable om: h ps://www.in oq.com/a icles/
apache-spa k-in oduc ion
[23] Jupy e Team. The Jupy e No ebook [online]. 2015, [ci . 2017-05-
04]. A ailable om: h p://jupy e -no ebook. ead hedocs.io/en/
la es /no ebook.h ml
[24] Jupy e Team. Wha is he Jupy e No ebook? [online]. 2015, [ci .
2017-05-04]. A ailable om: h p://jupy e -no ebook-beginne -
guide. ead hedocs.io/en/la es /wha _is_jupy e .h ml
[25] Tennyson, J. As onomical Spec oscopy: An In oduc ion o he A omic
and Molecula Physics o As onomical Spec a (Immpe ial College P ess
Ad anced Physics Tex s). Impe ial College P ess, 2005, ISBN 1860945139.
[26] Allen, S.; Wells, D. MIME Sub- ype Regis a ions o Flexible Image
T anspo Sys em (FITS). RFC 4047, RFC Edi o , Ap il 2005.
[27] Ochsenbein, F.; Williams, R.; e al. IVOA Recommenda ion: VOTable
Fo ma De ini ion Ve sion 1.3. 2011, a Xi :1110.0524.
[28] As opy Collabo a ion; Robi aille, T. P.; e al. As opy: A communi y
Py hon package o as onomy. As onomy and As ophysics, olume 558,
Oc . 2013: A33, doi:10.1051/0004-6361/201322068, 1307.6212.
[29] Hun e , J. Ma plo lib: A 2D g aphics en i onmen . Com-
pu ing in Science and Enginee ing, olume 9, no. 3, 2007:
pp. 99–104, doi:10.1109/MCSE.2007.55, ci ed By 1106. A ail-
able om: h ps://www.scopus.com/inwa d/ eco d.u i?eid=
2-s2.0-34247493236&doi=10.1109%2 MCSE.2007.55&pa ne ID=
40&md5=29e85e 102 6 3e89c7c074bc 360684
[30] The To nado Au ho s. To nado Documen a ion; Release 4.5.1 [on-
line]. Ap il 2017, [ci . 2017-05-06]. A ailable om: h ps://
media. ead hedocs.o g/pd / o nado/s able/ o nado.pd
[31] Fe e, I.; Melniko , A. The WebSocke P o ocol. RFC 6455, RFC Edi o ,
Decembe 2011, h p://www. c-edi o .o g/ c/ c6455. x . A ail-
able om: h p://www. c-edi o .o g/ c/ c6455. x
[32] P ojec Jupy e eam. Jupy e Hub Documen a ion; Release 0.7.2 [on-
line]. Feb ua y 2017, [ci . 2017-05-07]. A ailable om: h ps://
media. ead hedocs.o g/pd /jupy e hub/s able/jupy e hub.pd
[33] Na ional As onomical Obse a o ies. LAMOST Telescope [online].
2012, [ci . 2017-05-08]. A ailable om: h p://www.lamos .o g/public/
ins umen ?locale=en
67
Bibliog aphy
[34] The Apache So wa e Founda ion. Apache Hadoop Main 2.7.3
API [online]. 2016, [ci . 2017-05-08]. A ailable om: h ps://
hadoop.apache.o g/docs/s able/api/
[35] The Apache So wa e Founda ion. Apache A oTM 1.8.1 Documen-
a ion [online]. 2016, [ci . 2017-05-08]. A ailable om: h ps://
a o.apache.o g/docs/cu en /
68
Appendix A
Ac onyms
API Applica ion P og amming In e ace
CPU Cen al P ocessing Uni
CSV Comma-Sepa a ed Values
DB Da aBase
DR1 Da a Release 1
EE En e p ise Edi ion
EJB En e p ise Ja aBean
FITS Flexible Image T anspo Sys em
FR Func ional Requi emen
FTP File T ans e P o ocol
GUI G aphical Use In e ace
HDFS Hadoop Dis ibu ed File Sys em
HDUs Heade and Da a Uni s
HTML Hype Tex Ma kup Language
HTTP Hype Tex T ans e P o ocol
HTTPS Hype Tex T ans e P o ocol Secu e
IVOA In e na ional Vi ual Obse a o y Alliance
JPA Ja a Pe sis ence API
JPQL Ja a Pe sis ence Que y Language
69

A. Ac onyms
JSF Ja aSe e Faces
JSON Ja aSc ip Objec No a ion
LAMOST La ge Sky A ea Mul i-Objec Fib e Spec oscopic Telescope
NFR Non- unc ional Requi emen
ORM Objec -Rela ional Mapping
RAM Random Access Memo y
RDF Random Decision Fo es s
REST Rep esen a ional S a e T ans e
SOM Sel -O ganizing Maps
SPLAT Spec al Analysis Tool
SQL S uc u ed Que y Language
SSAP Simple Spec al Access P o ocol
SSH Secu e Shell
TCP T ansmission Con ol P o ocol
URI Uni o m Resou ce Iden i ie
URL Uni o m Resou ce Loca o
UWS Uni e sal Wo ke Se ice
VO Vi ual Obse a o y
XHTML Ex ensible Hype Tex Ma kup Language
XML Ex ensible Ma kup Language
XSD XML Schema De ini ion
YARN Ye Ano he Resou ce Nego ia o
70
Appendix B
Con en s o enclosed DVD
eadme. x ...................... he ile wi h DVD con en s desc ip ion
s c....................................... he di ec o y o sou ce codes
impl........................................implemen a ion sou ces
eposi o ies. x ... he ile con aining lis o Gi Hub eposi o ies
spec a-a o-se ialize ............A o se ialize ool sou ces
spec a iewe ......................spec a iewe ool sou ces
ocloud...........VO-CLOUD mas e se e and wo ke s sou ces
ocloud-au hen ica o ......... VocloudAu hen ica o sou ces
ocloud-jupy e hub............... ocloud-jupy e hub sou ces
ocloud spa k impo ............ ocloud spa k impo sou ces
hesis.............. he di ec o y o L
A
T
EX sou ce codes o he hesis
ex .......................................... he hesis ex di ec o y
hesis.pd ........................... he hesis ex in PDF o ma
zzp. x ....................... he hesis ask in a plain ex o ma
71
Appendix C
Uni e sal wo ke XML
con igu a ion ile schema
1<?xml e sion="1.0" encoding="u -8"?>
2<xsd:schema xmlns:xsd="h p://www.w3.o g/2001/XMLSchema"
3 a ge Namespace="h p:// ocloud.i oa.cz/uni e sal/schema"
4xmlns: ns="h p:// ocloud.i oa.cz/uni e sal/schema"
5elemen Fo mDe aul ="quali ied">
6<xsd:complexType name="wo ke ">
7<xsd:sequence>
8<xsd:elemen name="iden i ie " ype="xsd: oken"/>
9<xsd:elemen name="desc ip ion" ype="xsd:s ing"/>
10 <xsd:elemen name=" es ic ed" ype="xsd:boolean" de aul =
" alse"/>
11 <xsd:elemen name="bina ies-loca ion" ype="xsd:s ing"/>
12 <xsd:elemen name="exec-command" ype=" ns:command-lis "/>
13 </xsd:sequence>
14 </xsd:complexType>
15 <xsd:complexType name="command-lis ">
16 <xsd:sequence>
17 <xsd:elemen name="command" ype="xsd:s ing" maxOccu s="
unbounded"/>
18 </xsd:sequence>
19 </xsd:complexType>
20 <xsd:elemen name="uws-se ings">
21 <xsd:complexType>
22 <xsd:sequence>
23 <xsd:elemen name=" ocloud-se e -add ess" ype="xsd:
anyURI"/>
24 <xsd:elemen name="local-add ess" ype="xsd:anyURI"/>
73
E. Mas e se e README ile
•Click Sa e
•Now click on he newly c ea ed Login module
•Click on Module Op ions
•Add he ollowing key= alue pai s:
–dsJndiName =ja a:jboss/da asou ces/ ocloud
–p incipalsQue y =selec pass om use accoun whe e
use name=?
– olesQue y =selec g oupName, ’Roles’ om
use accoun whe e use name=?
–hashAlgo i hm =SHA-256
–hashEncoding =hex
8. C ea e mas e se e ’s ocloud.wa package
•Na iga e o he VO-CLOUD’s mas e se e applica ion’s di ec o y
•Execu e m n package
•Package should be now c ea ed in a ge / ocloud.wa
9. Deploy ocloud.wa package o he WildFly se e
•Log in o he WildFly’s admin console
•Na iga e o sec ion Deploymen s
•Click Add
•Selec ocloud.wa ile
•Submi
•Enable he newly deployed applica ion
VO-CLOUD mas e se e should now be unning a
h p://localhos :8080/ ocloud
10. C ea e admin accoun
•Open VO-CLOUD mas e se e applica ion in web b owse
•Click Regis e
•Regis e a new accoun wi h use name admin
This accoun now has adminis a o p i ileges.
80

Appendix F
Spa k wo ke README ile
F.1 Spa k wo ke
F.1.1 Requi emen s
•JDK 7+
•Ja a applica ion se e suppo ing Ja a se le echnology (Tomca ,
WildFly, . . . )
•Ma en ool (i building is necessa y)
•Spa k deployable applica ion o each Spa k wo ke ype
F.1.2 Ins all guide
Fo ins ance I will use Debian amd64 wi h WildFly 8.2 applica ion se e ,
JDK 8 and Ma en 3.1.
1. Ins all JDK 8
•Download JDK om h p://www.o acle.com/ echne wo k/
ja a/ja ase/downloads/index.h ml in zip ile o m, o example
jdk-8u45-linux-x64. a .gz
•Ex ac a chi e o /us /lib/j m
•Se up en i onmen a iables o Ja a – add hese lines o he end
o /e c/p o ile:
expo JAVA HOME=/us /lib/j m/jdk1.8.45
expo PATH=$JAVA_HOME/bin
2. Ins all WildFly 8.2.0
•Download zip om h p://wild ly.o g/downloads/
•Ex ac a chi e o he /us /local
•In he newly ex ac ed wild ly di ec o y execu e bin/add-use .sh
and se up a new WildFly adminis e ing use .
81
F. Spa k wo ke README ile
3. S a Wild ly by execu ing bin/s andalone.sh. Se e should success ully
s a . I e e y hing wen OK:
•Se e is unning on h p://localhos :8080/
•Admin console on h p://localhos :9990/
4. Con igu e spa k-wo ke con igu a ion ile (op ional s ep i you wan an-
o he con igu a ion ha i is in p e-buil a chi e)
•Download sou ces o spa k-wo ke
•Go o s c/main/ esou ces/
•Adjus uws-con ig.xml ile
•Go back o sou ces oo
•Execu e command m n package
•Wo ke is compiled and he deployable a chi e is c ea ed in
a ge /spa k-wo ke .wa
5. Deploy spa k wo ke o Wild ly
•Open WildFly admin console on h p://localhos :9990/
•Login wi h he c eden ials o adminis a ing use
•Na iga e o Deploymen s sec ion
•Click Add
•Selec deployable spa k-wo ke .wa a chi e
•Click OK
•Enable he newly deployed applica ion
UWS se ice should now be unning on
h p://localhos :8080/spa k-wo ke /uws
No e: This is only desc ip ion o spa k-wo ke applica ion which se es as
he media o be ween he mas e se e and spa k submi sc ip . In o de
o make a wo ke ully unc ional you ha e o se p ope con igu a ion alues
in o he UWS con igu a ion ile ma ching you unning Spa k ins ance.
F.1.3 Con igu a ion ile desc ip ion
Con igu a ion o he Spa k wo ke is de ine by he xml ile con aining all
necessa y in o ma ion o he Spa k wo ke deploymen . The schema o he
XML con igu a ion ile is speci ied by XSD ile and is loca ed in
s c/main/ esou ces/con igSchema.xsd.
Le us explain he con igu a ion ile o ma on he example:
1<?xml e sion="1.0" encoding="u -8"?>
2<ns:uws-se ings
3xmlns:xsi=’h p://www.w3.o g/2001/XMLSchema-ins ance’
4xmlns:ns=’h p:// ocloud.i oa.cz/spa k/schema’
82
F.1. Spa k wo ke
5xsi:schemaLoca ion=’h p:// ocloud.i oa.cz/spa k/schema
con igSchema.xsd’>
6<ns: ocloud-se e -add ess>h p://localhos :8080/ ocloud-
be elgeuse</ns: ocloud-se e -add ess>
7<ns:local-add ess>h p://localhos :8080</ns:local-add ess>
8<ns:spa k-execu able>/op /spa k/bin/spa k-submi </ns:spa k-
execu able>
9<ns:hadoop-de aul - s>hd s://be elgeuse:9000</ns:haddop-
de aul - s>
10 <ns:max-jobs>4</ns:max-jobs>
11 <ns:desc ip ion>Spa k UWS wo ke </ns:desc ip ion>
12 <ns:en i onmen >
13 <HADOOP_CONF_DIR>/op /hadoop/e c/hadoop</HADOOP_CONF_DIR
>
14 </ns:en i onmen >
15 <ns:submi -pa ams>
16 <con name="spa k.d i e .maxResul Size">12g</con >
17 <con name="spa k.ya n.execu o .memo yO e head">4096</
con >
18 <mas e >ya n</mas e >
19 <d i e -memo y>4g</d i e -memo y>
20 <deploy-mode>clien </deploy-mode>
21 <num-execu o s>5</num-execu o s>
22 <execu o -co es>3</execu o -co es>
23 <execu o -memo y>4g</execu o -memo y>
24 </ns:submi -pa ams>
25 <ns:wo ke s>
26 <ns:wo ke >
27 <ns:iden i ie >spa k-p ep ocessing</ns:iden i ie >
28 <ns:desc ip ion>Spa k p ep ocessing</ns:desc ip ion>
29 <ns:submi -pa ams>
30 <packages>com.da ab icks:spa k-a o_2.10:2.0.1</
packages>
31 <py- iles>
32 /home/hadoop/wo k low- es /p ep ocessing/
ocloud_spa k_impo /dis /
ocloud_spa k_p ep ocess-0.1.0-py2.7.egg
33 </py- iles>
34 </ns:submi -pa ams>
35 <ns:submi - a ge >
36 /home/hadoop/wo k low- es /p ep ocessing/
ocloud_spa k_impo /bin/ ocloud_p ep ocess.py
37 </ns:submi - a ge >
38 </ns:wo ke >
83
F. Spa k wo ke README ile
39 </ns:wo ke s>
40 </ns:uws-se ings>
• ocloud-se e -add ess [op ional] – Speci ies URL add ess o he de-
ployed ocloud se e . This URL is necessa y when he wo ke needs
o download some da a om he ocloud se e . No e ha in o de o
do so you will ha e o a ange he ne wo k isibili y om he wo ke o
mas e se e and ice e sa.
•local-add ess – Hos name URL o he wo ke se e om he mas e
se e poin o ne wo k iew.
•spa k-execu able – Pa h o he spa k-submi sc ip on he ilesys em.
•hadoop-de aul - s – URL loca o o he HDFS ilesys em.
•max-jobs – Maximum coun o jobs ha his wo ke allows o be un
concu en ly. No e ha Spa k execu ion manage (e.g. YARN) can ha e
addi ional es ic ions o he coun o jobs/ esou ces equi emen .
•desc ip ion – Desc ip ion o his UWS wo ke .
•en i onmen [op ional] - The sequence o op ional ags se ings he en-
i onmen a iables o be passed o he spa k-submi sc ip . Fo his
cu en ins ance he HADOOP CONF DIR a iable is se o be able o use
--mas e ya n pa ame e p ope ly.
•submi -pa ams [op ional] – This complex ag can be ei he in he oo
uws-se ings ag o in he wo ke ag (see la e ). I speci ies implici
pa ame e s o be passed o he spa k-submi . Pa ame e s om he
oo ag can be o e iden by he pa ame e s speci ied in he wo ke ag
and bo h pa ame e speci ica ion can be o e idden by he pa ame e s
speci ied in he job’s con igu a ion ile. Pa ame e s a e speci ied in he
ollowing o ma :
<pa am-name>pa am- alue</pa am-name>
This s a emen is ansla ed o --pa am-name pa am- alue in he
spa k-submi sc ip . No e: <con > ag ha e a special o m:
<con name="con -name">con - alue</con >
ha is ansla ed o --con con -name=con - alue. The e can be
mul iple <con > ags.
•wo ke s – Con ains sequence o <wo ke > ags.
•wo ke – Con ains con igu a ion o he single wo ke ype ins ance. I
con ains ollowing ags:
84
F.1. Spa k wo ke
–iden i ie – Iden i ica ion i he wo ke . Mus no con ain space
cha ac e .
–desc ip ion – Desc ip ion o he wo ke .
–submi -pa ams – Same as in he oo ag.
–submi - a ge – Pa h o he ile ha should be passed o he
spa k-submi sc ip .
F.1.4 Job con igu a ion
The ollowing JSON is an example o he spa k job con igu a ion.
1{
2"download_ iles":[
3{
4"u ls":[
5" ocloud://DATA/allspec-ond700-p ep/p ep.cs ",
6" ocloud://DATA/allspec-ond700-p ep/p ep2.cs "
7],
8" olde ":"/use / es /inpu 1/"
9},{
10 "u ls":[" ocloud://DATA/ olde /s .cs "],
11 " olde ":"/use / es /inpu 2/"
12 }
13 ],
14 "spa k_pa ams":{
15 "num-execu o s":"2",
16 "execu o -co es":"4",
17 "con ":{
18 "spa k.d i e .maxResul Size":"12g",
19 "spa k.ya n.execu o .memo yO e head":"4096"
20 }
21 },
22 "job_con ig":{
23 "da ase ":"hd s:///use /wo k low- es /lo -inpu /
p ep ocessed.cs ",
24 "min_p s":15,
25 "ou pu ":"hd s:///use /wo k low- es /ou pu /lo _keple -
ou .cs "
26 },
27 "copy_ou pu ":[
28 {
29 "pa h":"/use /wo k low- es /ou pu /lo _keple -ou .
cs ",
30 "ou pu _name":"p ep ocessed.cs ",
31 "me ge_pa s": ue
85

F. Spa k wo ke README ile
32 }
33 ]
34 }
Mos o he con igu a ion JSON ile is op ional. The only manda o y pa
is job con ig objec pa ha speci ies he con igu a ion ile o he Spa k
applica ion. The con en o his objec will be w i en in he empo a y ile
and he pa h will be passed o he spa k-submi sc ip as he las pa ame-
e . I he con igu a ion does no con ain he copy ou pu i em, he whole
con igu a ion ile is conside ed as he con ig o he spa k-submi sc ip – in
his case i would be:
1{
2"da ase ":"hd s:///use /wo k low- es /lo -inpu /
p ep ocessed.cs ",
3"min_p s":15,
4"ou pu ":"hd s:///use /wo k low- es /ou pu /lo _keple -ou .
cs "
5}
•download iles – Speci ies iles ha should be downloaded om he
ocloud ilesys em (o some o he URL) and sa ed o he hd s o he
speci ied pa h be o e he spa k job i sel is execu ed. I mus con ain
a ay whe e each i em is objec con aining wo manda o y i ems:
–u ls – A ay o s ing con aining he emo e ile pa h. I suppo s
h p/h ps p o ocol and i he pa h has scheme ocloud he iles
a e downloaded om he ocloud’s ilesys em. No e: in o de o
do so i is necessa y ha wo ke has p ope ly se he pa h o he
ocloud se e and he se e is di ec ly isible on he ne wo k.
– olde – Ta ge pa h on HDFS whe e he iles speci ied in he u ls
pa should be sa ed. Sa e ails i he pa h al eady exis s.
No e: In o de o be able o download iles in o he HDFS i is necessa y
ha he wo ke applica ion has p ope ly se up w i e pe mission o he
HDFS. This is usually done by adding use unde which he wo ke
applica ion is s a ed o he supe g oup g oup.
•spa k pa ams – Allows use o o e ide any pa ame e s passed o he
spa k-submi sc ip . I con ains JSON objec whe e each i em "name":
" alue" is ansla ed o he pa ame e --name alue. The only ex-
cep ion is an i em named con ha i p esen mus con ain addi ional
JSON objec whe e each i em "name": " alue" is ansla ed o --con
name= alue. Pa ame e s he e can o e ide he de aul one speci ied in
he xml con igu a ion ile.
•job con ig – Speci ies he con igu a ion o he Spa k job i sel . See
abo e.
86
F.1. Spa k wo ke
•copy ou pu – Allows use o ob ain iles om he hd s back o he o-
cloud. I mus con ain JSON a ay con aining JSON objec s con aining
ollowing i ems:
–pa h – Pa h o he ile o olde on he HDFS.
–ou pu name [op ional] – Name o he copied ile o di ec o y. I
no p esen , ies o ind ou he ile/ olde name om he pa h
pa ame e .
–me ge pa s [op ional] – Spa k jobs usually p oduce esul s as
olde con aining pa xxx iles. I his i em is se o ue he
wo ke me ges hese pa s oge he o p oduce a single ile. This
i em is op ional, de aul alue is se o alse.
87