How o make you messy da a
usable?
Diana Pil a
Da a Manage
Ins i u e o Compu e Science
Uni e si y o Ta u
T aining coo dina o
ELIXIR-Es onia
[email p o ec ed]
Gene al In o ma ion
Please w i e you name o a endance check
I you ha e ques ions, you can ask hem igh away
Lea ning ou comes
Desc ibe sp eadshee bes p ac ices
Compa e Excel and OpenRe ine
Apply ans o ms (cell edi ing, column edi ing, ansposing) in
OpenRe ine
W i e simple exp essions in OpenRe ine
Ma ch you da ase wi h ha o an ex e nal sou ce
Poll
Wha does a good eusable sp eadshee
look like?
When you e-used somebody else’s ables,
how easy was o unde s and i ?
Wha kind o mis akes ha e you ound in
ables?
Sp eadshee bes p ac ices
Based on :
Ka l W. B oman & Ka a H. Woo (2018) Da a O ganiza ion in Sp eadshee s, The Ame ican
S a is ician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989
h ps://da aca pen y.o g/sp eadshee -ecology-lesson/02-common-mis akes#common-
sp eadshee -e o s
P inciple 1: Be Consis en
DO USE
DO NOT USE (in he same p ojec )
consis en codes o ca ego ical a iables.
“Male”, “M” and “male”
consis en ixed code o any missing alues
“NA” and some imes “-” and “ “
consis en a iable names
“Glucose_10wk” and “gluc_10weeks,” and “10 week glucose”
consis en subjec iden i ie s
“153”; “mouse153”, “Mouse153”, “mouse-153”
consis en da a layou in mul iple iles.
Di e en da a layou
consis en ile names.
“Se um_ba ch1_da e” and “Ba ch2_se um_da e”
consis en o ma o all da es (YYYY-MM-DD)
“8/1/2015” and “8-1-15”
consis en ph ases in you no es.
“dead” and “Deceased”
Be ca e ul abou ex a spaces wi hin cells.
blank cell ≠ single space; “male” ≠ “ male ”
Examples
Good Name
Good Al e na i e
A oid
Max_ emp_C
MaxTemp
Maximum Temp (°C)
P ecipi a ion_mm
P ecipi a ion
p ecmm
Mean_yea _g ow h
MeanYea G ow h
Mean g ow h/yea
sex
sex
M/F
weigh
weigh
w.
cell_ ype
CellType
Cell Type
Obse a ion_01
i s _obse a ion
1s Obs
P inciple 2: Use meaning ul ile names
T y no o use spaces
I makes p og amming ha de . Analys needs o su ound e e y hing in “”
P e e unde sco es _ and hyphens -
A oid special cha ac e s ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " |
They o en ha e special meanings in p og amming languages
Keep i sho , bu meaning ul
A oid using wo d “ inal”
The e will always be “ inal_2”
P inciple 3: W i e Da es
as YYYY-MM-DD
global ISO 8601 s anda d
Mic oso Excel and da es*.
Di e en expo p oblems on Mac and
Windows
Solu ions: S o e yea , mon h and day in
sepa a e columns o YYYYMMDD o
‘YYYYMMDD
E e y hing is no a da e: Fo ma →Cells,
choose “Tex ”
h ps://xkcd.com/1179/
* h ps://da aca pen y.o g/sp eadshee -ecology-lesson/03-da es-as-da a.h ml
Raise a hand i …
You open an Excel ile and s a yping and no hing happens, and hen
you selec a cell and you can s a yping. Whe e did all o ha ini ial ex
go?
Well, some imes i go en e ed in o some andom cell, o be disco e ed
la e du ing da a analysis.
P inciple 9: Do No Use Fon Colo o
Highligh ing as Da a
Suspicious da a o da a ha should be igno ed
add ano he column wi h an indica o a iable (e.g., ” us ed” o “ou lie ” wi h alues
TRUE o FALSE)
P inciple 10: Make Backups
P inciple 11: Use Da a Valida ion o
A oid E o s
Check o each column
Do alues all in o expec ed ange
Example: blood p essu e no mal ange 90-140 mm/Hg sys olic; alues 300 o 40 a e
no expec ed
A lis o possible alues
Ci ies in a coun y
Clinical diagnosis
Tex , in he expec ed leng h
Pos al code
P inciple 12: Sa e he Da a in Plain Tex
Files
Keep a copy in .cs o . s
No p e y, bu can be opened wi h a a ie y o di e en p og ams
Poll
Ha e you been using hese bes p ac ices?
Yes, all o hem
Yes, some o hese p ac ices
I ha e hea d o hem, bu implemen a ion is ha d
No, ne e hea d, ha en’ used hem
Will you y o use hem now?
Yes
I will y (no p omises)
No (I like my way be e )
Summa y
Be consis en
W i e da es like YYYY-MM-DD
No emp y cells, no me ged
cells
Pu jus one hing in a cell
O ganize he da a as a single
ec angle
C ea e a da a dic iona y
Based on :
Ka l W. B oman & Ka a H. Woo (2018) Da a O ganiza ion in Sp eadshee s, The Ame ican S a is ician, 72:1, 2-10, DOI:
10.1080/00031305.2017.1375989
h ps://da aca pen y.o g/sp eadshee -ecology-lesson/02-common-mis akes#common-sp eadshee -e o s
Do no include calcula ions in
he aw da a iles
Do no use on colo o
highligh ing as da a
Choose good ile names
Make backups
Use da a alida ion o a oid
da a en y e o s
Sa e he da a in plain ex iles.
Any ques ions?
P e ious name Google Re ine, c ea ed by Me aweb Technologies,
Inc
Acqui ed by Google in 2010, Oc obe 2012 enamed OpenRe ine
A ailable in mo e han 15 languages (no in Es onian)
Keeps you da a p i a e un il you a e eady o sha e i
Can un i using GUI o command line
P ac ical lesson on OpenRe ine
Exe cises I
Reo de ing and enaming columns
So ing da a
Fil e ing da a
Clus e ing
Undo/ edo
Expo ing da a
Based on ma e ials (in oduc ion) de eloped by Owen S ephens on he behal o he B i ish Lib a y. CC-BY 4.0 license.
h p://www.meanboy iend.com/o e due_ideas/wp-con en /uploads/2014/11/In oduc ion- o-OpenRe ine-handou -CC-BY.pd
Simila hings done in h ps://lib a yca pen y.o g/lc-open- e ine/index.h ml
OpenRe ine Ins alla ion
h ps://open e ine.o g/download
Ins all OpenRe ine 3.9.5 e sion (o newe e sion).
The e a e e sions o Windows, Linux and Mac OS X wi h
addi ional eaching on how o download i .
Windows and Linux needs Ja a o be ins alled be o e ins alling
OpenRe ine, al hough he e is he Windows e sion wi h Ja a included
also.
Mac use s need o gi e pe mission o he p og am
Ex a guidance: h ps://open e ine.o g/docs/manual/ins alling
Running OpenRe ine
Open p og am
command line window - Igno e ha
Pu h p://127.0.0.1:3333/ in he web b owse
Compa ible b owse s
Google Ch ome
Ch omium
Ope a
Mic oso Edge
Sa a i
Mino ende ing and pe o mance issues on o he b owse s such as Fi e ox.
In e ne Explo e no suppo ed.
File o ma s accep ed by OpenRe ine
comma-sepa a ed alues (CSV) o ex -sepa a ed alues (TSV)
Tex iles
Fixed-wid h columns
JSON
XML
OpenDocumen sp eadshee (ODS)
Excel sp eadshee (XLS o XLSX)
PC-Axis (PX)
MARC
RDF da a (JSON-LD, N3, N-T iples, Tu le, RDF/XML)
Wiki ex
How i looks like
Inc easing memo y alloca ion
de aul 1 gigaby e (GB) o memo y (1024MB).
I you…
eel ha OpenRe ine is unning slowly, o you a e ge ing “ou o memo y” e o s
Ha e mo e han one million o al cells
Ha e an inpu ile size o mo e han 50 megaby es (MB)
Ha e mo e han 50 ows pe eco d in eco ds mode
A good p ac ice is o s a wi h no mo e han 50% o wha e e memo y is le
o e a e he es ima ed usage o you ope a ing sys em
De ailed ins uc ions o Windows, Mac and Linux
h ps://open e ine.o g/docs/manual/ins alling#inc easing-memo y-alloca ion
C ea ing a p ojec
Click ‘C ea e P ojec ’
Choose ‘Ge Da a om his Compu e ’
Click ‘Choose Files’
Loca e he ile called ‘P ac ice_da ase .cs ’
Click ‘Nex ’
Impo ing da a
Tes which e sion wo ks o you
Columns a e sepa a ed by
Cus om → ;
commas (CSV)
Pa se nex 1 line(s) as column heade s
S o e blank ows
UTF-8
Sa ing
OpenRe ine sa es all o you ac ions (e e y hing you can see in he Undo/Redo panel).
I doesn’ sa e
ace s
il e s
iew
numbe o ows showing
So ing
column collapsing
Au osa ing de aul e e y i e minu es
To sa e cu en ace s and il e s, click Pe malink. The p ojec will eload wi h a di e en
URL, which you can hen copy and sa e elsewhe e.
TASK: Undo
and e-o de
Le ’s keep he shel ma k
column a e all bu
eo de he columns
again!
I you ha e undone some hing, i asks o
con i ma ion o ew i e his o y
TASK: Undo and e-o de
So ing
Column → So ..
So by publica ion yea , smalles o la ges
So ing
Now we ha e so d op-down menu
Unlike Excel ‘So s’ in OpenRe ine a e empo a y
emo e he ‘So ’, he da a will go back o i s o iginal ‘uno de ed’
s a e.
‘So ’ d op down menu
e e se he so o de
emo e exis ing so s
make so s pe manen
You can so on mul iple columns a he same ime.
TASK
TASK: So au ho name om A → Z.
See wha happened
T y Re e se so
Remo e Au ho so
Go o Undo/Redo ab
No ice how so ing is absen
Reo de ows pe manen ly by Publica ion yea
Go o Undo/Redo ab
So ing can now be undone
Tex il e
Column d op down menu → Tex il e
Look le on he sideba
T y yping London
displays only he ows con aining ha pa icula
ph ase
T y o he ci ies:
Camb idge
A e you a e done wi h his
ace / il e , close i . o i will
a ec you u u e analysis
Face
Place o publica ion →
Face → Tex ace
So by coun
Face
Use case
o e iew o he da a in a p ojec
Easy o no ice ypos/
inconsis encies
Click on include. See wha i
does
You can also change alues
di ec ly om he e using Edi
❗
Da a alida ion p inciple
Face
Yea o publica ion → Face
→ Tex ace
So by coun
No ice he e a e b acke s
whe e he e shouldn’ be
Clus e ing
Clus e ing unc ion enables you o ind simila alues ac oss a ace and me ge hem
oge he .
example “New Yo k” and “new yo k”
Mo e in o ma ion abou he algo i hms h ps://open e ine.o g/docs/ echnical-
e e ence/clus e ing-in-dep h
❗
Be consis en p inciple
Task
Yea o publica ion → Edi cells → Clus e and edi - Click on Clus e
T y mul iple me hods and keying unc ions!
Check i ace by numbe s now has all nume ical alues
I no
T y clus e ing
T y edi ing in ex ace
Don’ o ge o ans o m again o numbe s
Check he Place o Publica ion oo.
❗
Be consis en p inciple
❗
Da a alida ion p inciple
Expo ing he wo k low
OpenRe ine sa es e e y change you
make.
JSON (Ja asc ip Objec No a ion)
expo JSON sc ip and apply i o
o he da a iles
❗
Backups p inciple
Rep oducibili y
Expo ing able
In he igh co ne he e is Expo
Impo ing he wo k low
I you ha e
mul iple iles o clean
hey all ha e he same ype o e o s
ha e he same column names
Sa e he JSON sc ip , open a new
ile in OpenRe ine, pas e he sc ip
and un i .
This gi es you a quick way o clean
all o you ela ed da a.
Poll
Ha e you unde s ood hings so a ?
Wha emains con using?
Tip: Ask ad ice om AI cha bo s, hey know some
OpenRe ine! Be ca e ul wi h GREL commands.
A small b eak
P ac ical lesson on OpenRe ine
Exe cises II
GREL- based ans o ma ions
Ex ensions
Reconcilia ion
Based on ma e ials (in oduc ion) de eloped by Owen S ephens on he
behal o he B i ish Lib a y. CC-BY 4.0 license.
h p://www.meanboy iend.com/o e due_ideas/wp-
con en /uploads/2014/11/In oduc ion- o-OpenRe ine-handou -CC-BY.pd
T ans o ms in OpenRe ine
clean
co ec
codi y
ex end you da a
NB! T ans o ms a e one
click away, no need o w i e
code.
GREL syn ax
Gene al Re ine Exp ession Language.
In GREL, unc ions can use ei he o hese wo syn axes:
unc ionName( alue, op ions)
alue. unc ionName(op ions)
Whe e alue means he alues in he cu en column
Full no a ion
Do no a ion
unc ionName( alue, op ions)
alue. unc ionName(op ions)
im( alue)
alue. im()
leng h( im( alue))
alue. im().leng h()
h ps://docs.open e ine.o g/manual/g el
GREL syn ax
Example
Desc ip ion
Fi s Name.cells
Access he cell in he column named “Fi s Name” o
he cu en ow
cells["Fi s Name"]
Access he cell in he column called “Fi s Name” o
he cu en ow
ow.columnNames[4]
Will e u n he name o he i h column
Do no a ion can be used o access he membe ields o a iables
Fo e e ing o column names ha con ain spaces, use squa e b acke s ins ead o
do no a ion
Squa e b acke s o ge subs ings and sub-a ays, and single i ems om a ays
h ps://docs.open e ine.o g/manual/g el
GREL unc ions
S ing unc ions
Leng h o s ing as a numbe
Takes any alue ype (s ing, numbe ,
da e, boolean, e o , null) and gi es a
s ing e sion o ha alue
Can use ounding o numbe s
Tes i a s ing s a s/ends wi h a
ce ain le e o con ains i
Case con e sion
Remo e leading and ailing
whi espace o speci ied le e s
(T imming)
Subs ings
Find and eplace
S ing pa sing and spli ing
Encoding and hashing
De ec language
h ps://docs.open e ine.o g/manual/g el unc ions
GREL unc ions
Boolean unc ions
Logical ope a o o e alua e condi ions
wi h ou pu being T ue/False
Fo ma -based unc ions
Jsoup XML and HTML pa sing
URI pa sing (when gi en a link)
A ay unc ions
Size o a ay
C ea ing sub-a ay
Checking i a ay con ains desi ed
s ing
Re e se a ay
So a ay
Sum a ay
Join he i ems in he a ay wi h sep,
and e u ns i all as a s ing
Duplica e emo al
h ps://docs.open e ine.o g/manual/g el unc ions
GREL unc ions
Da e unc ions
Cu en ime acco ding o you sys em
clock
Con e objec o da e
Gi en wo da es, e u ns a numbe
indica ing he di e ence in a gi en
ime uni
Change a da e by he gi en amoun in
he gi en uni o ime
Re u n pa o a da e
Ma h unc ions
O he unc ions
Coun gi en alue
C oss-in o ma ion
h ps://docs.open e ine.o g/manual/g el unc ions
T ans o ma ions
alue. oUppe case()
oUppe case( alue)
con e s he cu en alue o uppe case
alue. oLowe case()
oLowe case( alue)
con e s he cu en alue o lowe case
alue. oTi lecase()
oTi lecase( alue)
con e s he cu en alue o i lecase (i.e. each wo d
s a s wi h an uppe case cha ac e , and all o he
cha ac e s a e con e ed o lowe case
alue. im()
im( alue)
emo es any “whi espace” cha ac e s (e.g. spaces,
abs) om he s a o end o he cu en alue
alue.subs ing(numbe
om, op ional numbe o)
subs ing( alue, numbe
om, op ional numbe o)
inds he i s (numbe om) X (numbe o)
cha ac e s o he cu en alue
alue. eplace(“s ing o
ind”, “ eplacemen s ing”)
eplace( alue, “s ing o
ind”, “ eplacemen s ing”)
inds he le e “X” in he cu en alue and eplaces i
wi h he le e “Y”
“s ing“ + alue
“s ing“ + alue
adds (conca ena es) he wo ld “XXXX” o he on o
he cu en alue
TASK: Ha monize au ho names
Pu he names in Ti le Case:
Use Face s ( o see wha is w ong)
T ans o m
GREL: alue. oTi lecase()
Py hon: e u n alue. i le()
❗
Be consis en p inciple
Answe : Ti le Case
Use Face s and he GREL exp ession alue. oTi lecase() o pu he i les in Ti le Case
Le ’s look a he “Au ho ” column (Tex ace )
We see se e al hings:
Some names a e w i en FIRSTNAME LASTNAME
Some names a e w i en LASTNAME, FIRSTNAME
Some ha e one name in capi al le e s
Some ha e all names in capi al le e s
Click he d opdown menu on he “Au ho ” column
Choose Edi cells→T ans o m…
In he Exp ession box ype alue. oTi lecase()
Click OK
❗
Be consis en p inciple
TASK: Using Boolean unc ions o
ix au ho names
a c ude es → looking o commas
Au ho → Face →Cus om ex ace ...
In he Exp ession box ype alue.con ains(",") o e u n bool("," in alue)
(Py hon)
‘con ains’ unc ion ou pu s a Boolean alue, ace ha con ains ‘ alse’
and ‘ ue’.
Py hon one gi es 0 and 1. 0 meaning alse, 1 meaning ue
❗
Be consis en p inciple
Include “ ue” boolean ace s
On he “Au ho ” column, use he d opdown menu and selec Edi cells →T ans o m
Exp ession alue.spli (", ") o e u n alue.spli (", ")
include a space a e he comma inside he spli exp ession o a oid ex a
spaces in you au ho name la e
See how his c ea es an a ay wi h wo membe s in each ow in he P e iew column
UNFORTUNATELY, a ays canno appea di ec ly in an OpenRe ine cell any mo e
So i you apply he command, no hing happens isually
TASK: Using Boolean unc ions o
ix au ho names
❗
Be consis en p inciple
To ind bo h he ’s’ and ‘z’ spellings o ‘o ganize/o ganise’):
/o gani.e/
Speci y exac numbe s o epe i ions o a max/min numbe
Use cu ly b acke s:
/a{2}/ Ma ches ‘aa’
/a{2,4}/ Ma ches any o ‘aa’, ‘aaa’, ‘aaaa’
Regula exp essions (Regex)
h p://www.meanboy iend.com/o e due_ideas/wp-con en /uploads/2014/11/In oduc ion- o-OpenRe ine-handou -CC-BY.pd
h ps://docs.open e ine.o g/manual/exp essions#exp essions
Regula exp essions
These can be combined wi h ‘ epe i ion’ ope a o s, which allow you
o say how many imes a cha ac e o pa e n is epea ed.
Repe i ion
cha ac e
Meaning
Explana ion/Example
*
The p eceding cha ac e /exp ession can
be epea ed any numbe o imes
(including 0)
/.*/
Any ex s ing a all (any cha ac e epea ed any numbe o
imes
+
The p eceding cha ac e /exp ession can
be epea ed one o mo e imes
/head s+/ es /
Ma ches “head es ” (one space), “head es ” ( wo spaces),
bu no “head es ”
?
The p eceding cha ac e /exp ession can
be epea ed 1 o 0 imes
/colou? /
Ma ches bo h wo ds “colo ” and “colou ”
{X}
The p eceding cha ac e /exp ession can
be epea ed X numbe o imes
/a{2}/
Ma ches he le e “a” appea ing wice (“aa”)
/a{2,4}/
Ma ches he le e “a” appea ing a minimum o wo imes o
maximum o ou imes (“aa”, “aaa”, “aaaa”)
GREL-suppo ed egex
W ap egex be ween a pai o o wa d slashes (/). Fo example, in
alue. eplace(/ s+/, " ")
he egula exp ession in he e is s+, and he syn ax used in he exp ession w aps i wi h o wa d
slashes (/ s+/).
s any whi espace cha ac e (spaces, abs, newlines, e c.)
X+ X occu ing one o mo e imes
Do no use slashes o w ap egula exp essions ou side o a GREL exp ession.
On he GREL unc ions page, unc ions ha suppo egex will indica e ha wi h a “p” o “pa e n.”
h ps://docs.open e ine.o g/manual/g el
h ps://docs.open e ine.o g/manual/g el unc ions
Jy hon-suppo ed egex
Jy hon is an implemen a ion o he Py hon p og amming language designed o un on
Ja a
In e ac wi h egula exp essions ia he buil -in e module in Py hon
Py hon code ha depends on C bindings will no wo k in OpenRe ine, which uses Ja a /
Jy hon only. Since Jy hon is essen ially Ja a, you can also impo Ja a lib a ies and u ilize
hose.
Exp essions mus ha e a e u n s a emen
impo e
e u n e.sub(" s+", " ", alue)
h ps://docs.open e ine.o g/manual/jy honcloju e h ps://docs.open e ine.o g/manual/exp essions#jy hon-
suppo ed- egex
O he unc ions lis ed he e: h ps://www.py hon u o ial.ne /py hon- egex/py hon- egula -exp essions/
How-To h ps://docs.py hon.o g/3/how o/ egex.h ml
Same command in GREL
alue. eplace(/ s+/, " ")
Cloju e-suppo ed egex
Cloju e is a dialec o he Lisp p og amming language on he Ja a pla o m
Cloju e ea s code as da a and has a Lisp mac o sys em
Cloju e egexes a e hos language egexes. On he Ja a Vi ual Machine (including
Open e ine) you' e using Ja a egexes. In Cloju eSc ip , i 's Ja asc ip egexes.
Regex pa e ns can be compiled a ead- ime ia he #"pa e n" eade mac o, o
a un ime wi h e-pa e n
(cloju e.s ing/ eplace alue #" s+" " ")
h ps://docs.open e ine.o g/manual/jy honcloju e
h ps://docs.open e ine.o g/manual/exp essions#cloju e-suppo ed- egex
Same command in GREL
alue. eplace(/ s+/, " ")
Same command in Jy hon
impo e
e u n e.sub(" s+", " ", alue)
Regex in OpenRe ine
h ps://gi hub.com/OpenRe ine/OpenRe ine/wiki/Recipes This page collec s OpenRe ine ecipes,
small wo k lows and code agmen s ha show you how o achie e speci ic hings wi h OpenRe ine.
Regex examples on you chea shee
He e is 2 chea shee s wi h egex:
h ps://code4lib o on o.gi hub.io/2018-10-12-access/GoogleRe ineChea Shee s.pd
h ps://da enschule.de/ iles/downloads/wo kshops/Chea Shee -Open-Re ine.pd
I his is you i s ime wo king wi h egex, I ecommend
h ps:// egex .com/ his es ing and lea ning ool (Suppo s Ja aSc ip & PHP/PCRE RegEx)
Ano he es ing ool h ps:// egex101.com/
TASK: Ex ac ing da es o publica ion
Wo k only wi h eco ds, whe e “Yea o
Publica ion” is blank
“Place o Publica ion”→ Edi column →
add a column based on his column
unc ion
Use “ma ch” unc ion wi h a egula
exp ession o ind whe e he “Place o
Publica ion” ends wi h ou digi s
Tips:
[-1] use las pa o a ay
/ / egex is inside hese
.*( d{4}).* Regex.
.* Any ex s ing a all (any cha ac e
epea ed any numbe o imes
() g oup hese
d digi s
{4} ma ch 4 o p eceding oken
Solu ion: Ex ac ing da es o publica ion
Wo k only wi h eco ds, whe e “Yea o Publica ion” is blank
Edi cells → Common ans o ms → o numbe s
Face → Nume ic ace → ick Blanks only
“Place o Publica ion”→ Edi column → add a column based on his column unc ion
Use “ma ch” unc ion wi h a egula exp ession o ind whe e he “Place o Publica ion”
ends wi h ou digi s
Op ion 1
alue.spli (",")[-1]
e u n alue.spli (",")[-1]
Keeps in []
Op ion 2
alue.ma ch(/.*( d{4}).*/).join("")
impo e
e u n "".join( e. indall("( d{4})", alue))
Tips:
[-1] use las pa o a ay
/ / egex is inside hese
.*( d{4}).* Regex.
.* Any ex s ing a all (any cha ac e
epea ed any numbe o imes
() g oup hese
d digi s
{4} ma ch 4 o p eceding oken
Py hon
TASK: Ex ac ing da es o publica ion
Mo e “new Da e o Publica ion” yea in o
“Yea o Publica ion column”
Remo e he “new DoP” column
Solu ion: Ex ac ing da es o
publica ion
Mo e “new Da e o Publica ion” yea in o “Yea o Publica ion column”
Yea o publica ion → Edi columns→ Join columns
Choose you 2 columns o be wedded
Tick: W i e esul s in selec ed columns
NB! I w i es in o he column which d op-down menu you
selec ed om he s a . So i you chose new da e column, i will
jus ew i e con en s in o new column.
Remo e he “new DoP” column
New Da e o Publica ion → Edi column → Remo e his column
Addi ional ea u es
Ex ensions
Ex ensions ha e been c ea ed by OpenRe ine communi y o add unc ionali y o
p o ide con enien sho cu s o common uses o OpenRe ine. They migh be ou o
da e - look a he la es compa ible e sion!
Lis o ex ensions: h ps://open e ine.o g/ex ensions
Selec ion o use ul ex ensions:
GeoJSON Expo
Adds a G aphical Use In e ace (GUI) ha allows you o expo OpenRe ine da a o
he GeoJSON o ma . Suppo s la i ude/longi ude coo dina es.
FAIR me ada a
Suppo s FAIR me ada a by in eg a ing wi h FAIR Da a Poin o s o e you da a and
expo o FAIR.
S a s ex ension o
Google Re ine 2.5+
Compu es elemen a y s a is ics on column da a.
Reconciling
Reconcilia ion is ma ching you da ase wi h ha o an ex e nal sou ce
Ex e nal da ase mus o e a web se ice ha con o ms o he Reconcilia ion Se ice API
s anda ds
Reconcilia ion is semi-au oma ed:
OpenRe ine ma ches you cell alues as bes i can
Human judgmen is equi ed o e iew and app o e he esul s
Typos, whi espace, and ex aneous cha ac e s will ha e an e ec on he esul s
Clean and clus e you da a be o e econcilia ion
h ps://docs.open e ine.o g/manual/ econciling
Reconciling
You may wish o econcile in o de o:
Fix spelling o a ia ions in p ope names
Clean up manually-en e ed subjec headings agains au ho i ies
Link you da a o an exis ing da ase
Add o an edi able pla o m such as Wikida a
See whe he en i ies in you p ojec appea in some speci ic lis ,
such as he Panama Pape s.
h ps://docs.open e ine.o g/manual/ econciling
Reconcilia ion sou ces
Cu en lis o econcilable au ho i ies h ps:// econcilia ion-api.gi hub.io/ es bench/#/
Fu he lis o sou ces on he wiki
h ps://gi hub.com/OpenRe ine/OpenRe ine/wiki/Reconcilable-Da a-Sou ces
Ways ha you can econcile agains a local da ase
h ps://gi hub.com/OpenRe ine/OpenRe ine/wiki/Reconcilable-Da a-Sou ces#local-
se ices
You can econcile agains he en i e da ase o only he con ibu ions om ce ain
ins i u ions h ps:// e ine.code o k.com/
Reconciling wi h Wikibase h ps://docs.open e ine.o g/manual/wikibase/ econciling
Ex ensions can add econcilia ion se ices, and can also add enhanced econcilia ion
capaci ies.
How- o econcile
Chosen columns d opdown menu → Reconcile →
S a econciling
I you wan o econcile only some cells in ha
column, i s use il e s and ace s o isola e hem
Reconcilia ion window
Wikida a as a de aul se ice
To add ano he se ice, click Add S anda d
Se ice... and pas e in he URL o a se ice
You should see he name o he se ice
appea in he lis o Se ices i he URL is
co ec
h ps://docs.open e ine.o g/manual/ econciling#ge ing-s a ed
How- o econcile
Choose “ ypes” (ca ego ies)
You can econcile ba ches agains di e en
ypes
Time-consuming p ocess, especially wi h la ge
da ase s.
S a wi h a small es ba ch
I he cell was success ully ma ched, i displays
ex as a single da k blue link.
You should no ha e o check i manually.
I he e is no clea ma ch, one o mo e
candida es a e displayed, oge he wi h hei
econcilia ion sco e, wi h he ex in ligh blue
links. You will need o selec he co ec one.
h ps://docs.open e ine.o g/manual/ econciling#ge ing-s a ed
How- o econcile
Fo each ma ching decision you make, you
ha e wo op ions:
Ma ch his cell only (one checkma k)
Use he same iden i ie o all o he cells
con aining he same o iginal s ing ( wo
checkma ks).
“p e iew en i ies” ea u e
Fo ma ched alues he unde lying cell
alue has no been al e ed - he cell is
s o ing bo h he o iginal s ing and he
ma ched en i y link a he same ime.
h ps://docs.open e ine.o g/manual/ econciling#ge ing-s a ed
Au oma ic econcilia ion ace s
Reconcile → Face s
Au oma ically c ea es wo ace s when
you econcile some cells
Nume ic ace o “bes candida e's
sco e”
App o e hem all in bulk by
using Reconcile → Ac ions →
Ma ch each cell o i s bes
candida e
Judgmen ace
Le s you il e o he cells ha
ha en' been ma ched
h ps://docs.open e ine.o g/manual/ econciling# econcilia ion- ace s
Reconcilia ion ace s
Use ul o doing
successi e econcilia ion
a emp s
The in o ma ion is held in
he cells hemsel es
h ps://docs.open e ine.o g/manual/ econciling# econcilia ion- ace s
Task: Roden da ase
Reconcile scien i ic names wi h he Encyclopedia o Li e (EOL)
Fi s ly i ge s a s anda d o m o he name o label o he en i y.
Secondly i ge s an ID o he en i y - in his case a page and nume ic
id o he scien i ic name in EOL. This is hidden in he de aul iew,
bu can be ex ac ed:
In he scien i icName column use he d opdown menu o choose
Reconcile > Add en i y iden i ie s column...
Gi e he column he name “EOL-ID”
This will c ea e a new column ha con ains he EOL ID o he
ma ched en i y
Reconcile coun y, s a e, and coun ies agains Wikida a
h ps://da aca pen y.gi hub.io/OpenRe ine-ecology-lesson/06- econcilia ion.h ml
BONUS Tasks
Open da ase Ask_a_manage _sala y_su ey.cs
h ps://osca ba u a.com/messy/
Gi e good names o a iables
Using Face s ind ou lie s
Example: young bu signi ican wo k expe ience (mo e han hei age)
Ha monise coun y,coun y and indus y
Replace blanks and 0-s in Addi ional mone a y compensa ion
Re o ma Times amp
Spli ace in o mul iple columns
Fix any hing else you see is messy
Take away message
Excel can do da a cleaning, bu i will ake
ex a s eps and he wo k low will no be
eco ded.
Excel has 477 unc ions, bu i pales in
compa ison wi h all he Py hon and Ja a
lib a ies you can impo wi h Jy hon.
In Excel you can use mac os w i en in
Visual Basic o Applica ions (VBA) and
w i e cus om unc ions
Mic oso conside s mac os a secu i y
isk
Subjec i e g aph depic ing ela ion be ween ease
o use and a ie y o p oblems a ool can sol e.
Poll
Will you use OpenRe ine o messy
da a cleaning?
Take away message
OpenRe ine was made o dealing wi h messy da a
Is open sou ce and ee o use
Wo k lows a e ep oducible
Keeps you da a p i a e
Common ans o ma ions allow you o do ad anced da a cleaning wi hou w i ing
code
Cus om ans o ma ions can be w i en in exp ession edi o
You can combine egexes, unc ions and con ols
Open e ine suppo s many ex ensions ha enhance i s unc ionali y
You can econcile da a wi h ex e nal sou ces
Feedback
h ps:// o ms.gle/SeZqpXQYB6sG8JG7A
Re e ences: ELIXIR and OpenRe ine
ELIXIR-Es onia h ps://elixi .u .ee/
Subsc ip ion lis : News abou cou ses and e en s o ganised by ELIXIR Es onia
h ps://lis s.u .ee/wws/subsc ibe/elixi .news?p e ious_ac ion=edi _lis _ eques
OpenRe ine Use Manual h ps://docs.open e ine.o g/
OpenRe ine chea shee s h ps://gi hub.com/OpenRe ine/OpenRe ine/wiki/Recipes
In oduc ion o OpenRe ine by Owen S ephens ([email p o ec ed]) on behal o he
B i ish Lib a y in July 2014 h p://www.meanboy iend.com/o e due_ideas/wp-
con en /uploads/2014/11/In oduc ion- o-OpenRe ine-handou -CC-BY.pd
Lis u o ials and esou ces de eloped ou side he OpenRe ine use manual
h ps://gi hub.com/OpenRe ine/OpenRe ine/wiki/Ex e nal-Resou ces
h ps://open e ine.o g/ex e nal_ esou ces
Tu o ial on econcilia ion in OpenRe ine h ps://www.you ube.com/wa ch?
=q8 deyuNQ
Re e ences: Excel limi a ions and Regex
Excel limi a ions
h ps://suppo .mic oso .com/en-us/o ice/excel-speci ica ions-and-limi s-
1672b34d-7043-467e-8e27-269d656771c3
Regex u o ials
h ps://www. egula -exp essions.in o/
h ps://www.codep ojec .com/A icles/939/An-In oduc ion- o-Regula -Exp essions
Regex chea shee s
h ps://da enschule.de/ iles/downloads/wo kshops/Chea Shee -Open-Re ine.pd
h ps://code4lib o on o.gi hub.io/2018-10-12-access/GoogleRe ineChea Shee s.pd
Regex es ing ools
h ps:// egex .com/
h ps:// egex101.com/
‹#›
Re e ences: Regex
Py hon egex
h ps://www.py hon u o ial.ne /py hon- egex/py hon- egula -exp essions/
h ps://docs.py hon.o g/3/how o/ egex.h ml
Jy hon egex
h ps://www.jy hon.o g/jy hon-old-si es/docs/lib a y/ e.h ml
Lea ning Jy hon
h ps://wiki.py hon.o g/jy hon/Lea ningJy hon
Cloju e egex u o ial
h ps://e icno mand.me/mini-guide/cloju e- egex
Thank you!