Enhancing complex XML documen s wi h linguis ic anno a ion
Anonymous submission
Abs ac
When building co po a om anno a ed XML documen s, he compile s a e usually con on ed wi h he incapabili y o
mos ools o linguis ic analysis and pa sing ( okenize s, lemma ize s, PoS- agge s, e c.) o p ocess mo e han
jus plain ex inpu . Va ious single pu pose solu ions ha e been c ea ed o his pu pose. We ied o de elop a
gene al se o sc ip s o assis wi h he ask o en iching documen s con aining complex XML anno a ion wi h linguis ic
anno a ion gene a ed by au oma ic analyze s. We p esen he challenges we me and solu ions we chose, discussing
hei ad an ages, disad an ages and limi s.
Keywo ds: anno a ion, TEI, XML, co pus compila ion
1. Objec i es
When building co po a om XML documen s wi h
mo e o less complex anno a ion, he compile s a e
usually con on ed wi h he incapabili y o mos ools
o linguis ic analysis and pa sing ( okenize s, lem-
ma ize s, PoS- agge s, syn ac ic pa se s, e c.) o
p ocess mo e han jus plain ex inpu . In addi ion,
hei ou pu is also seldom compa ible wi h XML.
Va ious ad-hoc solu ions ha e been c ea ed o his
pu pose a di e en ins i u ions and wi hin di e en
p ojec s. Wi hin he p ojec , we ied
o de elop a mo e gene al solu ion: a se o sc ip s
o suppo me ging o complex XML documen s
wi h linguis ic anno a ion gene a ed by au oma ic
analyze s.
This ask is usually hampe ed by se e al ac o s,
such as he equi emen s (and limi s) o he XML
o ma and mu ually con lic ing equi emen s o
anno a ion o di e en aspec s o he ex s. The
mo e complex he XML anno a ion o he sou ce,
he mo e di icul is he p ocess o combining i
wi h addi ional anno a ion p oduced by ools o
linguis ic analysis.
2. Basic si ua ion and equi emen s
on he agge
The basic si ua ion we had o sol e a he beginning
was no much mo e complex han he si ua ion o
mos co pus compile s – as long as he sou ce is
jus a simple XML documen con aining simple ex
elemen s such as pa ag aphs o plain ex , he ask
is a he i ial: ex ac he con en s o he ex ele-
men s, p ocess hem by he analyse and eplace
hem wi h he analyzed and anno a ed ou pu , pos-
sibly also con e ed in o an XML o ma .
Bu ou si ua ion was s ill sligh ly mo e di i-
cul : he pa ag aphs o en con ain ex anno a ed
wi h spans indica ing highligh ed con en s ha we
wan ed o keep. As he easies solu ion, we decided
o emo e all he XML anno a ion, keep ack o i s
o iginal posi ion in o m o a s and-o anno a ion
o a la e econs uc ion, and p ocess he isola ed
plain ex con en s. This solu ion was loosely in-
spi ed by he PAULA XML o ma de eloped by he
Po sdam Uni e si y and he Humbold Uni e si y
in Be lin,
1
using simila basic p inciple o complex
mul i-laye s and-o anno a ion o ex s. Thus, he
plain ex con en s was ex ac ed om he XML
sou ce and all in o ma ion abou he o iginal anno-
a ion was kep in a sepa a e ile in he o m o a
JSON lis o ags and hei a ibu es.
Since agge s commonly do no p oduce an ou -
pu compa ible wi h XML, i mus usually be con-
e ed be o e being used as eplacemen o he
o iginal ex con en s. Howe e , we decided no
o c ea e an ou pu di ec ly om he ou pu o he
agge o se e al easons. Ins ead, we ma ch he
esul ing okens wi h he o iginal plain ex ex ac ed
om he XML in o de o c ea e addi ional laye o
anno a ion o he o iginal con en s. The mos im-
po an eason is he simple ac ha he agge s
o en disca d all in o ma ion abou he p esence o
whi e-space be ween he okens.
2
No jus we wan
o keep he o iginal whi e-space, bu any changes
in whi e-space would also b eak he eco e y o he
o iginal XML anno a ion s o ed by means o hei ex-
ac posi ions de ined by he amoun o cha ac e s in
he o iginal ex hey span (including whi e-space).
Ano he issue is he occasional need o he agge s
o no malize he con en s o some deg ee.
3
The
mos common case is he no maliza ion o punc ua-
ion symbols, such as a ious ypog aphical quo a-
ion ma ks de ined by he Unicode s anda d, which
usually ep esen he same (o simila ) unc ion
om he linguis ic poin o iew, i espec i e o hei
pa icula isual o m.
Ma ching he ou pu o he agge s wi h he o igi-
nal sou ce ex may some imes be a complex ask.
The basic equi emen o he agge o ou pu he
1
See Dippe (2005), Chia cos e al. (2008) o
h ps:
//www.s b632.uni-po sdam.de/paula.h ml
2The UD agge is a he an excep ion, in his case.
3
In some cases, he agge s e en end o co ec ing
"mis akes" in he sou ce ex .
1
o iginal s ing o each anno a ed oken (be o e any
no maliza ion) can’ always be ul illed, especially
i he agge is de eloped and main ained by a di -
e en ins i u ion wi h di e en goals and p io i ies.
The e o e, we implemen ed basic mechanisms o
be able o ma ch he esul ing okens wi h he o igi-
nal ex s ings also by means o addi ional eplace-
men o ans o ma ion ules using egula exp es-
sions, applied ei he o he ou pu o he agge
o he sou ce ex . These can help o iden i y he
o iginal sou ce o m o a no malized oken as long
as he ou pu o he agge is de e minis ic and
p edic able. In his way, agge s wi h somewha
no malized ou pu may also be used.
In any case, he agge is expec ed o always
p oduce some ou pu o e e y non-whi e-space
sequence o cha ac e s in he sou ce ex . On he
o he hand, whi e-space cha ac e s may also be
pa o he anno a ed okens in he ou pu , i desi ed.
3. XML documen s. i s plain ex
con en s
One impo an di e ence be ween in o ma ion p o-
ided by he XML anno a ion and he ex ac ed
pu e ex con en s quickly ma e ialized in o an is-
sue which equi ed a solu ion: while XML delimi s
basic ex ual uni s jus by means o ags, plain ex
can only do his by means o line-b eaks, which
delimi pa ag aphs and usually p esen a kind o
ha d b eaks he agge is no supposed o c oss in
he p ocess o sen ence segmen a ion.
4
In XML,
howe e , he line-b eaks can be p esen jus e -
e ywhe e – wi hou ma king any s uc u al b eaks –
jus o p ac ical easons. On he o he hand, hey
may no be p esen be ween he XML elemen s
a all. Due o his i ial di e ence, he plain ex
may con ain andom line-b eaks anywhe e in he
ex low, bu i may also con ain no line-b eaks
be ween ex elemen s such as pa ag aphs (some-
imes no whi e-space sepa a o s a all), when he
XML anno a ion is jus mechanically emo ed.
The simples solu ion is o pu a equi emen on
he inpu XML, so ha i mus al eady be p ep o-
cessed in a way ha he e a e always line-b eaks
p esen be ween he ex ual XML elemen s, bu
nowhe e wi hin he body o he ex elemen s hem-
sel es.
Since we wan ed o make he ools mo e lexi-
ble and independen o he inpu , we in eg a ed a
solu ion o his p oblem as well. This equi ed a
signi ican change o he ools: o make hem con-
ex awa e. Since XML is a gene ic me a- o ma no
4
In ou ex s, we usually don’ wan he pa se o span
sen ences ac oss bounda ies o i ems in a lis o (possi-
bly) e ses in a poem. This is a decision depending on a
pa icula ype o ex s o pa icula p ojec and i s goals,
o cou se.
speci ying he names and meaning o ags used,
in o ma ion abou he pa icula XML inpu mus be
p o ided by means o ex e nal con igu a ion. The
minimal equi emen o such p ocessing is p o id-
ing i wi h a lis o XML elemen names which should
be ea ed as basic ex elemen s usually sepa a ed
by line-b eaks in he isual p esen a ion o he ex .
Consequen ly, any " echnical" line-b eaks wi hin
hei con en s can be elimina ed and new p o i-
sional line-b eaks can be inse ed be ween he co -
esponding blocks o plain ex ex ac ed, whe e
necessa y. The in o ma ion abou any added line-
b eaks can be s o ed oge he wi h he in o ma ion
abou he emo ed XML anno a ion, in o de o e-
mo e such p o isional line-b eaks again in he las
s age, when econs uc ing he inal XML.5
4. Ex ending he capabili ies o
p ocess complex XML
Wi h he desi e o p ocess mo e complex XML doc-
umen s such as TEI,
6
he con ex awa e mode be-
came a necessi y needing e en u he de elop-
men . Beside o he possibili y o lis names o ex
elemen s o be ex ac ed and p ocessed, he e is
also a need o speci y elemen s o be comple ely ex-
cluded om he ex ac ion, including all i s con en s.
This conce ns o example he
< eiHeade >
el-
emen con aining me ada a, some imes also en-
closed in common ex ual elemen s such as
<p>
ha should no be analyzed in his case. Ano he
ypical candida e o exclusion is he TEI elemen
< o eign>.
The combina ion o a lis o ex elemen s and
a lis o excluded elemen s is al eady powe ul
enough o mos common cases, bu in o de o
make he possibili ies e en mo e lexible, we also
implemen ed a simple XPa h-like syn ax o he
speci ica ion o ex ual elemen s (o excluded el-
emen s), so ha hey may also be speci ied by
means o cons ain s on hei a ibu e alues o
nes ing wi hin pa icula ances o s. This allows o
much mo e de ailed speci ica ion han a plain lis
o elemen names.7
5
We wan o ha e he esul ing XML as simila o he
o iginal documen as possible. Ye we decided no o
keep he in o ma ion abou line-b eaks emo ed om he
ex elemen s and no o eco e hem. Since he ex ual
con en s will be enhanced by he anno a ion p oduced
by he analyze , he pu ely p ac ical pu pose o he o igi-
nal o ma ing (e.g. using line-b eaks o easie manual
edi ing) would usually become poin less anyway.
6
See TEI Conso ium (2025) o
h ps:// ei-c.
o g
7
Since he beginning, he e was a di icul decision o
be made: whe he o use a ully- ledged XML pa se o
he ex ac ion o he ex con en s, o a cus om simple
de ec ion o he XML ma k-up. The la e op ion was cho-
2
5. Con e sion o he agge ou pu
in o XML anno a ion
As explained abo e, he ou pu om a linguis ic
analyze is no jus di ec ly con e ed in o an XML.
Ins ead, i is ma ched wi h he plain ex da a ex-
ac ed om he o iginal XML and ano he (inde-
penden ) s and-o desc ip ion o an addi ional laye
o anno a ion spans and a ibu es is c ea ed, sim-
ila o he desc ip ion o he o iginal XML anno-
a ion ex ac ed in he i s s ep. This con e sion
mus be adap ed o he pa icula ou pu o ma o
he analyze . We c ea ed a con igu able pa se
o any e ical-based o ma common o mos ag-
ge s, including a p ep ocesso o he mo e speci ic
ConNLL-U o ma commonly used by he Uni e -
sal Dependencies agge . This pa se c ea es
<s>
spans om sen ence segmen a ion (usually delim-
i ed simply by emp y lines in he e ical o ma )
and
<w>
spans o he okens iden i ied by he ag-
ge . Addi ional oken a ibu es, such as a lemma,
PoS o mo phological ags, a e con e ed in o he
a ibu es o he oken span. The a ibu e names
can be cus omized in he con igu a ion; selec ed
a ibu es may also be comple ely omi ed om he
inal ou pu , i desi ed.
The CoNLL-U o ma p oduced by he UD ag-
ge has some addi ional ea u es ha need special
ea men . The mos impo an dis inc ion o a sim-
ple e ical o ma is he use o wo-le el okeniza-
ion, whe e sub okens can be c ea ed o okens
ep esen ing wo syn ac ic wo ds. We implemen ed
wo ways how o deal wi h such sub okens, which
do no necessa ily always ma ch any iden i iable
pa o he o iginal ex s ing:
•
a ibu e alues o he sub okens a e conca e-
na ed (using a speci ied sepa a o ) in o he
co esponding a ibu e alues o he p ima y
oken; his solu ion is sui able o sea ch en-
gines no suppo ing mul i-le el okeniza ion,
such as he e ical-based CQP engines; e.g.
he anno a ion o he English oken "can’ "
may esul in o he elemen
<w id="1|2"
synwo d="ca|n’ " lemma="can|no "
upos="AUX|PART">can’ </w>
•
i ual sub okens – XML elemen s wi h a
con igu able name and wi hou any ex ual
sen as easie o implemen and comple ely sa is ac o y
o he o iginal simple ype o documen s. Also, he p io -
i y o e icien ly p ocess la ge documen s would equi e
deploymen o he mo e complex SAX pa se ins ead o
he simple DOM me hod. The o me op ion would la e
easily enable he suppo o ull speci ica ion o elemen s
based on he XPa h syn ax, bu i would s ill make he
es o he p ocess mo e di icul in many o he aspec s.
The e o e he idea o ew i e he sc ip s by deploying an
XML pa se was dismissed again.
con en s – a e c ea ed as child en o he
p ima y oken o keep hei own a ibu es
sepa a ely; his solu ion is sui able o XML-
awa e sea ch engines such as TEITOK
8
;
e.g. he anno a ion o he English oken
"can’ " may esul in o a s uc u e such
as
<w id="1-2" synwo d="Can’ "
lemma="_" upos="_">Can’ <d ok
o m="Ca" id="1" synwo d="Ca"
lemma="can" upos="AUX"/><d ok
o m="n’ " id="2" synwo d="n’ "
lemma="no " upos="PART"/></w>
The CoNLL-U o ma can also ca y o he le els
o addi ional anno a ion. La ely, we added suppo
o CoNLL-U en iched wi h he mul i-laye anno-
a ion o named en i ies added by he NameTag
pa se 9.
Fo agge s wi h an ou pu incompa ible wi h he
common e ical, a special pa se (con e sion ool)
would ha e o be w i en, o he ou pu would ha e
o be con e ed in o a e ical o ma i s .
6. Me ging he anno a ions and
composing he esul ing XML
The c ucial s age o he whole p ocess (see ig. 1)
is he inal econs uc ion o he o iginal XML docu-
men en iched wi h addi ional laye s o anno a ion
p oduced by he analyze s. This p ocess has wo
s eps: 1) me ging he o iginal and he added an-
no a ion in o one single anno a ion hie a chy; 2)
inse ion o he me ged XML ma k-up back in o
he s eam o plain ex con en s. Bo h s eps a e
cu en ly in eg a ed in a single sc ip .
The combina ion o wo independen anno a ions
is no a i ial ask, since i has o comply wi h he
equi emen s o he XML s anda d
10
and possibly
also ul il speci ic demands o he pa icula da a
p ocessed. The new anno a ion is added laye by
laye ( om he opmos elemen s – sen ences – o
hei e minal child en – okens) in o he o iginal
XML anno a ion. Any con lic s need o be esol ed
acco ding o he p ojec -speci ic p io i ies. In case
a newly added elemen c osses i s pa en ’s span
bo de s, he pa en mus be spli and eopened
beyond he end o he new span. Spli ing o iginal
spans by new anno a ion laye s complies well wi h
he commonly con lic ing anno a ion o highligh ed
ex spans, which ha e a lowe p io i y and lowe
demand o con inui y han he added linguis ic an-
no a ion wi h a mo e impo an s uc u al unc ion
and hey can hus be in e up ed. Howe e , in case
8
See Janssen (2016) o
h p:// ei ok.
co puswiki.o g
9
Also called "CoNLL-U+NE". See
h ps://linda .
m .cuni.cz/se ices/name ag/
10h ps://www.w3.o g/TR/xml/
3
O iginal XML
documen
spli
Plain ex
con en s
O iginal
anno a ion
(s and-o )
agge
linguis ic
analysis
ma ch
New
anno a ion
(s and-o )
me ge
Me ged XML
documen
Figu e 1: Wo k low o he enhancemen p ocess
some o iginal s uc u al elemen is unexpec edly
b oken by he new laye s o anno a ion, he sc ip s
can also issue a wa ning on demand.
The con e sion o e s some o he addi ional con-
igu able unc ions o con enience: in e up ed
spans such as highligh ing o ex don’ usually need
o be eopened also be ween sen ences
11
and ele-
men s wi h he same name and exac ly same span
can be au oma ically me ged in o a single elemen
wi h hei a ibu es combined, i desi ed.
The p ocess o me ging he independen anno a-
ions hus ies o do i s bes wi hin he limi s o he
XML s anda d. Howe e , he e a e cases whe e p i-
o i ies may change. Usually, each new laye o non-
o e lapping spans is p e e ably inse ed in o he
old spans as hei child en. This wo ks well wi hin
he scope o he seconda y (added) anno a ion,
bu i s ela ion he he p ima y (o iginal) anno a ion
may be mo e complex, i he o iginal anno a ion
includes mo e han jus highligh ing spans o em-
phasis. Fo example, TEI suppo s anno a ion o
dele ed (co ec ed) con en s which may appea a
he e y beginning o end o a oken and which
should also s ay wi hin he inal span o he oken,
despi e he ac hey shouldn’ be sen o he agge
– his ea u e is equen ly used e.g. in co ec ed
ex s in lea ne s co po a. In ha case, he p e e -
ence can be se o include such anno a ion wi hin
he new anno a ion spans whe e e possible.
Fo example, lea ne co po a may con ain a co -
11
Highligh ing he in e mi en space be ween sen-
ences only is usually a he poin less.
ec ed sen ence such as
I will<del>e</del>
w i e an essay.
, whe e he o iginal o m
wille
was co ec ed o
will
by dele ing he inal
le e
e
– and his o m should also be p esen ed
o he agge o co ec linguis ic analysis. Fo
his pu pose, we can add he elemen
<del>
o
he lis o excluded con en s no o be ex ac ed
in o he plain ex o agging. Howe e , he p o-
cess o me ging he o iginal and he added anno-
a ion would hen s ill pu he
del
-span ou side
o he oken span:
<w>will</w><del>e</del>
.
In such case, we can also change he nes ing p i-
o i y o he elemen
<del>
o ob ain he desi ed
esul : <w>will<del>e</del></w>.
7. Supplemen al u ili ies
Fo u he con enience, we also p o ide h ee ad-
di ional sc ip s ou side o he co e scope o he
ools:
•
a ull- ledged command-line clien o he UD-
Pipe API
12
in eg a ed wi h a clien o NameTag
o p oduce comple e CoNLL-U(+NE) ou pu
om plain ex iles
•
cus omizable ool o con e sion o he me ged
XML documen s in o e ical o ma sui able o
indexing by CQP-like sea ch engines; i also
o e s solu ion o he ypical issues wi h XML
ea u es no suppo ed by hese engines, such
as emo al o XML anno a ion om he oken
s ing, la ening o nes ed elemen s wi h he
same name o inse ion o addi ional "glue" ele-
men s be ween okens o iginally no sepa a ed
by space
•
w appe o un he whole pipeline au oma ically
on a ba ch o iles, showing a p og ess ba and
cleaning up all empo a y iles when inished;
i suppo s p ocessing in mul iple h eads in
o de o imp o e e iciency
The w appe sc ip does no implemen a mo e
complex u iliza ion o mul ip ocessing as his ask
can p obably be be e implemen ed by unning he
p ocess in mul iple ba ches by dedica ed ex e nal
ools such as GNU pa allel.13
8. Con igu a ion and cus omiza ion
o he p ocess
As indica ed be o e, i ually all aspec s o he p o-
cessing – ex ac ion o ex con en s o analysis,
con e sion o he agge ou pu and names o all he
12h ps://linda .m .cuni.cz/se ices/
udpipe/
13
See Tange (2018) o
h ps://www.gnu.o g/
so wa e/pa allel/
4
newly c ea ed XML elemen s and hei a ibu es –
a e con igu able bo h using command line op ions
o he indi idual sc ip s and a common con igu a-
ion ile. The con igu a ion ile may con ain mul iple
p o iles (p econ igu ed se s o se ings) o p ocess-
ing o di e en da a se s. A pa icula p o ile may
hen be selec ed by a single command line op ion.
9. A ailabili y and de elopmen
The sc ip s a e implemen ed in Py hon (wi hou any
unnecessa y ex ensions) and hey a e a ailable a
Gi Hub unde he GNU GPL 2.0 license, including
de ailed documen a ion and p ac ical examples.
The ools ha e also been in eg a ed in o he
VELD pla o m o ep oducible ex p ocessing de-
eloped a he Aus ian Cen e o Digi al Humani-
ies and Cul u al He i age (2024).
A mo e ho ough es ing wi h a ious ypes o
da a is s ill needed. We will also be g a e ul o any
eedback and sugges ions o u he imp o emen s
o possibly o he ex ensions co e ing mo e ypes
o inpu da a, ou pu o ma s o in eg a ion o o he
linguis ic analyze s.
10. Acknowledgemen s
[P ojec suppo – TBA]
We also wan o hank Maa en Janssen, he
au ho o TEITOK, o e y use ul eedback and
sugges ions o many impo an imp o emen s nec-
essa y o p ocessing o mo e ad anced TEI XML
documen s o less common ypes o da a.
11. Re e ences
Aus ian Cen e o Digi al Humani ies and Cul u al
He i age. 2024. VELD: Ve sioned execu able
logic and da a, a design pa e n o ep oducible
and lexible wo k lows.
Ch is ian Chia cos, S e anie Dippe , Michael Gö ze,
Ul Lese , Anke Lüdeling, Julia Ri z, and Man ed
S ede. 2008. A lexible amewo k o in eg a ing
anno a ions om di e en ools and ag se s. TAL,
49:217–246.
S e anie Dippe . 2005. XML-based s and-o
ep esen a ion and exploi a ion o mul i-le el
linguis ic anno a ion. In Be line XML Tage
2005, Humbold -Uni e si ä zu Be lin, 12. bis 14.
Sep embe 2005, pages 39–50.
Maa en Janssen. 2016. TEITOK: Tex - ai h ul an-
no a ed co po a. In P oceedings o he Ten h In-
e na ional Con e ence on Language Resou ces
and E alua ion (LREC’16), pages 4037–4043,
Po o ož, Slo enia. Eu opean Language Re-
sou ces Associa ion (ELRA).
Ole Tange. 2018. GNU Pa allel 2018. Ole Tange.
TEI Conso ium. 2025. TEI P5: Guidelines o
elec onic ex encoding and in e change.
5