scieee Science in your language
[en] (orig)

Manual Review Process for the Biodata Resource Inventory

Author: Imker, Heidi J.; Schackart III, Kenneth E.
Publisher: Zenodo
DOI: 10.5281/zenodo.7768363
Source: https://zenodo.org/records/7768363/files/biodata_resource_inventory_manual_review_process.pdf
Manual Re iew P ocess o he Bioda a Resou ce In en o y
Pu pose: This documen accompanies he Bioda a Resou ce In en o y p ojec and p o ides
guidance o e iewing he p elimina y in en o y o p edic ed esou ces ha esul s om he
Na u al Language P ocessing asks. This is he “ e iew” s ep highligh ed in yellow:
In he 2022 in en o y (Schacka , Imke , and Cook (2023) doi: 10.5281/zenodo.7767794) and
o any subsequen upda es o he in en o y, a cu a o e iews lagged eco ds p io o
augmen ing he in en o y wi h addi ional me ada a and inalizing he in en o y. The lags
indica e ha human judgmen should be used o de e mine i low p obabili y eco ds should
be emo ed om he in en o y and i po en ial duplica e eco ds should be me ged wi hin
he in en o y. The ile o manual e iew comes om he Bioda a Resou ce In en o y pipeline
as a CSV. In his ile, he e a e h ee columns ha indica e a eco d has been lagged o
manual e iew: duplica e_u ls, duplica e_names, and low_p ob.
STEP 1: Se up manual e iew sp eadshee
1. Open he CSV ile in MS Excel and immedia ely sa e as wi h “_ e iew_V1” appending
he ile name. I wo k is done o e mul iple days, ecommend sa ing each new day as a
new e sion i e a i ely o be able o go back in case o mis akes.
2. Da a -> Fil e o be able o so all columns (including new) wi hou sc ambling. THIS IS
ESSENTIAL.
3. C ea e d op-downs o consis en , s uc u ed alues o e iew columns:
a. C ea e a new shee / ab and name as “Re iew Values”
b. On his shee , c ea e a able o op ions exac ly as below (changing any alues
will b eak he sc ip ha p ocesses he manual e iew esul s):
e iew_low_p ob
e iew_dup_u ls
e iew_dup_names
emo e
me ge on eco d wi h bes name p ob
me ge all "dup name" IDs
do no emo e
do no me ge
do no me ge
con lic ing eco d(s) o be emo ed
me ge only:
con lic ing eco d(s) o be emo ed
c. Lock he Re iew Values ab o p e en acciden al changes ( igh click on ab
name a bo om -> P o ec Shee …-> allow use s o selec locked and unlocked
cells -> OK)
D a ed: HJI and KES | Re iewed CEC | Upda ed: 2023-01-04, HJI 1
d. Back in he manual e iew shee / ab, highligh en i e column o
“ e iew_low_p ob”
i. Da a -> Valida ion -> Allow; selec lis ; highligh he “ emo e and do no
emo e” cells o “ e iew_low_p ob” on he Re iew Values ab
ii. Check ha only hose alues a e a ailable now o ha column
iii. Repea o “ e iew_dup_u ls” and “ e iew_dup_names”
4. Op ional: Condi ional o ma bes _name_p ob o be g een = high, ed = low
STEP 2: Re iew lags o low_p ob using “ e iew_low_p ob” column
1. So by low_p ob
2. Re iew each, e e ing o ex column as needed
a. I ex ac ed name is e oneous, selec “ emo e” om d opdown
i. Guidelines:
●Allow ei he ull name o sho name/abb e ia ion - as long as
co ec , e ain in he in en o y
●Do no emo e eco ds o names missing a gene al ype, e.g.
“da abase” o “DB” o “ca alog” o “da a hub” o “lib a y”
oe.g. do no emo e: “ au ome ic” when ac ual name is jus
“ au ome ic da abase”
oe.g. emo e: “HGD” when ac ual name is “HGD mu a ion
da abase”
●Flag o emo al when any key pa o he name was missing, e.g.
“Glycosi eA las” (p edic ed) s “N-Glycosi eA las” (ac ual)
ii. no e eason o emo al in “ e iew_no es_low_p ob”
●FALSE POS: CLASS
●FALSE POS: INCORRECT NAME
●FALSE POS: PARTIAL NAME
●FALSE POS: URL sc amble
iii. When comple e, double check numbe s be ween “low_p ob” and
“ e iew_low_p ob” o make su e all ha e been e iewed.
STEP 3: Re iew lags o duplica e URLs using “ e iew_dup_u ls” column
1. So by ex ac ed_u l so duplica e u ls line up
2. Re iew each, e e ing o ex column as needed
i. Guidelines:
D a ed: HJI and KES | Re iewed CEC | Upda ed: 2023-01-04, HJI 2
●I URLs e e o he same esou ce and me ging he eco ds on he eco d
wi h bes name p obabili y will no esul in an e oneous name, selec
“me ge on eco d wi h bes name p ob”
●I URLs DO NOT e e o he same esou ce OR me ging he eco ds will
esul in an e oneous name o e iding a co ec name, selec “do no
me ge”
●I he eco d is lagged and emo ed because o he e alua ion in S ep 2
abo e, selec “con lic ing eco d(s) o be emo ed”
ii. No e any odd cases in “ e iew_no es_dup_u l”
iii. When comple e, double check numbe s be ween “duplica e_u ls” and
“ e iew_dup_u ls” o make su e all ha e been e iewed.
STEP 4: Re iew lags o duplica e names using he “ e iew_dup_names” column
1. So by bes _name so duplica e names line up
2. Re iew each, e e ing o ex column as needed
i. Guidelines:
●I p edic ed names a e accu a e and applies o all o he eco ds (jus
a ia ion be ween URLs), selec “me ge all "dup name" IDs”
○The “main” URL will be ei he he one associa ed wi h he mos
ecen publica ion o , in he case o a ie o newes publica ion
da e, he one associa ed wi h he highes name p obabili y
●I p edic ed names a e accu a e o only some o he eco ds, selec
“me ge only:” and pu he IDs o be me ged in he
“ e iew_no es_dup_names” column (comma sepa a ed) wi h no o he
ex , e.g. “26481361, 31647100”
●Fo any eco ds ha should no be me ged, (e.g. di e en esou ces o
me ging would cause some o he issue), selec “do no me ge”
●I he eco d is lagged and emo ed because o he e alua ion in S ep 2
abo e, selec “con lic ing eco d(s) o be emo ed”
ii. When comple e, double check numbe s be ween “duplica e_names” and
“ e iew_dup_names” o make su e all ha e been e iewed.
STEP 5: Add he e iewed in en o y in o di ec o y o u he p ocessing
1. Sa e he e iewed in en o y as a CSV ile
2. Add his ile o he di ec o y which will p ocess he e iewed in en o y o emo e/me ge
eco ds.
3. Run he pos - e iew pipeline. Sc ip s wi hin he pipeline will i s check ha all lagged
eco ds ha e been e iewed and con ain alid e iew alues; i no , e o s will indica e
whe e he e a e issues.
D a ed: HJI and KES | Re iewed CEC | Upda ed: 2023-01-04, HJI 3