scieee Science in your language
[en] (orig)

IdolSongsJp Corpus: A Multi-Singer Song Corpus in the Style of Japanese Idol Groups

Author: Hitoshi Suda; Junya Koguchi; Shunsuke Yoshida; Tomohiko Nakamura; Satoru Fukayama; Jun Ogata
Publisher: Zenodo
DOI: 10.5281/zenodo.17706547
Source: https://zenodo.org/records/17706547/files/000075.pdf
IDOLSONGSJP CORPUS: A MULTI-SINGER SONG CORPUS IN THE
STYLE OF JAPANESE IDOL GROUPS
Hi oshi Suda1Junya Koguchi2Shunsuke Yoshida3
Tomohiko Nakamu a1Sa o u Fukayama1Jun Oga a1
1Na ional Ins i u e o Ad anced Indus ial Science and Technology (AIST), Tokyo, Japan
2Meiji Uni e si y, Tokyo, Japan 3The Uni e si y o Tokyo, Tokyo, Japan
[email p o ec ed], [email p o ec ed]
ABSTRACT
Japanese idol g oups, comp ising pe o me s known as
“idols,” a e an indispensable pa o Japanese pop cul u e.
They equen ly appea in li e conce s and ele ision p o-
g ams, en e aining audiences wi h hei singing and danc-
ing. Simila o o he J-pop songs, idol g oup music co e s
a wide ange o s yles, wi h a ious ypes o cho d p og es-
sions and ins umen al a angemen s. These acks o en
ea u e nume ous ins umen s and employ complex mas-
e ing echniques, esul ing in high signal loudness. Ad-
di ionally, mos songs include a song di ision (u awa i)
s uc u e, in which membe s al e na e be ween singing so-
los and pe o ming oge he . Hence, hese songs a e well-
sui ed o benchma king a ious music in o ma ion p o-
cessing echniques such as singe dia iza ion, music sou ce
sepa a ion, and au oma ic cho d es ima ion unde chal-
lenging condi ions. Focusing on hese cha ac e is ics, we
cons uc ed a song co pus i led IdolSongsJp by commis-
sioning p o essional compose s o c ea e 15 acks in he
s yle o Japanese idol g oups. This co pus includes no
only mas e ed audio acks bu also s ems o music sou ce
sepa a ion, d y ocal acks, and cho d anno a ions. This
pape p o ides a de ailed desc ip ion o he co pus, demon-
s a es i s di e si y h ough compa isons wi h eal-wo ld
idol g oup songs, and p esen s i s applica ion in e alua ing
se e al music in o ma ion p ocessing echniques.
1. INTRODUCTION
P ocessing music audio signals is one o he majo di ec-
ions in he music in o ma ion e ie al (MIR) ield [1, 2].
The echniques include bea acking [3, 4], undamen al
equency es ima ion [5, 6], au oma ic musical cho d es i-
ma ion [7–9], music sou ce sepa a ion (MSS) [10,11], and
au oma ic ly ics ansc ip ion (ALT) [12, 13].
T aining and e alua ing MSS echniques in a supe ised
manne equi es a co pus consis ing o g ound- u h s ems,
© H. Suda, J. Koguchi, S. Yoshida, T. Nakamu a, S.
Fukayama, and J. Oga a. Licensed unde a C ea i e Commons A ibu-
ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: H. Suda, J.
Koguchi, S. Yoshida, T. Nakamu a, S. Fukayama, and J. Oga a, “Idol-
SongsJp Co pus: A Mul i-Singe Song Co pus in he S yle o Japanese
Idol G oups”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
i.e., isola ed acous ic signals ca ego ized by ins umen
ype. Such co po a include MUSDB18 [14, 15] and Moi-
sesDB [16], which ha e been widely used o aining and
e alua ing sou ce sepa a ion me hods [17]. On he o he
hand, some s udies ha e shown ha hese co po a end o
exhibi lowe loudness le els compa ed o comme cially
a ailable acks and ha e e alua ed ealis ic pe o mance
on ex ended co po a by ampli ying exis ing acks o ma ch
comme cial loudness le els [18]. Howe e , some o he
acks in hese co po a s ill ea u e ewe ins umen s and
simple mas e ing e ec s compa ed o con empo a y com-
me cial acks, which makes p ac ical pe o mance e alu-
a ion challenging. A mo e ealis ic assessmen o MSS o
con empo a y songs equi es cons uc ing a music co pus
ha closely emula es he cha ac e is ics o hese acks.
In he ield o music in o ma ion p ocessing, a ious
echniques o mul i-singe songs ha e also been explo ed,
such as he sepa a ion o o e lapping ocal signals [19].
Se e al co po a ha e been cons uc ed o he s udy o such
echniques, such as he jaCappella co pus and MedleyVox
[20, 21]. In songs whe e singe combina ions change be-
ween sec ions, key challenges include iden i ying which
pa s a e sung by which singe s and a wha ime. In
Japanese idol g oups, such s uc u es a e e e ed o as
song di ision (u awa i in Japanese), which is in en ion-
ally designed o enhance bo h he appeal o he song and
he indi iduali y o each idol [22–24]. Such in o ma ion
is i al no only o music app ecia ion bu also o ap-
plica ions such as music ideo p oduc ion and li e con-
ce a angemen s. Simila ly, o K-pop dance g oups, line
dis ibu ion ideos ha isualize song di ision s uc u es
ha e been widely sha ed on pla o ms such as TikTok and
YouTube. The au oma ic ecogni ion o hese s uc u es
om music signals is known as singe dia iza ion [25].
Se e al s udies ha e p oposed specialized me hods ailo ed
o J-pop songs and ha e also cons uc ed a dedica ed co -
pus [23,24]. In summa y, hese s udies sugges ha ocus-
ing on mul i-singe songs will u he ad ance he ield o
music in o ma ion p ocessing.
Ano he impo an aspec o music in o ma ion p ocess-
ing is ac i e music lis ening, which in ol es in e ac i e in-
e aces ha enable lis ene s o be e unde s and and ap-
p ecia e music wi h he help o music unde s anding ech-
niques [26]. S udies on ac i e music lis ening in e aces
equi e p epa ing a ge songs and e alua ing hem h ough
647
use in e ac ions wi h he p oposed in e aces. While using
comme cial acks o hese e alua ions is desi able, ob-
aining pe mission om indi idual copy igh holde s (e.g.,
eco d companies) is o en challenging. Simila ly, some
esea ch-o ien ed song co po a, such as he RWC Music
Da abase [27], also equi e pe mission o public dis ibu-
ion, and conduc ing online e alua ion expe imen s poses
copy igh conce ns. Thus, o ad ance esea ch in his a ea,
i is essen ial o cons uc a music co pus whose license
explici ly pe mi s esea ch use and public sha ing.
This s udy aims o cons uc a high-loudness co pus o
mul i-singe songs ha can be dis ibu ed online and acces-
sible o bo h esea che s and o dina y lis ene s. To add ess
his challenge, we ocus on Japanese idol g oup songs,
which a e well-sui ed o music in o ma ion p ocessing ap-
plica ions owing o hei complex ins umen al and ocal
a angemen s, comme cial-le el loudness, and di e se mu-
sical s yles. In addi ion, hese idol g oup songs a e indis-
pensable elemen s o Japanese pop cul u e. Fo ins ance,
a he 2024 Japan Reco d Awa ds 1, FRUITS ZIPPER’s
“NEW KAWAII” ecei ed he Bes Piece Awa d, and Cho
Tokimeki♥Sendenbu’s “Saijokyu ni Kawaii no!” ecei ed
he Bes Ly ics Awa d. Gi en hei cul u al and echni-
cal impo ance, we commissioned p o essional c ea o s o
compose 15 songs in he s yle o Japanese idol g oups and
cons uc ed a co pus i led IdolSongsJp. These acks a e
mas e ed o a loudness le el compa able o comme cial
songs and ea u e ealis ic song di ision (u awa i) s uc-
u es and di e se cho d p og essions. Fu he mo e, he
co pus includes no only mas e ed acks bu also s ems de-
signed o sou ce sepa a ion, d y solo ocal acks o each
singe , and musical cho d anno a ions, enabling a wide
ange o e alua ions in music in o ma ion p ocessing. To
ad ance s udies on music lis ening applica ions, we made
his co pus publicly dis ibu able o non-comme cial use.
This pape p esen s he aim and de ailed desc ip ion o he
co pus and demons a es he di e si y o he songs by com-
pa ing hem wi h eal-wo ld idol g oup songs. This pape
also shows he applica ion o he co pus by e alua ing se -
e al undamen al MIR echniques.
2. IDOLSONGSJP CORPUS: A MULTI-SINGER
CORPUS IN THE JAPANESE IDOL GROUP STYLE
We cons uc ed a no el co pus i led IdolSongsJp, com-
p ising 15 mul i-singe songs in he s yle o Japanese
idol g oups. This co pus ea u es 10 emale and 8
male singe s, all o whom a e p o essionals o semi-
p o essionals, and each o he 15 songs ea u es a unique
combina ion o singe s. All c ea o s a e p o ession-
als wi h expe ience p o iding songs o eal idol g oups.
Each indi idual con ibu ed o no mo e han h ee songs
o ensu e a di e se ange o c ea i e s yles ac oss he
co pus. Each song was designed o exhibi dis inc i e
cha ac e is ics ha e lec he di e se s yles obse ed in
Japanese idol g oup songs. Table 1 p esen s he lis o
songs along wi h hei keywo ds and co e concep s. This
1h ps://www. bs.co.jp/ eco dawa d/
co pus is a ailable a h ps://hugging ace.co/
da ase s/imp /idol-songs-jp.
The songs exhibi se e al cha ac e is ics ypical o
Japanese idol songs, as desc ibed below:
•Song di ision (u awa i) s uc u es. In his s uc-
u e, he ocal a angemen s a y ac oss di e -
en sec ions o each song. Di e en singe s ake
u ns line by line, wi h some sec ions sung solo
and o he s pe o med by mul iple singe s simul-
aneously. In addi ion, he songs ea u e coun-
e melodies, ha monies, and cho al sec ions (e.g.,
“oohs” and “ahhs”), which a e cha ac e is ic o J-
pop songs.
•Va ious music s yles. This co pus comp ises a di-
e se ange o music s yles, including pop songs as
well as gen es such as UK Ga age (M05), ballads
(F05 and M06), ock (M03), and dance music (F08).
•Va ious ly ical hemes. While lo e songs (e.g.,
F02, F04, M04, M06) a e common in idol songs,
some songs ea u e al e na i e hemes, such as sel -
in oduc ion (F01) and opics ela ed o idol g oups
(F06).
•High loudness. The mas e ing p ocess a ge ed a
loudness o −7 LUFS, simila o ha o comme cial
idol g oup songs. Since some c ea o s elease low-
loudness e sions o online pla o ms, such as Spo-
i y and YouTube, his co pus also includes acks
mas e ed o a a ge loudness o −9 LUFS.
•Re lec ion o Japanese idol cul u e. A idol g oup
conce s, audiences equen ly chee and shou in e-
sponse o he pe o mances. Speci ically, acks F01,
F07, and M01 inco po a e chee s, shou s, and chan s
( e e ed o as calls and mixes in Japanese [28]) in o
hei eco dings.
This co pus comp ises se e al ypes o da a ha a e sui -
able o e alua ing a ious music in o ma ion p ocessing
echniques. Figu e 1 shows he da a ypes o ganized ac-
co ding o he song p oduc ion p ocess. All audio signals
we e sampled a 48 kHz and encoded in 32-bi loa ing-
poin o ma o make di he ing unnecessa y. The ollowing
lis shows he included da a:
•S ems. These s ems we e c ea ed by linea ly mix-
ing ins umen al signals acco ding o hei espec i e
ca ego ies and applying bus e ec s. The ca ego y
de ini ions we e de i ed om MoisesDB [16].
•S ems o MSS. This co pus employs ou ca e-
go ies: d ums, bass, ocals, and o he . These signals
we e c ea ed by linea ly mixing he co esponding
s ems and applying mas e ing e ec s.
•D y ocal acks. The co pus con ains 414 indi id-
ual d y ocal acks, including lead melodies, coun-
e melodies, ha monies, cho uses, and chan s. Each
singe eco ded he whole song, and inal ocals in
he mas e ed acks we e immed based on he des-
igna ed song di ision s uc u es. The acks we e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
648
ID Ti le Tempo No. o singe s No. o s ems Keywo ds and co e concep s
F01 In ojuice 186 bpm 7 15 Sel -in oduc ion; audience in e ac ion (calls and mixes)
F02 Aoha u Synd ome 190 bpm 6 11 Rapidly un olding ly ical na a i e
F03 Awake 175 bpm 5 12 F equen key changes (11 imes)
F04 Awanai Kagi 123 bpm 8 13 Dance music wi h acous ic piano and gui a ; ques ioning ly ics
F05 Kae imichi 90 bpm 6 14 Hea b eak ballad; me apho ical ly ics; no mul i-singe sec ions
F06 Ima, Sekai wa Kagayai e u 132 bpm 4 12 Topics ela ed o idol g oups; me apho ical ly ics
F07 Illuminya i 160 bpm 7 11 Coined wo d; wo dplay; audience in e ac ion (chan s)
F08 ic- ac- oe 120 bpm 9 10 Vocal doubling; ocode e ec s; ap; code-swi ching wi h English
M01 Libe y wing 135 bpm 8 15 In oduc ion o o he membe s; audience in e ac ion (calls)
M02 ick s a 134 bpm 7 17 Augmen ed cho ds; high s em coun
M03 Answe 185 bpm 6 12 Rock; a eji (al e na e kanji p onuncia ions)
M04 U aomo e Docchi? 157 bpm 7 10 Release cu piano; Vocaloid s yle
M05 Deep b ea h 130 bpm 5 8 UK Ga age; ap; code-swi ching wi h English; ocal chops
M06 So edemo, Sukidayo 78 bpm 4 11 Hea b eak ballad; ad-libbed ocals and piano
M07 “Suki” no Aizu 138 bpm 6 12 Th ee membe s singing an oc a e lowe ; long spoken lines
Table 1. Songs in he IdolSongsJp co pus. The p e ix o each song ID (F o M) indica es he gende o he singe s.
Kick
Sna e
Hi-ha
E.G . (Le )
E.G . (Righ )
T ack e ec s
T ack e ec s
T ack e ec s
T ack e ec s
T ack e ec s
Bus e ec s
Bus e ec s
d ums
eg
Singe 1 Lead
Singe 2 Lead ocal_lead
[S ems]
Mas e e ec s no_limi
(–9LUFS)
P o-L 2
+0dB
Mas e ed
–9LUFS
P o-L 2
+2dB
Mas e ed
–7LUFS
[Mas e bus (w/o limi e )] [Mas e ed]
T ack e ec s
T ack e ec s Bus e ec s
Mas e e ec s o he
[S ems o music sou ce sepa a ion]
Mas e e ec s d ums
x
piano
…
pe c_a onal
[Ins umen s]
Singe 1 Ha mony
Singe 2 Ha mony
ocal_bg
T ack e ec s
T ack e ec s Bus e ec s
P o-L 2 d ums
P o-L 2 o he
Mas e e ec s ocals P o-L 2 ocals
Mas e e ec s bass P o-L 2 bass
Figu e 1. O e iew o he p oduc ion p ocess o songs in he IdolSongsJp co pus. Da a ypes included in he co pus a e
highligh ed in ed.
p ocessed using specialized so wa e (e.g., Melo-
dyne 2) o adjus pi ch and iming. These p ocessed
acks can be u ilized o ain and e alua e singing
oice syn hesis echniques.
•Mas e bus signals wi hou limi e s. These signals
we e p oduced by linea ly mixing he s ems and ap-
plying mas e ing e ec s excep he inal limi e .
•Mas e ed acks a −7 LUFS and −9 LUFS. Fo
maximiza ion and limi ing, we used he Mode n p e-
se o FabFil e P o-L 2 3. The loudness o he acks
mas e ed o a a ge o −7 LUFS anges be ween
−7.1 LUFS and −7.0 LUFS, based on he ITU-R
BS.1770-3 s anda d [29]. The acks ha e been de-
signed and mas e ed based on he −7 LUFS e sion.
•O - ocal acks and minus-one ocal acks.
These signals we e gene a ed using he same p o-
cess as o he mas e ed acks, excep ha he ocal
signals we e excluded. Minus-one ocal acks in-
clude backing ha monies and cho uses. The co pus
includes hese e sions in he same h ee mas e ing
2h ps://www.celemony.com/en/melodyne/
wha -is-melodyne
3h ps://www. ab il e .com/p oduc s/
p o-l-2-limi e -plug-in
ypes as desc ibed abo e.
•Solo e sions. Each o he 95 solo e sion acks
was mas e ed by mixing ins umen al signals wi h
he co esponding solo ocal signal, meaning ha
each ack ea u es he ocal pe o mance o only
one speci ic singe . The e o e, he co pus can also
be u ilized as a solo song co pus. Each ack is a ail-
able in he same h ee mas e ing ypes as desc ibed
abo e.
•Key and cho d anno a ions in Ha e’s sho hand
no a ion [30] wi h ime in o ma ion. These anno a-
ions a e based on he consensus o a leas wo hi ed
expe anno a o s.
The au ho s e ain all copy igh s o he co pus. This co -
pus is a ailable ee o cha ge o non-comme cial esea ch
and en e ainmen pu poses. No p io consen is equi ed
o such uses. Any comme cial use o he co pus equi es
p io pe mission om he au ho s. Use s may ea ange,
pa ody, and apply machine lea ning echniques o he co -
pus, p o ided ha he c ea o s’ mo al igh s a e upheld.
To p o ec he igh s associa ed wi h so wa e ins umen s
and comme cial sample lib a ies, sampling he ins umen-
al acks o c ea e un ela ed con en o o ain machine
lea ning models is p ohibi ed.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
649
Figu e 2. Visualiza ion o music embeddings ex ac ed
om eal-wo ld idol g oup songs and songs om ou co -
pus. Each g ay do ep esen s a eal-wo ld idol g oup song,
while ed and blue do s indica e songs om ou co pus ea-
u ing emale and male ocals, espec i ely.
3. COMPARISON WITH REAL-WORLD IDOL
GROUP SONGS
The IdolSongsJp co pus is designed o cap u e he di e se
s yles o idol g oup music. This sec ion demons a es he
s yle di e si y o he songs in ou co pus by compa ing
hei music embeddings o hose de i ed om eal-wo ld
idol g oup songs.
We p ocessed 4,483 publicly a ailable p e iew acks
pe o med by 234 emale idol g oups, which we ob-
ained om a music subsc ip ion se ice. To mi iga e
he e ec s o ocal cha ac e is ics and a ia ions in he
numbe o singe s, we ex ac ed he accompanimen
signals using he ine- uned model o Hyb id T ans o me
Demucs (HT Demucs) [11, 17]. We hen ex ac ed
512-dimensional embeddings om hese signals us-
ing a con as i e language–audio p e aining (CLAP)
model [31]. Subsequen ly, we educed he dimensionali y
o hese embeddings o wo using uni o m mani old
app oxima ion and p ojec ion (UMAP) [32]. We also
ex ac ed embeddings om he songs in ou co pus and
educed hei dimensionali y using he same UMAP
pa ame e s as p e iously desc ibed. The CLAP model
was music_audiose _epoch_15_esc_90.14 4,
which is designed o music signals. In his compa ison,
all eal-wo ld songs we e pe o med by emale g oups,
as hese songs a e gene ally mo e accessible and well-
o ganized online han hose by male g oups. None heless,
his sec ion ea s all songs in he IdolSongsJp co pus
uni o mly since ocal cha ac e is ics we e mi iga ed.
Figu e 2 shows he esul s. The embedding space e-
eals a con inuous dis ibu ion o songs wi hou o ming
dis inc clus e s. The igu e shows ha he songs in ou
4h ps://hugging ace.co/lukewys/laion_clap
co pus a e b oadly dis ibu ed ac oss he embedding space,
indica ing a wide a ie y o s yles. T acks loca ed nea he
pe iphe y o he embedding space exhibi unique s yles,
while hose nea he cen e end o ha e mo e common
idol g oup cha ac e is ics. This sugges s ha he co pus
includes bo h dis inc i e songs (e.g., F05, F06, F08) and
mo e ypical ones (e.g., F01, M04). Mo eo e , he absence
o da a poin s in ce ain egions o he embedding space,
such as he uppe igh egion, sugges s di ec ions o u -
he expansion o he co pus.
4. APPLICATION 1: MUSIC SOURCE
SEPARATION
As men ioned in Sec ion 2, he IdolSongsJp co pus in-
cludes s em signals, which help he e alua ion o MSS
echniques. In his sec ion, we e alua e he pe o mance o
he ine- uned model o HT Demucs [11, 17], which sep-
a a es music signals in o ou s ems: bass, d ums, ocals,
and o he . We e alua ed he pe o mance using h ee ypes
o inpu signals, all o which a e p o ided in he co pus:
1. Linea summa ion o s ems wi hou mas e ing e -
ec s. The s ems a e summed linea ly wi hou any
addi ional mas e ing p ocessing.
2. Mas e ed acks a −9 LUFS. Fo hese acks,
he same mas e ing e ec s as hose used in p oduc-
ing he mas e ed acks we e applied o he g ound-
u h s ems.
3. Mas e ed acks a −7 LUFS. The only di e -
ence om condi ion 2 is he gain pa ame e applied
o he inal limi e .
Since mas e ing e ec s include no only maximize s and
limi e s bu also equalize s, s e eo imaging plug-ins, and
o he p ocessing plug-ins, he inal mixed signals di e
conside ably om he aw s ems. To add ess his disc ep-
ancy, we applied he same mas e ing e ec s o he indi id-
ual s ems in condi ions 2 and 3 so ha he acous ic condi-
ions o he e e ence s ems app oxima ely ma ched hose
o he mas e ed acks. No e ha since mos mas e ing e -
ec s a e nonlinea , simply summing he p ocessed s ems
does no ep oduce he inal mas e ed acks. These e ec s
we e applied o each s em solely o app oxima e he o e -
all acous ic condi ions, such as loudness and s e eo bal-
ances. In his e alua ion, signal- o-dis o ion a io (SDR)
was used as he e alua ion me ic.
Figu e 3 shows he sepa a ion esul s. Fo condi-
ion 1 (linea summa ion o s ems), he sepa a ion pe o -
mance was compa able o ha obse ed wi h MUSDB18-
HQ [15], demons a ing he e ec i eness o he HT De-
mucs model. In con as , when p ocessing he mas e ed
signals, he sepa a ion pe o mance de e io a ed, pa ic-
ula ly o d ums and ocals, and was lowe han p e i-
ously epo ed o exis ing high-loudness music co po a
[18]. One possible explana ion is ha ce ain mas e -
ing e ec s a e designed o smoo h, glue, and sa u a e
he acks and induce in e ac ions be ween s ems, he eby
complica ing MSS. No ably, he mas e ed signals o ack
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
650
Figu e 3. Signal- o-dis o ion a ios (SDR) o sepa a ed
signals om he IdolSongsJp co pus, ob ained using he
ine- uned model o HT Demucs [11, 17]. Do s indica e
ou lie s ha lie mo e han 1.5 imes he in e qua ile ange
away om he qua iles.
M05, “Deep b ea h,” exhibi ed he lowes sepa a ion pe -
o mance ac oss all s ems. Because his song is based on
UK Ga age, i s sound design, especially he bass, di e s
conside ably om he ypical one ound in pop and ock
music. These indings sugges ha enhancing he di e si y
o gen es and e ec s in he aining acks may be nec-
essa y o u he imp o e he pe o mance o MSS ech-
niques. Fu he mo e, he esul s indica e po en ial limi a-
ions in he cu en ca ego iza ion o ins umen s, sugges -
ing ha a ge sou ce sepa a ion echniques [33] will be
equi ed o mo e e ec i e applica ions.
5. APPLICATION 2: AUTOMATIC CHORD
ESTIMATION
As men ioned in Sec ion 2, he IdolSongsJp co pus con-
ains musical cho d anno a ions p o ided by expe an-
no a o s. Figu e 4 shows he occu ence a es o cho d
quali ies, excluding cho d oo s, in e sions, ensions, and
non-cho ded sec ions. The p opo ion o majo and mino
cho ds is 50%, which is signi ican ly lowe han he 65%
obse ed in he McGill Billboa d co pus [34], a e e ence
co pus used in he Music In o ma ion Re ie al E alua ion
eXchange (MIREX) compe i ion. In con as , he co pus
includes a la ge numbe o e ads and pen ads (e.g., se -
en h and nin h cho ds), as well as diminished (dim) and
augmen ed (aug) cho ds. The e o e, e ec i e cho d es i-
ma ion o his co pus equi es a model wi h a la ge cho d
ocabula y capable o co e ing hese di e se cho d ypes.
We applied and e alua ed wo open-sou ce cho d es-
ima ion me hods on he IdolSongsJp co pus. The i s
me hod is based on mul i ask classi ica ion o cho d a -
ibu es (e.g., oo , quali y, bass, and se en h) using
bi-di ec ional long sho - e m memo y (Bi-LSTM) ne -
Figu e 4. Dis ibu ion o cho d quali ies in he Idol-
SongsJp co pus, ep esen ed using Ha e’s sho hand no a-
ion [30]. Cho d oo s, in e sions, ension no es, and non-
cho ded segmen s a e excluded. The 7sus4 cho d, which
is no de ined in his no a ion o in he open-sou ce e alu-
a ion package mi _e al, is composed o (1, 4, 5, ♭7).
wo ks 5[35]. The second is an end- o-end me hod based
on bi-di ec ional T ans o me s 6[36]. Bo h me hods can
handle ecogni ion asks in ol ing a la ge cho d ocabu-
la y. Figu e 5 shows he cho d es ima ion esul s. Fo oo
no e es ima ion and majo /mino cho d es ima ion, bo h
me hods achie ed accu acies exceeding 80%, demons a -
ing hei e ec i eness on his co pus. Howe e , he ecog-
ni ion accu acy o e ads was below 60% o bo h me h-
ods, indica ing limi a ions in handling la ge cho d ocab-
ula y. In he igu e, he MIREX4 column ep esen s he
ecogni ion accu acy o cho ds consis ing o ou o mo e
no es, whe e a leas ou componen no es mus be co -
ec ly iden i ied. In his case, he accu acy was below 30%
o bo h me hods, indica ing a la ge oom o imp o emen
in ecognizing cho ds wi h complex s uc u es. In sum-
ma y, cho d es ima ion in ol ing an ex ended cho d o-
cabula y, such as e ads and beyond, emains challenging
o cu en me hods.
6. APPLICATION 3: AUTOMATIC LYRICS
TRANSCRIPTION
As desc ibed in Sec ions 2 and 3, he songs in he Idol-
SongsJp co pus co e a wide ange o musical s yles wi h
a a ie y o gen es, empos, and ly ical hemes. The e o e,
he co pus se es as a benchma k da ase o a ious na -
u al language p ocessing asks in ol ing ly ics in di e se
5h ps://gi hub.com/music-x-lab/
ISMIR2019-La ge-Vocabula y-Cho d-Recogni ion
6h ps://gi hub.com/jayg996/BTC-ISMIR19
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
651

Figu e 5. Cho d es ima ion accu acies using wo open-
sou ce me hods [35, 36]. E alua ion me ics a e based
on he Music In o ma ion Re ie al E alua ion eXchange
(MIREX) compe i ion and he implemen a ion in he open-
sou ce package mi _e al. The MIREX4 column indi-
ca es he ecogni ion accu acy o cho ds consis ing o ou
o mo e no es, whe e a leas ou componen no es mus
be co ec ly iden i ied. Do s indica e ou lie s ha lie mo e
han 1.5 imes he in e qua ile ange away om he qua -
iles.
musical con ex s. Fo example, his sec ion e alua es he
pe o mance o au oma ic ly ics ansc ip ion (ALT) using
exis ing au oma ic speech ecogni ion (ASR) echniques.
We u ilized wo ASR me hods o pe o m ALT. The i s
me hod uses Whispe wi h i s la ge model [37]. The lan-
guage ag o Japanese was p o ided as inpu o i s T ans-
o me decode . The second me hod is based on a Con-
o me model [38] ha accep s Hidden-Uni BERT (Hu-
BERT) [39] ea u es. Fo he HuBERT model, we used
kushinada-hube -la ge 7[40]. The Con o me
model was ained on Labo oTVSpeech, a la ge-scale
Japanese speech co pus de i ed om ele ision b oadcas s
[41]. Bo h me hods we e designed o gene al ASR and
we e no speci ically ailo ed o ALT. In his e alua ion,
we used h ee ypes o inpu signals: (1) mas e ed acks a
−7 LUFS, (2) ocal signals sepa a ed om he mas e ed
acks using he ine- uned model o HT Demucs, and (3)
aw lead ocal s ems wi hou any mas e ing e ec s.
Table 2 p esen s he cha ac e e o a es (CERs) o
he ansc ip ion esul s. The pe o mance o he Whis-
pe model emained consis en ac oss all h ee inpu condi-
ions. This consis ency, as epo ed also in p e ious s ud-
ies [42], demons a es he obus ness o Whispe agains
accompanimen signals and deg ada ion caused by MSS.
In con as , he esul s indica e ha he Con o me model
wi h HuBERT ea u es was mo e nega i ely a ec ed by
7h ps://hugging ace.co/imp /
kushinada-hube -la ge
S[%] D[%] I[%] CER [%]
Whispe [37]
Mas e ed acks 10.3 8.7 21.8 40.7
+Demucs 8.1 15.7 20.0 43.8
Lead ocals 7.9 14.1 17.7 39.7
HuBERT+Con o me [38–40]
Mas e ed acks 13.4 59.8 2.5 75.6
+Demucs 16.3 26.1 5.1 47.5
Lead ocals 15.1 21.5 5.2 41.8
Table 2. Cha ac e e o a es (CERs) o au oma ic ly ics
ansc ip ion. Columns S,D, and I ep esen subs i u-
ion, dele ion, and inse ion e o a es, espec i ely. The
CER column shows he o al cha ac e e o a e, he sum
o hese h ee e o ypes. Rows labeled Mas e ed acks
co espond o acks mas e ed a −7 LUFS; +Demucs in-
dica es sepa a ed ocal signals ob ained om hese mas-
e ed acks using HT Demucs; and Lead ocals indica es
he aw lead ocal s ems wi hou mas e ing e ec s.
accompanimen signals, despi e ha ing been ained on
da a ha included music TV p og ams. The wo sys-
ems demons a ed di e en e o endencies: he Whispe
model ended o p oduce inse ion e o s, while he Con-
o me model equen ly gene a ed dele ion e o s, wi h
inse ion e o s occu ing a ely. Mo eo e , he Whispe
model o en p oduced hallucina ed ou pu s, such as “ hank
you o you wa ching.” Con e sely, he Con o me model
ended o skip o e as segmen s and ailed o ecognize
English ph ases accu a ely, which con ibu ed o i s e-
quen dele ion e o s.
7. CONCLUSION
We cons uc ed he IdolSongsJp co pus, a no el mul i-
singe song co pus in he s yle o Japanese idol g oups.
The co pus includes no only mas e ed acks bu also
s ems o e alua ing MSS echniques, d y ocal acks,
solo e sions o all song–singe pai s, and cho d anno-
a ions p o ided by expe anno a o s. The songs co e
a wide ange o musical s yles obse ed in Japanese idol
g oups, encompassing di e se gen es, empos, song di i-
sion pa e ns, and ly ical hemes. The e o e, he co pus
se es as a ealis ic esou ce no only o gene al asks,
such as MSS and au oma ic cho d es ima ion, bu also o
song-speci ic applica ions, such as audience in e ac ion en-
hancemen and piano a angemen gene a ion.
The IdolSongsJp co pus explici ly ocuses on song di i-
sion (u awa i) s uc u es, making i well-sui ed o mul i-
singe asks such as singe dia iza ion and mul i-pi ch de-
ec ion. Fu u e di ec ions include applying his co pus o
he de elopmen and e alua ion o such asks speci ic o
mul i-singe songs. Fu he mo e, ex ending his app oach
o popula music gen es om o he egions, such as Ko-
ean and Chinese pop, may u he b oaden he applica-
bili y o music in o ma ion p ocessing echniques, pa icu-
la ly hose o mul i-singe songs.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
652
8. ACKNOWLEDGMENTS
This esea ch was pa ially suppo ed by he AIST policy-
based budge p ojec “R&D on Gene a i e AI Founda ion
Models o he Physical Domain.” The au ho s would like
o acknowledge M . Takizawa (AIST) o his suppo in
he speech ecogni ion e alua ion.
9. ETHICS STATEMENT
This co pus includes ins umen s ems and d y ocal
acks, which may be used o music gene a ion and ela ed
applica ions. Ins umen s ems may be u ilized o c ea e
new musical con en h ough music gene a ion echniques
o sampling, po en ially in inging on he igh s o hese
sou ces. Vocal signals may also enable syn he ic speech
ha is une hical o impe sona es o he s. To add ess hese
conce ns, such uses a e explici ly p ohibi ed unde he li-
cense p o ided a he dis ibu ion eposi o y.
10. REFERENCES
[1] M. Mülle , D. P. W. Ellis, A. Klapu i, and G. Richa d,
“Signal p ocessing o music analysis,” IEEE jou nal
o selec ed opics in signal p ocessing, ol. 5, no. 6,
pp. 1088–1110, 2011.
[2] H. Pu wins, B. Li, T. Vi anen, J. Schlu e , S.-Y. Chang,
and T. Saina h, “Deep lea ning o audio signal p o-
cessing,” IEEE jou nal o selec ed opics in signal p o-
cessing, ol. 13, no. 2, pp. 206–219, 2019.
[3] M. Go o, “An audio-based eal- ime bea acking sys-
em o music wi h o wi hou d um-sounds,” Jou nal
o new music esea ch, ol. 30, no. 2, pp. 159–171,
2001.
[4] L. Jean, “E icien empo and bea acking in audio
eco dings,” Jou nal o he Audio Enginee ing Socie y.
Audio Enginee ing Socie y, ol. 51, pp. 226–233, 2003.
[5] M. Go o, “A obus p edominan -F0 es ima ion me hod
o eal- ime de ec ion o melody and bass lines in CD
eco dings,” in P oc. 2000 IEEE In e na ional Con-
e ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), ol. 2, 2000, pp. 757–760.
[6] J. Salamon, E. Gomez, D. P. W. Ellis, and G. Richa d,
“Melody ex ac ion om polyphonic music signals:
App oaches, applica ions, and challenges,” IEEE Sig-
nal P ocessing Magazine, ol. 31, no. 2, pp. 118–134,
2014.
[7] H.-T. Cheng, Y.-H. Yang, Y.-C. Lin, I.-B. Liao, and
H. H. Chen, “Au oma ic cho d ecogni ion o music
classi ica ion and e ie al,” in P oc. 2008 IEEE In e -
na ional Con e ence on Mul imedia and Expo, 2008,
pp. 1505–1508.
[8] N. Boulange -Lewandowski, Y. Bengio, and P. Vin-
cen , “Audio cho d ecogni ion wi h ecu en neu al
ne wo ks,” in P oc. 14 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2013),
2013.
[9] F. Ko zeniowski and G. Widme , “A ully con olu-
ional deep audi o y model o musical cho d ecogni-
ion,” in P oc. 2016 IEEE 26 h In e na ional Wo kshop
on Machine Lea ning o Signal P ocessing (MLSP),
2016, pp. 1–6.
[10] T. Vi anen, “Monau al sound sou ce sepa a ion by
nonnega i e ma ix ac o iza ion wi h empo al con i-
nui y and spa seness c i e ia,” IEEE T ansac ions on
Audio, Speech, and Language P ocessing, ol. 15,
no. 3, pp. 1066–1074, 2007.
[11] A. Dé ossez, “Hyb id spec og am and wa e o m
sou ce sepa a ion,” in P oc. ISMIR 2021 Music Demix-
ing Wo kshop, 2021, pp. 1–11.
[12] A. Mesa os and T. Vi anen, “Au oma ic ecogni ion o
ly ics in singing,” EURASIP Jou nal on Audio, Speech,
and Music P ocessing, ol. 2010, no. 1, pp. 1–11,
2010.
[13] X. Gao, C. Gup a, and H. Li, “Au oma ic ly ics an-
sc ip ion o polyphonic music wi h ly ics-cho d mul i-
ask lea ning,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 30, pp. 2280–
2294, 2022.
[14] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “The MUSDB18 co pus o music
sepa a ion,” 2017. [Online]. A ailable: h ps://doi.o g/
10.5281/zenodo.1117372
[15] ——, “MUSDB18-HQ - an uncomp essed e sion
o MUSDB18,” 2019. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.3338373
[16] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond 4-
s ems,” in P oc. 24 h In e na ional Con e ence on Mu-
sic In o ma ion Re ie al Con e ence (ISMIR 2023),
2023, pp. 619–626.
[17] S. Roua d, F. Massa, and A. Dé ossez, “Hyb id ans-
o me s o music sou ce sepa a ion,” in P oc. 2023
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2023, pp. 1–5.
[18] C.-B. Jeon and K. Lee, “Towa ds obus music sou ce
sepa a ion on loud comme cial music,” in P oc. 23 d
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR 2022), 2022.
[19] D. Pe e mann, P. Chandna, H. Cues a, J. Bonada, and
E. Gomez, “Deep lea ning based sou ce sepa a ion ap-
plied o choi ensembles,” in P oc. 21s In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR 2020), 2020.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
653
[20] T. Nakamu a, S. Takamichi, N. Tanji, S. Fukayama,
and H. Sa uwa a i, “jaCappella Co pus: A Japanese a
cappella ocal ensemble co pus,” in P oc. 2023 IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), 2023, pp. 1–5.
[21] C.-B. Jeon, H. Moon, K. Choi, B. S. Chon, and
K. Lee, “MedleyVox: An e alua ion da ase o mul i-
ple singing oices sepa a ion,” in P oc. 2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), 2023, pp. 1–5.
[22] Y. Okada, Ed., Li ing wi h Idols (in Japanese). Po
Publishing, 2013.
[23] H. Suda, D. Sai o, S. Fukayama, T. Nakano, and
M. Go o, “Singe dia iza ion o polyphonic music
wi h unison singing,” IEEE/ACM T ansac ions on Au-
dio, Speech, and Language P ocessing, ol. 30, pp.
1531–1545, 2022.
[24] H. Suda, S. Yoshida, T. Nakamu a, S. Fukayama,
and J. Oga a, “F ui sMusic: A eal-wo ld co pus o
Japanese idol-g oup songs,” in P oc. 25 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR 2024), 2024.
[25] M. Thli hi, C. Ba as, J. Pinquie , and T. Pelleg ini,
“Singe dia iza ion: Applica ion o e hnomusicologi-
cal eco dings,” in P oc. 5 h In e na ional Wo kshop
on Folk Music Analysis (FMA 2015), 2015, pp. 124–
125.
[26] M. Go o, “Ac i e music lis ening in e aces based on
signal p ocessing,” in P oc. 2007 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP), ol. 4, 2007, pp. IV–1441–IV–1444.
[27] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Popula , classical and jazz mu-
sic da abases,” in P oc. 3 d In e na ional Con e ence
on Music In o ma ion Re ie al Con e ence (ISMIR
2002), 2002.
[28] W. Xie, “Japanese “idols” in ans-cul u al ecep ion:
he case o AKB48,” in The A o Recep ion, J. B acke
and A.-K. Hub ich, Eds., 2021, pp. 371–399.
[29] Radiocommunica ion Sec o o In e na ional Telecom-
munica ion Union (ITU-R), “Recommenda ion ITU-R
BS.1770-3: Algo i hms o measu e audio p og amme
loudness and ue-peak audio le el,” 2012.
[30] C. Ha e, M. B. Sandle , S. A. Abdallah, and E. Gómez,
“Symbolic ep esen a ion o musical cho ds: A p o-
posed syn ax o ex anno a ions,” in P oc. 6 h In e -
na ional Con e ence on Music In o ma ion Re ie al
(ISMIR 2005), 2005, pp. 66–71.
[31] Y. Wu, K. Chen, T. Zhang, Y. Hui, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “La ge-scale con-
as i e language-audio p e aining wi h ea u e usion
and keywo d- o-cap ion augmen a ion,” a Xi [cs.SD],
2022.
[32] L. McInnes, J. Healy, N. Saul, and L. G oßbe ge ,
“UMAP: Uni o m mani old app oxima ion and p ojec-
ion,” Jou nal o open sou ce so wa e, ol. 3, no. 29,
p. 861, 2018.
[33] X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang,
M. D. Plumbley, and W. Wang, “Sepa a e wha you de-
sc ibe: Language-que ied audio sou ce sepa a ion,” in
P oc. In e speech 2022, 2022, pp. 1801–1805.
[34] J. A. Bu goyne, J. Wild, and I. Fujinaga, “An expe
g ound u h se o audio cho d ecogni ion and music
analysis,” in P oc. 12 h In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR 2011), 2011.
[35] J. Jiang, K. Chen, W. Li, and G. Xia, “La ge-
ocabula y cho d ansc ip ion ia cho d s uc u e
decomposi ion,” in P oc. 20 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR
2019), 2019.
[36] J. Pa k, K. Choi, S. Jeon, D. Kim, and J. Pa k, “A
bi-di ec ional T ans o me o musical cho d ecogni-
ion,” in P oc. 20 h In e na ional Socie y o Music In-
o ma ion Re ie al Con e ence (ISMIR 2019), 2019.
[37] A. Rad o d, J. W. Kim, T. Xu, G. B ockman,
C. McLea ey, and I. Su ske e , “Robus speech ecog-
ni ion ia la ge-scale weak supe ision,” in P oc.
40 h In e na ional Con e ence on Machine Lea ning
(ICML’23), 2022, pp. 28 492–28 518.
[38] A. Gula i, J. Qin, C.-C. Chiu, N. Pa ma , Y. Zhang,
J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang,
“Con o me : Con olu ion-augmen ed ans o me o
speech ecogni ion,” in P oc. In e speech 2020, 2020,
pp. 5036–5040.
[39] W.-N. Hsu, B. Bol e, Y.-H. H. Tsai, K. Lakho ia,
R. Salakhu dino , and A. Mohamed, “HuBERT: Sel -
supe ised speech ep esen a ion lea ning by masked
p edic ion o hidden uni s,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, ol. 29,
pp. 3451–3460, 2021.
[40] D. Takizawa, T. Nakamu a, H. Suda, and S. Fukayama,
“Au oma ic speech ecogni ion o Japanese dialec s
using la ge-scale sel -supe ised lea ning models (in
Japanese),” in P oc. 2025 Sp ing Mee ing o he Acous-
ical Socie y o Japan, 2025.
[41] S. Ando and H. Fujiha a, “Cons uc ion o a la ge-scale
Japanese ASR co pus on TV eco dings,” in P oc. 2021
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2021, pp. 6948–
6952.
[42] O. Cí ka, H. Sch eibe , L. Mine , and F.-R. S ö e ,
“Ly ics ansc ip ion o humans: A eadabili y-
awa e benchma k,” in P oc. 25 h In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence (IS-
MIR 2024), 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
654