scieee Science in your language
[en] (orig)

Coloring Music: Bridging Music and Color Palettes for Graphic Design

Author: Takayuki Nakatsuka; Masahiro Hamasaki; Masataka Goto
Publisher: Zenodo
DOI: 10.5281/zenodo.17706337
Source: https://zenodo.org/records/17706337/files/000009.pdf
COLORING MUSIC: BRIDGING MUSIC AND COLOR PALETTES
FOR GRAPHIC DESIGN
Takayuki Naka suka Masahi o Hamasaki Masa aka Go o
Na ional Ins i u e o Ad anced Indus ial Science and Technology (AIST), Japan
{ akayuki.naka suka, masahi o.hamasaki, m.go o}@ais .go.jp
ABSTRACT
This pape explo es he ela ionship be ween music and he
colo pale es used o designing hei co esponding mu-
sic co e images, p o iding a comp ehensi e analysis ha
b idges audi o y and isual exp ession. Ou indings e eal
a ela ionship be ween musical pieces and ce ain colo s,
sugges ing ha he colo pale es used in co e image de-
sign a e ca e ully selec ed o e lec he audi o y expe ience.
Building on hese indings, we p opose a amewo k ha
es ima es app op ia e colo pale es o musical pieces o
suppo selec ing colo s o co e image design. Using a
la ge p i a e da ase o 582,894 pai s o a musical piece
and i s co esponding co e image om a ious music gen-
es, ou amewo k le e ages deep lea ning echniques o
ain ou colo pale e es ima o . We demons a e he e ec-
i eness o ou p oposed amewo k in g aphic design by
showcasing an applica ion ha gene a es co e images us-
ing he es ima ed colo pale es om gi en musical pieces.
1. INTRODUCTION
In mul imodal music unde s anding, bo h music and hei
co esponding music co e images play a c ucial ole. Fo
ins ance, O amas e al. success ully imp o ed music gen e
classi ica ion accu acy by inco po a ing image ea u es in
addi ion o audio ea u es [1]. In addi ion, L
¯
ıbeks and
Tu nbull showed ha co e images in ol e dis inc ea u es
ha can be used o p edic music gen e ags [2]. These
s udies sugges ed ha a co e image embodies he essence
o i s co esponding music con en , he eby es ablishing
ha analyzing hese images yields a deepe unde s anding
o he music. This s udy ocuses on he colo s used in co e
images and analyzes hei ela ionship wi h music.
The colo s used in co e images end o empi ically e-
lec he cha ac e is ics o he co esponding music s yle. As
illus a ed in Fig. 1, di e en music gen es display dis inc-
i e cha ac e is ics in he colo s used in he co e images.
As colo s a e closely linked o cul u al con ex s [3], emo-
ions [4, 5], and he abili y o a ac isual a en ion [6],
co e images con ibu e o he p omo ion o music con en
© T. Naka suka, M. Hamasaki, and M. Go o. Licensed unde
a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: T. Naka suka, M. Hamasaki, and M. Go o, “Colo ing Music:
B idging Music and Colo Pale es o G aphic Design”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
Dea h me alCoun yElec onic
Figu e 1. Example esul s o Google Sea ch wi h he ex
que ies “{music gen e} music album co e s,” whe e we
used he music gen es ‘Dea h me al,’ ‘Coun y,’ and ‘Elec-
onic.’ Music co e images o each gen e a e cha ac e ized
by he colo s used in co e image design: da k colo s o
‘Dea h me al,’ b ownish colo s o ‘Coun y,’ and i id col-
o s o ‘Elec onic.’
and enhance he o e all music app ecia ion expe ience [7].
The e o e, his ela ionship be ween music and he colo s
used in co e images has been he subjec o se e al s ud-
ies [8
–
10]. Howe e , hese s udies ha e mainly ocused on
gen es, no on musical pieces.
This pape i s in es iga es he p e e ed colo s o de-
signing co e images ac oss mul iple gen es in ou p elimi-
na y s udy (Sec ion 4) and u he explo es he ela ionship
be ween musical pieces and he colo s used in hei co e-
sponding co e images based on ou p oposed amewo k
(Sec ion 5). In his s udy, we ocus on no only a ep e-
sen a i e colo bu also colo pale es used in co e images
because hey play a c ucial ole in g aphic design [11
–
13],
shedding ligh on he delibe a e selec ion p ocess o colo s
ha e lec he essence o he music con en .
Based on ou indings ha a ela ionship exis s be ween
musical pieces and he colo s used in hei co esponding
co e images, we p opose a amewo k o es ima e app o-
p ia e colo pale es o musical pieces. The key echnical
75
aspec s o ou amewo k a e how o ex ac colo pale es
om co e images and how o es ima e colo pale es o
musical pieces. Fo a colo pale e ex ac ion me hod, we
employ da a-d i en colo mani olds [14], which a e use-
ul in a anging he colo s as a colo pale e. Fo a colo
pale e es ima o , we ain a deep neu al ne wo k o es i-
ma e an app op ia e colo pale e o each musical piece. In
his aining, we le e age a p e ained audio model (con-
as i e language-audio p e aining (CLAP) [15] o Au-
dioToken [16]) as an audio ea u e ex ac o o ex ac a
dis inc i e ea u e om each musical piece. This ame-
wo k b idges musical pieces and hei co esponding co e
images using colo pale es.
To demons a e he e ec i eness o ou amewo k, we
p esen an example applica ion ha gene a es co e images
using he es ima ed colo pale es om gi en musical pieces
o suppo c ea ing isually appealing co e images.
2. RELATED WORK
Se e al s udies ha e in es iga ed he ela ionship be ween
music and colo . Wells a gued ha he e is a co ela ion
be ween music and colo based on he p inciple o comple-
men a i y [17]. Fu he mo e, Pesek e al. sugges ed ha
since music and emo ions a e closely ela ed (e.g., [18,19]),
as well as emo ions and colo s (e.g., [4, 5]), he e exis s
a ela ionship be ween music and colo media ed by emo-
ions [20]. Howe e , hese s udies ha e only pa ially elu-
cida ed he ela ionship be ween music and colo , as hey
analyzed his ela ionship using a limi ed numbe o colo s.
The e o e, in his s udy, we use he colo s used in music
co e images ha embody a musical essence [1,2] as he
basis o ou analysis.
In esea ch explo ing he colo s used in co e images,
p e ious s udies ha e ocused on speci ic gen es (classi-
cal [8] and me al [9]). Seke [8] disco e ed ha he colo s
used in co e images o classical music p edominan ly a-
o neu al colo s. F iconne [9] ound ha co e images
o me al music end o use da ke colo s han hose o
o he gen es, wi h a p e e ence o black and o ange [9].
Al hough hese s udies p o ide insigh s in o he colo s used
in co e images o speci ic gen es, no s udies ha e explo ed
which colo alues a e p e e ed o speci ic musical pieces
o a ious gen es.
Addi ionally, colo hemes used in designing co e im-
ages ha e been s udied [10]. Do ochowicz and Kos ek [10]
analyzed co e images ac oss mul iple gen es wi h espec
o basic colo analysis ules such as seasonal colo s (e.g.,
sp ing (wa m and b igh ), summe (cool and so ), au umn
(wa m and so ), and win e (cool and b igh )) and deg ees
o b igh ness (e.g., ligh , medium, and da k). While hei
indings p o ide aluable insigh s in o he colo cha ac e is-
ics o each gen e, hey ocus on a limi ed numbe o colo
pale es based on he basic colo analysis ules.
In his pape , we in es iga e he ela ionship be ween
musical pieces and he colo pale es used in hei co e-
sponding co e images and explo e he applica ion o his
ela ionship in co e image design.
Inpu images
"-means
(clus e ing me hod)
Da a-d i en
colo mani olds
Figu e 2. Compa ison o colo pale e ex ac ion me hods.
Gi en he inpu image ( op ow), he da a-d i en colo man-
i olds (bo om ow) ex ac a colo pale e om he image
in consecu i e colo o de , while
k
-means (middle ow) ex-
ac s a colo pale e om he image in andom colo o de .
3. COLOR EXTRACTION
To ex ac a ep esen a i e colo o colo pale es om music
co e images, we le e age da a-d i en colo mani olds [14],
a echnique which aims o acqui e colo samples om im-
ages and lea n a lowe -dimensional mani old o he acqui ed
colo samples. The lea ned mani old e lec s he dis ibu-
ion o colo s in co e images, comp essing a eas o he
colo space ha a e less commonly used and expanding
hose ha a e mo e equen ly u ilized.
The echnique in ol es se e al s eps, s a ing wi h he ac-
quisi ion o colo samples om co e images. Fo success-
ul colo mani old lea ning, a su icien numbe o samples
(o e 10k) mus be ob ained om each image. No e ha
we u ilized all samples om
224 px ×224 px
- esized co e
images, amoun ing o o e 50k samples. These samples
a e hen used o es ima e he densi y o each colo in he
co e images, wi h a ocus on iden i ying and p ese ing he
mos impo an colo s. A sel -o ganizing map [21], which
is used o educe dimensionali y, is hen applied o de i e
he one-dimensional o wo-dimensional colo mani olds.
We u ilize he one-dimensional colo mani old o ex ac
colo pale es om co e images. In p ac ice, we calcula e
a disc e e colo mani old, which consis s o
M∈N
colo s,
o use he de i ed colo mani old as a colo pale e. All
hype pa ame e alues ela ed o densi y es ima ion and
dimensionali y educ ion we e aken om [14], excep o
he smoo hness pa ame e , which we se o 0= 1.
The ad an age o his echnique o e clus e ing me hods
such as
k
-means [22] is ha he colo pale e ex ac ed by
he da a-d i en colo mani olds has a meaning ul o de -
ing, whe e he o de o colo s is de e mined by he de i ed
one-dimensional colo mani old and hus esul s in consec-
u i eness, while he colo pale e ex ac ed by a clus e ing
me hod has a andom o de ing (see Fig. 2). When using a
colo pale e consis ing o mul iple colo s in g aphic design,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
76
Classical S age & Sc een
Reggae La in
Rock Elec onic
R
G
B
0
1
0
0
1
1
R
G
0
10
0
1
1
R
G
0
1
0
0
1
1
R
G
0
1
0
0
1
1
R
G
0
10
0
1
1
R
G
0
1
0
0
1
1
BB
BB
B
Low sa u a ion
Wa m colo s
Wide a ia ions
Figu e 3. Visualiza ion o a ep esen a i e colo used in
music co e images by music gen e. The la ge he ci cle
in he isualiza ion, he mo e equen ly he ci cle’s colo
appea s in he co e images.
he colo pale e wi h a con inuous colo o de based on he
da a-d i en colo mani olds is in ui i e and easy o use.
4. PRELIMINARY STUDY
This sec ion desc ibes ou p elimina y s udy ha aims o
analyze he p e e ed colo s o designing music co e im-
ages ac oss mul iple gen es by le e aging colo pale es
ex ac ed om hese images.
4.1 Expe imen al Se up
4.1.1 Da ase
We andomly collec ed 3,887 co e images (each image is
an RGB image) o he expe imen s. We assigned gen e ags
o each image based on he g ouping o gen es and s yles
in Discogs
1
, in which a ious music is o ganized in o 15
gen es and s yles (‘Blues,’ ‘B ass & Mili a y,’ ‘Child en’s,’
‘Classical,’ ‘Elec onic,’ ‘Folk, Wo ld, & Coun y,’ ‘Funk /
Soul,’ ‘Hip-Hop,’ ‘Jazz,’ ‘La in,’ ‘Non-Music,’ ‘Pop,’ ‘Reg-
gae,’ ‘Rock,’ and ‘S age & Sc een’). A o al o 5,150 gen e
ags we e assigned o 3,887 co e images, which means an
a e age o 343.3 images pe gen e.
1
The g ouping o gen es and s yles in Discogs is a ailable
a
h ps://suppo .discogs.com/hc/en-us/a icles/
360005055213-Da abase-Guidelines-9-Gen es-S yles
.
Table 1. Lis o ep esen a i e colo s mos equen ly used
in music co e images o each music gen e, excluding
g ayscale colo s.
Music gen e RGB alue Colo
Blues (148,135,102)
B ass & Mili a y (110,101,74)
Child en’s (139,178,241)
Classical (105,132,128)
Elec onic (72,36,36)
Folk, Wo ld, & Coun y (108,101,68)
Funk / Soul (111,109,73)
Hip-Hop (73,36,36)
Jazz (181,145,109)
La in (146,112,110)
Non-Music (165,127,156)
Pop (110,73,73)
Reggae (168,132,68)
Rock (72,36,36)
S age & Sc een (174,172,106)
4.1.2 Implemen a ion de ails
Fo ep esen ing colo s in colo mani olds, we u ilized
an RGB colo space, which is a widely used addi i e
colo model. We esized all o he co e images in o
224 px ×224 px
and no malized hei RGB alues o [0,
1]. Then, o he pu pose o his p elimina y s udy, we
simply ex ac ed one ep esen a i e colo (i.e.,
M= 1
)
om he esized images using he da a-d i en colo mani-
olds [14] as desc ibed in Sec ion 3. No e ha we ex ac ed
mo e colo s o o m he colo pale es in Sec ion 5.4. We
used all o he pixels in he esized images as colo samples.
The cons uc ed colo mani old is di ided in o eigh bins,
and he samples a e disc e ized in o hese bins, enabling
hei isualiza ion as a his og am.
4.2 Resul s
Fig. 3 ep esen s h ee-dimensional his og ams o he R,
G, and B alues in he RGB colo space. As shown in
Fig. 3, ends in he dis ibu ion o colo s di e by gen es:
o example, ‘Classical’ and ‘S age & Sc een’ music co e
images ha e low sa u a ion, and ‘Reggae’ and ‘La in’ music
co e images end o use wa m colo s such as ed and
yellow. Addi ionally, i can also be obse ed ha gen es
such as ‘Pop’ and ‘Rock’ ea u e a wide a ie y o colo s
in hei co e images. These gen es ha e mo e di e se
subca ego ies and s yles han o he gen es, esul ing in such
colo a ia ion. No e ha due o ex on co e images
and backg ound colo s, g ayscale colo s a e p ominen ly
displayed in he his og am. The e o e, Table 1 lis s he
ep esen a i e colo s mos equen ly used in co e images,
excluding g ayscale colo s. As shown in Table 1, he mos
equen ly used colo s a y by music gen e. Ou p oposed
colo pale e es ima ion amewo k is designed based on
his insigh .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
77
5. COLOR PALETTE ESTIMATION FRAMEWORK
Using he close ela ionship be ween musical pieces and
colo s used in hei co esponding music co e images, we
p opose a amewo k designed o es ima e app op ia e colo
pale es o musical pieces. Fig. 4 shows an o e iew o ou
amewo k. We u ilize an audio ea u e ex ac o and a colo
pale e es ima o o es ima e a colo pale e o each musical
piece. The audio ea u e ex ac o ex ac s audio ea u es
om musical pieces. To le e age ecen ad ancemen s in
audio models o downs eam asks, we use a p e ained
audio model as he audio ea u e ex ac o . Then, he colo
pale e es ima o akes he ex ac ed audio ea u es as inpu
and es ima es app op ia e colo pale es o musical pieces.
To ain he colo pale e es ima o , we cons uc a la ge
p i a e da ase o musical pieces and hei co esponding
co e images, bu we need he g ound- u h colo pale es o
be es ima ed. The e o e, we ex ac he g ound- u h colo
pale e om each co e image by le e aging he da a-d i en
colo mani olds [14] desc ibed in Sec ion 3 o ou colo
pale e ex ac ion me hod.
5.1 Audio Fea u e Ex ac o
Audio models ained on la ge da ase s ha e demons a ed
hei capabili ies in downs eam asks [15,23]. Fo exam-
ple, he ou pu s o he inal laye o an audio model a e
u ilized in classi ica ion asks, while audio embeddings a e
used in gene a i e asks. In ou app oach, we le e age he
p e ained audio model
F
o ex ac audio ea u es om
musical pieces.
Le
A={an∈RT}N
n=1
be a se o musical pieces,
whe e
T
is he leng h o each musical piece and
N
is he
numbe o musical pieces. Nex , le
Z={zn∈Rd}N
n=1
be
a se o audio ea u es, whe e
d
is he numbe o dimensions
o each audio ea u e. The audio ea u e
zn
can be ex ac ed
om he musical piece
an
by using he p e ained audio
model Fas ollows:
zn=F(an).(1)
We u ilize he p e ained audio model wi h all o i s ainable
pa ame e s ixed.
5.2 Colo Pale e Es ima o
To es ima e colo pale es om he ex ac ed audio ea u es
zn
, we p opose he colo pale e es ima o
G
, which consis s
o h ee linea laye s wi h a GELU unc ion [24].
Le
C={cn∈R3×M}N
n=1
be a se o colo pale es,
whe e
M
is he numbe o colo s in each colo pale e
and each colo is ep esen ed by a se o h ee nume ical
alues (i.e., RGB alues). The colo pale e
cn
can be
es ima ed om he audio ea u e
zn
by using he colo
pale e es ima o Gas ollows:
cn=G(zn)
=σ(W3GELU(W2GELU(W1zn+b1) + b2) + b3),
(2)
whe e
σ
is a sigmoid unc ion. The pa ame e s o he colo
pale e es ima o
G
a e de ined by
W1∈Rh×d, W2∈
Rh×h, W3∈R3×M×h,b1∈Rh,b2∈Rh
, and
b3∈
Musical piece
Music co e
image
Audio ea u e
ex ac o ( ixed)
Colo pale e
ex ac ion
Colo pale e
es ima o
!!"#
Es ima ed
colo pale e
G ound- u h
colo pale e
T aining p ocedu e
Colo pale e es ima ion amewo k
Colo pale e es ima ion
Musical piece
Audio ea u e
ex ac o ( ixed)
Colo pale e
es ima o ( ixed)
Es ima ed
colo pale e
Figu e 4. O e iew o ou p oposed colo pale e es ima-
ion amewo k. (T aining p ocedu e) We s a wi h an
o iginal pai o a musical piece and i s co esponding music
co e image. The musical piece is p ocessed by a ixed
audio model o ex ac i s audio ea u e, and hen a colo
pale e es ima o is ained o es ima e a colo pale e om
he audio ea u e. Fo his aining, he g ound- u h colo
pale e is ex ac ed om he co e image by using he colo
pale e ex ac ion me hod. We use he mean squa ed e o
(MSE) loss unc ion o op imize ou colo pale e es ima o .
(Colo pale e es ima ion) A e aining, he colo pale e
es ima o can be used o es ima e app op ia e colo pale es
o musical pieces.
R3×M
, whe e he dimension o he hidden laye
h
is se o
768. While aining, a d opou wi h a p obabili y o 0.2 is
applied o each ou pu o he GELU unc ions.
5.3 Expe imen al Se up
5.3.1 Da ase
The la ge p i a e da ase o aining ou colo pale e es i-
ma o con ains music audio exce p s (each exce p is a 30 s
audio p e iew o ial lis ening, wi h a 44.1 kHz sampling
a e) and hei co esponding co e images (each image is
an RGB image). The exce p s and hei co e images a e
limi ed o single acks, i.e., an o iginal pai o an exce p
and i s co esponding co e image is unique. The da ase
con ains 582,894 pai s o an exce p and i s co espond-
ing co e image by 115,113 a is s. We andomly spli he
da ase in o aining, alida ion, and es se s wi h an eigh -
one-one a io (i.e., 466,316 pai s o he aining se and
58,289 pai s o alida ion and es se s, espec i ely) and
wi h no a is s o e lapping ac oss hese se s.
5.3.2 Implemen a ion De ails
As desc ibed in Sec ion 4.1.2, we u ilized he RGB colo
space o colo ep esen a ion, esized he co e images
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
78
in o
224 px ×224 px
, and no malized hei RGB alues o
[0, 1]. We used a single NVIDIA A6000 GPU o ain he
colo pale e es ima o . Ou implemen a ion was based on
PyTo ch [25]. We used he mini-ba ch size o 2,048. To
ain he colo pale e es ima o , we used he Adam op i-
mize [26] wi h a lea ning a e o
1.0×10−4
. We calcula ed
he mean squa ed e o (MSE) loss unc ion
LMSE
be ween
he es ima ed and g ound- u h colo pale es o op imize
he pa ame e s o he colo pale e es ima o .
5.4 Expe imen al Se ings
To cla i y which audio ea u e ex ac o would be mos
e ec i e in es ima ing he colo pale e, we conduc ed com-
pa a i e expe imen s o in es iga e he impo ance o selec-
ion. Addi ionally, as a e e ence, we used colo pale es
composed o andom colo s.
5.4.1 Audio ea u e ex ac o
To ex ac audio ea u es om musical pieces, we compa ed
wo audio models: CLAP [15] and AudioToken [16].
To use CLAP [15], we se he pa ame e s o p e ained
models a ailable a HuggingFace’s T ans o me s [27] (i.e.,
“laion/clap-h sa - used”). Each musical piece was con e ed
o a mel spec og am h ough a CLAP ea u e ex ac o ,
and he CLAP audio model used he spec og am as inpu .
We ixed all pa ame e s o he model. By using he model,
we can ob ain a 768-dimensional ea u e ec o o each
musical piece.
We also used AudioToken [16], which consis s o a bidi-
ec ional encode ep esen a ion om audio ans o me s
(BEATs) [23] model and an embedde [16] model. We
se he pa ame e s o p e ained models a ailable a o i-
cial Gi Hub eposi o ies
2
. All pa ame e s o he models
we e ixed. By employing he models, we can ob ain a
768-dimensional ea u e ec o o each musical piece.
5.4.2 Numbe o Colo s in Colo Pale e
We used he numbe o colo s
M={1,2,3,4,5}
in he
colo pale es o he expe imen s. To ex ac he colo
pale es om he co e images, we used he da a-d i en
colo mani olds [14] as desc ibed in Sec ion 3.
5.5 E alua ion Me ic o Compa a i e Expe imen s
In ou compa a i e expe imen s, we used he minimum
colo di e ence model (MCDM) [28], which is p ac ically
designed o e alua e he colo di e ence be ween wo colo
pale es. The MCDM compa es he wo colo pale es, each
consis ing o
M
colo s, o de e mine hei a e age colo di -
e ence. Fi s , he colo s in he colo pale es a e con e ed
om RGB o CIELAB [29]. Then, he MCDM calcula es a
CIELAB colo di e ence be ween each colo in one pale e
and all colo s in he o he pale e, iden i ying he closes
colo ma ch o each and eco ding he minimum di e -
ences. While mul iple a ian s o CIELAB colo di e ence
2
The p e ained models a e a ailable a
h ps://gi hub.com/
mic oso /unilm/ ee/mas e /bea s
o he BEATs model
and
h ps://gi hub.com/guyya i /AudioToken
o he em-
bedde model.
Table 2. Resul s o he MCDM sco e on he es se o
ou da ase . A lowe MCDM sco e indica es a close ma ch
be ween he es ima ed colo pale es and he g ound- u h
colo pale es.
Audio ea u e ex ac o MMCDM sco e
CLAP
128.37
225.66
3 22.68
4 21.72
5 21.17
AudioToken
1 28.40
2 25.69
322.66
421.71
521.14
(Random)
1 69.29
2 68.09
3 66.35
4 65.90
5 65.50
exis , we he e adop ed he CIE1976 colo di e ence [30]
o e alua ion. This p ocess is epea ed o e e y colo
in he i s pale e, esul ing in
M
colo di e ence alues,
which a e hen a e aged o ob ain a mean alue, deno ed
as
m1
. The same p ocess is epea ed o he second pale e,
inding he closes ma ches in he i s pale e and a e aging
he
M
minimum di e ences o ob ain ano he mean alue,
m2
. Finally, he a e age o
m1
and
m2
gi es he o e all
colo di e ence be ween he wo pale es. The lowe he
MCDM sco e, he close he wo colo pale es. We le e -
age his MCDM o compa e a colo pale e es ima ed wi h
each expe imen al se ing and a g ound- u h colo pale e.
5.6 Resul s
Table 2 p esen s he esul s o he MCDM sco e unde each
expe imen al se ing. As shown in Table 2, ou p oposed
amewo k achie es a much lowe (i.e., be e ) MCDM
sco e compa ed o andom colo pale es, wi h an imp o e-
men o o e 40 poin s. This demons a es he e ec i eness
o ou amewo k. Addi ionally, hese esul s suppo ha
he e is a ela ionship be ween musical pieces and he colo
pale es used o designing hei co esponding co e im-
ages because ou amewo k succeeds in aining he colo
pale e es ima o .
Rega ding he selec ion o each expe imen al se ing,
he e is no pe o mance di e ence be ween he CLAP and
AudioToken audio models, as shown in Table 2. This sug-
ges s ha ei he audio model can be selec ed based on he
in ended applica ion. As hese esul s demons a e, ou
amewo k e ec i ely es ima es app op ia e colo pale es
o musical pieces.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
79

Musical piece
music co e image, a se ene lakeside a nigh unde
a ull moon, a cellis plays soul ully on a small
wooden dock, wi h swans swimming gen ly nea by.
Colo pale e
es ima ion
Tex p omp
Colo pale e
Wi h colo s Wi hou colo s
①②③①②③Musical piece
music co e image, an ou doo ock es i al in a
moun ainous a ea, he s age is se agains a backd op
o owe ing moun ains, he audience chee s as he
band plays unde he open sky a sunse .
Tex p omp
Colo pale e
es ima ion
Colo pale e
Wi h colo s Wi hou colo s
①②③①②③Ti /Simi Cold Ones/Relaxing
Piano Music
Figu e 5. Example esul s o music co e image gene a ion using he es ima ed colo pale es. The uppe le o each
example shows he i le and a is name o he inpu musical piece, he lowe le shows he es ima ed colo pale e, he uppe
igh shows he inpu ex p omp , and he lowe igh shows he images gene a ed using he inpu ex p omp wi h/wi hou
he es ima ed colo pale e. In each ex p omp , he colo ed wo ds indica e objec s being speci ied wi h ha colo by a use
by le e aging he es ima ed colo pale e.
6. APPLICATION
Ou amewo k can es ima e colo pale es app op ia e o
musical pieces, opening new doo s in designing isual con-
en o music con en . In he con ex o gene a i e AI, colo
pale es se e as an impo an means o con ollabili y in
image syn hesis. P o iding colo hin s allows use s o align
gene a i e model’s ou pu s wi h design in en ions, he eby
highligh ing he impo ance o colo pale es as compac
and in e p e able in o ma ion in human–AI collabo a i e
c ea i i y [31,32]. We showcase he po en ial o ou ame-
wo k h ough an example applica ion o c ea ing music
co e images. This applica ion le e ages a cu ing-edge
ich- ex - o-image model p oposed by Ge e al. [33, 34],
which enables p ecise colo ende ing on a ge objec s o
egions in image gene a ion. By in eg a ing ou amewo k
wi h his ich- ex - o-image model, we can enhance he
g aphic design o co e images wi h colo pale es. Ins ead
o using ex p omp s such as “ ed”, “blue”, o “g een”, his
applica ion can ende he gene a ed images wi h speci ied
colo alues, such as RGB alues o Hex colo codes. As
desc ibed in Sec ion 1, he ad an age o being able o use
speci ied colo alues is ha colo cha ac e is ics can be
e lec ed in he co e image.
To use his applica ion based on he ich- ex - o-image
model, a use needs o p epa e a ex p omp (i.e., a plain
ex p omp ) ha illus a es he in ended co e images, in
addi ion o he colo pale e ha is es ima ed by using ou
colo pale e es ima ion amewo k. The use hen speci ies
he objec s’ o egions’ colo s using he colo pale e, and i s
speci ied ich ex p omp is gi en o he ich- ex - o-image
model in JSON o ma
3
. The ich- ex - o-image model
u ilizes a gene al ex - o-image model
4
o gene a e ini ial
3
Example ich ex p omp s a e a ailable a
h ps://gi hub.
com/SongweiGe/ ich- ex - o-image
. A plane ex p omp can
be con e ed in o a ich ex p omp in he JSON o ma by using he ans-
la o a ailable a
h ps:// ich- ex - o-image.gi hub.io/
ich- ex - o-json.h ml.
4
We used S able Di usion XL a ailable a
h ps://hugging ace.co/s abili yai/
s able-di usion-xl-base-1.0
images, which a e hen e ined h ough colo ing. The colo
a ibu es used in he ich- ex p omp s con ol he colo s o
a ge objec s o egions wi hin he ini ial images, he eby
imp o ing he isual ideli y o he gene a ed esul s.
The example esul s in Fig. 5 highligh he p ecision o
colo ma ching in co e images gene a ed by he ex - o-
image model using ex p omp s and colo pale es. A single
ex is displayed benea h he wo gene a ed images, as i is
sha ed by bo h he plain ex p omp and he ich ex p omp .
In he ich ex p omp , he colo ed wo ds indica e ha
hose objec s a e speci ied wi h he co esponding colo s,
whe eas in he plain ex p omp , such colo in o ma ion is
no p o ided. These examples demons a e he po en ial o
ou colo pale e es ima ion amewo k, p omising a wide
a ay o applica ions, including music ideo c ea ion and
pe o mance ligh ing design.
7. CONCLUSION
Ou s udy inds he ela ionship be ween audi o y and i-
sual exp ession, speci ically h ough he colo s chosen o
music co e images. Ou indings sugges ha he e is a
delibe a e, meaning ul selec ion o colo pale es ha e lec
he essence o he music hey exp ess.
Ou p oposed amewo k, which u ilizes he la ge p i-
a e da ase encompassing a a ie y o gen es, employs deep
lea ning echniques o es ima e app op ia e colo pale es
o co e images. This b idges he ields o music compu -
ing and g aphic design, especially o designe s seeking
o encapsula e he audi o y expe ience o music in isual
o m. Ou amewo k can s eamline he design p ocess by
au oma ing o helping he p ocess o colo selec ion.
Mo eo e , he example applica ion o ou amewo k,
as demons a ed h ough he gene a ion o co e images
using he colo pale es, unde sco es i s e ec i eness and
po en ial impac on g aphic design o music con en . I
o e s a no el app oach o designing co e images ha
a e bo h aes he ically pleasing and deeply connec ed o a
musical piece.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
80
8. ACKNOWLEDGMENTS
This wo k was suppo ed in pa by JST CREST G an
Numbe JPMJCR20D4 and JSPS KAKENHI G an Num-
be 22K18017, Japan.
9. REFERENCES
[1]
S. O amas, F. Ba bie i, O. Nie o, and X. Se a, “Mul-
imodal deep lea ning o music gen e classi ica ion,”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al, ol. 1, no. 1, pp. 4–22, 2018.
[2]
J. L
¯
ıbeks and D. Tu nbull, “You can judge an a is by
an album co e : Using images o music anno a ion,”
IEEE Mul iMedia, ol. 18, no. 4, pp. 30–37, 2011.
[3]
C. Wa e, In o ma ion isualiza ion: pe cep ion o de-
sign. Mo gan Kau mann, 2019.
[4]
P. Valdez and A. Meh abian, “E ec s o colo on emo-
ions.” Jou nal o Expe imen al Psychology: Gene al,
ol. 123, no. 4, pp. 394–409, 1994.
[5]
L.-C. Ou, M. R. Luo, A. Woodcock, and A. W igh , “A
s udy o colou emo ion and colou p e e ence. Pa I:
Colou emo ions o single colou s,” Colo Resea ch &
Applica ion, ol. 29, no. 3, pp. 232–240, 2004.
[6]
U. Anso ge and S. I. Becke , “Con ingen cap u e in
cueing: he ole o colo sea ch empla es and cue- a ge
colo ela ions,” Psychological Resea ch, ol. 78, pp.
209–221, 2014.
[7]
M. Vad, “The album co e ,” Jou nal o Popula Music
S udy, ol. 33, no. 3, pp. 11–15, 2021.
[8]
C. Seke , “New classics: The analysis o classical music
album co e s’ digi al age cha ac e is ics,” Eu opean
Scien i ic Jou nal, pp. 163–174, 2017.
[9]
G. F iconne , “A k-means clus e ing and his og am-
based colo ime ic analysis o me al album a wo ks:
The colou pale e o me al music,” Me al Music S udies,
ol. 9, no. 1, pp. 77–100, 2023.
[10]
A. Do ochowicz and B. Kos ek, “Rela ionship be ween
album co e design and music gen es,” in P oceedings
o he 2019 Signal P ocessing: Algo i hms, A chi ec-
u es, A angemen s, and Applica ions (SPA), 2019, pp.
93–98.
[11]
J. I en, The elemen s o colo . John Wiley & Sons,
1970, ol. 4.
[12]
T. L. S one, S. Adams, and N. Mo ioka, Colo design
wo kbook: A eal wo ld guide o using colo in g aphic
design. Rockpo Pub, 2008.
[13]
Y. Li and A. Sheopu i, “C ea i e design o colo pale es
o p oduc packaging,” in P oceedings o he 2015
IEEE In e na ional Con e ence on Mul imedia and
Expo (ICME), 2015, pp. 1–6.
[14]
C. H. Nguyen, T. Ri schel, and H.-P. Seidel, “Da a-
d i en colo mani olds,” ACM T ansac ions on G aph-
ics (TOG), ol. 34, no. 2, pp. 1–9, 2015.
[15]
Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-audio
p e aining wi h ea u e usion and keywo d- o-cap ion
augmen a ion,” in P oceedings o he 2023 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP), 2023, pp. 1–5.
[16]
G. Ya i , I. Ga , L. Wol , Y. Adi, and I. Schwa z, “Adap-
a ion o ex -condi ioned di usion models o audio- o-
image gene a ion,” in P oceedings o he 24 h Annual
Con e ence o he In e na ional Speech Communica ion
Associa ion (In e speech), 2023, pp. 5446–5450.
[17]
A. W. Wells, “Music and isual colo : A p oposed
co ela ion,” Leona do, ol. 13, pp. 101–107, 1980.
[18]
J.-C. Wang, Y.-H. Yang, H.-M. Wang, and S.-K. Jeng,
“The acous ic emo ion gaussians model o emo ion-
based music anno a ion and e ie al,” in P oceedings o
he 20 h ACM In e na ional Con e ence on Mul imedia
(MM), 2012, pp. 89–98.
[19]
J. De Be a dinis, A. Cangelosi, and E. Cou inho, “The
mul iple oices o musical emo ions: Sou ce sepa a ion
o imp o ing music emo ion ecogni ion models and
hei in e p e abili y,” in P oceedings o he 21s Con e -
ence o he In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2020, pp. 310–317.
[20]
M. Pesek, P. Godec, M. Po edos, G. S le, J. Guna,
E. S ojmeno a, M. Pogacnik, and M. Ma ol , “In o-
ducing a da ase o emo ional and colo esponses o
music.” in P oceedings o he 15 h Con e ence o he
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2014, pp. 355–360.
[21]
T. Kohonen, “The sel -o ganizing map,” P oceedings o
he IEEE, ol. 78, no. 9, pp. 1464–1480, 1990.
[22]
Y.-C. Hu and M.-G. Lee, “K-means-based colo pale e
design scheme wi h he use o s able lags,” Jou nal o
Elec onic Imaging, ol. 16, no. 3, p. 033003, 2007.
[23]
S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen,
and F. Wei, “Bea s: Audio p e- aining wi h acous ic
okenize s,” in P oceedings o he 40 h In e na ional
Con e ence on Machine Lea ning (ICML), 2023, pp.
5178–5193.
[24]
D. Hend ycks and K. Gimpel, “Gaussian e o linea
uni s (gelus),” a Xi p ep in a Xi :1606.08415, 2016.
[25]
A. Paszke, S. G oss, F. Massa, A. Le e , J. B ad-
bu y, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. An iga e al., “PyTo ch: An impe a i e s yle, high-
pe o mance deep lea ning lib a y,” in P oceedings o
he 33 d Con e ence on Neu al In o ma ion P ocessing
Sys ems (Neu IPS), ol. 32, 2019, pp. 8024–8035.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
81
[26]
D. P. Kingma and J. Ba, “Adam: A me hod o s ochas-
ic op imiza ion,” in P oceedings o he 3 d In e na-
ional Con e ence on Lea ning Rep esen a ions (ICLR),
2015, pp. 1–13.
[27]
T. Wol , L. Debu , V. Sanh, J. Chaumond, C. Delangue,
A. Moi, P. Cis ac, T. Raul , R. Lou , M. Fun owicz,
J. Da ison, S. Shlei e , P. on Pla en, C. Ma, Y. Je -
ni e, J. Plu, C. Xu, T. L. Scao, S. Gugge , M. D ame,
Q. Lhoes , and A. M. Rush, “T ans o me s: S a e-o - he-
a na u al language p ocessing,” in P oceedings o he
2020 Con e ence on Empi ical Me hods in Na u al Lan-
guage P ocessing: Sys em Demons a ions (EMNLP-
SD), 2020, pp. 38–45.
[28]
Q. Pan and S. Wes land, “Compa a i e e alua ion o
colo di e ences be ween colo pale es,” in P oceed-
ings o he 26 h IS&T Colo and Imaging Con e ence
(CIC), ol. 26, 2018, pp. 110–115.
[29]
T. Smi h and J. Guild, “The cie colo ime ic s anda ds
and hei use,” T ansac ions o he Op ical Socie y,
ol. 33, no. 3, p. 73, 1931.
[30]
A. R. Robe son, “The CIE 1976 colo -di e ence o -
mulae,” Colo Resea ch & Applica ion, ol. 2, no. 1, pp.
7–11, 1977.
[31]
V. Bozic, A. Djelouah, Y. Zhang, R. Timo e, M. G oss,
and C. Sch oe s, “Ve sa ile ision ounda ion model o
image and ideo colo iza ion,” in P oceedings o he
ACM SIGGRAPH 2024 Con e ence Pape s, 2024, pp.
1–11.
[32]
J. Yun, S. Lee, M. Pa k, and J. Choo, “iColo iT: To-
wa ds p opaga ing local hin s o he igh egion in in e -
ac i e colo iza ion by le e aging ision ans o me ,” in
P oceedings o he 2023 IEEE/CVF Win e Con e ence
on Applica ions o Compu e Vision (WACV), 2023, pp.
1787–1796.
[33]
S. Ge, T. Pa k, J.-Y. Zhu, and J.-B. Huang, “Exp essi e
ex - o-image gene a ion wi h ich ex ,” in P oceedings
o he 2023 IEEE/CVF In e na ional Con e ence on
Compu e Vision (ICCV), 2023, pp. 7545–7556.
[34]
——, “Exp essi e ex - o-image gene a ion and edi -
ing wi h ich ex ,” In e na ional Jou nal o Compu e
Vision (IJCV), ol. 133, no. 7, pp. 4604–4622, 2025.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
82