Document [original]

https://doi.org/10.1177/25152459221106366

Advances in Methods and

Practices in Psychological Science

July-September 2022, Vol. 5, No. 3,

pp. 1 –18

Article reuse guidelines:

sagepub.com/journals-permissions

DOI: 10.1177/25152459221106366

www.psychologicalscience.org/AMPPS

ASSOCIATION FOR

PSYCHOLOGICAL SCIENCE

Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License

(https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission

provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).

General Article

The behavioral and social sciences have been criticized

for relying excessively on WEIRD samples in which most

participants are Western, educated, and from industrial-

ized, rich, and democratic countries (Apicella etal.,

2020; Henrich, 2020; Henrich etal., 2010; Muthukrishna

etal., 2020). Research has established substantial cross-

cultural variation in key psychological domains, such as

thinking styles (e.g., Masuda & Nisbett, 2001; Nisbett &

Miyamoto, 2005), economic preferences (e.g., Falk etal.,

2018; Gächter & Schulz, 2016), personality structure

(e.g., Smaldino etal., 2019), and moral judgments (e.g.,

Awad etal., 2020; Curtin etal., 2020), and furthermore

demonstrated that WEIRD subjects often represent outli-

ers among present-day societies (Apicella etal., 2020).

These findings make it clear that broad, unqualified gen-

eralizations about human psychology based on WEIRD

samples alone are rarely justified.

Fortunately, behavioral scientists increasingly acknowl-

edge the problem. Cross-cultural psychologists and

anthropologists are making progress in documenting

variation in psychological phenomena (Apicella etal.,

2020). In addition to long-term fieldwork and experimen-

tal comparisons across societies, large-scale collaborative

projects have started compiling extensive data sets

addressing cross-cultural variation and commonality in

domains such as music (Mehr etal., 2019), social per-

ception (Jones etal., 2021), and economic (Henrich

etal., 2001) and moral decision-making (Awad etal.,

2018). Accompanying the surge in cross-cultural studies,

1106366AMPXXX10.1177/25152459221106366Deffner et al.Advances in Methods and Practices in Psychological Science

research-article2022

Corresponding Author:

Dominik Deffner, Center for Adaptive Rationality, Max Planck Institute

for Human Development, Berlin, Germany

Email: [email protected]

A Causal Framework for Cross-Cultural

Generalizability

Dominik Deffner1,2,3 , Julia M. Rohrer4, and

Richard McElreath1

1Department of Human Behavior, Ecology and Culture, Max Planck Institute for Evolutionary

Anthropology, Leipzig, Germany; 2Science of Intelligence Excellence Cluster, Technical University

Berlin, Berlin, Germany; 3Center for Adaptive Rationality, Max Planck Institute for Human

Development, Berlin, Germany; and 4Department of Psychology, Leipzig University, Leipzig, Germany

Abstract

Behavioral researchers increasingly recognize the need for more diverse samples that capture the breadth of human

experience. Current attempts to establish generalizability across populations focus on threats to validity, constraints

on generalization, and the accumulation of large, cross-cultural data sets. But for continued progress, we also require

a framework that lets us determine which inferences can be drawn and how to make informative cross-cultural

comparisons. We describe a generative causal-modeling framework and outline simple graphical criteria to derive

analytic strategies and implied generalizations. Using both simulated and real data, we demonstrate how to project

and compare estimates across populations and further show how to formally represent measurement equivalence or

inequivalence across societies. We conclude with a discussion of how a formal framework for generalizability can

assist researchers in designing more informative cross-cultural studies and thus provides a more solid foundation for

cumulative and generalizable behavioral research.

Keywords

cross-cultural research, generalizability, WEIRD-samples problem, causal inference, poststratification, open data,

open materials

Received 10/11/21; Revision accepted 4/21/22

2 Deffner et al.

researchers increasingly consider the historical and politi-

cal contexts of their work as well as its ethical ramifica-

tions (e.g., Broesch etal., 2020; Clancy & Davis, 2019;

Ghai, 2021; Urassa etal., 2021).

New data bring new problems. How can valid com-

parisons and conclusions be derived from cross-cultural

samples? Just as there are many ways to misinterpret

data from a single society, there are even more ways to

misinterpret differences or similarities between societies.

In each case, one must first generalize from each sample

to each population before valid comparisons can be

made between populations. This is a generalizability

problem on a global scale.

Methodologists have long discussed the importance of

generalizability or “external validity” and its relationship to

other kinds of validity (internal, statistical conclusion and

construct validity; e.g., Berkowitz & Donnerstein, 1982;

Calder etal., 1983; Campbell, 1957; Cock & Campbell,

1976; Winer, 1999). Researchers trained in psychology and

other behavioral sciences may be familiar with catalogs of

threats to validity that describe prototypical problematic

situations (Matthay & Glymour, 2020). These lists can grow

rapidly. For external validity alone, Shadish et al. (2002)

distinguished five types of threats that include interactions

of the causal relationship of interest with specific units,

settings, mediators, outcomes, and treatment variations.

Because of these threats, there have been reasonable

calls for constraint. Yarkoni (2022), for instance, argued

that poor alignment between verbal hypotheses and

quantitative inference lies at the heart of many of psy-

chology’s problems; narrow and seemingly arbitrary

operationalizations of broad constructs invalidate the

intended generalizations. As a remedy against, often

implicit, unwarranted generalizations, researchers have

proposed to add mandatory “Constraints on generality

(COG)” statements to all empirical articles (Simons etal.,

2017; Tiokhin etal., 2019). By specifying sample char-

acteristics and assessing its representativeness of wider

populations, such COG statements are meant to disci-

pline authors to explicitly state intended generalizations

and thereby improve transparency.

These steps toward a more global and generalizable

science are overdue. However, under the current frame-

work—with its emphasis on threats to validity, constraints,

and the accumulation of cross-cultural samples—only

limited progress can be made. Lists of threats are devices

that raise awareness of inferential problems, but they are

not also solutions. They do not spell out which inferences

are warranted and under which assumptions. This leads

to the impression that any claim that goes beyond the

precise operationalization, population, and historical con-

text of a study overgeneralizes.

From this perspective, it is understandable that

researchers are eager to collect rich data sets just to

describe what is “out there” (e.g., Barrett, 2020; Rozin,

2001). But even this is not possible without an explicit

framework that licenses generalization. In a cross-cultural

context, even “mere description” and simple comparisons

rely on usually implicit assumptions that permit moving

from sample to population and across populations.

Threats and constraints forbid inference; we require

a framework that also licenses inference. Such a frame-

work would inform researchers about the assumptions

underlying potential generalizations, assist them in the

design of empirical studies, and show them how to con-

struct appropriate statistical procedures. Such a frame-

work already exists and has sparked a “causal revolution”

(Pearl, 2018) in computer science and machine learning,

but it is not a standard part of training in the behavioral

and social sciences. This framework depends on trans-

parent, generative models of research. One key idea is

that generalizability does not depend on the presence

of sample differences per se or on raw statistical associa-

tions. The conditions that license generalization and

comparison with other populations depend on the causal

relations between variables and the exact mechanisms

by which populations differ.

Many cross-cultural scientists already pay close atten-

tion to concerns of causal inference and comparison

without use of a formal framework (e.g., Norenzayan &

Heine, 2005; Pollet etal., 2014). For these researchers,

a formalized framework can provide a vocabulary to

articulate their concerns and work toward solutions in

a more systematic manner.

For instance, many researchers will share the intuition

that the demographic breakdown and other relevant fac-

tors should be somehow standardized across groups to

eliminate potential confounds. A standard approach to

dealing with such threats to validity and cross-cultural

comparison is to condition on (i.e., adjust or “control” for)

any potential confounds such as age, income, or meth-

odological differences by, for example, including such

variables as predictors in multiple regression (condition-

ing on a variable means to analyze the values of other

variables for a given, constant value of the conditioned

variable). But it is not enough to mechanically control for

a set of variables that may vary across populations. One

reason is that not all controls are good—adding variables

can bias inference as much as it can correct it (Cinelli

etal., 2020). An important example is “collider bias,” in

which a spurious association between two variables arises

when a third variable, which is jointly caused by those

variables, is included. As we show below, which variables

act as confounds depends on the assumed causal struc-

ture and the specific research question. A formal genera-

tive framework lets us logically deduce which variables

we should—and should not—control for in any cross-

cultural comparison. Going beyond the question of which

variables to include, it also helps us derive the appropriate

statistical estimates that actually align with the scientific

goal at hand. Coefficients and parameters themselves are

valid measures of difference or causal effect in only the

Advances in Methods and Practices in Psychological Science 5(3) 3

simplest models (Morgan & Winship, 2015; Rohrer &

Arslan, 2021). Knowing a cause means that we can predict

the consequences of an intervention (Asteriou & Hall,

2015; Athey & Imbens, 2016; Greene, 2000; Morgan &

Winship, 2015; Woodward, 2005), and most causal ques-

tions require the construction of “marginal” effects, in

which we average the effect of interest over the influence

of all other important variables to find out how a depen-

dent variable would change if we intervened on the

independent variable. Such “poststratification,” that is,

reweighting of model estimates to answer specific causal

questions, becomes even more complicated when com-

parisons are made between societies (Oganisian & Roy,

2021).

In short, there is no universally valid procedure for

cross-cultural inference. For each inferential problem,

we have to start with a generative causal model that lets

us determine the role variables play in the analysis and

how to construct statistical summaries that are logically

derived from transparent research goals.

In the rest of this article, we outline a formal frame-

work for cross-cultural generalizability based on recent

advances in the fields of causal inference and data fusion

(Bareinboim & Pearl, 2016; Lundberg etal., 2021; Pearl,

2015; Pearl & Bareinboim, 2014). We apply these estab-

lished formal tools to commonplace questions in cross-

cultural research: (a) description of cultural variation, (b)

comparison of causal effects identified through experi-

ments, and (c) measurement equivalence or inequiva-

lence of latent constructs. To help researchers adopt this

approach, we provide example causal diagrams and sta-

tistical analyses using simulated and real-world cross-

cultural data. Finally, we discuss how our framework can

assist researchers in planning targeted cross-cultural com-

parisons and designing more informative studies.

A Causal Framework

A causal framework for cross-cultural research requires

us to state (a) what we want to know, that is, the esti-

mand; (b) a generative model of the evidence, that is, a

causal model of how the observed data came into exis-

tence; (c) a generative model of how populations may

differ; and (d) a tailored estimation strategy that allows

us to learn from data. We first develop these require-

ments in general terms. In later sections, we discuss

specific examples.

Theoretical and empirical estimands

The starting point for any empirical analysis is the theo-

retical estimand. This is the target of the analysis derived

from theory (for an excellent introduction, see Lundberg

etal., 2021). A theoretical estimand consists of a unit-

specific quantity and a target population. It is defined

outside of any statistical model—not in terms of, for

example, regression coefficients. We may simply be

interested in the mean of a variable in a certain popula-

tion (e.g., probability that individual i chooses the pro-

social option in a dictator game, averaged over all

individuals i in target population), or we may be inter-

ested in the average treatment effect of some indepen-

dent variable on an outcome in a certain population

(e.g., effect of norm prime on probability that individual

i chooses prosocial option, averaged over all individuals

i in target population; examples inspired by House etal.,

2020; see below).

Once the theoretical estimand is set, we need to link

it to an empirical estimand. While the theoretical esti-

mand might contain unobservable quantities such as

counterfactuals (“What would have been true under dif-

ferent circumstances?”), the empirical estimand is defined

solely in terms of observed data. We cannot observe the

average probability of prosocial choice for the whole

population; however, we can try to estimate it from a

sample. We also cannot observe individual-level causal

effects, but we may estimate their average by considering

observed differences between randomized experimental

conditions.

In the context of cross-cultural research, the distinc-

tion between theoretical and empirical estimands

encourages researchers to explicitly spell out assump-

tions about how theoretical constructs (e.g., prosociality)

can be operationalized in comparable ways across soci-

eties (“construct validity”). This issue of measurement

equivalence or inequivalence and bias is further dis-

cussed in the “Generalizing Latent Constructs: Measure-

ment equivalence or inequivalence” section.

Directed acyclic graphs

A valid link between theoretical and empirical estimand

requires causal assumptions. Generative models embody

causal assumptions, and there are many forms these

models can take. One popular approach is directed acy-

clic graphs (DAGs). This approach is accessible thanks

to its graphical nature, it can be used to develop an

intuitive understanding of inferential obstacles, and it can

alert researchers to inferential opportunities they had not

considered. There are other suitable ways to spell out

assumptions (e.g., psychological process models; Farrell

& Lewandowsky, 2018), and not all generative models

can be formalized with the help of DAGs. But DAGs

provide a pragmatic starting point and can be extended

to include commonplace issues such as measurement

error and missing data (see McElreath, 2020, Chapter 15).

Multiple comprehensive yet accessible introductions to

DAGs are available (Elwert, 2013; Pearl etal., 2016; Pearl

& Mackenzie, 2018; Rohrer, 2018); thus, we focus only on

the essentials. In DAGs, nodes represent variables, and

arrows represent causal effects. For example, Figure 1a

captures a set of assumptions regarding the associations

4 Deffner et al.

between age, prosociality, reputation, and the outcome of

a dictator game. The arrows indicate causal effects that

may take any functional form, which includes any possible

interaction between variables that jointly affect another

variable. Individual paths can be identified by traveling

along the arrows connecting any pair of variables. These

paths can be broken down into fundamental structures

(see Box 1) that determine whether a given path transmits

an association between variables and whether the associa-

tion is causal or noncausal.

Suppose we were interested in the causal influence of

prosociality on dictator-game choice in the population

from which we randomly drew our sample. If we are will-

ing to assume that the depicted DAG is a causal DAG—

which means that it includes all common causes of any

pair of variables (Elwert, 2013)—we can algorithmically

derive which variables need to be “conditioned” (see

Box 1) on to identify the causal effect of interest. In this

particular example, the answer is easy. There is only one

open noncausal path (see Box 1) between prosociality

Age Prosociality

Reputation

DG Choice

Norm Prime

Age Prosociality

Reputation

DG Choice

Norm Prime

Fig. 1. (a) A simple directed acyclic graph capturing the following assumptions: Age has a direct causal effect on an

individual’s prosociality, their reputation within their community, and the outcome of the dictator game (DG). Pro-

sociality and reputation share an unobserved common cause,

. Prosociality in turn affects the individual’s choice in

the dictator game, which is also affected by the randomized norm prime. (b) Selection diagram using selection nodes

S to represent the assumption that populations differ both in their age distribution and the effect of norm primes on

the choice in the dictator game.

Box 1. Elementary Causal Structures

Any path connecting two variables can be broken down into three fundamental causal structures: chains, forks,

and inverted forks (Elwert, 2013; Rohrer, 2018).

Chains: X

→

Y. The chain transmits a causal association between X and Y. If we condition on M (the mediator;

e.g., through statistical adjustment, sample stratification, or by design), the transmission of the association is blocked.

Forks: X

←

→

Y. The fork transmits a noncausal association between X and Y. If we condition on C (the

confounder), the transmission of the association is blocked.

Inverted forks: X

→

←

Y. The inverted fork transmits no association. If we condition on L (the collider), a

noncausal association between X and Y is transmitted.

A path between X and Y is said to be d-separated if it contains a confounder or mediator that has been

conditioned on or a collider that has not been conditioned on (Pearl, 1988). This implies that the path will not

transmit any association; it is “blocked.” For a statistical procedure to recover a causal association, it must be

designed to block any noncausal paths. For example, in the directed acyclic graph below, we wish to measure

the causal association between X and Y. There are, however, two noncausal paths that also connect X to Y. The

first is XCY

←→

. This is a confounder path, and we close it by conditioning on C. The second noncausal path

is XACBY

←→←→

. In this path, the variable C is not a confound but rather a collider. As a result, this path

would normally be closed. But after we condition on C to close the first path, it opens the second path. Therefore,

we must also condition on A or B to close this second path. Therefore, a procedure that measures the association

between X and Y, stratified by C and B (or A; but using B also increases precision [Pearl et al. 2016] and may thus

be preferable) would measure the causal effect of X on Y.

ACB

Advances in Methods and Practices in Psychological Science 5(3) 5

and dictator-game choice: Prosociality ← Age → Dictator-

game choice. Because age is a common cause of proso-

ciality and dictator-game choice, some of the association

between both variables is due to this noncausal path. This

path can be blocked by conditioning on age (again, see

Box 1). Thus, we have discovered a way to link the theo-

retical estimand (the effect of prosociality on dictator-game

choice in our population) to an actual empirical estimand

that we can estimate from observable data. If, instead, our

theoretical estimand was the effect of the norm prime, no

conditioning would be necessary for causal identification:

Because the norm prime has been randomized, no back-

door paths can exist (no arrows point into the randomized

variable). The simple mean difference between experi-

mental groups would be an empirical estimand that cor-

responds to the theoretical estimand under the assumptions

embodied in the DAG (taking into account other variables

that influence the outcome may still be helpful to improve

precision).

Our DAG is, of course, incomplete and possibly

wrong, in particular when it comes to the nodes that

have not been experimentally manipulated. But an

incomplete model is still an improvement over no model

at all. In the absence of causal assumptions, whether in

a DAG or otherwise, no analysis can be scientifically

justified. Even an unrealistic DAG can help identify spe-

cific problems as well as implicit assumptions underlying

more casually drawn causal inferences. Furthermore,

such graphs make it easier to contrast the implications

of different sets of assumptions that often lie at the heart

of scientific disagreements. Throughout this article, we

use DAGs in this spirit—as a pragmatic tool to commu-

nicate assumptions and improve inference.

Selection diagrams and generalizability

DAGs can be extended to address generalizability through

the use of selection diagrams (Pearl & Bareinboim, 2014).

When researchers consider multiple populations, selec-

tion diagrams allow them to precisely define the local

mechanisms by which populations are assumed to differ,

as represented by “selection nodes.” Selection nodes are

not variables but, rather, indicate which nodes have culture-

specific distributions or causal relationships.

Returning to our previous example, in Figure 1b, we

added two selection nodes. The S→Age node may

indicate that populations are characterized by different

age distributions, and the other S node may indicate

that the populations differ in the weight individuals give

to norm primes when making decisions in the dictator

game (recall that in a DAG, all variables that jointly affect

another variable may interact). The absence of selection

nodes in such graphs is of equal importance. It repre-

sents the assumption that certain mechanisms are the

same across populations. For instance, the diagram in

Figure 1b implies that the development of prosociality

with age does not vary among study populations. As

shown below, it is this assumed invariance of certain

mechanisms that makes generalizations possible.

Once we have a causal selection diagram, we can

determine the scope for generalizability using logical

rules. We can deduce when and how we can use data

from one population to estimate a target quantity in

another population, which is the central goal of the lit-

erature on transportability and data fusion (Bareinboim

& Pearl, 2016; Cinelli & Pearl, 2021; Pearl, 2015; Pearl &

Bareinboim, 2014). These logical rules can be com-

pressed in most contexts to a set of simple graphical

criteria, allowing us to perform the logic with our eyes

(see “Applying the Causal Framework” section).

In cross-cultural settings, the research question often

does not directly concern transport. Instead of transport-

ing an estimate from one population to another, we

instead have data sampled from multiple populations

and want to make sense of the resulting numbers to

learn more about whether, how, and why people differ

from one another. However, such cross-cultural com-

parisons are still indirect exercises in transport because

to compare distributions or causal effects in different

populations, we must calculate what those distributions

or effects would be if we changed the population.

Estimation: multilevel regression

with poststratification

After establishing the logic of a generalization, one must

actually compute it. For explicit generalization from sample

to population and comparison across populations, we used

multilevel regression with poststratification, a statistical

technique that adjusts for differences between a sample

population and a target population (Gao etal., 2021;

Gelman & Little, 1997; Wang etal., 2015). In a first step, the

model uses partial pooling to obtain robust estimates for

each “cell” (combination of attributes that we want to condi-

tion on; e.g., age/gender groups) taking into account infor-

mation gained from other cells (Gelman & Hill, 2006;

McElreath, 2020). For the data examples below, we used

Gaussian processes to obtain estimates for each gender and

age group while treating age as a continuous dimension;

similar ages were expected to be similar in terms of their

prosocial tendencies. In the second step (the poststratifica-

tion), estimates for all cells are reweighted using the relative

frequencies of individuals per cell in the target population

(for detailed explanation and model equations, see Appen-

dix A in the Supplemental Material available online; for Stan

[Carpenter etal., 2017] code used to implement all analy-

ses, see the GitHub repository: https://github.com/

DominikDeffner/Cross-Cultural-Generalizability).

Multilevel regression with poststratification enables us

to learn from data and to project or “generalize” results

6 Deffner et al.

to populations beyond the study sample in a principled

way. Which population to use for poststratification

depends on the theoretical estimand, the target of infer-

ence, and causal assumptions about the data-generating

process. Compared with more informal reweighing pro-

cedures, multilevel regression with poststratification

propagates uncertainty through all steps of analysis and

is thus particularly suited for the small samples common

in cross-cultural research.

Note that although the use of multilevel regression is not

logically required—there are other estimation approaches—

the use of poststratification is. The DAGs we describe

below mandate poststratification as a logical conse-

quence of their structure. Informal reweighting is only

sometimes equivalent to this approach. In every case,

the proper way to reweight estimates is a consequence

of causal assumptions.

Applying the Causal Framework

To illustrate our approach, we used a large-scale cross-

cultural project on societal diversity in prosocial behav-

ior as an empirical case study (House etal., 2020). The

researchers administered a binary-choice version of the

dictator game as a measure of costly sharing to 255

adults and 833 children from eight populations spanning

foragers, small-scale horticulturalists, and urban com-

munities (for demographic composition of samples, see

Appendix D, Fig. S3, in the Supplemental Material). Partici-

pants were asked to choose between a “self-maximizing”

option in which they would keep two rewards or a

“prosocial” option in which they would keep one reward

and give one to an anonymous peer. Children from six

societies were divided into three experimental condi-

tions in which they viewed a short video with normative

information before making their choices. These norm

primes communicated which behavior was preferable

(“Generous,” “Both OK,” or “Selfish”). We used this rich

data set because it exemplifies the state of the art in

experimental cross-cultural research and excels with

respect to research transparency.

Generalizing description: cross-

cultural comparisons and demographic

standardization

A basic aim of cross-cultural research is to describe

cultural variation. In the simplest case, we might want

to compare the prevalence of some institution or behav-

ior across societies. This seemingly innocuous task of

“pure description” may actually refer to a number of

different research questions that call for different proce-

dures. The example we provide is simplified and focuses

on demography, but the point is not about demography.

The same logic applies to all comparisons in which

populations differ in any known background factors.

Drawing out the causal assumptions. Samples from

different sites often differ in terms of their demographic

profiles (here, their age and gender distribution), and

these demographic variables might in turn affect the distri-

bution of the trait of interest.

How should researchers deal with these differences?

The answer depends on the processes that generated

the observed disparities. Demographic disparities among

samples may result from (a) differences in the actual

populations from which the samples are taken or (b)

sampling procedures that differ among sites. For exam-

ple, if we observe that a sample from one site is on

average younger than a sample from a second site, this

may be because the underlying population is indeed

younger. Alternatively, the difference could also result

from a comparison of a relatively young convenience

sample collected at one site with a full community sam-

ple at another site.

These scenarios are depicted in Figure 2. In this figure,

an observed outcome Y is influenced by both unobserved

cultural factors C and sample composition D. The sample

composition is in turn influenced by the true demogra-

phy P and sampling procedures E (for “experimenter”).

If disparities arise from population differences (Fig.

2a), we can directly compare samples as long as our goal

is to simply describe population differences in the focal

trait Y regardless of whether they arise from demography

or from cultural factors. Adjustment is necessary, how-

ever, if we are interested in different comparisons. For

example, we may be interested in the counterfactual (i.e.,

hypothetical) distribution of the trait under comparable

demographic profiles: If the two sites had comparable

age and gender distributions, would we still observe dif-

ferences in the trait of interest? This way, researchers

Fig. 2. Different sources of demographic disparities among study

samples. Prosociality Y is caused by demography D and unobserved

cultural factors C. The sample demography D is caused by popula-

tion demography P and sampling procedures E. Selection nodes S

indicate mechanism by which populations differ. In addition to latent

cultural factors, societies can differ in terms of (a) population demo-

graphy or (b) sampling procedures.

Advances in Methods and Practices in Psychological Science 5(3) 7

could, for instance, isolate the influence of different cul-

tural factors C while holding constant demographic dis-

tributions. Note that such counterfactual comparisons

might also correspond to a more substantive theoretical

estimand (i.e., the distribution of a trait under a hypo-

thetical intervention that moved individuals to another

population; Lundberg etal., 2021).

If disparities among sites arise from different sampling

procedures E (Fig. 2b), even the purely descriptive ques-

tion of observable population differences requires demo-

graphic adjustment because sample demographics are

systematically biased compared with the population of

interest. For example, if the gender of the researcher

influenced the gender of voluntary participants, then

any differences between societies could be due to a mix

of cultural, demographic, and sampling differences. In

this case, even large samples do not accurately describe

the target populations, and we need to poststratify using

information about the population from which samples

are taken.

Another scenario, not illustrated in Figure 2, is when

a sample is selected on the outcome variable Y itself. For

example, if prosocial individuals are more likely to coop-

erate with the researcher, this is selection on the out-

come. In this case, there may be no solution to generalize

from sample to population and therefore no way to com-

pare populations. This is perhaps the starkest example

of how description depends on causal assumptions.

We turn to real empirical data in the next subsection.

However, knowing how to simulate data to validate an

analytical strategy is also useful. For a walk-through on

a complete simulated data example in which we know

the true generative process, see Appendix B in the Sup-

plemental Material. We used multilevel regression with

poststratification for the situation in which populations

differ in their demographic profile (see Fig. S1, left, in

the Supplemental Material) and the complementary situ-

ation in which demographic profiles of the populations

are identical but genders are sampled unequally because

of differences in local sampling procedures (see Fig. S1,

right, in the Supplemental Material). In the first case,

unadjusted empirical estimates accurately recover true

population values, but poststratification can be used for

counterfactual comparisons. In the second case, only

poststratified estimates accurately recover true popula-

tion values.

Empirical example. We now turn to our empirical case

study on prosociality across societies. Figure 3 shows a

comparison between two actual populations included in

House et al. (2020), Tanna island in Vanuatu (left) and

Berlin in Germany (right). These societies have very differ-

ent demographic profiles and sample compositions. Here,

we were immediately confronted with a pragmatic con-

cern: For many populations, no fine-grained demographic

information is available. Therefore, we had to use the

demography of all of Vanuatu instead of only Tanna. This

highlights how collecting basic descriptive information

about study populations is a crucial first step for any cross-

cultural inference.

We divided data into 20 age categories spanning 5

years each and used Gaussian process multilevel regres-

sion with poststratification (for gender- and age-specific

model estimates, see Appendix D, Fig. S4, in the Supple-

mental Material). For Tanna, poststratification to either

the demographic profile of Vanuatu or of Berlin leaves

estimates unchanged (Fig. 3, bottom). This is because

there was only a weak effect of age in this sample and

the gender distribution was balanced. For the Berlin

sample, on the other hand, adjusting for the demo-

graphic population profile of Berlin substantially

increased the expected amount of prosociality. This is

because older individuals in Berlin tended to be more

prosocial in their choices and House et al. (2020) focused

their data collection on children, which resulted in a

much younger sample compared with the underlying

population. Drawing the counterfactual comparison for

Berlin individuals under the demographic profile of

Vanuatu slightly increased the estimate.

How does this compare with the standard approach

in which researchers report raw age- and gender-specific

estimates for each sample, thereby “controlling” for any

differences? The parameter estimates are necessary, but

they are not enough. First, the claim that conditioning

on age and gender controls for sample differences

depends on causal assumptions, as we explained in the

previous sections. Second, the distribution of population

differences depends not only on the parameters but also

on the distributions of age and gender in each target

population. A difference in parameters can look large

but have little impact on population differences because

both the relevant age-gender categories may be too rare

to make a large difference and sizable differences on the

parameter (e.g., logit) scale may result in minor differ-

ences on the outcome (e.g., probability) scale. Only by

poststratifying to the outcome scale and to the relevant

target population can behavioral differences be com-

pared (Oganisian & Roy, 2021; Rohrer & Arslan, 2021).

Although these examples have been simplified, they

highlight the general concern. To accurately describe

the prevalence of a trait and compare it across societies,

we need to carefully define our theoretical estimand—

consisting of unit-specific quantity and target population—

and make assumptions about the processes that generate

observed disparities in demography or any other poten-

tially significant variable. After a target population is set,

refined statistical procedures, such as multilevel regression

with poststratification, allow us to generalize observed

outcomes to other populations conditional on causal

assumptions.

8 Deffner et al.

Generalizing experimental results:

transportability of causal effects

Many hypotheses in cross-cultural research concern not

only the prevalence of a certain trait across societies

but also the causal effect of an independent variable

(“exposure,” “treatment”) on a dependent variable (“out-

come”). In our example, we were interested in the

causal effect of experimental norm primes on prosocial

choices in the dictator game (House etal., 2020). Using

10 864200246810

1−5

11−15

21−25

31−35

41−45

51−55

61−65

71−75

81−85

91−95

Tanna (Vanuatu)

Population

Age Class

10 864200246

Berlin (Germany)

Share of Population per Age Class and Gender [%]

Male

Female

30 20 10 0 10 20 30

1−5

11−15

21−25

31−35

41−45

51−55

61−65

71−75

81−85

91−95

Sample

Age Class

30 20 10 0 10 20 30

Number of Individuals per Age Class and Gender

Density

15 Empirical Estimate

Poststratified to Sample Population

Poststratified to Other Population

0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0

Probability of Choosing Prosocial Option

Fig. 3. Data example for demographic standardization comparing prosociality among two populations, (left)

Tanna, Vanuatu, and (right) Berlin, Germany. The top row shows demographic profiles of Vanuatu (UN Depart-

ment of Economic and Social Affairs, World population prospects 2019) and Berlin (Mikrozensus 2020, Amt für

Statistik Berlin-Brandenburg); the middle row shows demographic characteristics of study participants from both

sites in House et al. (2020); the bottom row shows posterior distributions for probability to choose prosocial option

from multilevel regression with poststratification analyses. Blue curves show empirical (unadjusted) estimates,

yellow curves are poststratified to be representative of the population from which the sample was drawn, and

gray curves are poststratified to demographic profile of other population.

Advances in Methods and Practices in Psychological Science 5(3) 9

the transportability framework from causal inference

(Pearl, 2015; Pearl & Bareinboim, 2014), we show how

causal thinking can be leveraged to generalize and com-

pare causal effects across populations (for formal

“S-admissibility” criterion, see Appendix C in the Sup-

plemental Material).

Figure 4 shows selection diagrams for different scenarios

varying in terms of scope and procedures for generaliza-

tions. They encode different sets of assumptions about the

local mechanisms that cause populations to differ. We con-

sider a situation in which normative social information X

and age A jointly cause choice in dictator game Y. Note

these DAGs represent the “pretreatment” situation, which

means X has not yet been experimentally set to a particular

value. After X is manipulated through norm-prime videos,

all arrows entering X (i.e., all “backdoor” paths) are deleted

because the experimentalist is now the sole cause of X.

This allows us to estimate the causal effect from observed

group differences. An experiment is necessary because we

assume unobserved confounds—represented by dashed

arrows—that influence both normative social information

and prosocial choices (e.g., societies that strongly empha-

size prosociality may be structured such that normative

information is salient but also encourage prosociality

through other means).

Differences in independent/treatment variable. In

Figure 4a, populations differ in the distribution of norma-

tive social information X. This could mean, for instance,

that in some societies, individuals frequently encounter

cultural narratives highlighting the importance of proso-

ciality in their everyday life. As we have just shown, treat-

ment randomization used in the experimental study

cancels out such differences. As a consequence, the causal

effect

XY→

is directly transportable or generalizable to

other populations. In general, all selection nodes pointing

into the independent variable (or other arrows that are

removed in the X-manipulated graph) can be ignored

(Pearl & Bareinboim, 2014).

Differences in effect modifiers. The scenario depicted

in Figure 4b is more interesting. Here, we assume popula-

tion differences in age. Because age modifies (or “moder-

ates”) the effect of normative information on choices (i.e.,

the effect of norm primes is assumed to be different for

different ages) and the age distribution varies across pop-

ulations, we cannot simply generalize the observed causal

effect from one population to another. However, if age

is assumed to affect the influence of norm primes in the

same way across populations, we can estimate the age-

specific effect of X on Y from experimental data and

XNY

ab c

de f

Fig. 4. Scenarios for transportability of causal effects across populations. (a) Normative social informa-

tion X, which is assumed to differ among populations, causes choice in dictator game Y; age A modifies

effect of X on Y but is invariant across populations; there are also unmeasured confounds between

X and Y (indicated by dashed line). (b) Effect-modifier A varies among populations. (c) Age itself

is unobserved, but we get to measure reported age R as a proxy. It is assumed that the way people

report their ages varies across societies but the underlying age distribution is the same. (d) Ages are

reported in the same way across populations, but there are population differences in age distribution.

(e) Response in mediator variable, norm activation N, varies across societies. (f) Populations vary in

response of outcome variable Y to treatment X. Note that scenarios c, d, and e are described in detail

in Appendix C in the Supplemental Material available online.

10 Deffner et al.

generalize by adjusting for the age distribution of the tar-

get population.

The transport approach not only allows principled

claims about the generalization of causal effects to new

populations, it can also be employed to compare esti-

mates from multiple populations from which experimen-

tal data are available. To determine whether observed

group differences in causal effects reflect “real” cultural

differences (i.e., differences we cannot, yet, explain

through other variables) or are due to sampling variation

or differences in known effect modifiers, researchers

need to make explicit assumptions about the causal pro-

cesses that generate the data.

Figure 5 shows a data example for the transport of

age-specific causal effects across populations (House

etal., 2020). Because of experimental manipulation, the

causal effect of norm primes X on prosocial choices Y,

our estimand, can be estimated from the difference in

the probability to choose the prosocial option in both

experimental conditions (“Generous” vs. “Selfish”). Dark

colors show empirical estimates of this causal effect from

six different societies included in House et al. (2020).

Across all societies, posterior densities lay well above 0.

This means that individuals who watched the “Generous”

prime video were substantially more likely to choose

the prosocial option in the dictator game compared with

individuals who watched the “Selfish” prime video; the

strongest effect was observed in the sample from Phoe-

nix, Arizona, United States.

To adjust for differences in the age distribution as a

potential effect modifier, we estimated age-specific causal

effects in each society and, as an example, adjusted esti-

mates to the demographic profile of the Wichí in Argentina.

Transparent colors in Figure 5 show such counterfactual

estimates for the effect of norm primes in each society

assuming it had the same demographic composition as the

sample from the Wichí. Although estimates remained

largely unchanged for most societies, the effect for Phoenix

became substantially smaller and more uncertain. This is

because in Phoenix, age strongly modifies the effect of

norm primes: Younger children were more influenced by

norm primes than older children. The Phoenix sample is,

Berlin (GER)

Empirical Estimates

Transported to the Wichi

La Plata (ARG)

−0.20.0 0.20.4 0.60.8−0.20.0 0.20.4 0.60.8

Phoenix (USA)

Pune (IND)

Shuar (ECU)

Wichi (ARG)

Effect of Norm Prime on Prosocial Choices

“Transport” of Causal Effects Across Populations

Fig. 5. Data example for transport of causal effects across societies. Empirical estimates (dark colors) and estimates transported to the Wichí

in Argentina (transparent colors) for causal effect of norm primes (“Generous” vs. “Selfish”) on prosocial choices in the dictator game in six

different societies included in House et al. (2020). Estimates are calculated as the age-specific differences in the probability to choose the

prosocial option in both conditions averaged over the age distribution in the target population.

Advances in Methods and Practices in Psychological Science 5(3) 11

on average, almost 3 years younger than the Wichí sample,

so estimates of the causal effect need to be adjusted to

apply correctly to the Wichí demographic situation. On the

basis of a comparison of naive empirical estimates,

researchers might have wrongly concluded that norm

primes have a particularly strong effect in Phoenix for some

age-invariant cultural reason; transported estimates instead

suggest that the larger effect is attributable to (potentially

culturally determined) effect modification in combination

with the younger sample. Adjustment for potential effect

modifiers such as age, therefore, allows researchers to com-

pare causal effects on an equal footing.

To aid understanding, most examples have been rela-

tively straightforward, so some researchers might won-

der what they gain from this causal approach compared

with more informal ways to standardize and compare

estimates across groups. Building up from those funda-

mental units, in Appendix C in the Supplemental Mate-

rial, we describe more complicated situations in which

implied generalizations and transport formulas could

hardly be obtained by intuition alone.

In particular, Appendix C in the Supplemental Mate-

rial introduces scenarios in which we did not observe

the true effect modifier, biological age, but only some

proxy, such as reported age R, which is observed to vary

across populations (Figs. 4c and 4d). Because different

scenarios will generate identical data distributions, the

correct procedure will depend solely on causal assump-

tions. In Appendix C in the Supplemental Material, we

further discuss situations in which a mechanism mediat-

ing the effect of X on Y differs among societies (Fig. 4e),

which requires a more sophisticated—yet algorithmically

derivable—generalization formula.

“Impossibility” of generalizations. Finally, if a selec-

tion node is pointing directly into outcome variable Y (Fig.

4f), no generalizations are possible because there is no

immediate way to account for the source of disparity

among populations (for “S-admissibility” criterion, see

Appendix C in the Supplemental Material). This would be

the case if unobserved population differences directly

modify the effect (e.g., Oyserman & Lee, 2008, found that

individualism-collectivism primes do not function in com-

parable ways across societies) or if the form of age modifi-

cation varies between sites. However, even such “impossible”

cases might allow generalizations and comparisons if

researchers make additional assumptions, for example, if

we have additional knowledge about the mechanisms caus-

ing the outcome variable and if only some of these differ

among populations (for an example analyzing effects of

Vitamin A supplementation on childhood mortality, see

Cinelli & Pearl, 2021).

These examples demonstrate that the generalizability

of experimental effects does not depend on the presence

of population differences per se but on the exact

mechanisms by which populations differ. While some

differences—especially those concerning the indepen-

dent variable—are inconsequential for intended gener-

alizations, differences concerning effect modifiers or

mediators require statistical adjustment. Differences in

the immediate mechanisms causing the outcome render

generalizations difficult or even impossible. Such “real”

cross-cultural differences may be the result of society-

level factors directly influencing the trait of interest, and

they present irreducible obstacles to generalization.

Whether “real” cultural differences exist or whether they

must eventually be explained away by other mechanisms

is a topic beyond the scope of this article.

Generalizing latent constructs:

measurement equivalence or inequivalence

In all examples so far, we assumed that researchers can

readily observe and measure the variables of interest.

However, many (cross-cultural) psychologists are par-

ticularly interested in the comparison of latent con-

structs that are not directly observable. For example,

researchers typically do not want to learn about dictator-

game choices per se but about the underlying psycho-

logical constructs (e.g., “prosociality”) that are assumed

to generate the observed choices (for potential impacts

of cultural context on economic game choices, see e.g.,

Bond etal., 1982; Lesorogol, 2007; Leung & Bond, 1984;

Pisor etal., 2020). In this section, we briefly demon-

strate how causal selection diagrams can be used to

represent common issues of measurement equivalence

or inequivalence in cross-cultural studies; note that this

is just a sketch; doing justice to this issue would require

a whole article.

Methodologists have long discussed whether and how

data generated in cross-cultural research can be inter-

preted in terms of the presumed underlying processes

and constructs. The “equivalence and bias” framework,

for instance, differentiates between construct equiva-

lence, metric equivalence, and scalar equivalence (e.g.,

Van de Vijver & Leung, 2021; Van de Vijver & Tanzer,

2004). Direct comparisons of measurements across soci-

eties are justified only if the underlying construct, mea-

surement units, and scale origin are equivalent across

societies (i.e., full scalar equivalence).

But we can also approach the problem from a genera-

tive perspective. The measurement process can naturally

be represented as a causal model of observed item or

test scores (Bandalos, 2018; Borsboom etal., 2004). Note

that there are alternative models in which constructs are

not seen as common causes of manifest variables but as

network structures (Borsboom etal., 2021) or organizing

principles (Sijtsma, 2006) that connect such variables;

however, the implications of such models for generaliz-

ability are beyond the scope of this article.

12 Deffner et al.

For Figure 6a, we assume that an individual’s choice

in the dictator game Y is caused by a latent psychologi-

cal factor P (for “Prosociality”) and unobserved sources

of random “error” E. Measurement equivalence in this

framework then requires that (a) only the distribution

of the latent factor P might vary across communities, (b)

P influences Y in the same way everywhere, and (c)

there are also no population differences in the unob-

served error sources E. These conditions are fulfilled in

Figure 6a, so in this case, we would be justified to com-

pare game choices as indicators of latent “prosociality”

across communities.

In Figure 6b, choices in the dictator game do not only

reflect prosociality and random error but also the degree

of market integration M. People who engage more in mar-

ket activities might be more likely to give a reward to an

anonymous peer simply because they are more used to

interacting and trading with unknown others, not because

they are more prosocial. In this case, choices in the dicta-

tor game are not equivalent measures of the latent factor

in different societies because they also include the influ-

ence of market integration that varies across societies.

Nonetheless, following the logic on generalizing descrip-

tion (see “Generalizing Description: Cross-Cultural Com-

parisons and Demographic Standardization” section), if

we have data on market integration for each society, we

can use poststratification to adjust for different levels of

this variable and arrive at valid comparisons of prosociality

and its causes across societies.

Finally, Figure 6c shows a scenario in which the selec-

tion node directly points into dictator-game choice Y.

This comprises situations of construct inequivalence in

which the latent construct itself is not comparable with

respect to its influence on manifest behavior but also

cases in which the influence of market integration or of

unobserved error sources differs among societies. Mir-

roring the impossibility of transport with selection nodes

pointing directly into outcome Y (see Fig. 4f), in any

such case, generalizations and comparisons about latent

factors are unwarranted (unless additional assumptions

are made). Because there is no way to statistically

account for different sources of variation of observed

choices Y, we cannot identify the unique influence of

the latent state P in equivalent ways across

communities.

Using the Causal Framework

for Principled Study Design

A causal framework is not only useful for analysis but

also aids research design. To connect research designs

to selection diagrams, we considered three stereotyped

cases: a “maximally diverse” sampling strategy, a “proxy

control” approach using phylogenetic distance or shared

history, and a “regional comparative” approach that

explicitly designs for local causal identification of the

mechanisms by which populations differ. We explain

each in turn.

A common approach in cross-cultural study design is

to aim for maximally diverse populations. If effects can

reliably be found across diverse societies, the reasoning

goes, researchers are justified in assuming cross-cultural

invariance or even universality; differences among sam-

ples are interpreted as evidence for either the influence

of observed or unobserved cultural factors or method-

ological differences. By comparing geographically and

culturally distant societies, this approach addresses

“Galton’s problem,” which describes the pitfalls of draw-

ing inferences from cross-cultural data that are autocor-

related because of shared cultural and historical roots

(Naroll, 1965). This rationale guided the construction of

the widely used “Standard Cross-Cultural Sample”

(Murdock & White, 1969). Figure 7a encodes a scenario

in which researchers lack substantive theory on the fac-

tors causing a trait Y that varies cross-culturally. Thus,

only a selection node is pointing into Y. Because there

ab c

Fig. 6. Causal representation of measurement equivalence or inequivalence across societies. (a) Choice

in dictator game Y is caused by latent psychological factor Prosociality P, which varies across popula-

tions, and unobserved sources of random error E. (b) Choice in dictator game Y is also influenced by

(population-specific) degree of market integration M. (c) The influence of prosociality P, error E, or

market integration M on observed choices differs among societies.

Advances in Methods and Practices in Psychological Science 5(3) 13

is no way to separate sources of population differences

from the trait itself, there are no theoretical grounds to

predict how the trait might vary across populations. In

such exploratory scenarios, it is advisable to sample

many culturally distinct societies to approach a repre-

sentative sample of the full range of variation (for

description of cultural variation, see “Generalizing

Description: Cross-Cultural Comparisons and Demo-

graphic Standardization” section). In general, when there

is a selection node pointing directly into the outcome

variable, researchers must incorporate relatively diverse

populations because there are relevant but unknown

variables causing population differences. However,

potential dimensions of variation across settings, indi-

viduals, and societies are effectively infinite. They can

never be sampled exhaustively, which reflects the classic

problem of induction (Hume, 1739/2003; Sloman &

Lagnado, 2005). In addition, although this approach

reduces the chance that cross-cultural similarity is due to

recent shared influences, it is not a general solution to

causal inference because any similarity between distant

societies could still be due to unobserved variables.

A generalizable understanding of a given phenome-

non, therefore, cannot be based only on the accumula-

tion of data but requires the theory-driven testing of

causal assumptions. How even the most rudimentary

causal theory helps increase generalizability can be seen

in Figure 7b, in which researchers have identified an

explanatory variable X. If researchers can find an iden-

tification strategy to estimate the causal effect

XY→

they can leverage this causal knowledge to enhance

generalizability following the transport approach out-

lined in “Generalizing Experimental Results: Transport-

ability of Causal Effects” section. The problem is that

unobserved cultural variables C, which differ between

populations, influence both X and Y and thus confound

the causal effect. One approach is to try to model the

covariation among populations that arises from such

unobserved confounds. Variables such as geographic,

linguistic, or cultural distance P can be used as proxies

to control for unmeasured common causes of similarity.

The notion is that populations closer in space or cultural

history share more unmeasured common causes. This

can permit causal investigation of, for example, ecologi-

cal and demographic factors in otherwise opportunistic

collections of societies. Various cultural and linguistic

phylogenetic methods try to implement this strategy (for

detailed examples, see McElreath, 2020, Section 14.5).

This approach makes strong causal assumptions about

the nature of confounding and our ability to measure

shared history. However, strong assumptions are always

necessary in observational settings. What is important is

that the assumptions are transparent and logically con-

nected to data analysis.

Finally, Figure 7c shows a scenario in which research-

ers have developed more mechanistic theory including

additional variables lying on the causal paths between

selection nodes and outcome; this provides more prin-

cipled expectations about the mechanisms generating

population disparities. Specifically, there is an intermedi-

ate variable Z mediating the effect of X on Y and another

variable W that modifies the effect of Z on Y. If this DAG

is assumed, there is no selection node pointing into Y

anymore, and thus researchers can explain all population

differences in the focal trait on the basis of the joint causal

effects of other variables. A research design that attempts

to address causation directly is the “regional comparative”

approach. In this approach, researchers explicitly target

closely related societies that differ only in key variables

of interest (Boas, 1896; Johnson, 1991). By holding other

factors constant, such “quasi-experimental” comparisons

among regional populations or subpopulations allow

researchers to isolate the effect of a variable of interest

and facilitate causal inference. This strategy is similar to

difference-in-difference (Lechner, 2011) and regression-

discontinuity designs (Imbens & Lemieux, 2008; Lee &

Lemieux, 2010). A classic example of the approach is the

Culture and Ecology in East Africa Project that compared

samples from four different ethnic groups, each of which

comprised neighboring pastoralist and horticulturalist

ab c

CP S

Fig. 7. Different causal scenarios for study design. (a) Unobserved factors cause cross-cultural

variation in outcome variable Y. (b) X is a cause of Y, and unobserved cultural variables C that

differ between populations influence both X and Y; phylogenetic relationships P influence C.

represents unobserved common causes.

14 Deffner et al.

communities in different but adjacent ecologies (Edgerton,

1971; Goldschmidt, 1965). Although differences among

ethnic groups are hard to interpret, differences between

neighboring communities in each ethnic group are argu-

ably due to local ecological and economic differences

(for more recent examples, see Glowacki & Molleman,

2017; Mattison etal., 2016).

To summarize, because of the problem of induction,

generalizability can never be determined through the

accumulation of cross-cultural data alone and requires

the development of formal theory to accompany and

guide cross-cultural data collection (Muthukrishna &

Henrich, 2019). The maximally informative research

design depends on the state of mechanistic understand-

ing of the phenomenon of interest. By explicitly stating

and refining the causal assumptions underlying popula-

tion differences, researchers can target maximally infor-

mative cross-cultural comparisons and generate results

that are not only grounded in theory but also generaliz-

able beyond the immediate study samples.

Conclusions

More diverse samples are urgently needed, but they

bring forth new conceptual challenges for description,

generalization, and comparison. The accumulation of

large cross-cultural data sets in combination with lists

of threats to validity allows only limited progress. What

is needed in addition is a structural-causal-modeling

framework. An explicit causal framework empowers

researchers by providing a way to plan cross-cultural

comparisons, implement and justify analyses, and deter-

mine which interpretations are warranted under which

sets of assumptions. It also provides a powerful way to

critically and fairly evaluate the studies of others and to

formally represent sources of disagreement. An effective

critique should aim for the same causal clarity as an

effective study. When an original study lacks causal clar-

ity, an effective critique may identify which causal model

is implied by the analysis and subsequently assess the

plausibility of specific elements.

Researchers in various fields already apply methods

that address some of the concerns we discussed above.

For example, political scientists and sociologists apply

demographic standardization (e.g., Kitagawa decomposi-

tion) to estimate effects of interventions for counterfac-

tual populations (Acharya etal., 2016; Ciocca Eller &

DiPrete, 2018; Kitagawa, 1955; Mize, 2016; Preston etal.,

2000; Ross etal., 2021; Storer etal., 2020). Anthropolo-

gists calculate age-corrected values to standardize across

populations (Borgerhoff Mulder et al., 2009; Jaeggi et al.,

2021; Mattison et al., 2016; and Rowan et al., 2021).

Economists calculate average treatment effects and mar-

ginal effects that can take into account effect modification

by demographic variables (Asteriou & Hall, 2015; Athey

& Imbens, 2016; Greene, 2000; Morgan & Winship, 2015),

and the Heckman correction is applied to account for

nonrandom sample selection (Heckman, 1976, 1979;

Puhani, 2000). And even simple regression controls can

account for population differences in background factors

in some limited situations.

The framework we champion—poststratification and

transport based on causal graphs—goes beyond these

partial solutions. It is explicit about the target of inference

and the assumptions that justify the analysis; it logically

derives statistical procedures from a generative causal

model. Therefore, it is more general and unifies a large

number of inferential concerns (e.g., confounding, selec-

tion bias, standardization, generalization) in a common

framework. Likewise, the estimation strategy that we

propose—multilevel regression with poststratification—is

very flexible. It allows to project estimates to arbitrary

target populations and can account for any number of

variables and functional relationships between them. In

contrast, simply including age and gender as covariates

in multiple regression assumes that all relationships are

linear and estimates population differences holding

covariates constant at an arbitrary level. Under the right

circumstances, this standard approach might tell us some-

thing about differences between observed samples but

does not enable us to generalize findings to the sample

populations (in case of sampling differences) and other

populations in a reasoned way.

It is quite obvious that all the scenarios we presented

were oversimplified. An explicit causal-inference frame-

work makes it (at times painfully) transparent how

strong the assumptions are that we need to arrive at

substantive conclusions and how little we collectively

know about many real-world phenomena. But this is no

reason to embrace the status quo that often avoids causal

language (Grosz etal., 2020)—assumptions do not dis-

appear just because we ignore them. Cross-cultural

research is daunting, and strong conclusions require

strong methods for data collection, its description, and

its analysis. A structural causal framework encourages

researchers to explicitly spell out their assumptions,

removing verbal ambiguity and facilitating communica-

tion, and it calls for a cumulative approach to science

as one study’s findings become the scaffolding assump-

tions of the next.

Transparency

Action Editor: Mijke Rhemtulla

Editor: Daniel J. Simons

Author Contributions

D. Deffner and R. McElreath conceived the project. D. Deffner

wrote the simulations and performed the analyses for the

data examples. D. Deffner, J. M. Rohrer, and R. McElreath

wrote the manuscript. Conceptualization: D. Deffner, J. M.

Rohrer, R. McElreath. Data curation: D. Deffner. Formal

Advances in Methods and Practices in Psychological Science 5(3) 15

analysis: D. Deffner. Investigation: D. Deffner, J. M. Rohrer,

R. McElreath. Methodology: D. Deffner, J. M. Rohrer,

R. McElreath. Software: D. Deffner. Supervision: R. McElreath.

Visualization: D. Deffner. Writing-original draft: D. Deffner,

J. M. Rohrer, and R. McElreath. All of the authors approved

the final manuscript for submission.

Declaration of Conflicting Interests

The authors declare that there were no conflicts of interest

with respect to the authorship or the publication of this

article.

Funding

This work has been funded by the Max Planck Society.

Open Practices

Open Data: https://github.com/DominikDeffner/Cross-

Cultural-Generalizability

Open Materials: https://github.com/DominikDeffner/

Cross-Cultural-Generalizability

Preregistration: not applicable

All data and materials have been made publicly available

via GitHub and can be accessed at https://github.com/

DominikDeffner/Cross-Cultural-Generalizability. This article

has received the badges for Open Data and Open Materials.

More information about the Open Practices badges can be

found at http://www.psychologicalscience.org/publica

tions/badges.

ORCID iD

Dominik Deffner https://orcid.org/0000-0002-1649-3861

Acknowledgments

We thank members of the Department for Human Behavior,

Ecology, and Culture and the Department of Comparative Cul-

tural Psychology at the Max Planck Institute for Evolutionary

Anthropology in Leipzig for constructive discussions and criti-

cisms that helped improve this article.

Supplemental Material

Additional supporting information can be found at http://jour

nals.sagepub.com/doi/suppl/10.1177/25152459221106366

References

Acharya, A., Blackwell, M., & Sen, M. (2016). Explaining causal

findings without bias: Detecting and assessing direct

effects. American Political Science Review, 110(3), 512–529.

Apicella, C., Norenzayan, A., & Henrich, J. (2020). Beyond

weird: A review of the last decade and a look ahead to

the global laboratory of the future. Evolution and Human

Behavior, 41(5), 319–329.

Asteriou, D., & Hall, S. G. (2015). Applied econometrics.

Macmillan International Higher Education.

Athey, S., & Imbens, G. (2016). Recursive partitioning for

heterogeneous causal effects. Proceedings of the National

Academy of Sciences, 113(27), 7353–7360.

Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A.,

Bonnefon, J.-F., & Rahwan, I. (2018). The moral machine

experiment. Nature, 563(7729), 59–64.

Awad, E., Dsouza, S., Shariff, A., Rahwan, I., & Bonnefon, J.-F.

(2020). Universals and variations in moral decisions made

in 42 countries by 70,000 participants. Proceedings of the

National Academy of Sciences, USA, 117(5), 2332–2337.

Bandalos, D. L. (2018). Measurement theory and applications

for the social sciences. Guilford Publications.

Bareinboim, E., & Pearl, J. (2016). Causal inference and the

data-fusion problem. Proceedings of the National Academy

of Sciences, USA, 113(27), 7345–7352.

Barrett, H. C. (2020). Deciding what to observe: Thoughts for

a post-weird generation. Evolution and Human Behavior,

41(5), 445–453.

Berkowitz, L., & Donnerstein, E. (1982). External validity is

more than skin deep: Some answers to criticisms of labora-

tory experiments. American Psychologist, 37(3), 245–257.

https://doi.org/10.1037/0003-066X.37.3.245

Boas, F. (1896). The limitations of the comparative method of

anthropology. Science, 4(103), 901–908.

Bond, M. H., Leung, K., & Wan, K. C. (1982). How does cul-

tural collectivism operate? The impact of task and main-

tenance contributions on reward distribution. Journal of

Cross-Cultural Psychology, 13(2), 186–200.

Borgerhoff Mulder, M., Bowles, S., Hertz, T., Bell, A., Beise, J.,

Clark, G., Fazzio, I., Gurven, M., Hill, K., Hooper, P. L.,

Irons, W., Kaplan, H., Leonetti, D., Low, B., Marlowe, F.,

McElreath, R., Naidu, S., Nolin, D., Piraino, P., . . .

Weissner, P. (2009). Intergenerational wealth transmission

and the dynamics of inequality in small-scale societies.

Science, 326(5953), 682–688. https://doi.org/10.1126/sci

ence.1178336

Borsboom, D., Deserno, M. K., Rhemtulla, M., Epskamp, S.,

Fried, E. I., McNally, R. J., Robinaugh, D. J., Perugini, M.,

Dalege, J., Costantini, G., Isvoranu, A.-M., Wysocki, A. C.,

van Borkulo, C. D., van Bork, R., & Waldorp, L. J. (2021).

Network analysis of multivariate data in psychological

science. Nature Reviews Methods Primers, 1, Article 58.

https://doi.org/10.1038/s43586-021-00055-w

Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004).

The concept of validity. Psychological Review, 111(4),

1061–1071. https://doi.org/10.1037/0033-295X.111.4.1061

Broesch, T., Crittenden, A. N., Beheim, B. A., Blackwell, A. D.,

Bunce, J. A., Colleran, H., Hagel, K., Kline, M., McElreath, R.,

Nelson, R. G., Pisor, A. C., Prall, S., Pretelli, I., Purzycki, B.,

Quinn, E. A., Ross, C., Scelza, B., Starkweather, K.,

Stieglitz, J., & Mulder, M. B. (2020). Navigating cross-cul-

tural research: Methodological and ethical considerations.

Proceedings of the Royal Society B: Biological Sciences,

287(1935), Article 20201245. https://doi.org/10.1098/

rspb.2020.1245

Calder, B. J., Phillips, L. W., & Tybout, A. M. (1983). Beyond exter-

nal validity. Journal of Consumer Research, 10(1), 112–114.

Campbell, D. T. (1957). Factors relevant to the validity of

experiments in social settings. Psychological Bulletin,

54(4), 297–312. https://doi.org/10.1037/h0040950

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B.,

Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A.

16 Deffner et al.

(2017). Stan: A probabilistic programming language. Journal

of Statistical Software, 76(1), 1–32. https://doi.org/10

.18637/jss.v076.i01

Cinelli, C., Forney, A., & Pearl, J. (2020). A crash course in

good and bad controls. SSRN. http://dx.doi.org/10.2139/

ssrn.3689437

Cinelli, C., & Pearl, J. (2021). Generalizing experimental results

by leveraging knowledge of mechanisms. European Journal

of Epidemiology, 36, 149–164. https://doi.org/10.1007/

s10654-020-00687-4 1–16.

Ciocca Eller, C., & DiPrete, T. A. (2018). The paradox of per-

sistence: Explaining the Black-White gap in bachelor’s

degree completion. American Sociological Review, 83(6),

1171–1214.

Clancy, K. B., & Davis, J. L. (2019). Soylent is people, and

WEIRD is white: Biological anthropology, whiteness, and

the limits of the WEIRD. Annual Review of Anthropology,

48, 169–186. https://doi.org/10.1146/annurev-anthro-

102218-011133

Cock, T., & Campbell, D. (1976). The design and conduct of

quasi-experiments and true experiments in field setting.

In M. D. Dunnette (Eds.), Handbook of industrial and

organizational psychology (pp. 223–326). Rand McNally.

Curtin, C. M., Barrett, H. C., Bolyanatz, A., Crittenden, A. N.,

Fessler, D. M., Fitzpatrick, S., Gurven, M., Kanovsky, M.,

Kushnick, G., Laurence, S., Pisor, A., Scelza, B., Stich, S.,

Rueden, C., & Henrich, J. (2020). Kinship intensity and the

use of mental states in moral judgment across societies.

Evolution and Human Behavior, 41(5), 415–429. https://

doi.org/10.1016/j.evolhumbehav.2020.07.002

Edgerton, R. B. (1971). The individual in cultural adapta-

tion: A study of four East African peoples. University of

California Press.

Elwert, F. (2013). Graphical causal models. In S. L. Morgan

(Ed.), Handbook of causal analysis for social research (pp.

245–273). Springer.

Falk, A., Becker, A., Dohmen, T., Enke, B., Huffman, D., &

Sunde, U. (2018). Global evidence on economic preferences.

The Quarterly Journal of Economics, 133(4), 1645–1692.

Farrell, S., & Lewandowsky, S. (2018). Computational modeling

of cognition and behavior. Cambridge University Press.

Gächter, S., & Schulz, J. F. (2016). Intrinsic honesty and the

prevalence of rule violations across societies. Nature,

531(7595), 496–499.

Gao, Y., Kennedy, L., Simpson, D., & Gelman, A. (2021).

Improving multilevel regression and poststratification

with structured priors. Bayesian Analysis, 16(3), 719–744.

https://doi.org/10.1214/20-BA1223

Gelman, A., & Hill, J. (2006). Data analysis using regression

and multilevel/hierarchical models. Cambridge University

Press.

Gelman, A., & Little, T. C. (1997). Poststratification into many

categories using hierarchical logistic regression. Survey

Methodology, 23(2), 127–135.

Ghai, S. (2021). It’s time to reimagine sample diversity and

retire the WEIRD dichotomy. Nature Human Behaviour,

5(8), 971–972. https://doi.org/10.1038/s41562-021-01175-9

Glowacki, L., & Molleman, L. (2017). Subsistence styles shape

human social learning strategies. Nature Human Behaviour,

1, Article 0098. https://doi.org/10.1038/s41562-017-0098

Goldschmidt, W. (1965). Theory and strategy in the study

of cultural adaptability. American Anthropologist, 67(2),

402–408.

Greene, W. H. (2000). Econometric analysis (4th ed., Inter-

national ed.). Prentice Hall.

Grosz, M. P., Rohrer, J. M., & Thoemmes, F. (2020). The taboo

against explicit causal inference in nonexperimental psy-

chology. Perspectives on Psychological Science, 15(5),

1243–1255.

Heckman, J. J. (1976). The common structure of statistical mod-

els of truncation, sample selection and limited dependent

variables and a simple estimator for such models. In S. V.

Berg (Ed.), Annals of economic and social measurement

(Vol. 5, no. 4, pp. 475–492). NBER. http://www.nber.org/

chapters/c10491

Heckman, J. J. (1979). Sample selection bias as a specification

error. Econometrica: Journal of the Econometric Society,

47(1), 153–161. https://doi.org/10.2307/1912352

Henrich, J. (2020). The WEIRDest people in the world: How

the West became psychologically peculiar and particularly

prosperous. Farrar, Straus and Giroux.

Henrich, J., Boyd, R., Bowles, S., Camerer, C., Fehr, E.,

Gintis, H., & McElreath, R. (2001). In search of homo eco-

nomicus: Behavioral experiments in 15 small-scale societ-

ies. American Economic Review, 91(2), 73–78.

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Beyond

WEIRD: Towards a broad-based behavioral science.

Behavioral and Brain Sciences, 33(2–3), 111–135. https://

doi.org/10.1017/S0140525X10000725

House, B. R., Kanngiesser, P., Barrett, H. C., Broesch, T.,

Cebioglu, S., Crittenden, A. N., Erut, A., Lew-Levy, S.,

Sebastian-Enesco, C., Smith, A. M., Yilmaz, S., & Silk, J. B.

(2020). Universal norm psychology leads to societal diver-

sity in prosocial behaviour and development. Nature

Human Behaviour, 4(1), 36–44. https://doi.org/10.1038/

s41562-019-0734-z

Hume, D. (2003). A treatise of human nature. Dover Publications.

(Original work published 1739).

Imbens, G. W., & Lemieux, T. (2008). Regression discontinu-

ity designs: A guide to practice. Journal of Econometrics,

142(2), 615–635.

Jaeggi, A. V., Blackwell, A. D., von Rueden, C., Trumble, B. C.,

Stieglitz, J., Garcia, A. R., Kraft, T. S., Beheim, B. A.,

Hooper, P. L., Kaplan, H., & Gurven, M. (2021). Do wealth

and inequality associate with health in a small-scale sub-

sistence society? Elife, 10, Article e59437. https://doi.org/

10.7554/eLife.59437

Johnson, A. (1991). Regional comparative field research.

Behavior Science Research, 25(1–4), 3–22.

Jones, B. C., DeBruine, L. M., Flake, J. K., Liuzza, M. T., Antfolk, J.,

Arinze, N. C., Ndukaihe, I. L., Bloxsom, N. G., Lewis, S. C.,

Foroni, F., Willis, M. L., Cubillas, C. P., Vadillo, M. A.,

Turiegano, E., Gilead, M., Simchon, A., Saribay, S. A.,

Owsley, N. C., Jang, C., . . . Coles, N. A. (2021). To which

world regions does the valence–dominance model of

social perception apply? Nature Human Behaviour, 5(1),

159–169. https://doi.org/10.1038/s41562-020-01007-21–9

Kitagawa, E. M. (1955). Components of a difference between

two rates. Journal of the American Statistical Association,

50(272), 1168–1194.

Advances in Methods and Practices in Psychological Science 5(3) 17

Lechner, M. (2011). The estimation of causal effects by differ-

ence-in-difference methods. Now.

Lee, D. S., & Lemieux, T. (2010). Regression discontinuity

designs in economics. Journal of Economic Literature, 48(2),

281–355.

Lesorogol, C. K. (2007). Bringing norms in: The role of con-

text in experimental dictator games. Current Anthropology,

48(6), 920–926.

Leung, K., & Bond, M. H. (1984). The impact of cultural col-

lectivism on reward allocation. Journal of Personality

and Social Psychology, 47(4), 793–804. https://doi.org/10

.1037/0022-3514.47.4.793

Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What is

your estimand? Defining the target quantity connects sta-

tistical evidence to theory. American Sociological Review,

86(3), 532–565.

Masuda, T., & Nisbett, R. E. (2001). Attending holistically

versus analytically: Comparing the context sensitivity of

Japanese and Americans. Journal of Personality and Social

Psychology, 81(5), 922–934. https://doi.org/10.1037//0022-

3514.81.5.922

Matthay, E. C., & Glymour, M. M. (2020). A graphical catalog

of threats to validity: Linking social science with epide-

miology. Epidemiology, 31(3), 376–384. https://doi.org/

10.1097/EDE.0000000000001161

Mattison, S. M., Beheim, B., Chak, B., & Buston, P. (2016).

Offspring sex preferences among patrilineal and matrilin-

eal Mosuo in Southwest China revealed by differences in

parity progression. Royal Society Open Science, 3(9), Article

160526. https://doi.org/10.1098/rsos.160526

McElreath, R. (2020). Statistical rethinking: A Bayesian course

with examples in R and Stan. CRC Press.

Mehr, S. A., Singh, M., Knox, D., Ketter, D. M., Pickens-

Jones, D., Atwood, S., Lucas, C., Jacoby, N., Egner, A. A.,

Hopkins, E. J., Howard, R. M., Hartshorne, J. K., Jennings,

M. V., Simson, J., Bainbridge, C. M., Pinker, S., O’Donnell,

T. J., Krasnow, M. M., & Glowacki, L. (2019). Universality

and diversity in human song. Science, 366(6468), Article

eaax0868. https://doi.org/10.1126/science.aax0868

Mize, T. D. (2016). Sexual orientation in the labor market.

American Sociological Review, 81(6), 1132–1160.

Morgan, S. L., & Winship, C. (2015). Counterfactuals and

causal inference. Cambridge University Press.

Murdock, G. P., & White, D. R. (1969). Standard cross-cultural

sample. Ethnology, 8(4), 329–369.

Muthukrishna, M., Bell, A. V., Henrich, J., Curtin, C. M.,

Gedranovich, A., McInerney, J., & Thue, B. (2020). Beyond

Western, educated, industrial, rich, and democratic (WEIRD)

psychology: Measuring and mapping scales of cultural and

psychological distance. Psychological Science, 31(6), 678–

701. https://doi.org/10.1177/0956797620916782

Muthukrishna, M., & Henrich, J. (2019). A problem in theory.

Nature Human Behaviour, 3(3), 221–229.

Naroll, R. (1965). Galton’s problem: The logic of cross-cultural

analysis. Social Research, 32(4), 428–451.

Nisbett, R. E., & Miyamoto, Y. (2005). The influence of culture:

Holistic versus analytic perception. Trends in Cognitive

Sciences, 9(10), 467–473.

Norenzayan, A., & Heine, S. J. (2005). Psychological univer-

sals: What are they and how can we know? Psychological

Bulletin, 131(5), 763–784. https://doi.org/10.1037/0033-

2909.131.5.763

Oganisian, A., & Roy, J. A. (2021). A practical introduction

to Bayesian estimation of causal effects: Parametric and

nonparametric approaches. Statistics in Medicine, 40(2),

518–551.

Oyserman, D., & Lee, S. W. (2008). Does culture influence

what and how we think? Effects of priming individualism

and collectivism. Psychological Bulletin, 134(2), 311–342.

Pearl, J. (1988). Probabilistic reasoning in intelligent systems:

Networks of plausible inference. Morgan Kaufmann.

Pearl, J. (2015). Generalizing experimental findings. Journal

of Causal Inference, 3(2), 259–266.

Pearl, J. (2018). Theoretical impediments to machine learning

with seven sparks from the causal revolution. arXiv. https://

doi.org/10.48550/arXiv.1801.04016

Pearl, J., & Bareinboim, E. (2014). External validity: From do-

calculus to transportability across populations. Statistical

Science, 29(4), 579–595. https://doi.org/10.1214/14-STS486

Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference

in statistics: A primer. John Wiley & Sons.

Pearl, J., & Mackenzie, D. (2018). The book of why: The new

science of cause and effect. Basic Books.

Pisor, A. C., Gervais, M. M., Purzycki, B. G., & Ross, C. T. (2020).

Preferences and constraints: The value of economic games

for studying human behaviour. Royal Society Open Science,

7(6), Article 192090. https://doi.org/10.1098/rsos.192090

Pollet, T. V., Tybur, J. M., Frankenhuis, W. E., & Rickard, I. J.

(2014). What can cross-cultural correlations teach us about

human nature? Human Nature, 25(3), 410–429.

Preston, S., Heuveline, P., & Guillot, M. (2000). Demography:

Measuring and modeling population processes. Wiley-

Blackwell.

Puhani, P. (2000). The Heckman correction for sample selec-

tion and its critique. Journal of Economic Surveys, 14(1),

53–68.

Rohrer, J. M. (2018). Thinking clearly about correlations and

causation: Graphical causal models for observational

data. Advances in Methods and Practices in Psychological

Science, 1(1), 27–42.

Rohrer, J. M., & Arslan, R. C. (2021). Precise answers to vague

questions: Issues with interactions. Advances in Methods

and Practices in Psychological Science, 4(2). https://doi

.org/10.1177/25152459211007368

Ross, C. T., Winterhalder, B., & McElreath, R. (2021). Racial

disparities in police use of deadly force against unarmed

individuals persist after appropriately benchmarking shoot-

ing data on violent crime rates. Social Psychological and

Personality Science, 12(3), 323–332.

Rowan, C. J., Eskander, M. A., Seabright, E., Rodriguez, D. E.,

Linares, E. C., Gutierrez, R. Q., Adrian, J. C., Cummings, D.,

Beheim, B., Tolstrup, K., Achrekar, A., Kraft, T., Michalik,

D. E., Miyamoto, M. I., Allam, A. H., Wann, L. S., Narula, J.,

Trumble, B. C., Stieglitz, J., . . . Gurven, M. D. (2021). Very

low prevalence and incidence of atrial fibrillation among

Bolivian forager-farmers. Annals of Global Health, 87(1),

Article 18. https://doi.org/10.5334/aogh.3252

Rozin, P. (2001). Social psychology and science: Some lessons

from Solomon Asch. Personality and Social Psychology

Review, 5(1), 2–14.

18 Deffner et al.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).

Experimental and quasi-experimental designs for general-

ized causal inference/William R. Shedish, Thomas D. Cook,

Donald T. Campbell. Houghton Mifflin.

Sijtsma, K. (2006). Psychometrics in psychological research:

Role model or partner in science? Psychometrika, 71(3),

451–455.

Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints

on generality (COG): A proposed addition to all empiri-

cal papers. Perspectives on Psychological Science, 12(6),

1123–1128. https://doi.org/10.1177/1745691617708630

Sloman, S. A., & Lagnado, D. (2005). The problem of induction.

In K. J. Holyoak & R. G. Morrison (Eds.), The Cambridge

handbook of thinking and reasoning (pp. 95–116). Cengage

Learning.

Smaldino, P. E., Lukaszewski, A., von Rueden, C., & Gurven, M.

(2019). Niche diversity can explain cross-cultural differ-

ences in personality structure. Nature Human Behaviour,

3(12), 1276–1283.

Storer, A., Schneider, D., & Harknett, K. (2020). What explains

racial/ethnic inequality in job quality in the service sector?

American Sociological Review, 85(4), 537–572.

Tiokhin, L., Hackman, J., Munira, S., Jesmin, K., & Hruschka, D.

(2019). Generalizability is not optional: Insights from a

cross-cultural study of social discounting. Royal Society

Open Science, 6(2), Article 181386. https://doi.org/10.1098/

rsos.181386

Urassa, M., Lawson, D. W., Wamoyi, J., Gurmu, E., Gibson,

M. A., Madhivanan, P., & Placek, C. (2021). Cross-cultural

research must prioritize equitable collaboration. Nature

Human Behaviour, 5, 668–671 (2021). https://doi.org/

10.1038/s41562-021-01076-x

Van de Vijver, F., & Leung, K. (2021). Methods and data

analysis for cross-cultural research (Vol. 116). Cambridge

University Press.

Van de Vijver, F., & Tanzer, N. K. (2004). Bias and equiva-

lence in cross-cultural assessment: An overview. European

Review of Applied Psychology, 54(2), 119–135.

Wang, W., Rothschild, D., Goel, S., & Gelman, A. (2015).

Forecasting elections with non-representative polls. Inter-

national Journal of Forecasting, 31(3), 980–991.

Winer, R. S. (1999). Experimentation in the 21st century: The

importance of external validity. Journal of the Academy

of Marketing Science, 27(3), Article 349. https://doi.org/

10.1177/0092070399273005

Woodward, J. (2005). Making things happen: A theory of causal

explanation. Oxford University Press.

Yarkoni, T. (2022). The generalizability crisis. Behavioral and

Brain Sciences, 45, Article E1. https://doi.org/10.1017/S01

40525X20001685