Introduction

In the property listing task (PLT), a subject sample is asked to list semantic properties associated with a given concept (e.g., for DOG, someone may produce barks and wags its tail). The PLT is widely used in psychology and related fields for basic and applied research (in cognitive psychology, e.g., Wu & Barsalou, 2009; in marketing, e.g., Hough & Ferraris, 2010; in social psychology, e.g., Walker & Hennig, 2004; in neuropsychology, e.g., Perri et al., 2012). The PLT is often used to collect semantic properties across whole semantic fields (e.g., artifacts, concrete concepts, abstract concepts). In those cases where the PLT is applied to many concepts, results are published as conceptual property norms (CPNs, e.g., Devereux et al., 2014; Kremer & Baroni, 2011; Lenci et al., 2013; McRae et al., 2005; Montefinese et al., 2013; Vivas et al., 2017). These CPNs allow different concepts to be characterized by their associated semantic properties and their corresponding frequency counts (Cree & McRae, 2003) and are useful for other researchers who resort to CPNs to find carefully controlled experimental stimuli (e.g., Bruffaerts et al., 2019; McRae et al., 1999). Though other tools are available to explore the structure of semantic memory and its implementation in the brain (Binder et al., 2016), CPNs offer a particularly rich source of information (Bruffaerts et al., 2019).

The usefulness of the PLT and CPNs notwithstanding, only recently has there been a proposal for taking sample representativeness into account (Canessa et al., 2021). Previously, representativeness has not been a requirement for CPNs. In what follows, we discuss why representativeness is relevant and offer a way of formally conceptualizing it for CPNs. Finally, we describe a statistical package in R that can be used by researchers to determine the necessary sample size to achieve a certain level of representativeness. We illustrate the package’s usefulness by applying it to empirical and synthetic datasets.

Consequences of not taking CPN representativeness into account

Not knowing whether a CPN is representative introduces several limitations that have been overlooked in the literature. Recall that CPNs are not experiments, where internal validity (related to the logic of experimental design) would be the validity criterion. For a CPN study, external validity (related to sample representativeness) seems the appropriate validity criterion. When researchers use a CPN to obtain normed stimuli, or when they use values associated with concepts and properties for control purposes, those raw values are at best unbiased point estimations of the true population values, not the population values themselves. Think of any measure computed from matrices of property frequency counts, such as property dominance and concept similarity. They are all estimates of population parameters (i.e., one would expect variations in those estimates from one study to another). As occurs with any measurement that depends on a sample representativeness, if it is not possible to determine the measurement’s representativeness relative to the relevant population, several consequences ensue. Here, we discuss three of them. Other consequences may be possible.

  1. 1.

    Perhaps the most important consequence is that lacking information about representativeness hinders a researcher’s ability to make claims about how well her specific results generalize to the population. Values obtained from a sample are only estimators of the relevant population parameters. Surprisingly, this problem has only recently been acknowledged relative to CPNs. If a cognitive researcher wants to draw conclusions from semantic properties collected from a sample of subjects, and if she does not know whether her data is representative, then she has only a limited capacity to generalize her results.

  2. 2.

    Another consequence of not being able to ascertain representativeness is that researchers cannot formally make decisions about sample size (i.e., how many participants will list properties for a given concept). This probably explains why researchers collecting CPN data have instead reached an implicit consensus about sample size. Perusing the literature, one finds that they have implicitly agreed that somewhere between 20 and 30 participants is a reasonable number (e.g., Cree & McRae, 2003; Devereux et al., 2014; Lenci et al., 2013; McRae et al., 2005; Montefinese et al., 2013). Sample sizes are justified solely based on convention. At this point, the reader may wonder whether effect sizes could be used to determine sample size, but one must consider that CPNs may be used as input to experiments by providing carefully controlled stimuli, but that CPNs are not experiments themselves, for which one could determine effect size.

  3. 3.

    A third consequence is that of replicability. There is currently no good way to compare CPNs. The status is that each CPN is considered to be unique, replications are not attempted, and comparisons between CPNs are tentative/informal or very general at best (for further discussion, see Canessa et al., 2021).

Conceptualizing CPN representativeness

Arguably, part of the reason for not including representativeness in CPN studies is that it is unclear what representativeness might be for a CPN. In our previous work (Canessa et al., 2021), we offered the following account. This explanation draws heavily from work on ecology (mainly, Chao & Chiu, 2016), which we think offers striking parallels to CPNs.

Imagine an ecologist who collects insect species specimens by putting traps at different spots in a jungle patch. After a predefined period has elapsed, the researcher opens her traps and does a frequency count of the different species. Importantly, the researcher would want to know whether the species that were collected are representative of the jungle patch population. By representative, what is meant here is whether the species that were obtained cover a sufficiently large percentage of the unknown total number of species that populate the jungle patch. By being able to estimate coverage, the researcher would also know whether generalizing from her sample to the population is warranted, whether a larger sample is necessary, and whether her study is comparable to other similar studies (these are the three problems we highlight above).

This imagined ecological study closely parallels a CPN study. The jungle patch is like a single concept. Subjects producing lists of semantic properties for a given concept are like traps collecting species specimens. Frequency counts of semantic properties are like frequency counts of different species. Many more semantic properties exist for a concept than those sampled in our imaginary CPN study, just as many more insect species existed in our imaginary jungle patch. The analogy can be carried even further, as similar conclusions may be drawn for ecological and cognitive data. In ecology, a researcher may be interested in a variable called “species richness” (i.e., the total count of unique species that characterize an ecosystem; Bellwood et al., 2004). In cognitive research, a researcher may be interested in a variable called “semantic richness” (SR). Regarding SR, here we conceptualize it as the total count of unique semantic properties that characterize a concept (i.e., those produced at least once in the CPN study). In cognitive psychology research, SR has been conceptualized as the average number of properties produced by a sample of subjects (Pexman et al., 2007; Pexman et al., 2008). We prefer to use our definition because it is closely related to the equations discussed in the following section. Note, however, that for real PLT data in which participants typically produce lists that include some partially shared properties and some idiosyncratic properties, both quantities will be correlated (i.e., larger individual lists generally imply a larger set of unique properties).

Thus, our proposal is that a CPN study is representative if the semantic properties captured for the concepts included in the CPN cover a sufficiently large proportion of the unknown total number of semantic properties that would be reached if an increasingly larger number of participants were included in the study. In what follows, we offer a formalization of these ideas, but the analogy proposed here may give the reader the necessary intuitions to follow the rest of this manuscript.

A possible concern regarding our proposed analogy is that given that human knowledge varies depending on experience and context, but the jungle patch in our analogy has a fixed set of species, our analogy does not hold. We believe this is solved by noting that two different levels of analysis are involved. One level is the individuals (or traps), where variability is expected. For individuals, variability is expected due to experience and context. For traps, variability is expected due to, e.g., trap placement. The other level is the population level (or jungle patch), where it will be true that there is a more or less fixed set of semantic features (or species). Note that the limited set of properties assumption must be true in the case of CPNs, given that dimension reduction techniques (clustering, PCA, etc.) on these kinds of data will often produce meaningful structures (e.g., animals, plants, fruits, artifacts, vehicles, etc.), which would be impossible if contents were not limited and shared (e.g., Cree & McRae, 2003).

Formalizing coverage

Chao and Chiu (2016), in the context of problems of species richness estimation in ecology, developed a mathematical model that allows one to formalize the abovementioned concept of coverage in ecology studies and can also be used to estimate other parameters in such studies. As Chao and Chiu (2016) discuss, that model is more general than the problem of species richness estimation and, thus, can be applied to other disciplines. In the context of CPN studies, we have already established a close parallel to the ecological mathematical model (Canessa et al., 2021). Thus, by assuming the same simplifications used in deriving the Chao and Chiu (2016) model, we may straightforwardly apply it to CPN studies. Those simplifications are extensively discussed in Canessa et al. (2021), where the authors justify them in the context of CPN studies. Here, we just list them, and the interested reader is referred to Canessa et al. (2021) for an in-depth discussion, although in this paper we summarily justify the simplifications. To apply the Chao and Chiu (2016) model, we need to assume the following:

  1. a)

    Each property listed by participants for a concept in a CPN study has a constant incidence probability, i.e., that the probability that a given property for a given concept is produced by a participant is the same for all participants.

  2. b)

    Properties are independent (i.e., that a property’s detectability is independent from other properties being or not being detected).

  3. c)

    The number of properties associated with a given concept in the population is finite.

These assumptions may need further justification, which will be found next. Regarding the constant incidence probability assumption, this may be rephrased as that there are no participant-specific effects. This means that each participant is equally representative of the population and that there are no systematic and important differences among participants related to the listing process. This is in fact a routine assumption in CPN studies when using random selection of participants, and hence we can argue that it is a reasonably close approximation.

Regarding the independence of properties, this also seems a reasonably close approximation. First, consider that this is a generally accepted simplification. Computing cosine similarity between featural descriptions in CPN data, e.g., already assumes that all properties are orthogonal (e.g., McRae et al., 2005). Though this is probably not completely true at the level of each individual participant (i.e., properties might be correlated, such that evoking a given property affects the probability of evoking the following one), interfeature correlations have not been found to be frequent in CPN studies (De Deyne et al., 2019). Furthermore, even if a substantial number of participants were producing correlated properties (e.g., listing “turbine” right after “wing” or “propeller” right after “wing”), the property independence assumption only requires that different people produce different pairs of correlated properties (e.g., someone produces “turbine” right after “wing”, while someone else produces “propeller” right after “wing”). Accumulating data across many individuals, as done in CPNs, would turn properties approximately independent across participants.

Finally, there is our last assumption that there are a limited number of properties in the population. We believe this is a reasonable assumption because property production is confined to the time span in which the PLT is carried out. If property lists were collected during a long period of time, then many factors could make the total list of properties increase indefinitely in length (e.g., creativity, cultural change, conceptual drift), which might be a problem. However, it seems reasonable to assume that the total number of properties accessible to participants for report at any given moment may be very large but still finite.

According to Chao and Chiu (2016), these three assumptions allow one to characterize the frequency count for each property by a binomial distribution. Based on that distribution, and using the standard method of moments estimation and asymptotic approach (Chao & Chiu, 2016 and references therein), these authors derive the representativeness of the sample used in a study, which they label coverage. Coverage is defined in general terms as the fraction of the total number of properties in the population that are captured in the total sample of T participants for a given concept. More formally, coverage is defined as the fraction of the total incidence probabilities of the reported properties that are in the reference sample (Chao & Chiu, 2016). Coverage can be estimated for each concept by Eq. (1):

$$\hat{C}(T)=1-\frac{Q_1}{\mathcal{U}}\left[\frac{Q_1\ \left(T-1\right)}{Q_1\ \left(T-1\right)+2{Q}_2}\right]$$
(1)

where Q1 is the number of properties reported by a single individual, Q2 is the number of properties reported by only two individuals (respectively, the number of singletons and the number of doubletons), and \(\mathcal{U}\) is the total number of properties listed by all participants for a concept.

Furthermore, the same logic used in deriving Eq. (1) allows us to estimate the coverage expected from increasing sample size in t* participants.

$$\hat{C}\left(T+{t}^{\ast}\right)=1-\frac{Q_1}{\mathcal{U}}{\left[\frac{Q_1\ \left(T-1\right)}{Q_1\ \left(T-1\right)+2{Q}_2}\right]}^{\left({t}^{\ast }+1\right)}\kern0.5em 0\le {t}^{\ast}\le 2T$$
(2)

From Eq. (2), we can solve for t* and estimate the number of additional participants needed to obtain a certain target coverage (\({\hat{C}}_{target}\)):

$${t}^{\ast }= ceiling\left[\frac{\ln \left(\frac{\mathcal{U}}{Q_1}\left[1-{\hat{C}}_{target}\right]\right)}{\ln \left(\frac{\left(T-1\right){Q}_1}{\left(T-1\right){Q}_1+2{Q}_2}\right)}-1\right]\ 0<{t}^{\ast}\le 2T$$
(3)

where the ceiling function returns the closest integer that is larger than or equal to the corresponding argument of that function.

As already explained, Chao and Chiu´s mathematical model was developed to estimate the species richness in an ecological study, i.e., the number of species present in a given environment. In the context of CPN studies, the equivalent of species richness is called “semantic richness” S (i.e., the total count of unique semantic properties that characterize a concept; Pexman et al., 2007; Pexman et al., 2008). To do that, the model only needs to know the same quantities used in Eq. (1), namely Q1, Q2, and T. Here it is useful to note that, more generally, singletons and doubletons are two of the incidence-based frequency counts (Q0, Q1, Q2, …, QT), where Qk corresponds to the number of properties that are reported by exactly k participants, k = 0, 1,…,T. The unobserved Q0 frequency count represents the number of properties not reported by any of the T participants. Thus, an estimate of S is given by Eq. (4):

$$\hat{S}=\left\{\begin{array}{c}{S}_{obs}+A\ \frac{Q_1^2}{2{Q}_2}\kern2.25em if\ {Q}_2>0\\ {}{S}_{obs}+A\ \frac{Q_1\ \left({Q}_1-1\right)}{2}\kern0.75em if\ {Q}_2=0\end{array}\right.$$
(4)

where \(A=\frac{\left(T-1\right)}{T}\) and Sobs corresponds to the observed semantic richness, i.e., the observed count of unique properties for a concept.

And where the estimator’s variance can be approximated by Eq. (5).

$$\hat{\mathit{\operatorname{var}}\ \Big(}\hat{S}\Big)=\left\{\begin{array}{c}{Q}_2\left[\frac{A}{2}{\left(\frac{Q_1}{Q_2}\right)}^2+{A}^2\ {\left(\frac{Q_1}{Q_2}\right)}^3+{\frac{A}{4}}^2\ {\left(\frac{Q_1}{Q_2}\right)}^4\right]\kern2.75em if\ {Q}_2>0\\ {}A\ \frac{Q_1\ \left({Q}_1-1\right)}{2}+{A}^2\frac{Q_1\ {\left(2{Q}_1-1\right)}^2}{4}-{A}^2\ \frac{Q_1^4\ }{4\hat{S}}\kern0.75em if\ {Q}_2=0\end{array}\right.$$
(5)

From this variance, we can calculate the standard deviation (SD) of \(\hat{S}\)(i.e., SD \(\hat{S}\) = \(\sqrt{\hat{\mathit{\operatorname{var}}\ \Big(}\hat{S}\Big)}\) ) and the confidence interval for S can be computed by Eq. (6)

$$95\% CI\ for\ S=\left[{S}_{obs}+\frac{\hat{S}-{S}_{obs}}{D},{S}_{obs}+\left(\ \hat{S}-{S}_{obs}\right)\ D\ \right]$$
(6)

where

$$D=\mathit{\exp}\left[1.96\ \sqrt{\ln \left(1+\frac{\hat{\mathit{\operatorname{var}}\ \Big(}\hat{S}\Big)}{{\left(\hat{S}-{S}_{obs}\right)}^2}\right)\ }\ \right]$$
(7)

Note that using Eq. (6) assumes that \(\ln \left(\hat{S}-{S}_{obs}\right)\) is approximately normally distributed, i.e., \(\left(\hat{S}-{S}_{obs}\right)\) follows an approximate log-normal distribution. Also see that \(\left(\hat{S}-{S}_{obs}\right)=\hat{Q_0}\) corresponds to the rightmost summand in Eq. (4).

Given that by using Eq. (3), one can estimate an increment t* to the original sample size T to obtain a desired coverage \({\hat{C}}_{target}\) , Eq. (8) allows us to estimate the corresponding \(\hat{S\ }\):

$$\hat{S}\left(T+{t}^{\ast}\right)={S}_{obs}+\hat{Q_0}\left[1-{\left(1-\frac{Q_1}{T\ \hat{Q_0}+{Q}_1}\right)}^{t^{\ast }}\right]\ 0<{t}^{\ast}\le 2T$$
(8)

where \(\hat{Q_0}=\hat{S}-{S}_{obs}\) and, as already noted, corresponds to the rightmost summand in Eq. (4). In the next section, we present how we implemented formulae (1) to (8) in an R package and explain how that package calculates the inputs to those expressions, i.e., Sobs, Q0, Q1, and \(\mathcal{U}\).

Implementation of equations in R

Transparency and openness

The CPNCoverageAnalysis R package is available from the R repository from the Comprehensive R Archive Network (CRAN; https://cran.r-project.org/). To download the package along with its accompanying files, go to https://CRAN.R-project.org/package=CPNCoverageAnalysis

R package’s main functions

In this section, we describe the CPNCoverageAnalysis package’s two main functions (generate_norms.R and estimate_participant.R). The function generate_norms receives the properties mentioned by all participants in a CPN for each concept and calculates all the figures that describe the CPN (i.e., Q1 [no. of singletons], Q2 [no. of doubletons], T [no. of participants who listed properties], Sobs [observed semantic richness, SR],\(\mathcal{U}\) [total no. of properties listed], \(\hat{S}\) [SR estimate], standard deviation [SD] of \(\hat{S}\), 95% CI for S, \(\hat{C}(T)\) [estimated coverage]). Note that to facilitate reference to the R code, in this section we will use the variable names employed in the code, e.g., Q1 instead of Q1, S_hat instead of \(\hat{S}\), etc.

The second function, estimate_participant.R, receives estimates generated by generate_norms and a value between 0 and 1, indicating the target coverage required by the user. By using those inputs, the second function returns estimates for t_star and S_hat_star.

The input data for the generate_norms function is a matrix with n rows and three columns, where n refers to the number of properties mentioned by all participants for all the concepts in the CPN. To explain the code and the dataset we start by loading the synthetic dataset called data_test consisting of a data frame with 65 rows and 3 columns (i.e., in this synthetic CPN, a total of 65 properties were listed for the concepts).

figure a

Each row of the dataset corresponds to one property mentioned for a specific concept by a specific participant. Then, the first column of the data frame refers to the ID of a participant. The second column corresponds to concepts’ IDs (any ID consisting of ASCII characters or integer numbers may work), and the third refers to properties. For example, the first row has the values 1 C1 p1, meaning that participant 1 mentioned property p1 for concept C1. Note that the second row is 1 C1 p2, because the same person mentioned a second property for the same concept C1. The whole synthetic dataset consists of 10 participants, three concepts, and 17 different properties.

The function generate_norms calculates all necessary figures to estimate t_star and S_hat_star from a dataset. Given a dataset, for each concept contained in the dataset, the function computes values for Q1, Q2, T, S_obs, U; estimates values for S_hat, sd_S_hat, CI_l, CI_U, and C_t; and returns a data frame with those values. This function starts by extracting each concept contained in the dataset and continues preprocessing the concepts and properties by trimming leading and trailing white spaces and changing all characters to upper case. Once the data is clean, it continues with the generation of an empty data frame for all the concepts and calculates the corresponding abovementioned values for each concept.

To calculate the necessary values, the function iterates over each concept. Given a concept, it extracts all mentioned properties and calculates the number of participants who listed at least one property for the concept (T). Then it calculates the number of unique properties (S_obs) and the number of times that each property is mentioned across all participants. To calculate U, Q1, and Q2, a binary matrix of size featuresXusers is created, where a value of 1 means that feature i was listed by participant j. Using this matrix, the value for U is just the sum of all values of 1, and a frequency property vector can be computed (the number of times each property is mentioned). From this vector, the function calculates the Q1 and Q2 values. Once these values are calculated, the function calculates values for C_T, S_hat, sd_S_hat, CI_l, and CI_U corresponding to Eqs. (1), (4), (5), and (6), respectively. The following example shows the estimated figures for the synthetic dataset.

figure b

As can be observed, there are three different concepts. For concept C1, nine people mentioned at least one property (T = 9). In total, 19 properties were mentioned (U = 19), with only 10 unique properties (S_obs = 10). Then, three properties were mentioned by exactly a single participant (Q1 = 3), while five properties were mentioned by exactly two participants (Q2 = 5). Recall that the other values are calculated from the preceding figures and correspond to the estimated semantic richness (S_hat) with its respective standard deviation (sd_S_hat) and 95% confidence interval (CI_I and CI_U), and the actual attained coverage (C_T).

The function estimate_participant estimates the additional number of participants needed to obtain a target coverage defined by the user and its corresponding estimated semantic richness. For this purpose, the function receives as inputs: est_norms (the estimated norms from function generate_norms) and target_cover (the target coverage defined by the user: a value between 0.0 and 1.0). Using these inputs, the function proceeds to calculate, for all concepts, the additional number of participants t_star using Eq. (3) (\({\hat{C}}_{target}\)in Eq. (3) is equal to target_cover), with its respective estimated semantic richness defined by Eq. (8) S_hat_star. The following example shows these values for target coverage of 0.90 using the synthetic dataset.

figure c

In this example, we can observe that C1 needs at least one more participant to increase its coverage to 90%. Recall that its original coverage was C_T = 0.89, according to the previous code example. In contrast, C2 does not need more participants. This value is not surprising, given that the target coverage is lower than C2’s originally calculated coverage (C_T = 0.91). Finally, for concept C3, given that Q2 = 0, Eq. (3) cannot be used to calculate t_star, and thus the function outputs the corresponding warning (“Q2 = 0, cannot calculate t_star”). Note also that when the calculated t_star exceeds 2T [see Eqs. (2) and (3)], the function delivers the respective warning (“t_star > 2T, t_star = 2T”) and shows S_hat_star and estimated coverage calculated using t_star = 2T.

Use of the R package employing real CPN data

To illustrate the use of the R package, we resort to data collected by a typical norming study conducted by the authors in previous research. In those norms, participants wrote down short phrases associated with each given concept, following a procedure used in Recchia and Jones (2012). Participants (N = 100) were all native Chilean Spanish speakers. They received 10 abstract concepts, selected randomly from a pool of 27 possible concepts. Due to concepts being randomly selected, we ended up with a different number of participants for each concept (mean number of participants = 36.6, min = 22, max = 52). Overall, our participants produced a total of 5457 token responses.

Properties were coded by a trained coder who selected valid and invalid responses (e.g., cue repetitions, property repetitions, metacognitive comments, and off-topic comments). Valid responses (4941) were coded into 729 response types. A second coder independently recoded all valid responses, which allowed us to estimate reliability. As recommended by Bolognesi et al. (2017) we computed Cohen’s kappa (Cohen, 1960) as a reliability estimate. Our reliability coefficient showed a substantial level of agreement (kappa = .76). Note that in the present work we use the CPN data coded in English, so that it is amenable to be used by a wider international audience. In Canessa et al. (2021), which we cite in other parts of this paper, we used the original Spanish coded CPN data, resulting in very slight differences in the calculated figures due to slang.

The input file containing the coded properties followed the prescribed format (see Section Implementation of equations in R), and can be downloaded from https://CRAN.R-project.org/package=CPNCoverageAnalysis. Table 1 presents the R package output. To generate those figures, you should run the code in the Appendix. Note that estimated sample coverage values (\(\hat{C}(T)\)) were only modest (see Table 1), suggesting that our norming study was far from exhausting the available information. Importantly, inspecting Table 1 shows that differences in coverage are not only a function of sample size (T), but also of property distributions (i.e., the values of Q1 and Q2). In general, longer tails (larger Q1 values) are associated with lower coverage. Figure 1 depicts the 95% semantic richness (SR) CIs, which correspond to the figures shown in Table 1. Note that those CIs are not symmetric around the mean SR value because the probability distribution of SR is log-normal.

Table 1 Results from R package corresponding to Q1 (no. of singletons), Q2 (no. of doubletons), T (no. of participants who listed properties), Sobs (observed semantic richness, SR),\(\mathcal{U}\)(total no. of properties listed),\(\hat{S}\)(estimate of SR), SD of\(\hat{S}\), 95% CI for S,\(\hat{C}(T)\)(estimated coverage), t* (increase in sample size to obtain a certain coverage) and\(\hat S\left(T+t^\ast\right)\)(estimated SR when adding t* participants)
Fig. 1
figure 1

Point estimates for semantic richness (\(\hat{S}\)) and corresponding 95% CI for each of the 27 concepts in CPN

To illustrate how coverage can affect data interpretation, note that in Fig. 1 many 95% CIs overlap. Thus, it is advisable not to take semantic richness estimates (\(\hat{S}\)) at face value. Take for example the REASON and HAPPINESS pair of concepts, for which the same number of participants listed properties (T = 37). Based on the point estimate of semantic richness (Sobs) in Table 1, and given that both concepts have the same T, one would be tempted to conclude that the first concept is richer than the second concept (i.e., 96 > 88). However, as shown in Fig. 1, those concepts’ CIs overlap, suggesting that their point estimates are not statistically different. This is not surprising, given that SR and other such variables are all random variables, which is why computing CIs, as discussed in the current work, is important. Additionally, note that the coverage achieved by REASON and HAPPINESS is different (i.e., 64% and 70%, respectively), though they were sampled using the same number of participants (T). Thus, solely standardizing by T does not guarantee per se that one can draw sound conclusions.

As discussed earlier, a common practice in CPN studies is standardizing sample size (T) across concepts. However, as Table 1 and Fig. 1 illustrate, although Sobs is influenced by sample size (i.e., increasing the number of participants increases the probability of additional properties being produced), sample size operates in conjunction with the distribution of properties in the population. Thus, when sample sizes are standardized across concepts Sobs becomes only a rough estimator of the true SR. This limits the precision with which we may compare different concepts along that same dimension.

Yet another example of the risks researchers may incur in when ignoring coverage in the analysis of CPN data is the following, which we cursorily describe here (the interested reader can find details in Canessa et al., 2021). Using our own CPN data, we searched for evidence of a relation between a concept’s mean list length (i.e., the mean number of properties produced by participants for a given concept) and its associated properties’ mean dominance (i.e., the mean frequency of those properties that are produced in response to the cueing concept). CPN data has shown that as concepts’ mean list length increases, concepts’ mean property dominance decreases (Canessa & Chaigneau, 2020; Chaigneau et al., 2018; Montefinese et al., 2013; Ruts et al., 2004). Furthermore, it has been shown that the relation’s functional form is hyperbolic (i.e., \(d={b}_0+\frac{b_1}{s}\), where d = dominance, s = mean list length, b0 and b1 = coefficients estimated from data) (Canessa & Chaigneau, 2020). Given that this relation has been repeatedly found, we would expect it to replicate on any CPN. Our own CPN data with different coverage can be used to illustrate the perils of not taking coverage into account.

Given that in our data, concepts present different coverage, we separated concepts into two groups, one with lower coverage data (as may happen if sample size is standardized without regard to property frequency distribution), and one with higher coverage data. To do so, we divided our CPN’s 27 concepts by their calculated coverage\(\hat{C}\), thus producing a lower coverage (less than 67.1%) and a higher coverage (67.1% or greater) group of concepts (see Table 1, and note that we presented \(\hat{C}(T)\) without decimals to improve readability). The respective mean coverage values for each group are 60.9% and 71.3%, which are significantly different, t(25) = 6.64, p < .001.

Assume now that each group represents a different study with the goal of testing the inverse relation between mean dominance and mean list length across concepts. To this end, for each group of concepts, values needed for the hyperbolic equation were computed (i.e., d and s), and the b0 and b1 coefficients were estimated by using Ordinary Least Squares (OLS). The regression equation using the lower coverage concepts exhibits the hyperbolic form, but suggests that average dominance increases with average list length, d = 12.141 – 35.186 / s, R2 = 0.503, F(1,11) = 11.124, p = .007. Note that this study would have concluded that concepts for which people list a large number of properties are also concepts with overall high dominance properties, something that could be possible from a purely empirical point of view, but cannot be reconciled with prior literature (Canessa & Chaigneau, 2020; Chaigneau et al., 2018; Montefinese et al., 2013; Ruts et al., 2004). In stark contrast, the same study with the higher coverage concepts produced a curve that replicates previous findings in the literature. The OLS procedure yields a significant hyperbolic regression that inversely relates d and s, d = −1.326 + 39.679 / s, R2 = 0.286, F(1,12) = 4.798, p = .049. This result is evidently consistent with prior literature. Results from our case study suggest that coverage matters, and not taking it into account may lead to erroneous and misleading results. Note that although the difference in coverage between both groups is rather small (10.4%), it still has an important impact on results.

Returning to Table 1, Eqs. (3) and (8) can be used to estimate the additional sampling effort necessary to achieve a certain desired coverage and the corresponding \(\hat{S}\). Equation (3) can be used to estimate the number of extra participants for each concept (t*), which would produce a similar coverage for each concept. For example, if we want to reach a coverage for each concept similar to the highest one already attained (78% for GRATITUDE), Table 1 shows the corresponding additional number of participants for each concept (t*) that we need. Note that t* goes from as small as 1 (for PROFIT) to as high as 105 (for GUILT). Note that in Table 1 we reported a rounded value of 78% for \(\hat{C}\) for GRATITUDE (a more precise value is 0.7773747). However, given the ceiling function, using Eq. (3) and \({\hat{C}}_{target}\) = 0.78 will produce a t* for GRATITUDE equal to 2 (in contrast to a value equal to 0 in Table 1). Thus, to make values in Table 1 more readily understandable, we used \({\hat{C}}_{target}\) = 0.77 to calculate t* for all concepts. Equation (8) allows us to estimate the expected semantic richness derived from adding t* participants to the sample for each concept (last column in Table 1). Finally, note that when Eq. (3) gives a t* above 2T, then the actual t* that must be used is 2T, and that is the value to be inputted into Eqs. (2) and (8) to calculate \(\hat{C}\left(T+{t}^{\ast}\right)\) and \(\hat{S}\left(T+{t}^{\ast}\right)\), see footnote to Table 1.

Strategies to determine sample size by considering coverage

To avoid the abovementioned problems of not considering coverage and treating values derived from CPN data as population parameters, we recommend that the sample size for each concept in a CPN should be established so that a certain equal coverage is achieved across concepts. To do so, we propose two possible strategies, which we explain in the next two subsections.

Two-stage sampling procedure

First, we propose here that Eqs. (1) through (8) provide researchers an informed means of deciding which concepts she might select to apply the additional effort necessary to match their coverage (a more in-depth discussion of coverage standardization can be found in Chao & Jost, 2012, and Rasmussen & Starr, 1979). In a nutshell, we propose that researchers use a two-stage sampling procedure. In the first stage, researchers should conduct a PLT for each concept with a small number of participants (judging from the literature, 10 or perhaps 15 participants per concept could suffice). With those data, it would be possible to use Eq. (1) to estimate the current estimated coverage, and Eq. (3) to estimate the t* additional participants necessary for achieving a desired coverage\({\hat{C}}_{target}\). For example, if with our own CPN we wanted to reach a coverage for each concept similar to the highest coverage already attained (78% for GRATITUDE), Table 1 shows the corresponding additional number of participants for each concept (t*) that we would need. Thus, the researcher now has an informed means of deciding whether she might apply the additional effort to match the coverage for all concepts, or just for some of them—perhaps the ones that are theoretically more interesting.

An anonymous reviewer expressed concerns regarding eliminating concepts from a CPN. It might happen that a researcher follows the easiest alternative and simply removes concepts with low coverage after the first sampling stage. For example, in our own CPN, given the highest coverage of 78% and the additional number of participants to match that coverage, for example, for ANXIETY (58), then we might have been tempted to immediately eliminate that concept. Thus, the reviewer suggested that establishing a predefined minimum coverage might be useful to lessen that problem. For example, one might tentatively establish a minimum of 60% coverage for all concepts in a CPN. Although such a threshold would be an interesting guideline, we cannot currently define a minimum coverage. However, based on our experience with our own CPNs, we think that a realistic minimum coverage should be in the 50% to 60% range. We believe that as more CPN studies routinely calculate and report coverage, we might get a better estimate of the minimum feasible coverage that researcher should aim to achieve.

In addition to aiming for similar coverage, a researcher would also want to have reliable SR estimates (\(\hat{S}\)). For example, in our CPN, note in Fig. 1 that for the sample sizes used, S shows large CIs, suggesting that with the current sample sizes there are no significant differences between most concepts’ SRs. In contrast, a good example of what a researcher should desire is given by the comparison of concepts PROFIT and DANGER, which show small CIs (and in this case, also non-overlapping) and similar coverage (respectively, 77% and 70%). Thus, the ideal case would be to jointly assess how many more participants per concept are needed to attain a certain coverage, t*, per Eq. (3), and also the CI width of the corresponding SR estimator\(\hat{S}\left(T+{t}^{\ast}\right)\). However, if one peruses expressions (1) to (8), one can see that no equation exists to estimate the variance of\(\hat{S}\left(T+{t}^{\ast}\right)\), and hence one cannot easily compute the corresponding CI. Consequently, there is no sure and easy manner to handle this, forcing the researcher to use his/her better judgment. Two elements that are available for that judgment are the current width of the S estimate’s CI and the value of the estimated SR when adding t* participants, i.e., \(\hat{S}\left(T+{t}^{\ast}\right)\) per Eq. (8).

First, if for a given concept the current coverage is high (not too far removed from the desired coverage) and the current CI for S is conveniently narrow, then the researcher can be reasonably confident that increasing sample size would pay off as expected (i.e., that the result of the additional sampling effort would be an adequate coverage with a reliable \(\hat{S}\) estimate). In our CPN, the concept PROFIT serves as an example, i.e., it currently exhibits a 77% coverage and has a relatively narrow CI for S (80.4 to 173.4) (i.e., by “relatively narrow” we understand that it does not overlap with other concepts’ CI for the parameters of interest).

Second, assuming that a researcher wants to find statistically significant differences, he/she can compare the estimated new SR \(\hat{S}\left(T+{t}^{\ast}\right)\) among those concepts that he/she wants to contrast, and approximately see whether those SR estimators are sufficiently different. If in fact the values for \(\hat{S}\left(T+{t}^{\ast}\right)\) are different enough, then the additional sampling effort might be useful. Of course, given that one does not have an estimate of the corresponding CIs, a judgment call is needed. For example, in our CPN, if the researcher is interested in comparing the SR of PROFIT, THOUGHT, and REASON, they might judge that the comparison between PROFIT and THOUGHT and between PROFIT and REASON might be informative, given that the corresponding \(\hat{S}\left(T+{t}^{\ast}\right)\) are 62.0, 187.2, and 155.9, respectively. On the other hand, the comparison between THOUGHT and REASON might prove to be useless.

Incremental sampling procedure

Note that a more general strategy is to use true incremental sampling (i.e., increasing sample size one participant at a time). Beginning with a small sample of participants, maybe 10 or 15, the sample size can be increased by one participant at a time until the desired coverage and also a conveniently small CI for SR are reached. We believe this may be an attractive strategy, given that it may tend to optimize sampling effort. Increasing the sample size one participant at a time assumes we could compute coverage dynamically and in real time, and to decide when to stop collecting accordingly. Because in typical CPN studies, whole phrases need to be coded into property types prior to any analysis, dynamic coverage computations might be costly (i.e., incremental sampling would entail solving the problem of how to perform whole-phrase incremental coding). Currently we are working on this problem by implementing an automated coding tool, applying machine learning techniques. A different coding procedure, such as the bag-of-words approach, which is amenable to automation (Buchanan et al., 2020) may also enable true incremental sampling.

An additional step might also be carried out as part of the described procedure, especially if the researcher needs to shorten the data collection effort. Similar to what was explained for the two-stage sampling procedure, instead of increasing one participant at a time, the researcher can look at t* (the additional number of participants needed to obtain the desired coverage) and increase the sample size for each concept in steps larger than one subject. For example, in our CPN, given that PROFIT needs only one more participant to attain the desired coverage, it would be sensible to increase subjects one at a time. On the other hand, for DECISION, which needs 61 additional participants, it might be more sensible to increase sample size for example in five-subject steps. Note that because coverage estimations could possibly change during data collection—Q1 and Q2 could change when doing the PLT with additional participants—the proposed assessment of sample size increments might be only tentative. Thus, as the estimated coverage changes dynamically, the step size should be adjusted accordingly.

Finally, the researcher should also take into account the initial width of the concepts’ SR CI to assess whether she might be able to obtain not only the desired coverage, but also a relatively narrow CI. If those estimates show that the CIs are too large and continue being so after participants are added, then the researcher might finish sampling for those concepts and even perhaps remove them from the study. This suggestion is similar to the one presented in the last paragraph of the previous subsection on the two-stage sampling procedure.

Given that the procedure outlined in this subsection is rather practical and is subject to many considerations, which may change during data collection, we need to resort to a dynamic CPN study to illustrate it. However, no published CPN study records and reports all the necessary dynamical details (i.e., how data changes through time as participants list properties), which are deemed unimportant. In the same vein, we did not record the dynamical details of data collection for our own CPN. Thus, in the next section, we use a CPN simulator to emulate the data collection effort and illustrate the incremental sampling approach.

CPN simulator and data collection using incremental sampling

To illustrate the incremental sampling procedure, we implemented a simulator (function property_simulator), which models a probability distribution from which properties are sampled in a PLT. Bear in mind that the goal of that simulator is to be used in an example of the incremental sampling procedure, and thus it does not need to accurately model any real probability distribution. Instead, the simulator should only provide a meaningful representation of the sampling procedure. Thus, to develop the simulator, we obtained the empirical sample probability distribution of the properties for a single concept. Then, we calculated the mean (mnp) and standard deviation (sdnp) of the number of properties listed by the CPN’s participants. Finally, given that we have only a sample from the true unobserved probability distribution of the properties, and we know that the true distribution should exhibit long tails (Chaigneau et al., 2018), we added new properties with frequency equal to 1 to the empirical sample distribution, so as to increase its tails and obtain Q1 and Q2 values similar to those originally calculated from our CPN data. After adjusting the distribution, we can sample the number of properties (NP ≥ 1) from a N(mnp,sdnp2).

The simulator is implemented by the function property_simulator, which samples properties from an artificial distribution. For this purpose, the function receives as inputs: orig_data (n properties for a single concept) data frame with the ID, concept, and properties for the concept, new_words (the number of words with frequency 1 to be added to the empirical distribution), and number_subjects (the number of artificial subjects who will list properties for the concept). Using these inputs, the function proceeds to estimate the mean (mnp) and standard deviation (sdnp) of the number of properties (NP) listed by the CPN’s participants. Then, it adds new_words properties with frequency 1 and estimates the empirical distribution. Finally, it proceeds to generate the new data frame with the properties sampled for each number_subjects. For these purposes, it samples the number of properties (NP ≥ 1) for each subject from a N(mnp,sdnp2) and then samples NP properties from the estimated distribution. The following example shows the properties sampled for two subjects, using concept C1 of the test data, and 20 new words.

figure d

The above results show previous existing properties (p5, p3, p10, p9, and p2), and new properties (added properties with frequency one) corresponding to properties with ID numbers (1, 3, 5, and 20).

To illustrate the incremental sampling procedure using the CPN simulator applied to our own CPN data collection, we obtained the empirical sample property probability distribution for three concepts in our CPN; namely for DECISION (representing a concept with a low initial coverage = 54%), HOPE (with an initial medium coverage = 68%), and GRATITUDE (with the highest initial coverage = 78%). Recall from our previous discussion that we want to have concepts with similar and relatively high coverage, and that the highest obtained coverage in our CPN was 78% for GRATITUDE (see Table 1). Hence, for illustrating the incremental sampling procedure we decided to attain 80% coverage for all three concepts. Given that we want to show the incremental sampling procedure as if we were doing a new CPN, we start the three-concept CPN totally anew. To begin with that procedure, the simulator added 80 new words to the empirical distribution to emulate the long tails of unobserved properties, and then we simulated a PLT with 15 participants for each concept (first stage). We then increased the number of participants until we reached our desired coverage (second to fourth stages), obtaining the results shown in Table 2. To replicate these results, please refer to the Appendix.

Table 2 Results from R package corresponding to Q1 (no. of singletons), Q2 (no. of doubletons), T (no. of participants who listed properties), Sobs (observed semantic richness, SR),\(\mathcal{U}\) (total no. of properties listed), \(\hat{S}\) (estimate of SR), SD of \(\hat{S}\), 95% CI for S, \(\hat{C}(T)\) (estimated coverage), t* (increase in sample size to obtain 80% coverage), and \(\hat{S}\left(T+{t}^{\ast}\right)\) (estimated SR when adding t* participants) for the simulated CPN and three selected concepts

From Table 2, first stage, we can see that the number of additional participants (t*) to obtain the 80% target coverage ranges from 17 to 30. Thus, we decided to do the second stage, adding 20 subjects (T = 35). Now we see that DECISION increased coverage to 74.4% and the other two concepts to about 60%. If the researcher does not mind doing the PLT with many more participants (i.e., she has the time and resources to do that), then she might decide to do the third stage adding 40 new participants, a round number close to t* for GRATITUDE and HOPE. However, remember that our goal is to obtain similar coverage among concepts. Hence, we might consider adding 10 participants (rounded t* for DECISION) (T = 45) and see whether we obtain the 80% coverage for DECISION and also appreciate how close to 80% gets the coverage of GRATITUDE and HOPE. Additionally, note that the CIs calculated for the second stage imply that we will perhaps not get statistically significant differences in semantic richness (SR) among the concepts. According to \(\hat{S}\)(T + t*) (estimated SR when adding t* participants), we might hope that at least DECISION could have a statistically significant difference in SR with HOPE.

From the third stage, we can see that the three concepts are now closer to attaining the 80% coverage and that the difference in SR among the concepts is becoming less significant. Therefore, here we face two possibilities: add the t* corresponding participants to each concept (to obtain a similar coverage among the three concepts), or increase the target coverage above 80%, hoping that maybe the difference in SR between HOPE and the other two concepts reaches significance. We decided to take the first action. Hence, in the fourth stage, we added five participants to DECISION (increasing its T to 50), 19 participants to GRATITUDE (increasing its T to 64), and 20 participants to HOPE (increasing its T to 65) (see Table 2, fourth stage). As we can see, we were able to reach an approximately similar 82% coverage among the three concepts. However, and as expected, the difference in SR among the three concepts did not attain statistical significance. Using the same exemplified procedure, we can do the same analyses for the 27 concepts in our CPN, which we omit here due to space limitations. However, we believe that the example clearly shows the advantage of standardizing coverage and using incremental sampling in a CPN. Additionally, the necessary calculations needed to execute the incremental sampling process (e.g., calculate coverage, SR, t*, etc.) are simple to perform using the developed R package.

Conclusions

Throughout this paper we have argued that researchers doing CPNs should set sample size so that a similar coverage is achieved among the concepts considered in the CPN. By attaining a similar coverage, more sound comparisons among concepts of the CPN can be done, along many dimensions (semantic richness [SR], property dominance, etc.). Additionally, standardizing by coverage ensures that results of different CPNs may be compared because they share an approximately similar representativeness of the properties that describe each concept. Hence, by routinely calculating and reporting coverage in CPNs, researchers will be better able to see whether the results of those CPNs are comparable and/or generalizable to other types of studies. This may also allow for the replication of CPNs and possibly conducting studies that are now infeasible, such as comparing semantic representations across different groups (societal, cultural, age, ethnic, etc.). Of course, standardizing coverage is more difficult than setting an equal sample size for each of the concepts considered in a CPN (i.e., using an equal number of participants for listing properties for each concept in a CPN). However, and as shown by our analyses, it is worth the extra effort. The examples presented (comparing SR among concepts of the same CPN and analyzing relations between different variables of a CPN) are but two examples of the deleterious effects and unsound conclusions that may ensue by not considering coverage in our analyses of CPN data.

To facilitate such a task, we have developed an R package that performs the necessary calculations and that can be used to effectively guide the researcher in his/her sampling effort. We believe that this package is simple to use and its inputs are the same that researchers already obtain in CPNs. Hence, researchers can readily apply the package to conducting their CPN studies, from data collection to data analyses. Moreover, by using incremental sampling, the time and resources to collect the properties that describe each concept may be more efficiently used. As our example of the incremental sampling procedure showed, one can continuously and dynamically adjust the number of participants listing properties for each concept, so as to achieve a desired coverage and also minimize sampling effort (i.e., use only the necessary number of participants per concept). Notably, during that same process, other considerations may be brought to bear on the decisions that the researcher could make regarding sample size. For example, he/she might drop concepts from his/her CPN if the calculated values of variables of interest do not achieve sought characteristics (e.g., SR among concepts that do not achieve statistically significant differences, concepts that do not attain the desired coverage). All in all, we believe that using the recommended procedures and analyses that our package allows is a first step in the right direction: formalizing CPNs studies and furthering the applicability and generalizability of their results.