Skip to main content
Log in

Fitting the truncated negative binomial distribution to count data

A comparison of estimators, with an application to groundfishes from the Mauritanian Exclusive Economic Zone

  • Published:
Environmental and Ecological Statistics Aims and scope Submit manuscript

Abstract

Modeling empirical distributions of repeated counts with parametric probability distributions is a frequent problem when studying species abundance. One must choose a family of distributions which is flexible enough to take into account very diverse patterns and possess parameters with clear biological/ecological interpretations. The negative binomial distribution fulfills these criteria and was selected for modeling counts of marine fish and invertebrates. This distribution depends on a vector \(\left( K,\mathfrak {P}\right) \) of parameters, and ranges from the Poisson distribution (when \(K\rightarrow +\infty \)) to Fisher’s log-series, when \(K\rightarrow 0\). Moreover, these parameters have biological/ecological interpretations which are detailed in the literature and in this study. We compared three estimators of K, \(\mathfrak {P}\) and the parameter \(\alpha \) of Fisher’s log-series, following the work of Rao CR (Statistical ecology. Pennsylvania State University Press, University Park, 1971) on a three-parameter unstandardized variant of the negative binomial distribution. We further investigated the coherence underlying parameter values resulting from the different estimators, using both real count data collected in the Mauritanian Exclusive Economic Zone (MEEZ) during the period 1987–2010 and realistic simulations of these data. In the case of the MEEZ, we first built homogeneous lists of counts (replicates), by gathering observations of each species with respect to “typical environments” obtained by clustering the sampled stations. The best estimation of \(\left( K,\mathfrak {P}\right) \) was generally obtained by penalized minimum Hellinger distance estimation. Interestingly, the parameters of most of the correctly sampled species seem compatible with the classical birth-and-dead model of population growth with immigration by Kendall (Biometrika 35:6–15, 1948).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Aidoo EN, Mueller U, Goovarets P, Hyndes GA (2015) Evaluation of geostatistical estimators and their applicability to characterise the spatial patterns of recreational fishing catch rates. Fish Res 168:20–32

    Article  PubMed  PubMed Central  Google Scholar 

  • Anscombe FJ (1950) Sampling theory of the negative binomial and logarithmic series distributions. Biometrika 36:358–382

    Article  Google Scholar 

  • Basu A, Shioya H, Park C (2011) Statistical inference. The minimum distance approach, vol 120., Monographs on statistics and applied probabilityChapman & Hall/CRC, Boca Raton

    Google Scholar 

  • Beran RJ (1977) Minimum Hellinger distance estimates for parametric models. Ann Stat 5:445–463

    Article  Google Scholar 

  • Bergerard P, Domain F, Richer de Forges B (1983) Evaluation par chalutage des ressources démersales du plateau continental mauritanien. Bull. Centr. Nat. Rech. Océanogr.et des Pêches, Nouadhibou 11:217–290

  • Bliss CI, Fisher RA (1953) Fitting the negative binomial distribution to biological data. Biometrics 9:176–200

    Article  Google Scholar 

  • Boswell MT, Patil GP (1970) Chance mechanisms generating the negative binomial distribution. In: Patil GP (ed) Random counts in models and structures, vol 1. Pensylvania State University Press, University Park, pp 3–22

    Google Scholar 

  • Boswell MT, Patil GP (1971) Chance mechanisms generating logarithmic series distribution used in the analysis of number of species and individuals. In: Patil G, Pielou E, Waters W (eds) Statistical ecology, vol 1. Pennsylvania State University Press, University Park, pp 99–130

    Google Scholar 

  • Chen J, Rubin H (1986) Bounds for the difference between median and mean of Gamma and Poisson distributions. Stat Probab Lett 4:281–283

    Article  Google Scholar 

  • Cohen JE (1972) Markov population processes as models of primate social and population dynamics. Theor Popul Biol 3(2):119–134

    Article  CAS  PubMed  Google Scholar 

  • Dabo-Niang S, Ferraty F, Vieu P (2007) On the using of modal curves for radar waveforms classication. Comput Stat Data Anal 51(10):4878–4890

    Article  Google Scholar 

  • Dabo-Niang S, Yao AF, Pischedda L, Cuny P, Gilbert F (2010) Spatial mode estimation for functional random fields with application to bioturbation problem. Stoch Environ Res Risk Assess 24(4):487–497

    Article  Google Scholar 

  • Dabo-Niang S, Hamdad L, Ternynck C, Yao AF (2014) A kernel spatial density estimation allowing for the analysis of spatial clustering. Application to Monsoon Asia Drought Atlas data. Stoch Environ Res Risk Assess 28:2075–2099

    Article  Google Scholar 

  • Di Salvo F, Ruggieri M, Plaia A (2015) Functional principal components analysis for multivariate multidimensional environmental data. Environ Ecol Stat 22(4):739–757

    Article  Google Scholar 

  • Diserud OH (2001) Detecting changes in diversity in a fluctuating environment based on simulation of stochastic processes. Oceanol Acta 24(5):505–517

    Article  Google Scholar 

  • Domain F (1986) Evaluation par chalutage des ressources démersales du plateau continental mauritanien. In: Josse E et Garcia S (eds). Description et évaluation des ressources halieutiques de la ZEE mauritanienne. Rapport du Groupe de travail CNROP/FAO/ORSTOM, Nouadhibou, Mauritanie, 16–27 septembre 1985, COPACE/PACE Séries 86/37: 248–273

  • Elliot JM (1979) Some methods for the statistical analysis of samples of benthic invertebrates, 2nd edn. Freshwater Biological Association Scientific Publication No. 25, Ambleside

  • Ferraty F, Vieu P (2006) Nonparametric functional data analysis. Theory and practice, Springer series in statistics. Springer, New York

    Google Scholar 

  • Fisher RA, Corbet A, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. J Anim Ecol 12:42–58

    Article  Google Scholar 

  • Gaertner JC, Mazouni N, Sabatier R, Millet B (1999) Spatial structure and habitat associations of demersal assemblages in the Gulf of Lions: a multicompartimental approach. Mar Biol 135:199–208

    Article  Google Scholar 

  • Gasser T, Hall P, Presnell B (1998) Nonparametric estimation of the mode of a distribution of random curves. J R Stat Soc B 60(4):681–691

    Article  Google Scholar 

  • Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 70(3):419–435

    Article  Google Scholar 

  • Hooten MB, Wikle C (2008) A hierarchical Bayesian non-linear spatio-temporal model for the spread of invasive species with application to the Eurasian Collared-Dove. Environ Ecol Stat 15:59–70

    Article  Google Scholar 

  • Johnson AF, Jenkins SR, Hiddink JG, Hinz H (2013) Linking temperate demersal fish species to habitat: scales, patterns and future directions. Fish Fish 14:256–280

    Article  Google Scholar 

  • Kendall DG (1948) On some modes of population growth leading to R. A. Fisher’s logarithmic series distribution. Biometrika 35:6–15

    Article  CAS  PubMed  Google Scholar 

  • Kunin WE, Gaston KJ (1997) The biology of rarity. Causes and consequence of rare-common differences. Chapman & Hall, London

    Google Scholar 

  • Lamouroux N, Capra H, Pouilly M, Souchon Y (1999) Fish habital preferences in large streams of southern France. Freshw Biol 42:673–687

    Article  Google Scholar 

  • Lewin W-C, Freyhof J, Huckstorf V, Mehner T, Wolter C (2010) When no catches matter: coping with zeros in environmental assessments. Ecol Indic 10:572–583

    Article  Google Scholar 

  • Magurran AE (2005) Species abundance distributions: pattern or processes? Funct Ecol 19:177–181

    Article  Google Scholar 

  • Mandal A, Basu A (2013) Minimum disparity estimation: improved efficiency through inliers modification. Comput Stat Data Anal 64:71–86

    Article  Google Scholar 

  • Manté C, Claudet J, Rebzani-Zahaf C (2003) Fairly processing rare and common species in multivariate analysis of ecological series. Application to macrobenthic communities from Algiers harbour. Acta Biotheor 51:277–294

    Article  PubMed  Google Scholar 

  • Manté C, Durbec JP, Dauvin JC (2005) A functional data-analytic approach to the classification of species according to their spatial dispersion. Application to a marine macrobenthic community from the Bay of Morlaix (Western English Channel). J Appl Stat 32(8):831–840

    Article  Google Scholar 

  • Nielsen JR, Kristensen K, Lewy P, Bastardie F (2014) a statistical model for estimation of fish density including correlation in size, space, time and between species from research survey data. PloS One 9(6):1–15

    Google Scholar 

  • O’Neil MF, Faddy MJ (2003) Use of binary and truncated negative binomial modelling in the analysis of recreational catch data. Fish Res 60:471–477

    Article  Google Scholar 

  • Quenouille MH (1949) A relation between the logarithmic. Poisson and negative binomial series. Biometrics 5(2):162–164

    Article  CAS  PubMed  Google Scholar 

  • Rao CR (1962) Efficient estimates and optimum inference procedures in large samples (with discussion). J R Stat Soc, Ser B (Methodol) 24(1):46–72

    Google Scholar 

  • Rao CR (1971) Some comments on the logarithmic series distribution in the analysis of insect trap data. In: Patil G, Pielou E, Waters W (eds) Statistical ecology, vol 1. Pennsylvania State University Press, University Park, pp 131–142

    Google Scholar 

  • Serfling R (2004) Nonparametric multivariate descriptive measures based on spatial quantiles. J Stat Plan Inference 123:259–278

    Article  Google Scholar 

  • Simpson DG (1987) Minimum Hellinger distance estimation for the analysis of count data. J Am Stat Assoc 82(399):802–807

    Article  Google Scholar 

  • Taylor LR, Kempton RA, Woiwod IP (1976) Diversity statistics and the log-series model. J Anim Ecol 45(1):255–272

    Article  Google Scholar 

  • Taylor LR, Woiwod IP, Perry JN (1979) The negative binomial as a dynamic ecological model for aggregation, and the density dependence of K. J Anim Ecol 48:289–304

    Article  Google Scholar 

  • Vaudor L, Lamouroux N, Olivier JM (2011) Comparing distribution models for small samples of overdispersed counts of freshwater fish. Acta Oecol 37(3):170–178

    Article  Google Scholar 

  • Wang Y (1996) Estimation problems for the two-parameters negative binomial distribution. Stat Probab Lett 26:113–114

    Article  Google Scholar 

  • Warton DI (2005) Many zeros does not mean zero inflation: comparing the goodness-of-fit of parametric models to multivariate abundance data. Environmetrics 16:275–289

    Article  Google Scholar 

  • Watterson GA (1974) Models for the logarithmic species abundance distributions. Theor Popul Biol 6:217–250

    Article  CAS  PubMed  Google Scholar 

  • Williams CB (1944) Some applications of the logarithmic series and the index of diversity to ecological problems. J Ecol 32(1):1–44

    Article  Google Scholar 

  • Williams CB (1947) The logarithmic series and its application to biological problems. J Ecol 34(2):253–272

    Article  Google Scholar 

  • Williams CB (1952) Sequences of wet and of dry days considered in relation to the logarithmic series. Q J R Meteorol Soc 73(335):91–96

    Article  Google Scholar 

  • Wyshak G (1974) Algorithm AS 68: a program for estimating the parameters of the truncated negative binomial. J R Stat Soc, Ser C (Appl Stat) 23(1):87–91

    Google Scholar 

Download references

Acknowledgments

We thank the Mauritanian Institute of Oceanographic Research and Fisheries (IMROP) and the Department of Cooperation and Cultural Action of the Embassy of France in Mauritania for their support for this study. We also thank all scientists who contributed to field surveys and data collection, Jean-Pierre Durbec for helpful discussions, and anonymous reviewers for their constructive comments. We are very grateful to Starrlight Augustine for greatly improving the English text.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claude Manté.

Additional information

Handling Editor: Pierre Dutilleul.

Appendices

Appendix 1: Determination of a threshold for the truncated Hellinger distance

While results about the asymptotic distribution of our estimators abound, nothing is known about the distribution of the goodness-of-fit index [see Formula (9)]

$$\begin{aligned} d_{H}^{T}\left( TNBD\left( \hat{K},\mathfrak {\hat{P}}\right) ,\,TNBD\left( K,\mathfrak {P}\right) \right) \end{aligned}$$
(10)

where \(\left( \hat{K},\mathfrak {\hat{P}}\right) \) is an estimate of \(\left( K,\mathfrak {P}\right) \). In order to determine from \(d_{H}^{T}\) the species which were correctly fitted, we performed a Monte Carlo study. It consisted in generating a sample of the statistics (10) for each one of the three estimators used, from a population of “negative binomial species” similar to the genuine population of the C4 habitat considered as a reference structure. This study is detailed hereunder.

1.1 The reference distribution of \(\left( K,\mathfrak {P}\right) \)

We plotted on Fig. 7 the minimum \(PHD_{h}\) estimates of the vector \(\left( K,\mathfrak {P}\right) \) associated with the species collected in the C4 habitat. About 35.6 % of the species were associated with very small values of the first parameter (\(\hat{K}\le e^{-10}\)); discarding these species, we could fit a bi-dimensional log-normal distribution of parameters \(\left( \mu _{B},\Sigma _{B}\right) \) to the remaining vectors of estimates, whose confidence ellipsoids are also represented on Fig. 7. Neither \(Log\left( \hat{K}\right) \) nor \(Log\left( \mathfrak {\hat{P}}\right) \) strictly obeyed a normal distribution, but this model was retained for the sake of simplicity, since the corresponding 95 % confidence region widely covers the data (see Fig. 7). As for the discarded species, we postulated that \(Log\left( \mathfrak {\hat{P}}\right) \) could be considered as obeying some Gaussian distribution, \(\mathcal {N}\left( \mu _{S},\sigma _{S}\right) \).

Fig. 7
figure 7

Fit of the estimated parameters for the C4 data. The ellipsoids correspond to 50 and 95 % confidence regions for the reference distribution, \(\mathcal {N}\left( \mu _{B},\Sigma _{B}\right) \)

1.2 Generating a “population” consistent with the reference distribution

To build a sample of \(d_{H}^{T}\left( TNBD\left( \hat{K},\mathfrak {\hat{P}}\right) ,\,TNBD\left( K,\mathfrak {P}\right) \right) \) for counts having the same overall characteristics as the C4 data, we generated random counts of 300 “NB species”, whose random parameters obeyed the mixture distribution

$$\begin{aligned} \mathcal {M}:=0.356\,\mathcal {U}\left( \left[ e^{-12},e^{-8}\right] \right) \otimes \mathcal {LN}\left( \mu _{S},\sigma _{S}\right) +0.644\,\mathcal {LN}\left( \mu _{B},\Sigma _{B}\right) \end{aligned}$$

where \(\left( \mu _{S},\sigma _{S}\right) =\left( -0.701062,2.68525\right) \) and \(\left( \mu _{B},\Sigma _{B}\right) \) were estimated from the C4 data. In practice, the parameters \(\left( k,p\right) \) of each species were first drawn according to \(\mathcal {M}\); then a sample of \(\beta =3000\) (or \(\beta =6000\) when \(k\le e^{-8}\)) counts obeying \(NBD\left( k,p\right) \) were drawn. The simulated data were then processed the same way as the MEEZ ones.

Fig. 8
figure 8

In black: 50 and 95 % confidence ellipsoids  for the reference distribution \(\mathcal {N}\left( \mu _{B},\Sigma _{B}\right) \); in gray: same confidence ellipsoids for the distribution \(\mathcal {N}\left( \widehat{\mu _{B}},\widehat{\Sigma _{B}}\right) \) obtained from the mixture distribution \(\mathcal {M}\). Dots correspond to the parameters of the “NB species”, estimated by PHD \(_{h}\)

On Fig. 8, we superimposed to the parameters of the species (estimated by \(PHD_{h}\)), confidence ellipsoids of the reference distribution \(\mathcal {LN}\left( \mu _{B},\Sigma _{B}\right) \) and of the distribution \(\mathcal {LN}\left( \widehat{\mu _{B}},\widehat{\Sigma _{B}}\right) \), whose parameters were estimated from the independent draws of \(\mathcal {M}\). It is worth noting that in this case, there was no significant difference between the empirical distribution of \(\left( \hat{K},\mathfrak {\hat{P}}\right) \) and the reference distribution \(\mathcal {LN}\left( \mu _{B},\Sigma _{B}\right) \) (P values: Cramer–Von Mises =0.544087; Pearson \(\chi ^{2}\) = 0.523489).

1.3 Results

About 30 % (93) of the “species” were observed \({<}6\) times, and discarded. The goodness-of-fit density estimates for the remaining ones are plotted on Fig. 9. The three estimators perform equally well in the case of genuine TNB distributions, while the goodness-of-fit by ULSD is very different. Among the 207 remaining species, only 3 % (6) were better fitted by ULSD, while the best estimator for TNBD was \(PHD_{h}\) (185 species: 89 % of the total), followed by MLE (10 species) and pseudo-MLE (6 species). Thus almost all “aggregative species” were discarded due to their rarity.

Fig. 9
figure 9

Simulations: density estimates of the four goodness-of-fit criteria

The quantile of order 0.95 of the goodness-of-fit associated with \(PHD_{h}\) was 0.531096; consequently, we considered that \(d_{H}^{T} = 0.53\) is an approriate threshold value that would not be exceeded by genuine negative binomial species. This threshold was used in Sect. 6.

Appendix 2: The Williams–Rao’s condition and the estimation of \(\beta \)

Notice that the equality \(\alpha =K\,\beta \) is implicit in Eq. (4), but that it not a constraint in the root finding of the pseudo-MLE system (6). Consequently, this relationship must be considered as a sign of consistency of the estimation of \(\left( K,\mathfrak {P,\alpha }\right) \). If in addition, the first condition of (5) is fulfilled, we should also have: \(\dfrac{\alpha _{LS}}{K\,\beta }\approx \dfrac{\alpha _{Rao}}{K\,\beta }\approx 1\). We investigated the validity of this relationship by performing 50 Monte Carlo experiments with “negative binomial species” random draws, in each one of four typical cases:

  • the “mean” one: \(\left( K,\mathfrak {P}\right) =\left( 1.193,73.15\right) \) is the mean of the bivariate Log-normal distribution fitting the C4 habitat non-aggregative species (see Sect. 5.2 and Appendix 1)

  • the “common” one: \(\left( K,\mathfrak {P}\right) =\left( 0.7767,14.43\right) \) is the spatial median (Serfling 2004) of the parameters of the simulated “NB species”; in this case, we chose \(\beta =10^{5}\) as the sample size of each one of the 50 simulations

  • an “aggregative” case: \(\left( K,\mathfrak {P}\right) =\left( 0.0001,14.43\right) \), with \(\beta =10^{7}\)

  • a “bell-shaped” case: \(\left( K,\mathfrak {P}\right) =\left( 10,14.43\right) \), with \(\beta =10^{4}\).

In these four cases, the best fit was obtained with \(PHD_{h}\), and we observed that \(\alpha _{LS}\) and \(\alpha _{Rao}\) could be considered as normally distributed (according to the Cramer-von Mises test), and that the mean of \(\alpha _{Rao}\) was always close to \(K\,\beta \) (T test), while the relationship \(\dfrac{\alpha _{LS}}{K\,\beta }\approx 1\) was verified only in the aggregative case (see Fig. 12). The equality \(\alpha _{LS}=K\,\beta \) was unacceptable in the “common” case (see Fig. 11), as well as in the bell-shaped one (Fig. 10) and in the “mean” case (not shown).

Fig. 10
figure 10

The Williams–Rao’s condition in the bell-shaped case

Fig. 11
figure 11

The Williams–Rao’s condition in the “common” case

Fig. 12
figure 12

The Williams–Rao’s condition in the aggregative case—the blue density is \(\dfrac{\alpha _{Rao}}{K\,\beta }\); the yellow one is \(\dfrac{\alpha _{LS}}{K\,\beta }\) (Color figure online)

From another side, dealing with real data, we often have to face an additional estimation task: because of the dubious nature of the collected zeros (if there are), the true value of \(\beta \) is unknown and we only know the number \(\beta ^{+}\) of strictly positive counts. Then, it is classical to estimate \(\beta \) by:

$$\begin{aligned} \beta ^{\left( \hat{K},\mathfrak {\hat{P}}\right) }:=\dfrac{\beta ^{+}}{\left( 1-1/\left( 1+\mathfrak {\hat{P}}\right) ^{\hat{K}}\right) } \end{aligned}$$
(11)

where \(\left( \hat{K},\mathfrak {\hat{P}}\right) \) is an estimation of the TNBD parameters, obtained either from pseudo-MLE or \(PHD_{h}\). The bell-shaped case is problem-free, because the probability of zero is extremely weak.

Let us now examine the most interesting case: the “common” one. On Fig. 13, we plotted on the left panel KDE estimates of the 50 values of the expression (11) obtained with \(\left( \hat{K},\mathfrak {\hat{P},\hat{\alpha }}\right) =\left( K_{Rao},\mathfrak {P}_{Rao},\alpha _{Rao}\right) \) and divided by \(\beta \): they were very close to 1. On the right panel, we plotted the values of (11) obtained with \(\left( \hat{K},\mathfrak {\hat{P}}\right) =\left( K_{PHD_{h}},\mathfrak {P}_{PHD_{h}}\right) \). Thus, this figure shows that all estimates of \(\beta \) are excellent in the “common” case; as a consequence, the results above concerning the Williams–Rao’s condition (see Fig. 11) stay valid in this case.

Fig. 13
figure 13

Estimating \(\beta \): the “common” case

Quite similar results were obtained in the “mean” case, as the reader can see on Fig. 14.

Fig. 14
figure 14

Estimating \(\beta \): the “mean” case

Things are very different in the aggregative situation, as the reader can see on Fig. 15: the estimator (11) based on \(\left( K_{Rao},\mathfrak {P}_{Rao},\alpha _{Rao}\right) \) strongly underestimates or overestimates \(\beta \). More precisely, in about 40 % of the samples, \(\beta \) was highly overestimated, which is natural, since \(\lim _{\hat{K}\rightarrow 0}\beta ^{\left( \hat{K},\mathfrak {\hat{P}}\right) }=\dfrac{\beta ^{+}}{\hat{K}\,\ln \left( 1+\mathfrak {\hat{P}}\right) }\). As for the estimator (11) associated with \(\left( \hat{K},\mathfrak {\hat{P}}\right) =\left( K_{PHD_{h}},\mathfrak {P}_{PHD_{h}}\right) \), it always underestimated \(\beta \). Consequently, when K is very small, the consistency of the estimations of the parameters of \(UNBD\left( K,\mathfrak {P,\alpha }\right) \) is questionable and the log-series model is probably more sound—even if the fit is not quite as good as with \(UNBD\left( K,\mathfrak {P,\alpha }\right) \). This is undoubtedly the meaning of Fig. 12.

Fig. 15
figure 15

Estimating \(\beta \) in the aggregative case; the 5 % upper values were dropped

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Manté, C., Kidé, S.O., Yao-Lafourcade, AF. et al. Fitting the truncated negative binomial distribution to count data. Environ Ecol Stat 23, 359–385 (2016). https://doi.org/10.1007/s10651-016-0343-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10651-016-0343-1

Keywords

Navigation