Skip to main content
Log in

Scalable and exact sampling method for probabilistic generative graph models

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Interest in modeling complex networks has fueled the development of multiple probabilistic generative graph models (PGGMs). PGGMs are statistical methods that model the network distribution and match common characteristics of real world networks. Recently, scalable sampling algorithms for well known PGGMs, made the analysis of large-scale, sparse networks feasible for the first time. However, it has been demonstrated that these scalable sampling algorithms do not sample from the original underlying distribution, and sometimes produce very unlikely graphs. To address this, we extend the algorithm proposed in Moreno et al. (in: IEEE 14th international conference on data mining, pp 440–449, 2014) for a single model and develop a general solution for a broad class of PGGMs. Our approach exploits the fact that PGGMs are typically parameterized by a small set of unique probability values—this enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove bias due to conditional sampling and probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Our new algorithm reduces time complexity by avoiding the expensive rejection sampling step previously necessary, and we demonstrate its generality, by outlining implementations for six different PGGMs. We conduct theoretical analysis and empirical evaluation to demonstrate the strengths of our algorithms. We conclude by sampling a network with over a billion edges in 95 s on a single processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. In this paper we will use graphs and networks interchangeably.

  2. In the worst case scenario, \(\varTheta \) is a \(N_v\times N_v\) matrix that explicitly represents \({\mathcal {P}}\) (\(\varTheta \equiv {\mathcal {P}}\)).

  3. Note that \({\mathcal {M}}_{p}\) includes the time needed to calculate probability values, which we referred to earlier in Sect. 2.2.1 as \({\mathcal {M}}_{c}\).

  4. In the worst case, lookups/insertions are \(O(N_v)\) for hash tables (when the hash function puts all entries are inserted into the same bucket). To avoid this worst case scenario, one can use a balanced binary tree (e.g., red-black trees) for the implementation, which is \(O(\log N_v)\).

References

  • Ahmadi B, Kersting K, Hadiji F (2010) Lifted belief propagation: Pairwise marginals and beyond. In: Proceedings of the 5th European workshop on probabilistic graphical models

  • Aicher C, Jacobs AZ, Clauset A (2015) Learning latent block structure in weighted networks. J Complex Netw 3(2):221–248

    Article  MathSciNet  Google Scholar 

  • Barabasi A, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512

    Article  MathSciNet  Google Scholar 

  • Benson AR, Riquelme C, Schmit S (2014) Learning multifractal structure in large networks. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1326–1335

  • Bu Z, Xia Z, Wang J, Zhang C (2013a) A last updating evolution model for online social networks. Physica A 392(9):2240–2247

    Article  Google Scholar 

  • Bu Z, Xia Z, Wang J, Zhang C (2013b) A last updating evolution model for online social networks. Physica A 392(9):2240–2247

    Article  Google Scholar 

  • Bui HH, Huynh TN, Riedel S (2013) Automorphism groups of graphical models and lifted variational inference. In: Proceedings of the 29th conference on uncertainty in artificial intelligence, UAI’13, pp 132–141

  • Choi J, Guzman-Rivera A, Amir E (2011) Lifted relational kalman filtering. In: Proceedings of the 22nd international joint conference on artificial intelligence—3, IJCAI’11, pp 2092–2099

  • Chung F, Lu L (2002) The average distances in random graphs with given expected degrees. Proc Nat Acad Sci 99(25):15,879–15,882

    Article  MathSciNet  Google Scholar 

  • de Salvo R, Natarajan S, Bui H, Shavlik J, Russell S (2009) Anytime lifted belief propagation. In: Working notes if the international workshop on statistical relational learning, vol 9

  • Deijfen M, van den Esker H, van der Hofstad R, Hooghiemstra G (2009) A preferential attachment model with random initial degrees. Arkiv för Matematik 47:41–72

    Article  MathSciNet  Google Scholar 

  • Devroye L (1980) Generating the maximum of independent identically distributed random variables. Comput Math Appl 6(3):305–315

    Article  MathSciNet  Google Scholar 

  • Diestel R (1997) Graph Theory. No. 173 in Graduate Texts in Mathematics, Springer

  • Dorogovtsev S, Mendes J, Samukhin A (2000) Structure of growing networks with preferential linking. Phys Rev Lett 85:4633–4636

    Article  Google Scholar 

  • Erdös P, Rényi A (1959) On random graphs, i. Publicationes Mathematicae (Debrecen) 6:290–297

    MathSciNet  MATH  Google Scholar 

  • Erdos P, Renyi A (1960) On the evolution of random graphs. Publ Math Inst Hungarian Acad Sci 5:17–60

    MathSciNet  MATH  Google Scholar 

  • Flaxman A, Frieze A, Vera J (2004) A geometric preferential attachment model of networks. In: 3rd International workshop of algorithms and models for the web-graph, WAW, pp 44–55

  • Frank O, Strauss D (1986) Markov graphs. J Am Stat Assoc 81(395):832–842

    Article  MathSciNet  Google Scholar 

  • Gleich DF, Owen AB (2012) Moment-based estimation of stochastic kronecker graph parameters. Internet Math 8(3):232–256

    Article  MathSciNet  Google Scholar 

  • Gogate V, Domingos P (2010) Exploiting logical structure in lifted probabilistic inference. In: Workshops at the 24th AAAI conference on artificial intelligence

  • Gogate V, Domingos PM (2012) Probabilistic theorem proving. CoRR arXiv:1202.3724

  • Golosovsky M, Solomon S (2012) Stochastic dynamical model of a growing citation network based on a self-exciting point process. Phys Rev Lett 109(098):701

    Google Scholar 

  • Harris K (2008) The National Longitudinal Study of Adolescent health, Waves I to III, 1994–2002 [machine-readable data file and documentation]. University of North Carolina at Chapel Hill, Chapel Hill

    Google Scholar 

  • Holland P, Leinhardt S (1981) An exponential family of probability distributions for directed graphs. J Am Stat Assoc 76(373):33–50

    Article  MathSciNet  Google Scholar 

  • Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137

    Article  MathSciNet  Google Scholar 

  • Jackson M, Rogers B (2007) Meeting strangers and friends of friends: how random are social networks? Am Econ Rev 97(3):890–915

    Article  Google Scholar 

  • Jaimovich A, Meshi O, Friedman N (2007) Template based inference in symmetric relational Markov random fields

  • Karrer B, Newman MEJ (2009) Random graph models for directed acyclic networks. Phys Rev E 80(046):110

    Google Scholar 

  • Karrer B, Newman MEJ (2010) Random graphs containing arbitrary distributions of subgraphs. Phys Rev E 82(066):118

    MathSciNet  Google Scholar 

  • Kersting K (2012) Lifted probabilistic inference. In: Proceedings of the 20th European conference on artificial intelligence

  • Kersting K, Ahmadi B, Natarajan S (2009) Counting belief propagation. In: Proceedings of the 25th conference on uncertainty in artificial intelligence, pp 277–284

  • Kersting K, El Massaoudi Y (2010) Informed lifting for message-passing. In: Proceedings of the 24th conference on artificial intelligence (AAAI)

  • Kim M, Leskovec J (2010) Multiplicative attribute graph model of real-world networks. Algorithms and Models for the Web-Graph, Springer, Berlin Heidelberg, Lecture Notes in Computer Science 6516:62–73

  • Kolda TG, Pinar A, Plantenga T, Seshadhri C (2014) A scalable generative graph model with community structure. SIAM J Sci Comput 36(5):C424–C452

    Article  MathSciNet  Google Scholar 

  • Krapivsky P, Redner S (2002) A statistical physics perspective on web growth. Comput Netw 39(3):261–276

    Article  Google Scholar 

  • Kumar R, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A, Upfal E (2000) Stochastic models for the web graph. In: Proceedings of the 41st annual symposium on foundations of computer science, 2000, pp 57–65

  • Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z (2010) Kronecker graphs: an approach to modeling networks. J Mach Learn Res 11:985–1042

    MathSciNet  MATH  Google Scholar 

  • Leskovec J, Faloutsos C (2007) Scalable modeling of real graphs using Kronecker multiplication. In: Proceedings of the 24th international conference on machine learning, pp 497–504

  • Leskovec J, Krevl A (2014) SNAP datasets: Stanford large network dataset collection. http://snap.stanford.edu/data

  • Massey FJ (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46:68–78

    Article  Google Scholar 

  • McCallum AK, Nigam K, Rennie J, Seymore K (2000) Automating the construction of internet portals with machine learning. Inf Retr 3(2):127–163

    Article  Google Scholar 

  • Mladenov M, Ahmadi B, Kersting K (2012) Lifted linear programming. In: International conference on artificial intelligence and statistics, pp 788–797

  • Moreno S, Kirshner S, Neville J, Vishwanathan S (2010) Tied Kronecker product graph models to capture variance in network populations. In: The 48th annual Allerton conference on communication, control, and computing, pp 1137–1144

  • Moreno S, Neville J (2013) Network hypothesis testing using mixed Kronecker product graph models. In: IEEE 13th international conference on data mining, pp 1163–1168

  • Moreno S, Neville J, Kirshner S (2013) Learning mixed Kronecker product graph models with simulated method of moments. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1052–1060

  • Moreno S, Pfeiffer III J, Kirshner S, Neville J (2014) A scalable method for exact sampling from kronecker family models. In: IEEE 14th international conference on data mining, pp 440–449

  • Pfeiffer JJ III, Neville J, Bennett PN (2014) Active exploration in networks: using probabilistic relationships for learning and inference. In: Proceedings of the 23rd ACM international conference on information and knowledge management, pp 639–648

  • Pinar A, Seshadhri C, Kolda TG (2011) The similarity between stochastic Kronecker and Chung-Lu graph models. CoRR arXiv:1110.4925

  • Poole D (2003) First-order probabilistic inference. In: Proceedings of the 18th international joint conference on artificial intelligence, pp 985–991

  • Pearl J (ed) (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, Burlington

    MATH  Google Scholar 

  • Price D (1976) A general theory of bibliometric and other cumulative advantage processes. J Am Soc Inf Sci 27(5):292–306

    Article  Google Scholar 

  • Robins G, Snijders T, Wang P, Handcock M, Pattison P (2007) Recent developments in exponential random graph (p*) models for social networks. Soc Netw 29:192–215

    Article  Google Scholar 

  • Sen P, Deshpande A, Getoor L (2009) Bisimulation-based approximate lifted inference. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp 496–505

  • Seshadhri C, Kolda TG, Pinar A (2012) Community structure and scale-free collections of Erdős-Rényi graphs. Phys Rev E 85(5):056109

    Article  Google Scholar 

  • Sheldon R (2002) A first course in probability. Pearson Education, London

    MATH  Google Scholar 

  • Simon H (1955) On a class of skew distribution functions. Biometrika 42(3–4):425–440

    Article  MathSciNet  Google Scholar 

  • Singla P, Domingos PM (2008) Lifted first-order belief propagation. In: Proceedings of the 23rd conference on artificial intelligence, vol 8, pp 1094–1099

  • Snijders T (2002) Markov chain Monte Carlo estimation of exponential random graph models. J Soc Struct 3(2):1–40

    Google Scholar 

  • Snijders T, Pattison P, Robins G, Handcock M (2004) New specifications for exponential random graph models. Sociol Methodol 36:99–153

    Article  Google Scholar 

  • Strauss D, Ikeda M (1990) Pseudolikelihood estimation for social networks. J Am Stat Assoc 85(409):204–212

    Article  MathSciNet  Google Scholar 

  • Tadic B (2001) Dynamics of directed graphs: the world-wide web. Physica A 293(1–2):273–284

    Article  Google Scholar 

  • Van den Broeck G, Choi A, Darwiche A (2012) Lifted relax, compensate and then recover: from approximate to exact lifted probabilistic inference. In: Proceedings of the 28th conference on uncertainty in artificial intelligence (UAI)

  • Van den Broeck G, Davis J (2012) Conditioning in first-order knowledge compilation and lifted probabilistic inference. In: Proceedings of the 26th conference on artificial intelligence, AAAI’12, pp 1961–1967

  • Van den Broeck G, Niepert M (2015) Lifted probabilistic inference for asymmetric graphical models. In: Proceedings of the 29th conference on artificial intelligence, pp 3599–3605

  • Van den Broeck G (2013) Lifted inference and learning in statistical relational models. PhD thesis, Arenberg Doctoral School, Faculty of Engineering

  • Voss J (ed) (2013) An introduction to statistical computing: a simulation-based approach. Wiley, Hoboken

    Google Scholar 

  • Wang P, Robins G, Pattison P, Lazega E (2013) Exponential random graph models for multilevel networks. Soc Netw 35(1):96–115

    Article  Google Scholar 

  • Wasserman S, Anderson C (1987) Stochastic a posteriori blockmodels: construction and assessment. Soc Netw 9(1):1–36

    Article  MathSciNet  Google Scholar 

  • Wasserman S, Pattison PE (1996) Logit models and logistic regression for social networks: I. An introduction to Markov graphs and p*. Psychometrika 61:401–425

    Article  MathSciNet  Google Scholar 

  • Watts D, Strogatz S (1998) Collective dynamics of ’small-world’ networks. Nature 393:440–42

    Article  Google Scholar 

Download references

Acknowledgements

We thank David Gleich for input on earlier versions of this work. Sebastian Moreno acknowledges the support of “CONICYT + PAI/ Concurso nacional de apoyo al retorno de investigadores/as desde el extranjero, convocatoria 2014 + folio 82140043”. This research is also supported by NSF under contract numbers IIS-1149789, IIS-1618690, IIS-1546488, and CCF-0939370. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation hereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied, of the National Science Foundation or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastian Moreno.

Additional information

Responsible editor: Kristian Kersting.

Sebastian Moreno acknowledges the support of “CONICYT + PAI/Concurso nacional de apoyo al retorno de investigadores/as desde el extranjero, convocatoria 2014 + folio 82140043”.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moreno, S., Pfeiffer, J.J. & Neville, J. Scalable and exact sampling method for probabilistic generative graph models. Data Min Knowl Disc 32, 1561–1596 (2018). https://doi.org/10.1007/s10618-018-0566-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-018-0566-x

Keywords

Navigation