Abstract
Interest in modeling complex networks has fueled the development of multiple probabilistic generative graph models (PGGMs). PGGMs are statistical methods that model the network distribution and match common characteristics of real world networks. Recently, scalable sampling algorithms for well known PGGMs, made the analysis of large-scale, sparse networks feasible for the first time. However, it has been demonstrated that these scalable sampling algorithms do not sample from the original underlying distribution, and sometimes produce very unlikely graphs. To address this, we extend the algorithm proposed in Moreno et al. (in: IEEE 14th international conference on data mining, pp 440–449, 2014) for a single model and develop a general solution for a broad class of PGGMs. Our approach exploits the fact that PGGMs are typically parameterized by a small set of unique probability values—this enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove bias due to conditional sampling and probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Our new algorithm reduces time complexity by avoiding the expensive rejection sampling step previously necessary, and we demonstrate its generality, by outlining implementations for six different PGGMs. We conduct theoretical analysis and empirical evaluation to demonstrate the strengths of our algorithms. We conclude by sampling a network with over a billion edges in 95 s on a single processor.
Similar content being viewed by others
Notes
In this paper we will use graphs and networks interchangeably.
In the worst case scenario, \(\varTheta \) is a \(N_v\times N_v\) matrix that explicitly represents \({\mathcal {P}}\) (\(\varTheta \equiv {\mathcal {P}}\)).
Note that \({\mathcal {M}}_{p}\) includes the time needed to calculate probability values, which we referred to earlier in Sect. 2.2.1 as \({\mathcal {M}}_{c}\).
In the worst case, lookups/insertions are \(O(N_v)\) for hash tables (when the hash function puts all entries are inserted into the same bucket). To avoid this worst case scenario, one can use a balanced binary tree (e.g., red-black trees) for the implementation, which is \(O(\log N_v)\).
References
Ahmadi B, Kersting K, Hadiji F (2010) Lifted belief propagation: Pairwise marginals and beyond. In: Proceedings of the 5th European workshop on probabilistic graphical models
Aicher C, Jacobs AZ, Clauset A (2015) Learning latent block structure in weighted networks. J Complex Netw 3(2):221–248
Barabasi A, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512
Benson AR, Riquelme C, Schmit S (2014) Learning multifractal structure in large networks. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1326–1335
Bu Z, Xia Z, Wang J, Zhang C (2013a) A last updating evolution model for online social networks. Physica A 392(9):2240–2247
Bu Z, Xia Z, Wang J, Zhang C (2013b) A last updating evolution model for online social networks. Physica A 392(9):2240–2247
Bui HH, Huynh TN, Riedel S (2013) Automorphism groups of graphical models and lifted variational inference. In: Proceedings of the 29th conference on uncertainty in artificial intelligence, UAI’13, pp 132–141
Choi J, Guzman-Rivera A, Amir E (2011) Lifted relational kalman filtering. In: Proceedings of the 22nd international joint conference on artificial intelligence—3, IJCAI’11, pp 2092–2099
Chung F, Lu L (2002) The average distances in random graphs with given expected degrees. Proc Nat Acad Sci 99(25):15,879–15,882
de Salvo R, Natarajan S, Bui H, Shavlik J, Russell S (2009) Anytime lifted belief propagation. In: Working notes if the international workshop on statistical relational learning, vol 9
Deijfen M, van den Esker H, van der Hofstad R, Hooghiemstra G (2009) A preferential attachment model with random initial degrees. Arkiv för Matematik 47:41–72
Devroye L (1980) Generating the maximum of independent identically distributed random variables. Comput Math Appl 6(3):305–315
Diestel R (1997) Graph Theory. No. 173 in Graduate Texts in Mathematics, Springer
Dorogovtsev S, Mendes J, Samukhin A (2000) Structure of growing networks with preferential linking. Phys Rev Lett 85:4633–4636
Erdös P, Rényi A (1959) On random graphs, i. Publicationes Mathematicae (Debrecen) 6:290–297
Erdos P, Renyi A (1960) On the evolution of random graphs. Publ Math Inst Hungarian Acad Sci 5:17–60
Flaxman A, Frieze A, Vera J (2004) A geometric preferential attachment model of networks. In: 3rd International workshop of algorithms and models for the web-graph, WAW, pp 44–55
Frank O, Strauss D (1986) Markov graphs. J Am Stat Assoc 81(395):832–842
Gleich DF, Owen AB (2012) Moment-based estimation of stochastic kronecker graph parameters. Internet Math 8(3):232–256
Gogate V, Domingos P (2010) Exploiting logical structure in lifted probabilistic inference. In: Workshops at the 24th AAAI conference on artificial intelligence
Gogate V, Domingos PM (2012) Probabilistic theorem proving. CoRR arXiv:1202.3724
Golosovsky M, Solomon S (2012) Stochastic dynamical model of a growing citation network based on a self-exciting point process. Phys Rev Lett 109(098):701
Harris K (2008) The National Longitudinal Study of Adolescent health, Waves I to III, 1994–2002 [machine-readable data file and documentation]. University of North Carolina at Chapel Hill, Chapel Hill
Holland P, Leinhardt S (1981) An exponential family of probability distributions for directed graphs. J Am Stat Assoc 76(373):33–50
Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137
Jackson M, Rogers B (2007) Meeting strangers and friends of friends: how random are social networks? Am Econ Rev 97(3):890–915
Jaimovich A, Meshi O, Friedman N (2007) Template based inference in symmetric relational Markov random fields
Karrer B, Newman MEJ (2009) Random graph models for directed acyclic networks. Phys Rev E 80(046):110
Karrer B, Newman MEJ (2010) Random graphs containing arbitrary distributions of subgraphs. Phys Rev E 82(066):118
Kersting K (2012) Lifted probabilistic inference. In: Proceedings of the 20th European conference on artificial intelligence
Kersting K, Ahmadi B, Natarajan S (2009) Counting belief propagation. In: Proceedings of the 25th conference on uncertainty in artificial intelligence, pp 277–284
Kersting K, El Massaoudi Y (2010) Informed lifting for message-passing. In: Proceedings of the 24th conference on artificial intelligence (AAAI)
Kim M, Leskovec J (2010) Multiplicative attribute graph model of real-world networks. Algorithms and Models for the Web-Graph, Springer, Berlin Heidelberg, Lecture Notes in Computer Science 6516:62–73
Kolda TG, Pinar A, Plantenga T, Seshadhri C (2014) A scalable generative graph model with community structure. SIAM J Sci Comput 36(5):C424–C452
Krapivsky P, Redner S (2002) A statistical physics perspective on web growth. Comput Netw 39(3):261–276
Kumar R, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A, Upfal E (2000) Stochastic models for the web graph. In: Proceedings of the 41st annual symposium on foundations of computer science, 2000, pp 57–65
Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z (2010) Kronecker graphs: an approach to modeling networks. J Mach Learn Res 11:985–1042
Leskovec J, Faloutsos C (2007) Scalable modeling of real graphs using Kronecker multiplication. In: Proceedings of the 24th international conference on machine learning, pp 497–504
Leskovec J, Krevl A (2014) SNAP datasets: Stanford large network dataset collection. http://snap.stanford.edu/data
Massey FJ (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46:68–78
McCallum AK, Nigam K, Rennie J, Seymore K (2000) Automating the construction of internet portals with machine learning. Inf Retr 3(2):127–163
Mladenov M, Ahmadi B, Kersting K (2012) Lifted linear programming. In: International conference on artificial intelligence and statistics, pp 788–797
Moreno S, Kirshner S, Neville J, Vishwanathan S (2010) Tied Kronecker product graph models to capture variance in network populations. In: The 48th annual Allerton conference on communication, control, and computing, pp 1137–1144
Moreno S, Neville J (2013) Network hypothesis testing using mixed Kronecker product graph models. In: IEEE 13th international conference on data mining, pp 1163–1168
Moreno S, Neville J, Kirshner S (2013) Learning mixed Kronecker product graph models with simulated method of moments. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1052–1060
Moreno S, Pfeiffer III J, Kirshner S, Neville J (2014) A scalable method for exact sampling from kronecker family models. In: IEEE 14th international conference on data mining, pp 440–449
Pfeiffer JJ III, Neville J, Bennett PN (2014) Active exploration in networks: using probabilistic relationships for learning and inference. In: Proceedings of the 23rd ACM international conference on information and knowledge management, pp 639–648
Pinar A, Seshadhri C, Kolda TG (2011) The similarity between stochastic Kronecker and Chung-Lu graph models. CoRR arXiv:1110.4925
Poole D (2003) First-order probabilistic inference. In: Proceedings of the 18th international joint conference on artificial intelligence, pp 985–991
Pearl J (ed) (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, Burlington
Price D (1976) A general theory of bibliometric and other cumulative advantage processes. J Am Soc Inf Sci 27(5):292–306
Robins G, Snijders T, Wang P, Handcock M, Pattison P (2007) Recent developments in exponential random graph (p*) models for social networks. Soc Netw 29:192–215
Sen P, Deshpande A, Getoor L (2009) Bisimulation-based approximate lifted inference. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp 496–505
Seshadhri C, Kolda TG, Pinar A (2012) Community structure and scale-free collections of Erdős-Rényi graphs. Phys Rev E 85(5):056109
Sheldon R (2002) A first course in probability. Pearson Education, London
Simon H (1955) On a class of skew distribution functions. Biometrika 42(3–4):425–440
Singla P, Domingos PM (2008) Lifted first-order belief propagation. In: Proceedings of the 23rd conference on artificial intelligence, vol 8, pp 1094–1099
Snijders T (2002) Markov chain Monte Carlo estimation of exponential random graph models. J Soc Struct 3(2):1–40
Snijders T, Pattison P, Robins G, Handcock M (2004) New specifications for exponential random graph models. Sociol Methodol 36:99–153
Strauss D, Ikeda M (1990) Pseudolikelihood estimation for social networks. J Am Stat Assoc 85(409):204–212
Tadic B (2001) Dynamics of directed graphs: the world-wide web. Physica A 293(1–2):273–284
Van den Broeck G, Choi A, Darwiche A (2012) Lifted relax, compensate and then recover: from approximate to exact lifted probabilistic inference. In: Proceedings of the 28th conference on uncertainty in artificial intelligence (UAI)
Van den Broeck G, Davis J (2012) Conditioning in first-order knowledge compilation and lifted probabilistic inference. In: Proceedings of the 26th conference on artificial intelligence, AAAI’12, pp 1961–1967
Van den Broeck G, Niepert M (2015) Lifted probabilistic inference for asymmetric graphical models. In: Proceedings of the 29th conference on artificial intelligence, pp 3599–3605
Van den Broeck G (2013) Lifted inference and learning in statistical relational models. PhD thesis, Arenberg Doctoral School, Faculty of Engineering
Voss J (ed) (2013) An introduction to statistical computing: a simulation-based approach. Wiley, Hoboken
Wang P, Robins G, Pattison P, Lazega E (2013) Exponential random graph models for multilevel networks. Soc Netw 35(1):96–115
Wasserman S, Anderson C (1987) Stochastic a posteriori blockmodels: construction and assessment. Soc Netw 9(1):1–36
Wasserman S, Pattison PE (1996) Logit models and logistic regression for social networks: I. An introduction to Markov graphs and p*. Psychometrika 61:401–425
Watts D, Strogatz S (1998) Collective dynamics of ’small-world’ networks. Nature 393:440–42
Acknowledgements
We thank David Gleich for input on earlier versions of this work. Sebastian Moreno acknowledges the support of “CONICYT + PAI/ Concurso nacional de apoyo al retorno de investigadores/as desde el extranjero, convocatoria 2014 + folio 82140043”. This research is also supported by NSF under contract numbers IIS-1149789, IIS-1618690, IIS-1546488, and CCF-0939370. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation hereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied, of the National Science Foundation or the U.S. Government.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Kristian Kersting.
Sebastian Moreno acknowledges the support of “CONICYT + PAI/Concurso nacional de apoyo al retorno de investigadores/as desde el extranjero, convocatoria 2014 + folio 82140043”.
Rights and permissions
About this article
Cite this article
Moreno, S., Pfeiffer, J.J. & Neville, J. Scalable and exact sampling method for probabilistic generative graph models. Data Min Knowl Disc 32, 1561–1596 (2018). https://doi.org/10.1007/s10618-018-0566-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-018-0566-x