NIPS 2012 Workshop on Scalable Methods for Social Network Analysis

Workshop Schedule, Friday Dec 7th, 2012

Morning session:

7:30	Opening remarks
7:40	Dave Hunter, Penn State University: Statistical modeling of networks: ERGMs and beyond. 20 minute invited talk.
8:00	Ulrik Brandes, University of Konstanz: Algorithmic Implications of a Representational Theory of Centrality Measurement. 30 min invited talk with 10 minute discussion led by Carter Butts.
8:40	Approximating partition functions for exponential-family random graph models Wen Pu, Eyal Amir, Dorothy Espelage (University of Illinois at Urbana-Champaign) 15 min contributed talk
8:55	Graph partitioning via balanced label propagation Johan Ugander (Cornell University), Lars Backstorm (Facebook) 15 min contributed talk
9:10	Coffee break and poster setup
9:30	Poster session (14 posters).
10:30	Break and official end of poster session

Afternoon session

15:30	Pavel Krivitsky, Penn State, and Mark Handcock, UCLA: Partial observations and missing information with social network data. (20 min invited talk)
15:50	David Kempe, USC: Low-distortion Inference of Latent Similarities from a Multiplex Social Network. 30 min invited talk with 10 min discussion led by David Mount
16:30	Graphlet decomposition of a weighted network H. A. Soufiani and E. M. Airoldi (Harvard). 15 min contributed talk
16:45	Bringing survival analysis to information propagation Manuel Gomez Rodriguez (MPI/Stanford), Bernhard Schölkopf (MPI) 15 min contributed talk
17:00	COFFEE BREAK
17:20	David Eppstein, University of California, Irvine: Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges 20 min invited talk with 10 min discussion led by Dave Hunter
17:50	Carter Butts, University of California, Irvine: Challenges and Directions in Modeling Large-scale Networks 30 min invited talk with 10 min discussion led by Ulrik Brandes
18:30	End of workshop

Abstracts for Invited talks

Algorithmic Implications of a Representational Theory of Centrality Measurement

Ulrik Brandes, University of Konstanz

Recently, I proposed a fundamentally different approach to the analysis of centrality in networks. It inspired by conjoint measurement theory and intended to narrow the gap between substantive theory and quantitative network analysis. In this talk, I will briefly review the approach and point out some interesting combinatorial and algorithmic problems arising it.

Statistical modeling of networks: ERGMs and beyond

Dave Hunter, Penn State

This talk discusses the notion of statistical models for networks, in particular exponential-family random graph (ERGM) framework. After briefly explaining some issues of parameter estimation for these models, we briefly consider two related extensions--model-based clustering for large networks and event-based models--and their application to large networks of over a hundred thousand nodes.

Low-distortion Inference of Latent Similarities from a Multiplex Social Network

David Kempe, USC

What can a social network tell us about the underlying latent "social structure," the way in which individuals are similar or dissimilar? Much of social network analysis is, implicitly or explicitly, predicated on the assumption that individuals tend to be more similar to their friends than to strangers. Having explicit access to similarity information instead of merely the noisy signal provided by the presence or absence of edges could improve analysis significantly. We study the following natural question: Given a social network - reflecting the underlying social distances between its nodes - how accurately can we reconstruct the social structure? It is tempting to model similarities and dissimilarities as distances, so that the social structure is a metric space. However, observed social networks are usually multiplex, in the sense that different edges originate from similarity in one or more among a number of different categories, such as geographical proximity, professional interactions, kinship, or hobbies. Since close proximity in even one category makes the presence of edges much more likely, an observed social network is more accurately modeled as a union of separate networks. In general, it is a priori not known which network a given edge comes from. While generative models based on a single metric space have been analyzed previously, a union of several networks individually generated from metrics is structurally very different from (and more complex than) networks generated from just one metric. We begin to address this reconstruction problem formally. The latent "social structure" consists of several metric spaces. Each metric space gives rise to a "distance-based random graph," in which edges are created according to a distribution that depends on the underlying metric space and makes long-range edges less likely than short ones. For a concrete model, we consider Kleinberg's small-world model and some variations thereof. The observed social network is the union of these graphs. All edges are unlabeled, in the sense that the existence of an edge does not reveal which random graph it comes from. Our main result is a near-linear time algorithm which reconstructs from this unlabeled union each of the individual metrics with provably low distortion. (Joint work with Ittai Abraham Shiri Chechik, and Aleksandrs Slivkins.)

Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges

David Eppstein, University of California, Irvine

We consider the problem of analyzing social network data sets in which the edges of the network have timestamps, and we wish to analyze the subgraphs formed from edges in contiguous subintervals of these timestamps. We provide data structures for these problems that use near- linear preprocessing time, linear space, and sublogarithmic query time to handle queries that ask for the number of connected components, number of components that contain cycles, number of vertices whose degree equals or is at most some predetermined value, number of vertices that can be reached from a starting set of vertices by time-increasing paths, and related queries.

Abstracts for workshop papers

Graphlet decomposition of a weighted network

Hossein Azari Soufiani, Edoardo M. Airoldi, Department of Statistics, Harvard University

We introduce the graphlet decomposition of a weighted network, which encodes a notion of social information based on social structure. We develop a scalable inference algorithm which combines EM with Bron-Kerbosch in a novel fashion, for estimating the parameters of the model underlying graphlets using one network sample. We explore some theoretical properties of graph decomposition, including computational complexity, redundancy and expected accuracy. We demonstrate graphlets on synthetic and real data. We analyze messaging patterns on Facebook and criminal associations in the 19th century.

How Does Maximum Likelihood Estimation for the p_1 Model Scale for Large Sparse Networks?

Alessandro Rinaldo, Department of Statistics, Carnegie Mellon University
Sonja Petrovic, Department of Statistics, Pennsylvania State University
Stephen E. Fienberg, Department of Statistics, Machine Learning Department, and Heinz School, Carnegie Mellon University

We study the crucial problem of MLE existence for the directed random graph model proposed by Holland and Leinhardt, known as the p_1 model, which describes dyadic interactions in a social network. We show that necessary and sufficient conditions for the existence of the MLE are based on polyhedral conditions, which offer a systematic way to characterize (in a combinatorial fashion) sample points leading to a nonexistent MLE, and non-estimability of the probability parameters under a nonexistent MLE. While specifics of nonexistence are best understood for networks of relatively small size, there are important implications of our results for large networks, since current algorithms and software for fitting p_1 and other exponential random graph models (ERGMs) have no simple mechanism for detecting non-existence and identifying non-estimable parameters.

Learning Stochastic Processes Governing the Spread of Infectious Disease Through Contact Networks Given Incomplete Data

Michael Schweinberger, Department of Statistics, Pennsylvania State University

Learning the mechanisms which govern the spread of infectious disease through large populations by way of contact is of great interest to public health. An attractive approach to modeling the spread of infectious disease is based on modeling the network of contacts between population members as well as the spread of infectious disease through the population. A network-based approach can model the paths along which infectious diseases can spread and take the possible presence of potential super-spreaders into account. We discuss the formidable statistical and computational challenges arising from the presence of incomplete data and develop scalable models and methods based on exponential-family random graph models with Dirichlet process priors and SEIR models.

Mass Preserving Exponential Random Graph Model

Lin Yuan, School of ECE, Purdue University
Sergey Kirshner, Department of Statistics, Purdue University
Robert Givan, School of ECE, Purdue University

Exponential random graph models (ERGMs) are commonly used for modeling network data. However, they often suffer from degeneracy manifested in having the learned model place very little mass on or near the observed network(s). We propose a mass preserving ERGM, which, in addition to matching the mean statistics for the features used with ERGMs, also ensures the resulting model places at least a predetermined amount of mass on graphs similar to the observed graphs. The resulting model is thus directly resistant to the degeneracy of assigning near zero probability to the observations. This claim is further confirmed by our experiments on several social networks from small to moderate in size (16 to 1000+ nodes).

Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges

Michael J. Bannister, Department of Computer Science, University of California, Irvine
Christopher DuBois, Department of Statistics, University of California, Irvine
David Eppstein, Department of Computer Science, University of California, Irvine
Padhraic Smyth, Department of Computer Science, University of California, Irvine

Statistical Detection of Network-Level Outliers

Sebastian Moreno, Computer Science department, Purdue University
Jennifer Neville, Computer Science department, Purdue University

The search for anomalies or outliers in large-scale network data has important security applications and can help to uncover interesting behavior. Four different types of anomalies can be discovered in static and dynamic network data: nodes, edges, small subgraphs, and/or larger (sub) networks. To date, most of the research in this area has focused on identifying anomalous nodes, links, or small subgraphs in static networks. In dynamic network data, research has focused on detecting outlier links and anomalous network evolution, where multiple, temporal samples of the network provide sufficient data for analysis. However, to our knowledge, there are few algorithms that search for network-level anomalies in large-scale static networks. This is due to the difficulty of obtaining a set of networks to use for learning statical models of normal behavior, coupled with model limitations that make it difficult to capture natural variations in the population. However, recent work on mixed Kronecker Product Graph Models (mKPGMs) has provided a statistical model that is capable of learning accurate models of net- work characteristics and network populations. Using this approach, we propose a novel outlier algorithm to detect anomalous networks in static network populations, with a model-based approach that can learn the underlying distribution of the data from a single network. We illustrate the strengths of our algorithm using synthetic and real world network datasets, showing that the majority of outliers are detected using only a single network as training data.

Graph partitioning via balanced label propagation

Johan Ugander, Cornell University
Lars Backstorm, Facebook

Partitioning graphs at scale is a key challenge for any application that involves distributing a graph across disks, machines, or data centers. Graph partitioning is a very well studied problem with a rich literature, but existing algorithms typically can not scale to billions of edges, or can not provide guarantees about partition sizes. In this work we introduce an efﬁcient algorithm, balanced label propagation, for precisely partitioning massive graphs while greedily maximizing edge locality, the number of edges that are assigned to the same shard of a partition. By combining the computational efﬁciency of label propagation — where nodes are iteratively relabeled to the same ‘label’ as the plurality of their graph neighbors — with the guarantees of constrained optimization — guiding the propagation by a linear program constraining the partition sizes — our algorithm efficiently partitions graphs with billions of edges. We evaluate our algorithm for its partitioning performance on the Facebook social graph, and also study its performance impact when partitioning Facebook’s ‘People You May Know’ service (PYMK). In a live deployment, we observed average query times and average network trafﬁc levels that were 37.1% and 50.5% (respectively) when compared to the previous naive random sharding.

Improved Generalization Bounds for Large-scale Structured Prediction

Ben London, University of Maryland College Park
Bert Huang, University of Maryland College Park
Lise Getoor, University of Maryland College Park

We derive new generalization bounds for collective inference on network data, specifically when the training set consists of a single, large structured example. This setting is common in computer vision and network analysis. In such scenarios, existing bounds do not suggest generalization, which contradicts overwhelming empirical evidence. We propose a new theory of how collective models generalize, identifying weak dependence, hypothesis complexity and collective stability as sufficient conditions. When satisfied, we obtain uniform convergence rates with significantly faster decay than previous bounds, particularly when each structured example is sufficiently large and the number of training examples is small or constant.

Approximating Partition Functions for Exponential-Family Random Graph Models

Wen Pu, Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA
Eyal Amir, Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA
Dorothy Espelage, Department of Educational Psychology, University of Illinois at Urbana-Champaign, IL, USA

Exponential family random graph models are widely used to characterize empirical social network structures. However the intractability of their partition functions (normalizing constants) renders maximum likelihood estimation a hard task. In this paper, we introduce a \emph{linear-time deterministic} approximation to the log partition function for these models. Compared to existing methods, the proposed approximation provides a stable yet scalable alternative for inference. Our technical contribution is two-fold. First, we prove an upper bound on the size of sufficient-statistics space, showing an approximation to the partition function via an optimization problem over that space. Second, we develop an asymptotic lower bound estimation to the optimization problem by leveraging the concentration of measure in that space, which only relies on searching through a small set of states. We perform theoretical analysis and experiments on real and synthetic datasets showing the effectiveness of the new method.

Modeling Scientific Impact with Citation Influence Regression

James Foulds, University of California, Irvine
Padhraic Smyth, University of California, Irvine

When reviewing the scientific literature in a specific subject area, it would be useful to have automatic tools that identify the most influential scientific articles as well as how ideas propagate between articles. Bibliometric measures based on citation counts, such as impact factors, provide some indication of the influence of an article or the prestige of its publication venue. However, citations can occur for different reasons and may not always indicate the transfer of ideas, so that citation counts alone can be misleading. In this paper we develop latent variable probabilistic models for inferring influence in scientific corpora. The models operate on a collection of documents embedded in a citation graph with articles as nodes and citations as edges, where the latent topics of cited papers influence the prior distribution over topics in citing papers. We show how the proposed models can be used to automatically determine the degree of influence of scientific articles, and their influence along the edges of the citation graph.

Bringing Survival Analysis to Information Propagation

Manuel Gomez Rodriguez, MPI for Intelligent Systems and Stanford University
Bernhard Schölkopf, MPI for Intelligent Systems

We propose to model information propagation over implicit or unobserved networks using survival theory. We develop additive and multiplicative risk models under which the network inference problem from cascades can be solved efficiently by exploiting convexity. We show that two previous continuous time models used in network inference from cascades are particular cases of our additive risk model. We show that the multiplicative model allows previously infected nodes to have either a positive or a negative effect on the instantaneous risk of infection a node, in contrast with previous approaches, which typically consider only positive effects.

Community Detection in the Labelled Stochastic Block Model

Simon Heimlicher, University of Massachusetts, Amherst
Marc Lelarge, INRIA-ENS
Laurent Massoulie, INRA-Microsoft Research Joint Centre

We consider the problem of community detection from observed interactions betwee individuals, in the context where multiple types of interaction are possible. We use labelled stochastic block models to represent the observed data, where labels correspond to interaction types. Focusing on a two-community scenario, we conjecture a threshold for the problem of reconstructing the hidden communities in a way that is correlated with the true partition. To substantiate the conjecture, we prove that the given threshold correctly identifies a transition on the behaviour of belief propagation from insensitive to sensitive. We further prove that the same threshold corresponds to the transition in a related inference problem on a tree model from infeasible to feasible. Finally, numerical results using belief propagation for community detection give further support to the conjecture.

Fitting Dynamic Network Models to Static Network Data

Pavel N. Krivitsky, Penn State University Department of Statistics

It is often of interest to model dynamics of social network evolution for networks, such as those of sexual partnerships, for which the only available data are a snapshot at one time point, if that. Given a plausible model for a network evolving over time, such an observation may be approximated by a draw from the stationary distribution of the network evolution process. However, for nontrivial models for network evolution, this distribution is only defined recursively, so fitting models to the available data calls for a Generalized Method of Moments (GMM) approach, and the moments, in turn, must be obtained by simulating from the process. In this work, we describe the computational challenges associated with this problem, and propose and demonstrate ways of addressing them.

Fast and Scalable Online CCA for Realtime Impact Analysis of Social Media Data

Felix Biessmann, Berlin Institute of Technology
Frank C. Meinecke, Berlin Institute of Technology
Matthias L. Jugel, Berlin Institute of Technology
Mikio Braun, Berlin Institute of Technology

The dynamics of temporal dependencies in time series of web graphs can be used to study the influence of single web sources on other web sources. Previous approaches to analysis of temporal dynamics in web graph data were either based on simple and manually tuned heuristics or not designed for online applications with massive amounts of data. Here we propose a simple but efficient and robust online learning approach to canonical correlation analysis. We show that the algorithm converges to the optimal canonical correlations and canonical variates. Using online CCA for canonical trend analysis we can a) assess the impact of a single node on all other nodes, b) explore the temporal dynamics of this impact and c) interpret the features (e.g. in BoW space) that gave rise to an information cascade. We provide preliminary results showing that we can efficiently estimate canonical trends and thus assess the impact of single users in realtime on large data streams of web data obtained from the social network Twitter. Our results represent a first step beyond simple heuristics and towards an automatized content based realtime impact analysis for large scale data such as social networks activity.