Bruno Ribeiro

Bruno Ribeiro

Associate Professor of Computer Science

Purdue University

Field: Machine Learning



Hall of Data Science and AI 3142
Department of Computer Science
Purdue University
West Lafayette, IN
ribeirob@purdue.edu
C.V. [pdf]

-

Current Students:

PhD

Undegraduate

Former Students:

PhD

Undergraduate

`

Select Publications (see complete list at Google Scholar):

2025

  1. Kaiyuan Zhang, Siyuan Cheng, Guangyu Shen, Bruno Ribeiro, Shengwei An, Pin-Yu Chen, Xiangyu Zhang, Ninghui Li, CENSOR: Defense Against Gradient Inversion via Orthogonal Subspace Bayesian Sampling, NDSS, 2025 [link].
    Abstract []
    Federated learning collaboratively trains a neural network on a global server, where each local client receives the current global model weights and sends back parameter updates (gradients) based on its local private data. The process of sending these model updates may leak client's private data information. Existing gradient inversion attacks can exploit this vulnerability to recover private training instances from a client's gradient vectors. Recently, researchers have proposed advanced gradient inversion techniques that existing defenses struggle to handle effectively. In this work, we present a novel defense tailored for large neural network models. Our defense capitalizes on the high dimensionality of the model parameters to perturb gradients within a subspace orthogonal to the original gradient. By leveraging cold posteriors over orthogonal subspaces, our defense implements a refined gradient update mechanism. This enables the selection of an optimal gradient that not only safeguards against gradient inversion attacks but also maintains model utility. We conduct comprehensive experiments across three different datasets and evaluate our defense against various state-of-the-art attacks and defenses.

2024

  1. Krishna Sri Ipsit Mantri, Xinzhi Wang, Carola-Bibiane Scönlieb, Bruno Ribeiro, Beatrice Bevilacqua, Moshe Eliasof, DiGRAF: Diffeomorphic Graph-Adaptive Activation Function, NeurIPS, 2024 [link].
    Abstract []
  2. Mikhail Galkin, Jincheng Zhou, Bruno Ribeiro, Jian Tang, Zhaocheng Zhu, A Foundation Model for Zero-shot Logical Query Reasoning, NeurIPS 2024 [link].
    Abstract []
  3. Shirley Wu, Kaidi Cao, Bruno Ribeiro, James Zou, Jure Leskovec, GraphMETRO: Mitigating Complex Graph Distribution Shifts via Mixture of Aligned Experts, NeurIPS 2024 [link].
    Abstract []
  4. Abd AlRhman AlQiam, Yuanjun Yao, Zhaodong Wang, Satyajeet Singh Ahuja, Ying Zhang, Sanjay Rao, Bruno Ribeiro, Mohit Tawarmalani, Transferable Neural WAN TE for Changing Topologies, ACM SIGCOMM 2024 [link].
    Abstract []
  5. Mai ElkadyG, Thu BuiG, Bruno Ribeiro, David I. Inouye, Vertical Validation: Evalu- ating Implicit Generative Models for Graphs on Thin Support Regions, UAI 2024.
    Abstract []
    There has been a growing excitement that implicit graph generative models could be used to design or discover new molecules for medicine or material design. Because these molecules have not been discovered, they naturally lie in unexplored or scarcely supported regions of the distribution of known molecules. However, prior evaluation methods for implicit graph generative models have focused on validating statistics computed from the thick support (e.g., mean and variance of a graph property). Therefore, there is a mismatch between the goal of generating novel graphs and the evaluation methods. To address this evaluation gap, we design a novel evaluation method called Vertical Validation (VV) that systematically creates thin support regions during the train-test splitting procedure and then reweights generated samples so that they can be compared to the held-out test data. This procedure can be seen as a generalization of the standard train-test procedure except that the splits are dependent on sample features. We demonstrate that our method can be used to perform model selection if performance on thin support regions is the desired goal. As a side benefit, we also show that our approach can better detect overfitting as exemplified by memorization.
  6. Jiacheng Li, Ninghui Li, Bruno Ribeiro, MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training, USENIX Security 2024 [link].
    Abstract []
    In Member Inference (MI) attacks, the adversary try to determine whether an instance is used to train a machine learning (ML) model. MI attacks are a major privacy concern when using private data to train ML models. Most MI attacks in the literature take advantage of the fact that ML models are trained to fit the training data well, and thus have very low loss on training instances. Most defenses against MI attacks therefore try to make the model fit the training data less well. Doing so, however, generally results in lower accuracy. We observe that training instances have different degrees of vulnerability to MI attacks. Most instances will have low loss even when not included in training. For these instances, the model can fit them well without concerns of MI attacks. An effective defense only needs to (possibly implicitly) identify instances that are vulnerable to MI attacks and avoids overfitting them. A major challenge is how to achieve such an effect in an efficient training process. Leveraging two distinct recent advancements in representation learning: counterfactually-invariant representations and subspace learning methods, we introduce a novel Membership-Invariant Subspace Training (MIST) method to defend against MI attacks. MIST avoids overfitting the vulnerable instances without significant impact on other instances. We have conducted extensive experimental studies, comparing MIST with various other state-of-the-art (SOTA) MI defenses against several SOTA MI attacks. We find that MIST outperforms other defenses while resulting in minimal reduction in testing accuracy.
  7. S Chandra Mouli, Muhammad Alam, Bruno Ribeiro, MetaPhysiCa: Improving OOD Robustness in Physics-informed Machine Learning, ICLR 2024 (spotlight) [link].
    Abstract []
    A fundamental challenge in physics-informed machine learning (PIML) is the design of robust PIML methods for out-of-distribution (OOD) forecasting tasks. These OOD tasks require learning-to-learn from observations of the same (ODE) dynamical system with different unknown ODE parameters, and demand accurate forecasts even under out-of-support initial conditions and out-of-support ODE parameters. In this work we propose a solution for such tasks, which we define as a meta-learning procedure for causal structure discovery (including invariant risk minimization). Using three different OOD tasks, we empirically observe that the proposed approach significantly outperforms existing state-of-the-art PIML and deep learning methods.
  8. Beatrice Bevilacqua, Moshe Eliasof, Eli Meirom, Bruno Ribeiro, Haggai Maron, Efficient Subgraph GNNs by Learning Effective Selection Policies, ICLR 2024 [link].
    Abstract []
    Subgraph GNNs are provably expressive neural architectures that learn graph representations from sets of subgraphs. Unfortunately, their applicability is hampered by the computational complexity associated with performing message passing on many subgraphs. In this paper, we consider the problem of learning to select a small subset of the large set of possible subgraphs in a data-driven fashion. We first motivate the problem by proving that there are families of WL-indistinguishable graphs for which there exist efficient subgraph selection policies: small subsets of subgraphs that can already identify all the graphs within the family. We then propose a new approach, called Policy-Learn, that learns how to select subgraphs in an iterative manner. We prove that, unlike popular random policies and prior work addressing the same problem, our architecture is able to learn the efficient policies mentioned above. Our experimental results demonstrate that Policy-Learn outperforms existing baselines across a wide range of datasets.

2023

  1. Jianfei Gao, Yangze Zhou, Jincheng Zhou, Bruno Ribeiro, Double Equivariance for Inductive Link Prediction for Both New Nodes and New Relation Types , arXiv:2302.01313 [link].
    Abstract []
    The task of inductive link prediction in discrete attributed multigraphs (e.g., knowledge graphs, multilayer networks, heterogeneous networks, etc.) generally focuses on test predictions with solely new nodes but not both new nodes and new relation types. In this work, we formally define the task of predicting (completely) new nodes and new relation types in test as a doubly inductive link prediction task and introduce a theoretical framework for the solution. We start by defining the concept of double permutation-equivariant representations that are equivariant to permutations of both node identities and edge relation types. We then propose a general blueprint to design neural architectures that impose a structural representation of relations that can inductively generalize from training nodes and relations to arbitrarily new test nodes and relations without the need for adaptation, side information, or retraining. We also introduce the concept of distributionally double equivariant positional embeddings designed to perform the same task. Finally, we empirically demonstrate the capability of the two proposed models on a set of novel real-world benchmarks, showcasing average relative performance gains of 39.65% on predicting new relations types compared to baselines.
  2. Leonardo Cotta, Beatrice Bevilacqua, Nesreen Ahmed, Bruno Ribeiro, Causal Lifting and Link Prediction , Proceedings of the Royal Society A, 2023 [link].
    Abstract []
    Existing causal models for link prediction assume an underlying set of inherent node factors ---an innate characteristic defined at the node's birth--- that governs the causal evolution of links in the graph. In some causal tasks, however, link formation is path-dependent: The outcome of link interventions depends on existing links. Unfortunately, these existing causal methods are not designed for path-dependent link formation, as the cascading functional dependencies between links (arising from path dependence) are either unidentifiable or require an impractical number of control variables. To overcome this, we develop the first causal model capable of dealing with path dependencies in link prediction. In this work we introduce the concept of causal lifting, an invariance in causal models of independent interest that, on graphs, allows the identification of causal link prediction queries using limited interventional data. Further, we show how structural pairwise embeddings exhibit lower bias and correctly represent the task's causal structure, as opposed to existing node embeddings, \textit{e.g.}, graph neural network node embeddings and matrix factorization. Finally, we validate our theoretical findings on three scenarios for causal link prediction tasks: knowledge base completion, covariance matrix estimation and consumer-product recommendations.
  3. Chandan Botha, Jianfei Gao, Sanjay Rao, Bruno Ribeiro, Veritas: Answering Causal Queries from Video Streaming Traces , ACM SIGCOMM, 2023 [pdf].
    Abstract []
    In this paper, we consider the task of answering what-if questions in the context of adaptive bit rate (ABR) video streaming without access to randomized control trials (RCTs) (e.g., no A/B testing) -- i.e., given recorded data of an existing deployed system, what would be the performance impact if we changed its design. Our work makes three contributions. First, we show the problem is challenging since data may only be available for a single ABR algorithm without RCTs, and since it is necessary to deal with the cascading effects that past ABR decisions have on future decisions. Next we present Veritas, the first framework that tackles causal reasoning for video streaming without requiring data collected through RCTs. Integral to Veritas is an easy-to-interpret domain-specific ML model that relates the latent stochastic process (intrinsic bandwidth that the video session can achieve) to actual observations (download times), while exploiting counterfactual queries via abduction using the observed TCP states (e.g., congestion window) for blocking the cascading dependencies. Third, we evaluate Veritas's ability to accurately answer a wide range of what-if questions using emulation experiments, and data of real video sessions from Puffer. The results show that (i) Veritas accurately tackles a wider range of what-if questions (e.g., change of buffer size or video quality) that existing approaches cannot; (ii) Veritas without RCT training data achieves performance comparable or better than a recent parallel approach that requires RCT data; and (iii) in many scenarios Veritas achieves accuracy close to an ideal oracle.
  4. Jiacheng Li, Ninghui Li, Bruno Ribeiro, Effective passive membership inference attacks in federated learning against overparameterized models , ICLR, 2023 [pdf].
    Abstract []
    This work considers the challenge of performing membership inference attacks in a federated learning setting ---for image classification--- where an adversary can only observe the communication between the central node and a single client (a passive white-box attack). Passive attacks are one of the hardest-to-detect attacks, since they can be performed without modifying how the behavior of the central server or its clients, and assumes *no access to private data instances*. The key insight of our method is empirically observing that, near parameters that generalize well in test, the gradient of large overparameterized neural network models statistically behave like high-dimensional independent isotropic random vectors. Using this insight, we devise two attacks that are often little impacted by existing and proposed defenses. Finally, we validated the hypothesis that our attack depends on the overparametrization by showing that increasing the level of overparametrization (without changing the neural network architecture) positively correlates with our attack effectiveness.

2022

  1. Yun Seong Nam, Jianfei Gao, Chandan Bothra, Ehab Ghabashneh, Sanjay Rao, Bruno Ribeiro, Jibin Zhan, Hui Zhang, Xatu: Richer Neural Network Based Prediction for Video Streaming , SIGMETRICS, 2022 [pdf].
    Abstract []
    The performance of Adaptive Bitrate (ABR) algorithms for video streaming depends on accurately predicting the download time of video chunks. Existing prediction approaches (i) assume chunk download times are dominated by network throughput; and (ii) apriori cluster sessions (e.g., based on ISP and CDN) and only learn from sessions in the same cluster. We make three contributions. First, through analysis of data from real-world video streaming sessions, we show (i) apriori clustering prevents learning from related clusters; and (ii) factors such as the Time to First Byte (TTFB) are key components of chunk download times but not easily incorporated into existing prediction approaches. Second, we propose Xatu, a new prediction approach that jointly learns a neural network sequence model with an interpretable automatic session clustering method. Xatu learns clustering rules across all sessions it deems relevant, and models sequences with multiple chunk-dependent features (e.g., TTFB) rather than just throughput. Third, evaluations using the above datasets and emulation experiments show that Xatu significantly improves prediction accuracies by 23.8% relative to CS2P (a state-of-the-art predictor). We show Xatu provides substantial performance benefits when integrated with multiple ABR algorithms including MPC (a well studied ABR algorithm), and FuguABR (a recent algorithm using stochastic control) relative to their default predictors (CS2P and a fully connected neural network respectively). Further, Xatu combined with MPC outperforms Pensieve, an ABR based on deep reinforcement learning.
  2. Yangze Zhou, Gitta Kutyniok, Bruno Ribeiro, OOD Link Prediction Generalization Capabilities of Message-Passing GNNs in Larger Test Graphs, NeurIPS, 2022 [pdf].
    Abstract []
    This work provides the first theoretical study on the ability of graph Message Passing Neural Networks (gMPNNs) -- such as Graph Neural Networks (GNNs) -- to perform inductive out-of-distribution (OOD) link prediction tasks, where deployment (test) graph sizes are larger than training graphs. We first prove non-asymptotic bounds showing that link predictors based on permutation-equivariant (structural) node embeddings obtained by gMPNNs can converge to a random guess as test graphs get larger. We then propose a theoretically-sound gMPNN that outputs structural pairwise (2-node) embeddings and prove non-asymptotic bounds showing that, as test graphs grow, these embeddings converge to embeddings of a continuous function that retains its ability to predict links OOD. Empirical results on random graphs show agreement with our theoretical results.
  3. S Chandra Mouli, Yangze Zhou, Bruno Ribeiro Bias Challenges in Counterfactual Data Augmentation, UAI 2022 Workshop on Causal Representation Learning, 2022 [pdf].
    Abstract []
    This work provides the first theoretical study on the ability of graph Message Passing Neural Networks (gMPNNs) -- such as Graph Neural Networks (GNNs) -- to perform inductive out-of-distribution (OOD) link prediction tasks, where deployment (test) graph sizes are larger than training graphs. We first prove non-asymptotic bounds showing that link predictors based on permutation-equivariant (structural) node embeddings obtained by gMPNNs can converge to a random guess as test graphs get larger. We then propose a theoretically-sound gMPNN that outputs structural pairwise (2-node) embeddings and prove non-asymptotic bounds showing that, as test graphs grow, these embeddings converge to embeddings of a continuous function that retains its ability to predict links OOD. Empirical results on random graphs show agreement with our theoretical results.
  4. Jianfei Gao, Bruno Ribeiro, On the Equivalence Between Temporal and Static Equivariant Graph Representations, ICML, 2022 [pdf].
    Abstract []
    This work formalizes the associational task of predicting node attribute evolution in temporal graphs from the perspective of learning equivariant representations. We show that node representations in temporal graphs can be cast into two distinct frameworks:(a) The most popular approach, which we denote as time-and-graph, where equivariant graph (eg, GNN) and sequence (eg, RNN) representations are intertwined to represent the temporal evolution of node attributes in the graph; and (b) an approach that we denote as time-then-graph, where the sequences describing the node and edge dynamics are represented first, then fed as node and edge attributes into a static equivariant graph representation that comes after. Interestingly, we show that time-then-graph representations have an expressivity advantage over time-and-graph representations when both use component GNNs that are not most-expressive (eg, 1-Weisfeiler-Lehman GNNs). Moreover, while our goal is not necessarily to obtain state-of-the-art results, our experiments show that time-then-graph methods are capable of achieving better performance and efficiency than state-of-the-art time-and-graph methods in some real-world tasks, thereby showcasing that the time-then-graph framework is a worthy addition to the graph ML toolbox.
  5. S. Chandra Mouli, Bruno Ribeiro, Asymmetry Learning for Counterfactually-invariant Classification in OOD Tasks , ICLR, 2022 (Oral) [pdf].
    Abstract []
    Generalizing from observed to new related environments (out-of-distribution) is central to the reliability of classifiers. However, most classifiers fail to predict label $Y$ from input $X$ when the change in environment is due a (stochastic) input transformation $T^\text{te} \circ X'$ not observed in training, as in training we observe $T^\text{tr} \circ X'$, where $X'$ is a hidden variable. This work argues that when the transformations in train $T^\text{tr}$ and test $T^\text{te}$ are (arbitrary) symmetry transformations induced by a collection of known $m$ equivalence relations, the task of finding a robust OOD classifier can be defined as finding the simplest causal model that defines a causal connection between the target labels and the symmetry transformations that are associated with label changes. We then propose a new learning paradigm, {\em asymmetry learning}, that identifies which symmetries the classifier must break in order to correctly predict $Y$ in both train and test. {\em Asymmetry learning} performs a causal model search that, under certain identifiability conditions, finds classifiers that perform equally well in-distribution and out-of-distribution. Finally, we show how to learn counterfactually-invariant representations with {\em asymmetry learning} in two simulated physics tasks \update{and six image classification tasks.}
  6. Carlos HC Teixeira, Mayank Kakodkar, Vinícius Dias, Wagner Meira, Bruno Ribeiro Sequential stratified regeneration: MCMC for large state spaces with an application to subgraph count estimation, Data Mining and Knowledge Discovery, 2022 [pdf].
    Abstract []
    This work considers the general task of estimating the sum of a bounded function over the edges of a graph, given neighborhood query access and where access to the entire network is prohibitively expensive. To estimate this sum, prior work proposes Markov chain Monte Carlo (MCMC) methods that use random walks started at some seed vertex and whose equilibrium distribution is the uniform distribution over all edges, eliminating the need to iterate over all edges. Unfortunately, these existing estimators are not scalable to massive real-world graphs. In this paper, we introduce Ripple, an MCMC-based estimator that achieves unprecedented scalability by stratifying the Markov chain state space into ordered strata with a new technique that we denote sequential stratified regenerations. We show that the Ripple estimator is consistent, highly parallelizable, and scales well. We empirically evaluate our method by...
  7. Lisette Espín-Noboa, Fariba Karimi, Bruno Ribeiro, Kristina Lerman, Claudia Wagner, Explaining classification performance and bias via network structure and sampling technique, Applied Network Science, 2022 [pdf].
    Abstract []
    Social networks are very important carriers of information. For instance, the political leaning of our friends can serve as a proxy to identify our own political preferences. This explanatory power is leveraged in many scenarios ranging from business decision-making to scientific research to infer missing attributes using machine learning. However, factors affecting the performance and the direction of bias of these algorithms are not well understood. To this end, we systematically study how structural properties of the network and the training sample influence the results of collective classification. Our main findings show that (i) mean classification performance can empirically and analytically be predicted by structural properties such as homophily, class balance, edge density and sample size, (ii) small training samples are enough for heterophilic networks to achieve high and unbiased classification performance, even with imperfect model estimates, (iii) homophilic networks are more prone to bias issues and low performance when group size differences increase, (iv) when sampling budgets are small, partial crawls achieve the most accurate model estimates, and degree sampling achieves the highest overall performance. Our findings help practitioners to better understand and evaluate their results when sampling budgets are small or when no ground-truth is available.

2021

  1. Leonardo Cotta, Christopher Morris, Bruno Ribeiro, Reconstruction for Powerful Graph Representations , NeurIPS 2021 [pdf] [code].
    Abstract []
    Graph neural networks (GNNs) have limited expressive power, failing to represent many graph classes correctly. While more expressive graph representation learning (GRL) alternatives can distinguish some of these classes, they are significantly harder to implement, may not scale well, and have not been shown to outperform well-tuned GNNs in real-world tasks. Thus, devising simple, scalable, and expressive GRL architectures that also achieve real-world improvements remains an open challenge. In this work, we show the extent to which graph reconstruction -- reconstructing a graph from its subgraphs -- can mitigate the theoretical and practical problems currently faced by GRL architectures. First, we leverage graph reconstruction to build two new classes of expressive graph representations. Secondly, we show how graph reconstruction boosts the expressive power of any GNN architecture while being a (provably) powerful inductive bias for invariances to vertex removals. Empirically, we show how reconstruction can boost GNN's expressive power -- while maintaining its invariance to permutations of the vertices -- by solving seven graph property tasks not solvable by the original GNN. Further, we demonstrate how it boosts state-of-the-art GNN's performance across nine real-world benchmark dataset.
  2. Beatrice Bevilacqua, Yangze Zhou, Bruno Ribeiro, Size-Invariant Graph Representations for Graph Classification Extrapolations , ICML 2021 (long talk) [pdf] [code].
    Abstract []
    In general, graph representation learning methods assume that the test and train data come from the same distribution. In this work we consider an underexplored area of an otherwise rapidly developing field of graph representation learning: The task of out-of-distribution (OOD) graph classification, where train and test data have different distributions, with test data unavailable during training. Our work shows it is possible to use a causal model to learn approximately invariant representations that better extrapolate between train and test data. Finally, we conclude with synthetic and real-world dataset experiments showcasing the benefits of representations that are invariant to train/test distribution shifts.
  3. Mengyue Hang, Jennifer Neville, Bruno Ribeiro, A Collective Learning Framework to Boost GNN Expressiveness , ICML 2021 [pdf].
    Abstract []
    Graph Neural Networks (GNNs) have recently been used for node and graph classification tasks with great success, but GNNs model dependencies among the attributes of nearby neighboring nodes rather than dependencies among observed node labels. In this work, we consider the task of inductive node classification using GNNs in supervised and semi-supervised settings, with the goal of incorporating label dependencies. Because current GNNs are not universal (i.e., most-expressive) graph representations, we propose a general collective learning approach to increase the representation power of any existing GNN. Our framework combines ideas from collective classification with self-supervised learning, and uses a Monte Carlo approach to sampling embeddings for inductive learning across graphs. We evaluate performance on five real-world network datasets and demonstrate consistent, significant improvement in node classification accuracy, for a variety of state-of-the-art GNNs.
  4. S Chandra Mouli and Bruno Ribeiro, Neural Networks for Learning Counterfactual G-Invariances from Single Environments , ICLR 2021 [pdf].
    Abstract []
    Despite ---or maybe because of--- their astonishing capacity to fit data, neural networks are believed to have difficulties extrapolating beyond training data distribution. This work shows that, for extrapolations based on finite transformation groups, a model's inability to extrapolate is unrelated to its capacity. Rather, the shortcoming is inherited from a learning hypothesis: Examples not explicitly observed with infinitely many training examples have underspecified outcomes in the learner’s model. In order to endow neural networks with the ability to extrapolate over group transformations, we introduce a learning framework counterfactually-guided by the learning hypothesis that any group invariance to (known) transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data.
  5. Mohsen Minaei, S Chandra Mouli, Mainack Mondal, Bruno Ribeiro, Aniket Kate, Deceptive Deletions for Protecting Withdrawn Posts on Social Media Platform, NDSS 2021 [pdf].
    Abstract []
    Over-sharing poorly-worded thoughts and personal information is prevalent on online social platforms. In many of these cases, users regret posting such content. To retrospectively rectify these errors in users' sharing decisions, most platforms offer (deletion) mechanisms to withdraw the content, and social media users often utilize them. Ironically and perhaps unfortunately, these deletions make users more susceptible to privacy violations by malicious actors who specifically hunt post deletions at large scale. The reason for such hunting is simple: deleting a post acts as a powerful signal that the post might be damaging to its owner. Today, multiple archival services are already scanning social media for these deleted posts. Moreover, as we demonstrate in this work, powerful machine learning models can detect damaging deletions at scale. Towards restraining such a global adversary against users' right to be forgotten, we introduce Deceptive Deletion, a decoy mechanism that minimizes the adversarial advantage. Our mechanism injects decoy deletions, hence creating a two-player minmax game between an adversary that seeks to classify damaging content among the deleted posts and a challenger that employs decoy deletions to masquerade real damaging deletions. We formalize the Deceptive Game between the two players, determine conditions under which either the adversary or the challenger provably wins the game, and discuss the scenarios in-between these two extremes. We apply the Deceptive Deletion mechanism to a real-world task on Twitter: hiding damaging tweet deletions. We show that a powerful global adversary can be beaten by a powerful challenger, raising the bar significantly and giving a glimmer of hope in the ability to be really forgotten on social platforms.

2020

  1. Leonardo Cotta, Carlos H. C. Teixeira, Ananthram Swami, Bruno Ribeiro, Unsupervised Joint k-node Graph Representations with Compositional Energy-Based Models, NeurIPS 2020 [pdf].
    Abstract []
    Existing Graph Neural Network (GNN) methods that learn inductive unsupervised graph representations focus on learning node and edge representations by predicting observed edges in the graph. Although such approaches have shown advances in downstream node classification tasks, they are ineffective in jointly representing larger k-node sets, k > 2. We propose MHM-GNN, an inductive unsupervised graph representation approach that combines joint k-node representations with energy-based models (hypergraph Markov networks) and GNNs. To address the intractability of the loss that arises from this combination, we endow our optimization with a loss upper bound using a finite-sample unbiased Markov Chain Monte Carlo estimator. Our experiments show that the unsupervised MHM-GNN representations of MHM-GNN produce better unsupervised representations than existing approaches from the literature.
  2. Amit Sheoran, Sonia Fahmy, Matthew Osinski, Chunyi Peng, Bruno Ribeiro, Jia Wang, Experience: towards automated customer issue resolution in cellular networks, MOBICOM 2020 [link].
    Abstract []
    Cellular service carriers often employ reactive strategies to assist customers who experience non-outage related individual service degradation issues (e.g., service performance degradations that do not impact customers at scale and are likely caused by network provisioning issues for individual devices). Customers need to contact customer care to request assistance before these issues are resolved. This paper presents our experience with PACE (ProActive customer CarE), a novel, proactive system that monitors, troubleshoots and resolves individual service issues, without having to rely on customers to first contact customer care for assistance. PACE seeks to improve customer experience and care operation efficiency by automatically detecting individual (non-outage related) service issues, prioritizing repair actions by predicting customers who are likely to contact care to report their issues, and proactively triggering actions to resolve these issues. We develop three machine learning-based prediction models, and implement a fully automated system that integrates these prediction models and takes resolution actions for individual customers. We conduct a large-scale trace-driven evaluation using real-world data collected from a major cellular carrier in the US, and demonstrate that PACE is able to predict customers who are likely to contact care due to non-outage related individual service issues with high accuracy. We further deploy PACE into this cellular carrier network. Our field trial results show that PACE is effective in proactively resolving non-outage related individual customer service issues, improving customer experience, and reducing the need for customers to report their service issues.
  3. P. C. Sruthi, Sanjay Rao, Bruno Ribeiro, Pitfalls of data-driven networking: A case study of latent causal confounders in video streaming, ACM SIGCOMM 2020 Workshop on Network Meets AI & ML (NetAI 2020) [pdf].
    Abstract []
    This paper motivates the need to support counterfactual reasoning (i.e., answer "what-if" questions about events that did not occur) when collecting network data. We focus on video streaming -- e.g., given logs of a video session, a video publisher may ask whether a user would continue to experience no rebuffering events if the lowest quality video choice were eliminated. We discuss potential pitfalls related to counterfactual reasoning, and argue that dynamic network state (e.g., bandwidth) serves as a confounding yet hidden (latent) feature that complicates such analyses. We illustrate the challenges, and present preliminary methods to address them using concrete examples. Our evaluations show that existing approaches, including randomized trials (collecting data from an algorithm that selects bitrates randomly), are by themselves inadequate for counterfactual reasoning related to video streaming, and must be supplemented by techniques that explicitly infer latent features.
  4. Balasubramaniam Srinivasan, Bruno Ribeiro, On the Equivalence between Positional Node Embeddings and Structural Graph Representations, ICLR 2020 [pdf].
    Abstract []
    This work provides the first unifying theoretical framework for node embeddings and structural graph representations, bridging methods like matrix factorization and graph neural networks. Using invariant theory, we show that the relationship between structural representations and node embeddings is analogous to that of a distribution and its samples. We prove that all tasks that can be performed by node embeddings can also be performed by structural representations and vice-versa. We also show that the concept of transductive and inductive learning is unrelated to node embeddings and graph representations, clearing another source of confusion in the literature. Finally, we introduce new practical guidelines to generating and using node embeddings, which fixes significant shortcomings of standard operating procedures used today.
    Errata []
    This version corrects some typos in the definitions, specially that of \Sigma, it should be \Sigma_n.
  5. Jianfei Gao, Mohamed A. Zahran, Amit Sheoran, Sonia Fahmy, Bruno Ribeiro, Infinity Learning: Learning Markov Chains from Aggregate Steady-State Observations, AAAI 2020 [pdf].
    Abstract []
    We consider the task of learning a parametric Continuous Time Markov Chain (CTMC) sequence model without examples of sequences, where the training data consists entirely of aggregate steady-state statistics. Making the problem harder, we assume that the states we wish to predict are unobserved in the training data. Specifically, given a parametric model over the transition rates of a CTMC and some known transition rates, we wish to extrapolate its steady state distribution to states that are unobserved. A technical roadblock to learn a CTMC from its steady state has been that the chain rule to compute gradients will not work over the arbitrarily long sequences necessary to reach steady state ---from where the aggregate statistics are sampled. To overcome this optimization challenge, we propose $\infty$-SGD, a principled stochastic gradient descent method that uses randomly-stopped estimators to avoid infinite sums required by the steady state computation, while learning even when only a subset of the CTMC states can be observed. We apply $\infty$-SGD to a real-world testbed and synthetic experiments showcasing its accuracy, ability to extrapolate the steady state distribution to unobserved states under unobserved conditions (heavy loads, when training under light loads), and succeeding in difficult scenarios where even a tailor-made extension of existing methods fails.

2019

  1. Changping Meng, Jiasen Yang, Bruno Ribeiro, Jennifer Neville, HATS: A Hierarchical Sequence-Attention Framework for Inductive Set-of-Sets Embeddings, KDD, 2019 [pdf].
    Abstract []
    In many complex domains, the input data are often not suited for the typical vector representations used in deep learning models. For example, in relational learning and computer vision tasks, the data are often better represented as sets (e.g., the neighborhood of a node, a cloud of points). In these cases, a key challenge is to learn an embedding function that is invariant to permutations of the input. While there has been some recent work on principled methods for learning permutation-invariant representations of sets, these approaches are limited in their applicability to set-of-sets (SoS) tasks, such as subgraph prediction and scene classification. In this work, we develop a deep neural network framework to learn inductive SoS embeddings that are invariant to SoS permutations. Specifically, we propose HATS, a hierarchical sequence model with attention mechanisms for inductive set-of-sets embeddings. We develop stochastic optimization and inference methods for learning HATS, and our experiments demonstrate that HATS achieves superior performance across a wide range of set-of-sets tasks
  2. Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, Bruno Ribeiro, Relational Pooling for Graph Representations, ICML, 2019 [pdf].
    Abstract []
    This work generalizes graph neural networks (GNNs) beyond those based on the Weisfeiler-Lehman (WL) algorithm, graph Laplacians, and graph diffusion kernels. Our approach, denoted Relational Pooling (RP), draws from the theory of finite partial exchangeability to provide a framework with maximal representation power for graphs. RP can work with existing graph representation models, and somewhat counterintuitively, can make them even more powerful than the original WL isomorphism test. Additionally, RP is the first theoretically sound framework to use architectures like Recurrent Neural Networks and Convolutional Neural Networks for graph classification. RP also has graph kernels as a special case. We demonstrate improved performance of novel RP-based graph representations over current state-of-the-art methods on a number of tasks.
  3. Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, Bruno Ribeiro, Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs, ICLR, 2019 [pdf].
    Abstract []
    We consider a simple and overarching representation for permutation-invariant functions of sequences (or set functions). Our approach, which we call Janossy pooling, expresses a permutation-invariant function as the average of a permutation-sensitive function applied to all reorderings of the input sequence. This allows us to leverage the rich and mature literature on permutation-sensitive functions to construct novel and flexible permutation-invariant functions. If carried out naively, Janossy pooling can be computationally prohibitive. To allow computational tractability, we consider three kinds of approximations: canonical orderings of sequences, functions with k-order interactions, and stochastic optimization algorithms with random permutations. Our framework unifies a variety of existing work in the literature, and suggests possible modeling and algorithmic extensions. We explore a few in our experiments, which demonstrate improved performance over current state-of-the-art methods.
  4. Leonardo Teixeira, Brian Jalaian, Bruno Ribeiro, Are Graph Neural Networks Miscalibrated?, Learning and Reasoning with Graph-Structured Representations Workshop @ ICML (Spotlight), 2019 [pdf].
    Abstract []
    Graph Neural Networks (GNNs) have proven to be successful in many classification tasks, outperforming previous state-of-the-art methods in terms of accuracy. However, accuracy alone is not enough for high-stakes decision making. Decision makers want to know the likelihood that a specific GNN prediction is correct. For this purpose, obtaining calibrated models is essential. In this work, we analyze the calibration of state-of-the-art GNNs on multiple datasets. Our experiments show that GNNs can be calibrated in some datasets but also badly miscalibrated in others, and that state-of-the-art calibration methods are helpful but do not fix the problem.
  5. Guangyu Li, Yong Liu, Bruno Ribeiro, Hao Ding, On group popularity prediction in event-based social networks, IEEE Transactions on Network Science and Engineering, 2019 [link].
    Abstract []
    Event-based social networks (EBSN) have recently emerged as an important complement to online social networks. They enjoy the advantages of both online social networks and offline social communities: offline social events can be conveniently organized online, and users interact with each other face-to-face in the organized offline events. Although previous work has shown that member and structural features are important to the future popularity of groups in EBSN, it is not yet clear how different member roles and the interplay between them contribute to group popularity. In this paper, we study a real-world dataset from Meetup a popular EBSN platform and propose a deep neural network based method to predict the popularity of new Meetup groups. Our method uses group-level features specific to event-based social networks, such as time and location of events in a group, as well as the structural features internal to a group, such as the inferred member roles in a group and social substructures among members. Empirically, our approach reduces the nRMSE of the popularity prediction (measured in RSVPs) of a group's future events by up to 12%, against the state-of-the-art baselines. Through case studies, our method also identifies member and structure patterns that are most predictive of a group's future popularity. Our study provides new understanding about what makes a group successful in EBSN.
  6. Fabricio Murai, Bruno Ribeiro, Don Towlsey, Pinghui Wang, Characterizing directed and undirected networks via multidimensional walks with jumps, ACM Transactions on Knowledge Discovery from Data (TKDD), 2019 [pdf].
    Abstract []
    Estimating distributions of node characteristics (labels) such as number of connections or citizenship of users in a social network via edge and node sampling is a vital part of the study of complex networks. Due to its low cost, sampling via a random walk (RW) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can traverse edges regardless of their direction. Some RW methods have been designed for directed networks where edges coming into a node are not directly observable. In this work, we propose Directed Unbiased Frontier Sampling (DUFS), a sampling method based on a large number of coordinated walkers, each starting from a node chosen uniformly at random. It is applicable to directed networks with invisible incoming edges because it constructs, in real-time, an undirected graph consistent with the walkers trajectories, and due to the use of random jumps which prevent walkers from being trapped. DUFS generalizes previous RW methods and is suited for undirected networks and to directed networks regardless of in-edges visibility. We also propose an improved estimator of node label distributions that combines information from the initial walker locations with subsequent RW observations. We evaluate DUFS, compare it to other RW methods, investigate the impact of its parameters on estimation accuracy and provide practical guidelines for choosing them. In estimating outdegree distributions, DUFS yields significantly better estimates of the head of the distribution than other methods, while matching or exceeding estimation accuracy of the tail. Last, we show that DUFS outperforms uniform node sampling when estimating distributions of node labels of the top 10% largest degree nodes, even when sampling a node uniformly has the same cost as RW steps.
  7. S Chandra Mouli, Leonardo Teixeira, Jennifer Neville, Bruno Ribeiro, Deep Lifetime Clustering, arXiv:1910.00547, 2019 [pdf].
    Abstract []
    The goal of lifetime clustering is to develop an inductive model that maps subjects into K clusters according to their underlying (unobserved) lifetime distribution. We introduce a neural-network based lifetime clustering model that can find cluster assignments by directly maximizing the divergence between the empirical lifetime distributions of the clusters. Accordingly, we define a novel clustering loss function over the lifetime distributions (of entire clusters) based on a tight upper bound of the two-sample Kuiper test p-value. The resultant model is robust to the modeling issues associated with the unobservability of termination signals, and does not assume proportional hazards. Our results in real and synthetic datasets show significantly better lifetime clusters (as evaluated by C-index, Brier Score, Logrank score and adjusted Rand index) as compared to competing approaches.

2018

  1. Pedro Savarese, Mayank Kakodar, Bruno Ribeiro, From Monte Carlo to Las Vegas: Improving Restricted Boltzmann Machine Training through Stopping Sets, AAAI, 2018 [arXiv:1711.08442] [pdf].
    Abstract []
    We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC) estimators of Restricted Boltzmann Machines (RBMs). We denote our approach {\em Markov Chain Las Vegas} (MCLV). MCLV gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has maximum number of Markov chain steps $K$ (referred as MCLV-$K$). We present a MCLV-$K$ gradient estimator (LVS-$K$) for RBMs and explore the correspondence and differences between LVS-$K$ and Contrastive Divergence (CD-$K$), with LVS-$K$ significantly outperforming CD-$K$ training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.
  2. Jason Meng, Chandra S. Mouli, Bruno Ribeiro, Jennifer Neville, Subgraph Pattern Neural Networks for High-Order Graph Evolution Prediction, AAAI, 2018 [src + data][pdf].
    Abstract []
    In this work we generalize traditional node/link prediction tasks in dynamic heterogeneous networks, to consider joint prediction over larger $k$-node induced subgraphs. Our key insight is to incorporate the unavoidable dependencies in the training observations of induced subgraphs into both the input features and the model architecture itself via high-order dependencies. The strength of the representation is its invariance to isomorphisms and varying local neighborhood sizes, while still being able to take node/edge labels into account, and %enabling inductive facilitating inductive reasoning (i.e., generalization to unseen portions of the network). Empirical results show that our proposed method significantly outperforms other state-of-the-art methods designed for static and/or single node/link prediction tasks. In addition, we show that our method is scalable and learns interpretable parameters.
  3. Carlos Teixeira, Leonardo Cotta, Bruno Ribeiro, and Wagner Meira Jr., Graph Pattern Mining and Learning through User-defined Relations, ICDM, 2018 [pdf].
    Abstract []
    In this work we propose R-GPM, a parallel computing framework for graph pattern mining (GPM) through a user-defined subgraph relation. More specifically, we enable the computation of statistics of patterns through their subgraph classes, generalizing traditional GPM methods. R-GPM provides efficient estimators for these statistics by employing a MCMC sampling algorithm combined with several optimizations. We provide both theoretical guarantees and empirical evaluations of our estimators in application scenarios such as stochastic optimization of deep high-order graph neural network models and pattern (motif) counting. We also propose and evaluate optimizations that enable improvements of our estimators accuracy, while reducing their computational costs in up to 3-orders-of-magnitude. Finally,we show that R-GPM is scalable, providing near-linear speedups on 44 cores in all of our tests.
  4. Zahaib Akhtar, Yun Seong Nam, Ramesh Govindan, Sanjay Rao, Jessica Chen, Ethan Katz-Bassett, Bruno Ribeiro, Jibin Zhan, and Hui Zhang, Oboe: auto-tuning video ABR algorithms to network conditions, SIGCOMM, 2018 [link].
    Abstract []
    Most content providers are interested in providing good video delivery QoE for all users, not just on average. State-of-the-art ABR algorithms like BOLA and MPC rely on parameters that are sensitive to network conditions, so may perform poorly for some users and/or videos. In this paper, we propose a technique called Oboe to auto-tune these parameters to different network conditions. Oboe pre-computes, for a given ABR algorithm, the best possible parameters for different network conditions, then dynamically adapts the parameters at run-time for the current network conditions. Using testbed experiments, we show that Oboe significantly improves BOLA, MPC, and a commercially deployed ABR. Oboe also betters a recently proposed reinforcement learning based ABR, Pensieve, by 24% on average on a composite QoE metric, in part because it is able to better specialize ABR behavior across different network states.
  5. Guangyu Li, Yong Liu, Bruno Ribeiro and Hao Ding, On Group Popularity Prediction in Event-Based Social Networks, ICWSM, 2018 [link].
    Abstract []
    Although previous work has shown that member and structural features are important to the future popularity of groups in EBSN, it is not yet clear how different member roles and the interplay between them contribute to group popularity. In this paper, we study a real-world dataset from Meetup — a popular EBSN platform — and propose a deep neural network based method to predict the popularity of new Meetup groups. Our method uses group-level features specific to event-based social networks, such as time and location of events in a group, as well as the structural features internal to a group, such as the inferred member roles in a group and social substructures among members. Empirically, our approach reduces the RMSE of the popularity prediction (measured in RSVPs) of a group’s future events by up to 12%, against the state-of-the-art baselines.
  6. Mohamed S Hassan, Bruno Ribeiro, Walid G Aref, SBG-sketch: a self-balanced sketch for labeled-graph stream summarization, ICSSDM, 2018 [arXiv:1711.08442] [pdf].
  7. Pinghui Wang, Junzhou Zhao, Bruno Ribeiro, John C.S. Lui, Don Towsley, and Xiaohong Guan, Practical Characterization of Large Networks Using Neighborhood Information, Knowledge and Information Systems 2018 (accepted).
    Abstract []
    Characterizing large online social networks (OSNs) through node querying is a challenging task. OSNs often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network. Various ad-hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for OSNs where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web API available on most major OSNs for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than current state-of-the-art methods. We characterize metrics such as node label density estimation and edge label density estimation, two of the most fundamental network characteristics from which other network characteristics can be derived. We evaluate our methods on-the-fly over several live networks using their native APIs. Our simulation studies over a variety of offline datasets show that by including neighborhood information, our method drastically (4-fold) reduces the number of samples required to achieve the same estimation accuracy of state-of-the-art methods.

2017

  1. Fabricio Murai, Diogo Rennó, Bruno Ribeiro, Gisele L. Pappa, Don Towsley, Krista Gile, Selective Harvesting over Networks, Data Mining and Knowledge Discovery, 2017 [arXiv:1703.05082].
    Abstract []
    Active search (AS) on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights under a query budget. However, in most networks, nodes, topology and edge weights are all initially unknown. We introduce selective harvesting, a variant of AS where the next node to be queried must be chosen among the neighbors of the current queried node set; the available training data for deciding which node to query is restricted to the subgraph induced by the queried set (and their node attributes) and their neighbors (without any node or edge attributes). Therefore, selective harvesting is a sequential decision problem, where we must decide which node to query at each step. A classifier trained in this scenario suffers from a tunnel vision effect: without recourse to independent sampling, the urge to query promising nodes forces classifiers to gather increasingly biased training data, which we show significantly hurts the performance of AS methods and standard classifiers. We find that it is possible to collect a much larger set of targets by using multiple classifiers, not by combining their predictions as an ensemble, but switching between classifiers used at each step, as a way to ease the tunnel vision effect. We discover that switching classifiers collects more targets by (a) diversifying the training data and (b) broadening the choices of nodes that can be queried next. This highlights an exploration, exploitation, and diversification trade-off in our problem that goes beyond the exploration and exploitation duality found in classic sequential decision problems. From these observations we propose D3TS, a method based on multi-armed bandits for non-stationary stochastic processes that enforces classifier diversity, matching or exceeding the performance of competing methods on seven real network datasets in our evaluation.
  2. Jiasen Yang, Bruno Ribeiro, Jennifer Neville, Stochastic Gradient Descent for Relational Logistic Regression via Partial Network Crawls, UAI Seventh International Workshop on Statistical Relational AI, 2017 [arXiv:1703.07716].
    Abstract []
    Research in statistical relational learning has produced a number of methods for learning relational models from large-scale network data. While these methods have been successfully applied in various domains, they have been developed under the unrealistic assumption of full data access. In practice, however, the data are often collected by crawling the network, due to proprietary access, limited resources, and privacy concerns. Recently, we showed that the parameter estimates for relational Bayes classifiers computed from network samples collected by existing network crawlers can be quite inaccurate, and developed a crawl-aware estimation method for such models (Yang, Ribeiro, and Neville, 2017). In this work, we extend the methodology to learning relational logistic regression models via stochastic gradient descent from partial network crawls, and show that the proposed method yields accurate parameter estimates and confidence intervals.
  3. Jiasen Yang, Bruno Ribeiro, Jennifer Neville, Should We Be Confident in Peer Effects Estimated from Partial Crawls of Social Networks?, ICWSM (poster), 2017 [pdf].
    Abstract []
    Research in social network analysis and statistical relational learning has produced a number of methods for learning relational models from large-scale network data. Unfortunately, these methods have been developed under the unrealistic assumption of full data access. In practice, however, the data are often collected by crawling the network, due to proprietary access, limited resources, and privacy concerns. While prior studies have examined the impact of network crawling on the structural characteristics of the resulting samples, this work presents the first empirical study designed to assess the impact of widely used network crawlers on the estimation of peer effects. Our experiments demonstrate that the estimates obtained from network samples collected by existing crawlers can be quite inaccurate, unless a significant portion of the network is crawled. Meanwhile, motivated by recent advances in partial network crawling, we develop crawl-aware relational methods that provide accurate estimates of peer effects with statistical guarantees from partial crawls.
  4. Kun Tu, Bruno Ribeiro, Ananthram Swami and Don Towsley, Temporal Clustering in time-varying Networks with Time Series Analysis, NIPS 2017 Time Series Workshop.
    Abstract []
  5. Mohamed S. Hassan, Bruno Ribeiro, Walid G. Aref, SBG-Sketch: A Self-Balanced Sketch for Labeled-Graph Stream Summarization, (preprint) [arXiv:1709.06723].
    Abstract []
    Applications in various domains rely on processing graph streams, e.g., communication logs of a cloud-troubleshooting system, road-network traffic updates, and interactions on a social network. A labeled-graph stream refers to a sequence of streamed edges that form a labeled graph. Label-aware applications need to filter the graph stream before performing a graph operation. Due to the large volume and high velocity of these streams, it is often more practical to incrementally build a lossy-compressed version of the graph, and use this lossy version to approximately evaluate graph queries. Challenges arise when the queries are unknown in advance but are associated with filtering predicates based on edge labels. Surprisingly common, and especially challenging, are labeled-graph streams that have highly skewed label distributions that might also vary over time. This paper introduces Self-Balanced Graph Sketch (SBG-Sketch, for short), a graphical sketch for summarizing and querying labeled-graph streams that can cope with all these challenges. SBG-Sketch maintains synopsis for both the edge attributes (e.g., edge weight) as well as the topology of the streamed graph. SBG-Sketch allows efficient processing of graph-traversal queries, e.g., reachability queries. Experimental results over a variety of real graph streams show SBG-Sketch to reduce the estimation errors of state-of-the-art methods by up to 99%.

2016

  1. Flavio Figueiredo, Bruno Ribeiro, Jussara Almeida, Christos Faloutsos, TribeFlow: Mining & Predicting User Trajectories, ACM International Conference on World Wide Web (WWW'16), 2016
    Version with corrected eq. (1): [pdf] [src + data].
    Earlier technical report at [arXiv:1511.01032].
    Abstract []
    Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). What users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.
  2. Bo Jiang, Daniel R. Figueiredo, Bruno Ribeiro, Don Towsley, On the Duration and Intensity of Competitions in Nonlinear Polya Urn Processes with Fitness, ACM SIGMETRICS, 2016 [pdf] Best Paper Award.
    Abstract []
  3. Konstantin Avrachenkov, Bruno Ribeiro, Jithin K. Sreedharan, Inference in OSNs via Lightweight Partial Crawls, ACM SIGMETRICS, 2016.
    Final version [pdf]
    Earlier version arXiv:1510.05407, 2015 [pdf].
    Abstract []
  4. Flavio Figueiredo, Bruno Ribeiro, Jussara M. Almeida, Christos Faloutsos, A Summary of the TribeFlow Model for Music Discovery Applications, ICML 2016 @ Machine Learning for Music Discovery Workshop (invited).
  5. Flavio Figueiredo, Bruno Ribeiro, Christos Faloutsos, Nazareno Andrade, Jussara M. Almeida, Mining Online Music Listening Trajectories, ISMIR: 17th International Symposium/Conference on Music Information Retrieval, 2016 (accepted).
    Final version [pdf]
    Abstract []
    Understanding the listening habits of users is a valuable undertaking for musicology researchers, artists, consumers and online businesses alike. With the rise of Online Mu- sic Streaming Services (OMSSs), large amounts of user behavioral data can be exploited for this task. In this pa- per, we present SWIFT-FLOWS, an approach that mod- els user listening habits in regards to how user attention transitions between artists. SWIFT-FLOWS combines re- cent advances in trajectory mining, coupled with mod- ulated Markov models as a means to capture both how users switch attention from one artist to another, as well as how users fixate their attention in a single artist over short or large periods of time. We employ SWIFT-FLOWS on OMSSs datasets showing that it provides: (1) semantically meaningful representation of habits; (2) accurately models the attention span of users.

2015

  1. Bruno Ribeiro, Minh Hoang, Ambuj Singh, Beyond Models: Forecasting Complex Network Processes Directly from Data, ACM International Conference on World Wide Web (WWW'15), 2015 [pdf] [Tech Report] [code + data] [pptx].
    Abstract []
    Complex network phenomena - such as information cascades in online social networks - are hard to fully observe, model, and forecast. In forecasting, a recent trend has been to forgo the use of parsimonious models in favor of models with increasingly large degrees of freedom that are trained to learn the behavior of a process from historical data. Extrapolating this trend into the future, eventually we would renounce models all together. * But is it possible to forecast the evolution of a complex stochastic process directly from the data without a model? * In this work we show that model-free forecasting is possible. We present SED, an algorithm that forecasts process statistics based on relationships of statistical equivalence using two general axioms and historical data. To the best of our knowledge, SED is the first method that can perform axiomatic, model-free forecasts of complex stochastic processes. Our simulations using simple and complex evolving processes and tests performed on a large real-world dataset show promising results.
  2. Bruno Ribeiro, Christos Faloutsos, Modeling Website Popularity Competition in the Attention-Activity Marketplace, ACM International Conference on Web Search and Data (WSDM'15), 2015 [arXiv:1403.0600] [pptx].
    Abstract []
    How does a new startup drive the popularity of competing websites into oblivion like Facebook famously did to MySpace? This question is of great interest to academics, technologists, and financial investors alike. In this work we exploit the singular way in which Facebook wiped out the popularity of MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity competition model. Our model provides new insights into what Nobel Laureate Herbert A. Simon called the "marketplace of attention," which we recast as the attention-activity marketplace. Our model design is further substantiated by user-level activity of 250,000 MySpace users obtained between 2004 and 2009. The resulting model not only accurately fits the observed Daily Active Users (DAU) of Facebook and its competitors but also predicts their fate four years into the future.
  3. Bo Jiang, Liyuan Sun, Daniel R Figueiredo, Bruno Ribeiro and Don Towsley, On the duration and intensity of cumulative advantage competitions, J. Stat., 2015 (accepted) [pdf].
    Abstract []
    Network growth can be framed as a competition for edges among nodes in the network. As with various other social and physical systems, skill (fitness) and luck (random chance) act as fundamental forces driving competition dynamics. In the context of networks, cumulative advantage (CA) - the rich-get-richer effect — is seen as a driving principle governing the edge accumulation process. However, competitions coupled with CA exhibit non-trivial behavior and little is formally known about duration and intensity of CA competitions. By isolating two nodes in an ideal CA competition, we provide a mathematical understanding of how CA exacerbates the role of luck in detriment of skill. We show, for instance, that when nodes start with few edges, an early stroke of luck can place the less skilled in the lead for an extremely long period of time, a phenomenon we call ‘struggle of the fittest’. We prove that duration of a simple skill and luck competition model exhibit power-law tails when CA is present, regardless of skill dierence, which is in sharp contrast to the exponential tails when fitness is distinct but CA is absent. We also prove that competition intensity is always upper bounded by an exponential tail, irrespective of CA and skills. Thus, CA competitions can be extremely long (infinite mean, depending on fitness ratio) but almost never very intense. The theoretical results are corroborated by extensive numerical simulations. Our findings have important implications to competitions not only among nodes in networks but also in contexts that leverage socio-physical models embodying CA competitions.

2014

  1. Pinghui Wang, John C.S. Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, Xiaohong Guan, Efficiently Estimating Motif Statistics of Large Networks, ACM Transactions on Knowledge Discovery from Data (TKDD), Sept. 2014 [pdf].
    Older ArXiv version arXiv:1306.5288.
    Abstract []
    Exploring statistics of locally connected subgraph patterns (also known as network motifs) has helped researchers better understand the structure and function of biological and online social networks (OSNs). Nowadays the massive size of some critical networks – often stored in already overloaded relational databases – effectively limits the rate at which nodes and edges can be explored, making it a challenge to accurately discover subgraph statistics. In this work, we propose sampling methods to accurately estimate subgraph statistics from as few queried nodes as possible. We present sampling algorithms that efficiently and accurately estimate subgraph properties of massive networks. Our algorithms require no pre-computation or complete network topology information. At the same time, we provide theoretical guarantees of convergence. We perform experiments using widely known data sets, and show that for the same accuracy, our algorithms require an order of magnitude less queries (samples) than the current state-of-the-art algorithms.
  1. Ting-Kai Huang, Bruno Ribeiro, Harsha V. Madhyastha, Michalis Faloutsos, The Socio-monetary Incentives of Online Social Network Malware Campaigns, ACM Conference on Online Social Networks (COSN'14), 2014 [pdf] [slides.pdf] [slides.pptx].
    Abstract []
    Online social networks (OSNs) offer a rich medium of malware propagation. Unlike other forms of malware, OSN malware campaigns direct users to malicious websites that hijack their accounts, posting malicious messages on their behalf with the intent of luring their friends to the malicious website, thus triggering word-of-mouth infections that cascade through the network compromising thousands of accounts. * But how are OSN users lured to click on the malicious links? *
    In this work, we monitor 3.5 million Facebook accounts and explore the role of pure monetary, social, and combined socio-monetary psychological incentives in OSN malware campaigns. Among other findings we see that the majority of the malware campaigns rely on pure social incentives. However, we also observe that malware campaigns using socio-monetary incentives infect more accounts and last longer than campaigns with pure monetary or social incentives. The latter suggests the efficiency of an epidemic tactic surprisingly similar to the mechanism used by biological pathogens to cope with diverse gene pools.
  1. Flavio Figueiredo, Jussara M Almeida, Yasuko Matsubara, Bruno Ribeiro, Christos Faloutsos, Revisit Behavior in Social Media: The Phoenix-R Model and Discoveries, (ECML/PKDD 2014) [arXiv:1405.1459].
    Abstract []
    How many listens will an artist receive on a online radio? How about plays on a YouTube video? How many of these visits are new or returning users? Modeling and mining popularity dynamics of social activity has important implications for researchers, content creators and providers. We here investigate and model the effect of revisits on popularity. A revisit can be defined as a repeated play (song or video) or a re-tweet of the same hashtag over time. Using four datasets of social activity, with up to tens of millions media objects (e.g., YouTube videos, Twitter hashtags or LastFM artists), we show the effect of revisits in the popularity evolution of such objects. Secondly, we propose the Phoenix-R model which captures the popularity dynamics of individual objects. Phoenix-R has the desired properties of being: (1) parsimonious, being based on the minimum description length principle, and achieving lower root mean squared error than state-of-the-art baselines; (2) applicable, the model is effective for predicting future popularity values of time series.
  1. Bruno Ribeiro, Modeling and Predicting the Growth and Death of Membership-based Websites, WWW 2014 [pdf] [arXiv:1307.1354] [slides.pptx].

    Carnegie Mellon University Press Release: In Age of Information Overload, Ability To Sustain Attention Determines Success | CMU SCS Press Release

    How the Model Works commentary (last updated Feb 6 2014)

    In the news: CNET | The Times of India | Pittsburgh Post Gazette | ACM TechNews | Phys.org news | Futurity.org | The Connectivist | The Tartan | Business Standard | Pittsburgh Business Times | Business Insider (I never said that "website audiences that grew via word-of-mouth are much better off in the long-run for retaining and engaging users", in fact my work has a number of counter-examples to this claim) | e! Science News | Network World | Science Daily News | TechAdvisor | MIT Technology Review: A roundup of the most interesting innovations coming out of research centers and universities | NSF's Facebook Page

    Abstract []
    Driven by outstanding success stories of Internet startups such as Facebook and The Huffington Post, recent studies have thoroughly described their growth. These highly visible online success stories, however, overshadow an untold number of similar ventures that fail. The study of website popularity is ultimately incomplete without general mechanisms that can describe both successes and failures. In this work we present six years of the daily number of users (DAU) of twenty-two membership-based websites -- encompassing online social networks, grassroots movements, online forums, and membership-only Internet stores -- well balanced between successes and failures. We then propose a combination of reaction-diffusion-decay processes whose resulting equations seem not only to describe well the observed DAU time series but also provide means to roughly predict their evolution. This model allows an approximate automatic DAU-based classification of websites into self-sustainable v.s. unsustainable and whether the startup growth is mostly driven by marketing & media campaigns or word-of-mouth adoptions.
  1. (alphabetical) K. Avrachenkov, P. Basu, G. Neglia, B. Ribeiro*, and D. Towsley, Pay Few, Influence Most: Online Myopic Network Covering, IEEE NetSciCom Workshop 2014 Best Paper Award [pdf] [slides.pptx].
    *Corresponding author.
    Older version as Technical Report arXiv.
    Abstract []
    Efficient marketing or awareness-raising campaigns seek to recruit a small number, $w$, of influential individuals -- where $w$ is the campaign budget -- that are able to cover the largest possible target audience through their social connections. To date, most of the related literature on maximizing this network cover assumes that the social network topology is known. Even in such a case the coverage problem is NP-hard. In practice, the network topology is generally unknown and needs to be discovered on-the-fly. In this paper we assume that the topology is gradually discovered thanks to recruited individuals disclosing their social connections. Our goal is to provide an efficient online algorithm that recruits individuals so as to maximize the size of target audience covered by the campaign.
    We analyze the performance of a variety of online myopic algorithms (i.e. that do not have a priori information on the topology) currently used to sample and search large networks, including the Flickr and YouTube networks. We also propose a new greedy online algorithm, Maximum Expected Uncovered Degree (MEUD). Our proposed algorithm greedily maximizes the expected size of the cover, but it requires the degree distribution to be known. For a class of random power law networks we show that MEUD simplifies into a straightforward procedure, denoted as MOD because it requires only the knowledge of the Maximum Observed Degree. We substantiate our analytical results with extensive simulations and show that MOD significantly outperforms all analyzed myopic algorithms.
  1. Bo Jiang, Liyuan Sun, Daniel R. Figueiredo, Bruno Ribeiro, Don Towsley, On the duration and intensity of cumulative advantage competitions and the struggle of the fittest, [arXiv:1402.4536].
    Abstract []
    At the heart of competition for resources lie two simple principles: "survival of the fittest", which reflects how intrinsic fitnesses of competing agents influence their ability to gather resources; and "cumulative advantage", which reflects how accumulated resources help gather even more resources. In this letter we show that competitions embodying just the essence of these two principles exhibit a variety of surprising behaviors with respect to their duration, as measured by how long it takes for an indisputable winner to emerge, and intensity, as measured by how many times agents have equal amount of resources. We demonstrate that when the fitnesses of the agents are different, long-lasting competitions continue to occur in spite of a dramatic decrease in their intensity and the certainty that the fitter agent will win. We also unveil an unexpected phenomenon that we call "struggle of the fittest", where long-lasting competitions become more probable when one agent is slightly fitter than when agents are equally fit. These results have important implications for our understanding of competitions and various applications that leverage such models.
  1. Peng Xia, Kun Tu, Bruno Ribeiro, Hua Jiang, Xiaodong Wang, Cindy Chen, Benyuan Liu, Don Towsley, Who is Dating Whom: Characterizing User Behaviors of a Large Online Dating Site in Social Network Analysis – Community Detection and Evolution , Springer [arXiv:1401.5710].

    MIT Technology Review (Emerging Technology From the arXiv): Data Mining Reveals the Surprising Behavior of Users of Dating Websites

    Abstract []
    Online dating sites have become popular platforms for people to look for potential romantic partners. It is important to understand users' dating preferences in order to make better recommendations on potential dates. The message sending and replying actions of a user are strong indicators for what he/she is looking for in a potential date and reflect the user's actual dating preferences. We study how users' online dating behaviors correlate with various user attributes using a large real-world dateset from a major online dating site in China. Many of our results on user messaging behavior align with notions in social and evolutionary psychology: males tend to look for younger females while females put more emphasis on the socioeconomic status (e.g., income, education level) of a potential date. In addition, we observe that the geographic distance between two users and the photo count of users play an important role in their dating behaviors. Our results show that it is important to differentiate between users' true preferences and random selection. Some user behaviors in choosing attributes in a potential date may largely be a result of random selection. We also find that both males and females are more likely to reply to users whose attributes come closest to the stated preferences of the receivers, and there is significant discrepancy between a user's stated dating preference and his/her actual online dating behavior. These results can provide valuable guidelines to the design of a recommendation engine for potential dates.
  1. Kun Tu, Bruno Ribeiro, H. Jiang, X. Wang, David Jensen, Benyuan Liu, Don Towsley, Online Dating Recommendations: Matching Markets and Learning Preferences, 5th International Workshop on Social Recommender Systems (SRS 2014) with WWW 2014 [arXiv:1401.8042].
    Abstract []
    Recommendation systems for online dating have recently attracted much attention from the research community. In this paper we proposed a two-side matching framework for online dating recommendations and design an LDA model to learn the user preferences from the observed user messag- ing behavior and user profile features. Experimental results using data from a large online dating website shows that two-sided matching improves significantly the rate of suc- cessful matches by as much as 45%. Finally, using simulated matchings we show that the the LDA model can correctly capture user preferences.
  1. Yeon-sup Lim, Bruno Ribeiro, Don Towsley, Classifying Latent Infection States in Complex Networks, SIMPLEX Workshop 2014 (@ WWW'14) [arXiv:1402.0013 ].
    Abstract []
    Algorithms for identifying the infection states of nodes in a network are crucial for understanding and containing infec- tions. Often, however, only a relatively small set of nodes have a known infection state. Moreover, the length of time that each node has been infected is also unknown. This miss- ing data – infection state of most nodes and infection time of the unobserved infected nodes – poses a challenge to the study of real-world cascades.
    In this work, we develop techniques to identify the latent infected nodes in the presence of missing infection time-and- state data. Based on the likely epidemic paths predicted by the simple susceptible-infected epidemic model, we propose a measure (Infection Betweenness) for uncovering these un- known infection states. Our experimental results using ma- chine learning algorithms show that Infection Betweenness is the most effective feature for identifying latent infected nodes.
  1. James Atwood, Bruno Ribeiro, Don Towsley, Efficient Network Generation Under General Preferential Attachment, SIMPLEX Workshop 2014 (@ WWW'14) [arXiv:1403.4521].
    James's source code.
    Abstract []
    Preferential attachment (PA) models of network structure are widely used due to their explanatory power and conceptual simplicity. PA models are able to account for the scale-free degree distribution observed in many real-world large networks through the remarkably simple mechanism of sequentially introducing nodes that attach preferentially to nodes with high degree. The ability to efficiently generate instances from PA models is a key asset in understanding both the models themselves and the real networks that they represent. Surprisingly, little attention has been paid to the problem of efficient instance generation. In this paper, we show that the complexity of generating network instances from a PA model depends on the preference function of the model, provide efficient data structures that work under any preference function, and present empirical results from an implementation based on these data structures. We demonstrate that, by indexing growing networks with a simple augmented heap, we can implement a network generator which scales many orders of magnitude beyond existing capabilities (106 – 108 nodes). We show the utility of an efficient and general PA network generator by investigating the consequences of varying the preference functions of an existing model. We also provide “quicknet”, a freely-available open-source implementation of the methods described in this work.

2013

  1. Bruno Ribeiro, Nicola Perra, and Andrea Baronchelli, Quantifying the Effect of Temporal Resolution on Time-varying Networks, Scientific Reports, 2013 [pdf]. arXiv:1211.7052.
    Abstract []
    Time-varying networks describe a wide array of systems whose constituents and interactions evolve in time. These networks are defined by an ordered stream of interactions between nodes. However, they are often represented as a sequence of static networks, resulting from aggregating all edges and nodes appearing at time intervals of size $\Delta t$. In this work we investigate the consequences of this procedure. In particular, we address the impact of an arbitrary $\Delta t$ on the description of a dynamical process taking place upon a time-varying network. We focus on the elementary random walk, and put forth a mathematical framework that provides exact results in the context of synthetic activity driven networks. Remarkably, the equations turn out to also describe accurately the behavior observed on real datasets. Our results provide the first analytical description of the bias introduced by time integrating techniques, and represent a step forward in the correct characterization of dynamical processes on time-varying graphs.
  1. Peng Xia, Bruno Ribeiro, Cindy Chen, Benyuan Liu, Don Towsley, A Study of User Behavior on an Online Dating Site (short paper), ASONAM 2013.
    Abstract []
  1. Ting-Kai Huang, Md Sazzadur Rahman, Harsha V. Madhyastha, Michalis Faloutsos, and Bruno Ribeiro An Analysis of Socware Cascades in Online Social Networks, WWW 2013 [pdf].
    Abstract []
    Online social networks (OSNs) have become a popular new vector for distributing malware and spam, which we refer to as socware. Unlike email spam, which is sent by spammers directly to intended victims, socware cascades through OSNs as compromised users spread it to their friends. In this paper, we analyze data from the walls of roughly 3 million Facebook users over five months, with the goal of developing a better understanding of socware cascades. We study socware cascades to understand: (a) their spatio-temporal properties, (b) the underlying motivations and mechanisms, and (c) the social engineering tricks used to con users. First, we identify an evolving trend in which cascades appear to be throttling their rate of growth to evade detection, and thus, lasting longer. Second, our forensic investigation into the infrastructure that supports these cascades shows that, surprisingly, Facebook seems to be inadvertently enabling most cascades; 44% of cascades are disseminated via Facebook applications. At the same time, we observe large groups of synergistic Facebook apps (more than 144 groups of size 5 or more) that collaborate to support multiple cascades. Lastly, we find that hackers rely on two social engineering tricks in equal measure—luring users with free products and appealing to users’ social curiosity—to enable socware cascades. Our findings present several promising avenues towards reducing socware on Facebook, but also highlight associated challenges.
  1. Fabricio Murai, Bruno Ribeiro, Don Towsley, and Krista Gile, Characterizing Branching Processes from Sampled Data, SIMPLEX Workshop 2013 (@ WWW'13). arXiv:1302.5847.
    Abstract []
    Branching processes model the evolution of populations of agents that randomly generate offsprings. These processes, more patently the Galton-Watson process, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical challenge of modeling real phenomena using a Galton-Watson process is the offspring distribution, which must be measured from the population. In most cases, however, directly measuring the offspring distribution is unrealistic due to lack of resources or the death of agents. So far, researchers have relied on informed guesses to guide their choice of offspring distribution. In this work we propose a collection of methods to estimate the offspring distribution from real sampled data. Using a small sampled fraction of the agents and instrumented with the identity of the ancestors of the sampled agents, we show that accurate offspring distribution estimates can be obtained by sampling as little as 14% of the population.
  1. F. Murai, B. Ribeiro, D. Towsley, P. Wang, On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling, IEEE JSAC special issue on Network Science, 2013. [arXiv:1209.0736].
    Abstract []
    In this work we study the set size distribution estimation problem, where elements are randomly sampled from a collection of non-overlapping sets and we seek to recover the original set size distribution from the samples. This problem has applications to capacity planning, network theory, among other areas. Examples of real-world applications include characterizing in-degree distributions in large graphs and uncovering TCP/IP flow size distributions on the Internet. We demonstrate that it is hard to estimate the original set size distribution. The recoverability of original set size distributions presents a sharp threshold with respect to the fraction of elements that remain in the sets. If this fraction remains below a threshold, typically half of the elements in power-law and heavier-than-exponential-tailed distributions, then the original set size distribution is unrecoverable. We also discuss practical implications of our findings.

2012

  1. D. Figueiredo, P. Nain, B. Ribeiro*, E. de Souza e Silva, and D. Towsley, Characterizing Continuous Time Random Walks on Time-varying Graphs, SIGMETRICS 2012. *Corresponding author. [pdf]. Technical Report arXiv:1112.5762.
    Abstract []
    In this paper we study the behavior of a continuous time random walk (CTRW) on a stationary and ergodic time varying dynamic graph. We establish conditions under which the CTRW is a stationary and ergodic process. In general, the stationary distribution of the walker depends on the walker rate and is difficult to characterize. However, we characterize the stationary distribution in the following cases: i) the walker rate is significantly larger or smaller than the rate in which the graph changes (time-scale separation), ii) the walker rate is proportional to the degree of the node that it resides on (coupled dynamics), and iii) the degrees of node belonging to the same connected component are identical (structural constraints). We provide examples that illustrate our theoretical findings.
  1. Bruno Ribeiro, Prithwish Basu, Don Towsley, Multiple Random Walks to Uncover Short Paths in Power Law Networks, IEEE INFOCOM NetSciCom 2012. [pdf].
    Abstract []
    Developing simple distributed algorithms to allow nodes to perform topology discovery and message routing using incomplete topological information is a problem of great interest in network science. Consider the following routing problem in the context of a large power law network $G$. A small number of nodes want to exchange messages over short paths on $G$. These nodes do not have access to the topology of $G$ but are allowed to crawl the network subject to a budget constraint. Only crawlers whose paths cross are allowed to exchange topological information. In this work we study the use of random walks (RWs) to crawl $G$. We show that RWs have the ability to find short paths and that this bears no relation to the paths that they take. Instead, it relies on two properties of RWs on power law networks:
    • The RW ability to observe a sizable fraction of the network edges;
    • The near certainty that two distinct RW sample paths cross after a small percentage of the nodes have been visited.
      • We show promising simulation results on several real world networks.
  1. Bruno Ribeiro, Don Towsley On the Estimation Accuracy of Degree Distributions from Graph Sampling, IEEE Conference on Decision and Control (invited),2012.[pdf].
    This is a new version of the old TechReport UM-CS-2011-036. [pdf].
    Abstract []
    Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. In this work we present an in-depth study of the Mean Squared Error (MSE) of sampling methods such as independent random vertex (RV) and random edge (RE) sampling and crawling methods such as random walks (RWs), a.k.a. RDS, and the a Metropolis-Hastings algorithm whose target distribution is to uniformly sample vertices (MHRWu). This paper provides an upper bound for the MSE of a stationary RW as a function of the MSE of RE and the absolute value of the second most dominant eigenvalue of the RW transition probability matrix. We see that RW and RV sampling are optimal in respect to different weighted MSE optimizations and show when RW is preferable to RV sampling. Finally, we present an approximation to the MHRWu MSE. We evaluate the accuracy of our approximations and bound using simulations on large real world graphs.
  1. Bruno Ribeiro, Pinghui Wang, Fabricio Murai, and Don Towsley, Sampling Directed Graphs with Random Walks, INFOCOM 2012 [pdf].
    A version is available as UMass CMPSCI Technical Report UMCS-2011-031 [pdf].
    Errata []
    The statement of Theorem 3.1 is incorrect in the camera-ready version of our INFOCOM 2012 paper. It should read
    Theorem 3.1: \hat\pi(s_i) is an *asymptotically* unbiased estimator of \pi(s_i).
    Abstract []
    Despite recent efforts to characterize complex networks such as citation graphs or online social networks (OSNs), little attention has been given to developing tools that can be used to characterize directed graphs in the wild, where no pre-processed data is available. The presence of hidden incoming edges but observable outgoing edges poses a challenge to characterize large directed graphs through crawling. Unless we can crawl the entire graph or the directed graph edges are highly symmetrical, hidden incoming edges induce unknown biases in the sampled nodes. In this work we propose a random walk sampling algorithm that is less prone these biases. The driving principle behind our random walk is to construct, in real-time, an undirected graph from the directed graph in a way that is consistent with the sample path followed by the algorithm walking on either graph. We also study out-degree and in-degree distribution estimation. Out-degrees are visible to the walker while in-degrees are hidden (latent). This makes for strikingly different estimation accuracies of in- and outdegree distributions. We show that our algorithm can accurately estimate out-degree distributions and show that no algorithm can accurately estimate unbiased in-degree distributions unless the directed graph is highly symmetrical.

2011

  1. Y. Lim, D. S. Menasche, B. Ribeiro, D. Towsley, and P. Basu, Online estimating the k central nodes of a network. IEEE Network Science Workshop (NSW), 2011 [pdf].
    Abstract []
    Estimating the most influential nodes in a network is a fundamental problem in network analysis. Influential nodes may be important spreaders of diseases in biological networks, key actors in terrorist networks, or marketing targets in social networks. By identifying such central nodes, one can devise efficient strategies for prevention of diseases or crime and efficient marketing strategies. The goal of this paper is to estimate the k most central nodes in a network through parsimonious sampling. Centrality determines the relative importance of a particular node within the network. Conventional measures of node centrality include degree, betweenness, and closeness.
  1. Bruno Ribeiro, Daniel Figueiredo, Edmundo de Souza e Silva, and Don Towsley, Characterizing Dynamic Graphs with Continuous-time Random Walks, SIGMETRICS 2011 (Generalized to more general graph processes in our SIGMETRICS 2012 paper) [pdf].
    Abstract []
    In this paper we study the steady state behavior of continuous-time random walks (CTRW) on Markov dynamic networks. We consider two types of CTRWs: one that walks at a constant rate (CTRW-C) and another that walks with a rate proportional to the vertex degree (CTRW-D). We derive closed-form analytical expressions for the steady state (SS) distribution of these walkers. For CTRW-C we obtain the approximate SS distribution for either a very fast or very slow walker. We show that the behavior of CTRW-C and CTRW-D is strikingly different. Surprisingly, the steady state distribution of the fixed rate walker depends on the walker rate, which is not the case for the degree proportional walker. Such findings have direct implication on the design of algorithms to measure dynamic networks.

2010

  1. Konstantin Avrachenkov, Bruno Ribeiro, and Don Towsley, Improving Random Walk Estimation Accuracy with Uniform Restarts, 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010), Sept 2010.
    A version is available as INRIA Technical Report no 7394 [pdf].
    Abstract []
    This work proposes and studies the properties of a hybrid sampling scheme that mixes independent uniform node sampling and random walk (RW)-based crawling. We show that our sampling method combines the strengths of both uniform and RW sampling while minimizing their drawbacks. In particular, our method increases the spectral gap of the random walk, and hence, accelerates convergence to the stationary distribution. The proposed method resembles PageRank but unlike PageRank preserves time-reversibility. Applying our hybrid RW to the problem of estimating degree distributions of graphs shows promising results.
  1. William Gauvin, Bruno Ribeiro, Benyuan Liu, Don Towsley, Jie Wang, Measurement and Gender-Specific Analysis of User Publishing Characteristics on MySpace, IEEE Network special issue on Online Social Networks, vol 24, number 5, Oct 2010.
    Abstract []
    Online social networks have become popular platforms for people to make connections, share information, and interact with each other. In an online social network, user publishing activities such as sending messages and posting photos represent online interactions between friends that involve the use of network and system resources. As more and more businesses use online social networks as a means to promote their brand names a good understanding of user publishing characteristics is important not only for capacity planning (server scaling estimation, network bandwidth provisioning, and performance tuning), but also for marketing analysis and security measures of online social networks. Recently there have been many efforts to measure and analyze various characteristics of online social networks. Most of these studies have focused on the network graph properties such as node degree distribution, clustering coefficient, and connected components. In this work, we measure and analyze the gender-specific user publishing characteristics on MySpace. Our results show that there are recognizable patterns with respect to profile attributes, user interactions, temporal usages, and blog contents for users of different genders. In particular, gender neutral profiles (most of them are music bands, TV shows, and other commercial sites) differ significantly from normal male and female profiles for several publishing patterns.
  1. Bruno Ribeiro and Don Towsley, Estimating and Sampling Graphs with Multidimensional Random Walks, ACM SIGCOMM Internet Measurement Conference, Nov, 2010. [src] [pdf]. IMC 2010 slides [pptx].
    Abstract []
    Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. Current sampling methods such as (independent) random vertex and random walks are useful but have drawbacks. Random vertex sampling may require too many resources (time, bandwidth, or money). Random walks, which normally require fewer resources per sample, can suffer from large estimation errors in the presence of disconnected or quasi-disconnected subgraphs. In this work we propose a new $m$-dimensional random walk that uses $m$ dependent random walkers. We prove that the proposed sampling method, which we call Frontier sampling, has all of the nice sampling properties of a regular random walk. At the same time, our simulations over large real world graphs show that, in the presence of disconnected or quasi-disconnected subgraphs, Frontier sampling exhibits lower estimation errors than regular random walks. We also argue that Frontier sampling is more suitable to sample power-law graphs than random vertex sampling.
    Errata []
    Table 1: Size of LCC for Internet RLT is 190,914 not 609,066.
  1. Bruno Ribeiro, William Gauvin, Benyuan Liu, and Don Towsley, On MySpace Account Spans and Double Pareto-Like Distribution of Friends, IEEE Infocom 2010 Network Science Workshop, March, 2010. [pdf]
    An older version is available as UMass CMPSCI Technical Report UM-CS-2010-001 [pdf].
    Abstract []
    In this work we study the activity span of MySpace accounts and its connection to the distribution of the number of friends. The activity span is the time elapsed since the creation of the account until the user's last login time. We observe exponentially distributed activity spans. We also observe that the distribution of the number of friends over accounts with the same activity span is well approximated by a lognormal with a fairly light tail. These two findings shed light into the puzzling (yet unexplained) inflection point (knee) in the distribution of friends in MySpace when plotted in log-log scale. We argue that the inflection point resembles the inflection point of Reed's (Double Pareto) Geometric Brownian Motion with Exponential Stopping Times model. We also present evidence against the Dunbar number hypothesis of online social networks, which argues, without proof, that the inflection point is due to the Dunbar number (a theoretical limit on the number of people that a human brain can sustain active social contact with). While we answer many questions, we leave many others open.

2009 & ealier work

  1. Bruno Ribeiro, Daniel Figueiredo, and Don Towsley, Herding BitTorrent Traffic away from Expensive ISP Links, Jan, 2009. A version is available as UMass CMPSCI Technical Report UM-CS-2008-029 [pdf].
    Abstract []
    The steady increase in access link capacity of broadband subscribers coupled with the significant increase in traffic, partially due to popular peer-to-peer (P2P) file sharing applications, has had an adverse economic impact on last-mile Internet Service Providers (ISPs). Reducing the cost associated with P2P traffic has become a major issue for ISPs worldwide and has triggered several proposals. However, previous works advocate for changes to P2P software or require application-specific packet identification and manipulation. This paper presents a distinct approach. We leverage on the observation that traffic costs can be reduced by diverting traffic away from expensive links and into cheaper links. We focus on BitTorrent (BT) traffic and propose a change to the ISP's access router packet schedulers in order to divert traffic away from links labeled as expensive by the ISP. Our proposed packet scheduler is stateless. Our approach requires no change to applications, is blind to packet payload, and only requires changes to access routers of ISPs that desire to reduce their BT traffic costs. Our simulation and experimental results indicate that our approach can successfully herd BT traffic towards cheaper links (up to 50% reduction), without sacrificing user-level performance.
  1. Bruno Ribeiro, Tao Ye, and Don Towsley, A Resource-minimalist Flow Size Histogram Estimator, ACM SIGCOMM Internet Measurement Conference, Oct, 2008. A version is available as UMass CMPSCI Technical Report 29-08 [pdf].
    The IMC 2008 slides: [ppt]
    Description []
    This work presents a data stream algorithm for coarse-grained Internet flow size histogram estimation.
    This algorithm is a fast sub-optimal estimator that also has a small memory footprint.
    The memory savings comes from the application of a probabilistic counting technique proposed by
    Robert Morris in 1978 and from our hash folding technique.
    In this work we present two tools that are possibly of interest in their own:
    • hash folding (multiplexing two or more hash tables into the space of one table)
    • semi-probabilistic sampling of packets in a flow.
    Errata []
    Section 3.1.2 has a typo: " ... its ownership bit changed from one to zero, g_j is decremented by one" should be " ... its ownership bit changed from one to zero, g_0 is decremented by one". g_0 counts the number of virtual counters in the sketch and is decremented when a virtual counter is lost (happens when the ownership bit changes from one to zero). Thanks to Chia-Mu Yu for point out this typo.
  1. Bruno Ribeiro, Weifeng Chen, Gerome Miklau, and Don Towsley Analyzing Privacy in Enterprise Packet Trace Anonymization, Proceedings of the 15th Annual Network & Distributed System Security Symposium (NDSS), Feb, 2008 [pdf] and [ppt]
    An extended version is currently available as UMass CMPSCI Technical Report TR 48-07. (BibTeX) [pdf].
    The source code of the software used in this work is available at [src].
    Abstract []
    In this work we show (and present the formal background of) an attack on anonymized IP addresses of (full and partial) prefix-preserved
    anonymized traces. This attack that combines the use of external information (fingerprints) and the matching
    contrains imposed by prefix preservation.
    Perhaps most importantly, we develop analysis tools that allow data publishers to quantify the worst-case
    vulnerability of their trace given assumptions about the adversary's external information.
  1. Bruno Ribeiro, Don Towsley, Tao Ye, and Jean Bolot, Fisher Information of Sampled Packets: an Application to Flow Size Estimation, ACM SIGCOMM Internet Measurement Conference (IMC 2006), Rio de Janeiro, Brazil, October 2006. (BibTeX)
    The camera-ready version and the IMC'06 slides are available (also an errata) here [pdf][ppt] (recommended).
    An older version is also available as UMass CMPSCI Technical Report 06-37 [pdf]
    Also available is a tar.gz file with the source code that computes the Cramer-Rao bounds found in the paper. It needs R 2.4 or newer: [src].
    The source code to numerically obtain the MLE can be found at [src]. It contains the version of wnlib that I used. Wnlib needs csh to compile.
    Description []
    In this work we look at the amount of statistical information, more specifically the flow size distribution, obtained from probabilistic packet sampling.
    We focus primarily on TCP flows and ask if it is possible to extract enough information from a
    sampled packet stream in order to recover its original flow size distribution. We look at single and multiple
    monitor cases.
    This work also provides some insights of which TCP/IP fields can be used to improve flow size estimation.
    We also provide tools to pre-compute the number of samples needed to achieve a given confidence interval.
  1. Bruno Ribeiro, Edmundo de Souza e Silva, and Don Towsley, On the Efficiency of Path Diversity for Continuous Media Applications, A version is currently available as UMass CMPSCI Technical Report 05-19 [pdf].
  2. Description []
    Path diversity is beneficial for multimedia application in the presence of loss.
    It ameliorates the effect of end-to-end burst losses by spreading such losses.
    Multimedia applications using path diversity are known to achieve better error performance over the single path case.
    Packet interleaving is an older and more widespread technique that also spread burst losses.
    To the best of my knowledge, this work is the first to provide evidence (and a mathematical proof :) that packet interleaving can
    also be used to achieve good error performance on multimedia applications.
  1. Weifeng Chen, Yong Huang, Bruno Ribeiro, Kyoungwon Suh, Honggang Zhang, Edmundo de Souza e Silva, Jim Kurose, and Don Towsley, Exploiting the IPID field to infer network path and end-system characteristics, Passive and Active Measurement Workshop (PAM'05), Boston, March 2005. [pdf] (BibTeX)
  2. Description []
    This work shows the many uses of the IPID field for network measurements.
    Among many other applications, it presents a technique to measure end-to-end one way delay differences that does
    not require the end host to be instrumented.
    This means that two hosts A and B can measure their one-way delay difference to a
    Windows 2000/XP end host C without the need to instrument host C -- or without the knowledge or consent of host C :-O .
  1. S. C. Coutinho, Bruno Ribeiro, On Holomorphic Foliations without Algebraic Solutions, Experimental Mathematics, v.10, n.4, p.529 - 536, 2001. [pdf]
  2. Description []
    This work uses Groebner basis through the Buchberger algorithm to find if an holomorphic foliation has an algebraic solution.