Pearson's chi-squared test, from 1900, is the standard statistical tool for "hypothesis testing on distributions": namely, given samples from an unknown distribution
that may or may not equal a hypothesis distribution , we want to return "yes" if and "no" if is far from . While the chi-squared test is easy to use, it has been known for a while that it is not "data efficient", it does not make the best use of its data. Precisely, for accuracy and confidence , and given samples from the unknown distribution , a tester should return "yes" with probability when , and "no" with probability when . The challenge is to find a tester with the best tradeoff between , , and . We introduce a new tester, efficiently computable and easy to use, which we hope will replace the chi-squared tester in practical use. Our tester is found via a new non-convex optimization framework that essentially seeks to "find the tester whose Chernoff bounds on its performance are as good as possible". This tester is
optimal, in that the number of samples needed by the tester is within factor of the samples needed by any tester, even non-linear testers (for the setting: accuracy , confidence , and hypothesis ). We complement this algorithmic framework with matching lower bounds saying, essentially, that "our tester is instance-optimal, even to factors, to the degree that Chernoff bounds are tight". Our overall non-convex optimization framework extends well beyond the current problem and is of independent interest.
We study the size of a neural network needed to approximate the maximum function over
inputs, in the most basic setting of approximating with respect to the norm, for continuous distributions, for a network that uses ReLU activations. We provide new lower and upper bounds on the width required for approximation across various depths. Our results establish new depth separations between depth 2 and 3, and depth 3 and 5 networks, as well as providing a depth and width construction which approximates the maximum function, significantly improving upon the depth requirements of the best previously known bounds for networks with linearly-bounded width. Our depth separation results are facilitated by a new lower bound for depth 2 networks approximating the maximum function over the uniform distribution, assuming an exponential upper bound on the size of the weights. Furthermore, we are able to use this depth 2 lower bound to provide tight bounds on the number of neurons needed to approximate the maximum by a depth 3 network. Our lower bounds are of potentially broad interest as they apply to the widely studied and used max function, in contrast to many previous results that base their bounds on specially constructed or pathological functions and distributions.
There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the fundamental limits of what we can extract from limited and valuable data. The state of the art results for mean estimation in
are 1) the optimal sub-Gaussian mean estimator by [Lee and Valiant, 2022], attaining the optimal sub-Gaussian error constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [Bubeck, Cesa-Bianchi and Lugosi, 2013] and a matching lower bound by [Devroye, Lerasle, Lugosi, and Oliveira, 2016], characterizing the big-O optimal errors for distributions that have tails heavy enough that only a moment exists for some . Both of these results, however, are optimal only in the worst case. Motivated by the recent effort in the community to go "beyond the worst-case analysis" of algorithms, we initiate the fine-grained study of the mean estimation problem. Given a distribution , assuming only that it has a finite mean and absent any additional assumptions, we show how to construct a distribution such that and are impossible to distinguish with samples with probability , yet the means of and are well-separated. The construction of preserves the density of up to a factor of 2, thus preserving the existence of all moments. In particular, specializing to the second moment, the variance of is at most twice the variance of . The main consequence of our result is that, under the finite but unknown variance assumption, no reasonable estimator can expect to asymptotically achieve better than the sub-Gaussian error rate, up to constant factors, matching the worst-case optimality result of [Lee and Valiant, 2022]. Further, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak admissibility/Pareto optimality definitions. As an application of the new framework, we show that the standard median-of-means algorithm is neighborhood optimal, up to constant factors. It is an open question to find a neighborhood-optimal estimator without constant factor slackness.
Location estimation is one of the most basic questions in parametric statistics. Suppose we have a known distribution density
, and we get i.i.d. samples from for some unknown shift . The task is to estimate to high accuracy with high probability. The maximum likelihood estimator (MLE) is known to be asymptotically optimal as , but what is possible for finite ? In this paper, we give two location estimators that are optimal under different criteria: 1) an estimmator that has minimax-optimal estimation error subject to succeeding with probability and 2) a confidence interval estimator which, subject to its output interval containing with probability at least , has the minimum expected squared interval width among all shift-invariant estimators. The latter construction can be generalized to minimizing the expectation of any loss function on the interval width.
We consider 1-dimensional location estimation, where we estimate a parameter
from samples , with each drawn i.i.d. from a known distribution . For fixed the maximum-likelihood estimate (MLE) is well-known to be optimal in the limit as : it is asymptotically normal with variance matching the Cramér-Rao lower bound of where is the Fisher information of . However, this bound does not hold for finite , or when varies with . We show for arbitrary and that one can recover a similar theory based on the Fisher information of a smoothed version of , where the smoothing radius decays with .
We address the problem of mean estimation in very high dimensions, in the high probability regime parameterized by failure probability
. For a distribution with covariance , let its "effective dimension" be . For the regime where , we show the first algorithm whose sample complexity is optimal to within factor. The algorithm has a surprisingly simple structure: 1) re-center the samples using a known sub-Gaussian estimator, 2) carefully choose an easy-to-compute positive integer and then remove the samples farthest from the origin and 3) return the sample mean of the remaining samples. The core of the analysis relies on a novel vector Bernstein-type tail bound, showing that under general conditions, the sample mean of a bounded high-dimensional distribution is highly concentrated around a spherical shell.
We settle the fundamental problem of estimating the mean of a real-valued distribution, presenting a novel estimator with sub-Gaussian convergence: intuitively, "our estimator, on any distribution, is as accurate as the sample mean is for the Gaussian distribution of matching variance." Crucially, in contrast to prior works, our estimator does not require prior knowledge of the variance, and works across the entire gamut of distributions with bounded variance, including those without any higher moments. Parameterized by the sample size n, the failure probability
, and the variance , our estimator is accurate to within , tight up to the 1+o(1) factor. Our estimator construction and analysis gives a framework generalizable to other problems, tightly analyzing a sum of dependent random variables by viewing the sum implicitly as a 2-parameter -estimator, and constructing bounds using mathematical programming and duality techniques.
Given a mixture between two populations of coins, "positive" coins that each have---unknown and potentially different---bias
and "negative" coins with bias , we consider the task of estimating the fraction of positive coins to within additive error . We achieve an upper and lower bound of samples for a probability of success, where crucially, our lower bound applies to all fully-adaptive algorithms. Thus, our sample complexity bounds have tight dependence for every relevant problem parameter. A crucial component of our lower bound proof is a decomposition lemma showing how to assemble partially-adaptive bounds into a fully-adaptive bound, which may be of independent interest: though we invoke it for the special case of Bernoulli random variables (coins), it applies to general distributions. We present simulation results to demonstrate the practical efficacy of our approach for realistic problem parameters for crowdsourcing applications, focusing on the "rare events" regime where is small. The fine-grained adaptive flavor of both our algorithm and lower bound contrasts with much previous work in distributional testing and learning.
This chapter considers the challenge of saying as much as possible about a probability distribution given a limited number of samples. Traditionally, work has focused on either developing algorithms that are optimal in an asymptotic sense as the amount of data goes to infinity, or developing algorithms that are optimal in a worst-case sense when parameterized by relevant quantities such as the support size. This chapter, by contrast, considers two standard settings---learning a discrete distribution from samples, and testing whether a set of samples was drawn from a specific distribution---and develops algorithms that are near optimal on every instance.
We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements
with corresponding data values . We observe the values for a "sample" set and wish to estimate some statistic of the values for a "target" set where could be the entire set. Crucially, we assume that the sets and are drawn according to some known distribution over pairs of subsets of . A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution from which the sample and target sets are drawn, and the worst-case is with respect to the data values . Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values - where the weights are functions of the distribution and the sample and target sets , - and show that the worst-case expected error achieved by this algorithm is at most a multiplicative factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We also extend these results to the linear regression setting where each datapoint is not a scalar but a labeled vector . This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution , is a significant departure from the typical statistical estimation framework and introduces a uniform analysis for the many natural settings where membership in a sample may be correlated with data values, such as when individuals are recruited into a sample through their social networks as in "snowball/chain" sampling or when samples have chronological structure as in "selective prediction".
We consider networks, trained via stochastic gradient descent to minimize
loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.
Considering the vorticity formulation of the Euler equations, we partition the kinetic energy into its contribution from each pair of interacting vortices. We call this contribution the "interaction energy". We show that each contribution satisfies a reciprocity relation on triples of vortices: A's action on B changes the interaction energy between B and C in an equal and opposite way to the effect of C's action on B on the interaction energy between A and B. This result is a curiously detailed accounting of energy flow, as contrasted to standard pointwise conservation laws in fluid dynamics. This result holds for all triples of points A,B,C in two dimensions; and in 3 dimensions for all points A,C, and all closed vorticity streamlines B. We show this result in 3 dimensions as a consequence of an interaction energy flow around B that is a function only of the triple
, a result which may be of independent interest.
We show that a class of statistical properties of distributions, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most k distinct elements, these properties can be estimated accurately using a sample of size O(k/log k). For these estimation tasks, this performance is optimal, to constant factors. Complementing these theoretical results, we also demonstrate that our estimators perform exceptionally well, in practice, for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. The key step in our approach is to first use the sample to characterize the "unseen" portion of the distribution — effectively reconstructing this portion of the distribution as accurately as if one had a logarithmic factor larger sample. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: We seek to estimate the shape of the unobserved portion of the distribution. This work can be seen as introducing a robust, general, and theoretically principled framework that, for many practical applications, essentially amplifies the sample size by a logarithmic factor; we expect that it may be fruitfully used as a component within larger machine learning and statistical analysis systems.
We consider the problem of verifying the identity of a distribution: Given the description of a distribution over a discrete finite or countably infinite support, p=(p1, p2,...), how many samples (independent draws) must one obtain from an unknown distribution, q, to distinguish, with high probability, the case that p=q from the case that the total variation distance (L1 distance) ||p-q||1≥ε? We resolve this question, up to constant factors, on an instance by instance basis: there exist universal constants c, c' and a function f(p, ε) on the known distribution p and error parameter ε such that our tester distinguishes p=q from ||p-q||1≥ε using f(p, ε) samples with success probability >2/3, but no tester can distinguish p=q from ||p-q||1≥cε when given c' f(p, ε) samples.
The analysis of our very simple testing algorithm involves several hairy inequalities. To facilitate this analysis, we give a complete characterization of a general class of inequalities - generalizing Cauchy-Schwarz, Hölder's inequality, and the monotonicity of Lp norms. Our characterization is constructive and algorithmic, leveraging linear programming to prove or refute an inequality, which would otherwise have to be investigated, through trial and error, by hand. We hope the computational nature of our characterization will be useful to others, and facilitate analyses like the one here.
We introduce a definition of security that applies to databases that maintain dynamic information on users such as financial account information, medical records, etc. Such a database system is secure in between accesses provided 1) users can efficiently access their data, and 2) while a user is not accessing their data, the user's information is information theoretically secure to malicious agents. We propose a realization of such a database system and prove that a user's stored information, in between times when it is being legitimately accessed, is information theoretically secure both to adversaries who interact with the database in the prescribed manner, as well as to adversaries who have installed a virus that controls the entire internet-facing server that stores the database. We stress that the security guarantee is information theoretic and everlasting: it relies neither on unproved hardness assumptions, nor on the assumption that the adversary is computationally or storage bounded.
The central idea behind our design of an information theoretically secure database system is the construction of a "re-randomizing database" that periodically changes the internal representation of the information that is being stored. To ensure security, these remappings of the representation of the data must be made sufficiently often in comparison to the amount of information that is being communicated from the database between remappings and the amount of local memory on the database server that a virus may preserve during the remappings. While this changing representation provably foils the ability of an adversary to glean information, it can be accomplished in a manner transparent to the legitimate users, preserving how database users access their data.
The core of the proof of the security is a new analysis of a simple and explicit locally computable extractor, based on a hypercontractivity inequality, optimized for the relatively unexplored parameter regime of "nearly full min-entropy" where the min-entropy of the
-bit input string is .
Star-convexity is a significant relaxation of the notion of convexity, that allows for functions that do not have (sub)gradients at most points, and may even be discontinuous everywhere except at the global optimum. We introduce a polynomial time algorithm for optimizing the class of star-convex functions, under no Lipschitz or other smoothness assumptions whatsoever, and no restrictions except exponential boundedness on a region about the origin, and Lebesgue measurability. The algorithm's performance is polynomial in the requested number of digits of accuracy and the dimension of the search domain. This contrasts with the previous best known algorithm of Nesterov and Polyak [2006] which has exponential dependence on the number of digits of accuracy, but only nω dependence on the dimension n (where ω is the matrix multiplication exponent), and which further requires Lipschitz second differentiability of the function.
Despite a long history of successful gradient-based optimization algorithms, star-convex optimization is a uniquely challenging regime because 1) gradients and/or subgradients often do not exist; and 2) even in cases when gradients exist, there are star-convex functions for which gradients provably provide no information about the location of the global optimum. We thus bypass the usual approach of relying on gradient oracles and introduce a new randomized cutting plane algorithm that relies only on function evaluations. Our algorithm essentially looks for structure at all scales, since, unlike with convex functions, star-convex functions do not necessarily display simpler behavior on smaller length scales. Thus, while our cutting plane algorithm refines a feasible region of exponentially decreasing volume by iteratively removing "cuts", unlike for the standard convex case, the structure to efficiently discover such cuts may not be found within the feasible region: our novel star-convex cutting plane approach discovers cuts by sampling the function exponentially far outside the feasible region.
We emphasize that the class of star-convex functions we consider is as unrestricted as possible: the class of Lebesgue measurable star-convex functions has theoretical appeal, introducing to the domain of polynomial-time algorithms a huge class with many interesting pathologies. We view our results as a step forward in understanding the scope of optimization techniques beyond the garden of convex optimization and local gradient-based methods.
A review of analyses based upon anti-parallel vortex structures suggests that structurally stable dipoles with eroding circulation may offer a path to the study of vorticity growth in solutions of Euler's equations in R3. We examine here the possible formation of such a structure in axisymmetric flow without swirl, leading to maximal growth of vorticity as t4/3. Our study suggests that the optimizing flow giving the t4/3 growth mimics an exact solution of Euler's equations representing an eroding toroidal vortex dipole which locally conserves kinetic energy. The dipole cross-section is a perturbation of the classical Sadovskii dipole having piecewise constant vorticity, which breaks the symmetry of closed streamlines. The structure of this perturbed Sadovskii dipole is analysed asymptotically at large times, and its predicted properties are verified numerically. We also show numerically that if mirror symmetry of the dipole is not imposed but axial symmetry maintained, an instability leads to breakup into smaller vortical structures.
As new proposals aim to sequence ever larger collection of humans, it is critical to have a quantitative framework to evaluate the statistical power of these projects. We developed a new algorithm, UnseenEst, and applied it to the exomes of 60,706 individuals to estimate the frequency distribution of all protein-coding variants, including rare variants that have not been observed yet in the current cohorts. Our results quantified the number of new variants that we expect to identify as sequencing cohorts reach hundreds of thousands of individuals. With 500K individuals, we find that we expect to capture 7.5% of all possible loss-of-function variants and 12% of all possible missense variants. We also estimate that 2,900 genes have loss-of-function frequency of <0.00001 in healthy humans, consistent with very strong intolerance to gene inactivation.
We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in L1 distance (equivalently, total variation distance, or "statistical distance"). Perhaps surprisingly, it is often possible to "de-noise" the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an instance optimal learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. More formally, given n independent draws from a distribution p, our algorithm returns a labelled vector whose expected distance from p is equal to the minimum possible expected error that could be obtained by any algorithm, even one that is given the true unlabeled vector of probabilities of distribution p and simply needs to assign labels - up to an additive subconstant term that is independent of p and goes to zero as n gets large. This somewhat surprising result has several conceptual implications, including the fact that, for any large sample from a distribution over discrete support, prior knowledge of the rates of decay of the tails of the distribution (e.g. power-law type assumptions) is not significantly helpful for the task of learning the distribution. As a consequence of our techniques, we also show that given a set of n samples from an arbitrary distribution, one can accurately estimate the expected number of distinct elements that will be observed in a sample of any size up to n log n. This sort of extrapolation is practically relevant, particularly to domains such as genomics where it is important to understand how much more might be discovered given larger sample sizes, and we are optimistic that our approach is practically viable.
We formulate a notion of evolvability for functions with domain and range that are real-valued vectors, a compelling way of expressing many natural biological processes. We show that linear and fixed-degree polynomial functions are evolvable in the following dually-robust sense: There is a single evolution algorithm that, for all convex loss functions, converges for all distributions. It is possible that such dually-robust results can be achieved by simpler and more-natural evolution algorithms. Towards this end, we introduce a simple and natural algorithm that we call wide-scale random noise and prove a corresponding result for the L2 metric. We conjecture that the algorithm works for a more general class of metrics.
We consider the problem of verifying the identity of a distribution: Given the description of a distribution over a discrete finite or countably infinite support, p=(p1, p2,...), how many samples (independent draws) must one obtain from an unknown distribution, q, to distinguish, with high probability, the case that p=q from the case that the total variation distance (L1 distance) ||p-q||1≥ε? We resolve this question, up to constant factors, on an instance by instance basis: there exist universal constants c, c' and a function f(p, ε) on the known distribution p and error parameter ε such that our tester distinguishes p=q from ||p-q||1≥ε using f(p, ε) samples with success probability >2/3, but no tester can distinguish p=q from ||p-q||1≥cε when given c' f(p, ε) samples.
The analysis of our very simple testing algorithm involves several hairy inequalities. To facilitate this analysis, we give a complete characterization of a general class of inequalities - generalizing Cauchy-Schwarz, Hölder's inequality, and the monotonicity of Lp norms. Our characterization is constructive and algorithmic, leveraging linear programming to prove or refute an inequality, which would otherwise have to be investigated, through trial and error, by hand. We hope the computational nature of our characterization will be useful to others, and facilitate analyses like the one here.
We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an n-element set, we wish to distinguish whether p=q versus p is at least ε-far from q, in either L1 or L2 distance. Batu et al [2000] gave the first sub-linear time algorithms for these problems, which matched the lower bounds of Valiant [2011] up to a logarithmic factor in n, and a polynomial factor of ε.
In this work, we present simple testers for both the L1 and L2 settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on n, and the dependence on ε; for the L1 testing problem we establish that the sample complexity is Θ(max{n2/3/ε4/3, n1/2/ε2}).
Recently, we showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/log n). We propose a novel modification of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to first use the sample to characterize the "unseen" portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems.
We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the L1 (total variation) distance between two k-modal distributions p and q over the discrete domain {1,...,n}. For each of four problems we give sub-logarithmic sample algorithms, and show that our algorithms have optimal sample complexity up to additive poly(k) and multiplicative polylog log n + polylog k factors.
As our main conceptual contribution, we introduce a new reduction-based approach for distribution-testing problems that lets us obtain all the above results in a unified way. Roughly speaking, this approach enables us to transform various distribution testing problems for k-modal distributions over {1,...,n} to the corresponding distribution testing problems for unrestricted distributions over a much smaller domain {1,...,L} where L=O(k log n).
This article provides new worst-case bounds for the size and treewith of the result Q(D) of a conjunctive query Q applied to a database D. We derive bounds for the result size |Q(D)| in terms of structural properties of Q, both in the absence and in the presence of keys and functional dependencies. These bounds are based on a novel "coloring" of the query variables that associates a coloring number C(Q) to each query Q. Intuitively, each color used represents some possible entropy of that variable. Using this coloring number, we derive tight bounds for the size of Q(D) in case (i) no functional dependencies or keys are specified, and (ii) simple functional dependencies (keys) are given. These results generalize recent size-bounds for join queries obtained by Atserias et al. [2008]. In the case of arbitrary (compound) functional dependencies, we use tools from information theory to provide lower and upper bounds, establishing a close connection between size bounds and a basic question in information theory. Our new coloring scheme also allows us to precisely characterize (both in the absence of keys and with simple keys) the treewidth-preserving queries---the queries for which the treewidth of the output relation is bounded by a function of the treewidth of the input database. Finally, we give some results on the computational complexity of determining the size bounds, and of deciding whether the treewidth is preserved.
We resolve the problem of small-space approximate selection in random-order streams. Specifically, we present an algorithm that reads the n elements of a set in random order and returns an element whose rank differs from the true median by at most n1/3+o(1) while storing a constant number of elements and counters at any one time. This is optimal: it was previously shown that achieving better accuracy required poly(n) memory. However, it was conjectured that the lower bound was not tight and that a previous algorithm achieving an n1/2+o(1) approximation was optimal. We therefore consider the new result a surprising resolution to a natural and basic question.
We formulate a notion of evolvability for functions with domain and range that are real-valued vectors, a compelling way of expressing many natural biological processes. We show that linear and fixed degree polynomial functions are evolvable in the following dually robust sense: There is a single evolution algorithm that for all convex loss functions converges for all distributions. Existing results suggest that for Boolean function evolution no correspondingly general result is possible.
It is possible that such dually robust results can be achieved by simpler and more natural evolution algorithms. In the second part of the paper we introduce a simple and natural algorithm that we call "wide-scale random noise" and prove a corresponding result for the L_2 metric. We conjecture that the algorithm works for more general classes of metrics.
For a broad class of practically relevant distribution properties, which includes entropy and support size, nearly all of the proposed estimators have an especially simple form. Given a set of independent samples from a discrete distribution, these estimators tally the vector of summary statistics---the number of domain elements seen once, twice, etc. in the sample---and output the dot product between these summary statistics, and a fixed vector of coefficients. We term such estimators linear. This historical proclivity towards linear estimators is slightly perplexing, since, despite many efforts over nearly 60 years, all proposed such estimators have significantly suboptimal convergence, compared to the bounds shown in [Valiant and Valiant, 2011].
Our main result, in some sense vindicating this insistence on linear estimators, is that for any property in this broad class, there exists a near-optimal linear estimator. Additionally, we give a practical and polynomial-time algorithm for constructing such estimators for any given parameters.
While this result does not yield explicit bounds on the sample complexities of these estimation tasks, we leverage the insights provided by this result, to give explicit constructions of near-optimal linear estimators for three properties: entropy, L1 distance to uniformity, and for pairs of distributions, L1 distance.
Our entropy estimator, when given O(n/(c log n)) independent samples from a distribution of support at most n, will estimate the entropy of the distribution to within accuracy c, with probability of failure o(1/poly(n)). From the recent lower bounds given in [Valiant and Valiant, 2011], this estimator is optimal, to constant factor, both in its dependence on n, and its dependence on c. In particular, the inverse-linear convergence rate of this estimator resolves the main open question of [Valiant and Valiant, 2011], which left open the possibility that the error decreased only with the square root of the number of samples.
Our distance to uniformity estimator, when given O(m/(c^2 log m)) independent samples from any distribution, returns an c-accurate estimate of the L1 distance to the uniform distribution of support m. This is the first sublinear-sample estimator for this problem, and is constant-factor optimal, for constant c.
Finally, our framework extends naturally to properties of pairs of distributions, including estimating the L1 distance and KL-divergence between pairs of distributions. We give an explicit linear estimator for estimating L1 distance to accuracy c using O(n/(c^2 log n)) samples from each distribution, which is constant-factor optimal, for constant c.
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear-sample additive estimators for a class of properties that includes entropy and distribution support size. Together with the lower bounds proven in the companion paper [VV'10a], this settles the longstanding question of the sample complexities of these estimation problems (up to constant factors). Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n log n) samples; [VV'10a] shows that no algorithm on o(n log n) samples can achieve this (where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound the probability of any element of the domain). Previously, no explicit sublinear-sample algorithms for either of these problems were known. Additionally, our algorithm runs in time linear in the number of samples used.
We prove two new multivariate central limit theorems; the first relates the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance matric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for ``generalized multinomial" distributions---a large class of discrete distributions, parameterized by matrices, that generalize binomial and multinomial distributions, and describe many distributions encountered in computer science. This central limit theorem relates a generalized multinomial distribution to a multivariate Gaussian distribution, discretized by rounding to the nearest lattice points. In contrast to the metric of our first central limit theorem, this bound is in terms of statistical distance, which immediately implies that any algorithm with input drawn from a generalized multinomial distribution behaves essentially as if the input were drawn from a discretized Gaussian with the same mean and covariance. Such tools in the multivariate setting are rare, and we hope this new tool will be of use to the community.
In the second part of the paper, we employ this central limit theorem to establish a lower bound of Omega(n/ log n) on the sample complexity of additively estimating the entropy or support size of a distribution (where 1/n is a lower bound on the probability of any element in the domain). Together with the canonical estimator constructed in the companion paper [VV'10b], this settles the longstanding open question of the sample complexities of these estimation problems, up to constant factors. In particular, for any constants c_1 < log 2, and c_2 < 1/2, there is a family of pairs of distributions D,D' each of whose elements occurs with probability at least 1/n, whose entropies satisfy H(D)-H(D') > c_1, and whose support sizes differ by at least c_2 n, such that no algorithm on o(n/ log n) samples can distinguish D from D' with probability greater than 2/3. For the problem of estimating entropy, we also provide a bound on the
rate of convergence of an optimal estimator, showing that the sample complexity of estimating entropy to within additive c is Omega(n/ (c log n). The previous lower-bounds on these sample complexities were n/2^Theta(Sqrt(log n)), for constant c. We explicitly exhibit such a family of pairs of distributions D,D' via a Laguerre polynomial construction that may be of independent interest.
We investigate the number of samples required for testing the monotonicity of a distribution with respect to an arbitrary underlying partially ordered set. Our first result is a nearly linear lower bound for the sample complexity of testing monotonicity with respect to the poset consisting of a directed perfect matching. This is the first nearly linear lower bound known for a natural non-symmetric property of distributions. Testing monotonicity with respect to the matching reduces to testing monotonicity with respect to various other natural posets, showing corresponding lower bounds for these posets also. Next, we show that whenever a poset has a linear-sized matching in the transitive closure of its Hasse digraph, testing monotonicity with respect to it requires Omega(Sqrt(n)) samples. Previous such lower bounds applied only to the total order. We also give upper bounds to the sample complexity in terms of the chain decomposition of the poset. Our results simplify the known tester for the two dimensional grid and give the first sublinear bounds for higher dimensional grids and the Boolean cube.
Because of its devastating effects in auctions and other mechanisms, collusion is prohibited and legally prosecuted. Yet, colluders have always existed, and may continue to exist. We thus raise the following question for mechanism design:
What desiderata are achievable, and by what type of mechanisms, when any set of players who wish to collude are free to do so without any restrictions on the way in which they cooperate and coordinate their actions?
In response to this question we put forward and exemplify the notion of a collusion-leveraging mechanism. In essence, this is a mechanism aligning its desiderata with the incentives of all its players, including colluders, to a significant and mutually beneficial extent. Of course such mechanisms may exist only for suitable desiderata.
In unrestricted combinatorial auctions, where classical mechanisms essentially guarantee 0 social welfare and 0 revenue in the presence of just two colluders, we prove that it is possible for collusion-leveraging mechanisms to guarantee that the sum of social welfare and revenue is always high, even when all players are collusive.
To guarantee better performance, collusion-leveraging mechanisms in essence "welcome" collusive players, rather than pretending they do not exist, raising a host of new questions at the intersection of cooperative and non-cooperative game theory.
In light of much recent interest in finding a model of multi-player multi-action games that allows for efficient computation of Nash equilibria yet remains as expressive as possible, we investigate the computational complexity of Nash equilibria in the recently proposed model of action- graph games (AGGs). AGGs, introduced by Bhat and Leyton-Brown, are succinct representations of games that encapsulate both local dependencies as in graphical games, and partial indifference to other agents' identities as in anonymous games, which occur in many natural settings such as financial markets. This is achieved by specifying a graph on the set of actions, so that the payoff of an agent for selecting a strategy depends only on the number of agents playing each of the neighboring strategies in the action graph. We present a simple Fully Polynomial Time Approximation Scheme for computing mixed Nash equilibria of AGGs with constant degree, constant treewidth and a constant number of agent types (but an arbitrary number of strategies), and extend this algorithm to a broader set of instances. However, the main results of this paper are negative, showing that when either of the latter conditions are relaxed the problem becomes intractable. In particular, we show that even if the action graph is a tree but the number of agent-types is unconstrained, it is NP-complete to decide the existence of a pure-strategy Nash equilibrium and PPAD-complete to compute a mixed Nash equilibrium (even an approximate one). Similarly for AGGs with a constant number of agent types but unconstrained treewidth. These hardness results suggest that, in some sense, our FPTAS is as strong a positive result as one can expect. In the broader context of trying to pin down the boundary where the equilibria of multi-player games can be computed efficiently, these results complement recent hardness results for graphical games and algorithmic results for anonymous games.
We introduce the notion of a Canonical Tester for a class of properties on distributions, that is, a tester strong and general enough that "a distribution property in the class is testable if and only if the Canonical Tester tests it''. We construct a Canonical Tester for the class of properties of one or two distributions that are symmetric and satisfy a certain weak continuity condition. Analyzing the performance of the Canonical Tester on specific properties resolves several open problems, establishing lower bounds that match known upper bounds: we show that distinguishing between entropy $<\alpha$ or $>\beta$ on distributions over $[n]$ requires $n^{\alpha/\beta- o(1)}$ samples, and distinguishing whether a pair of distributions has statistical distance $<\alpha$ or $>\beta$ requires $n^{1- o(1)}$ samples. Our techniques also resolve a conjecture about a property that our Canonical Tester does not apply to: distinguishing identical distributions from those with statistical distance $>\beta$ requires $\Omega(n^{2/3})$ samples.
We investigate dominant-strategy auction mechanisms that, should a sufficiently informed coalition of players be present, exploit it so as to guarantee more efficiency and revenue than is otherwise possible.
(Coming from a cryptographic tradition and prizing extreme settings, we refrain from relying on weaker notions of equilibrium; the availability of Bayesian information; any restrictions on the number of players, goods, and colluders; or any restrictions on the type of valuations and the ability of the colluders.)
Dominant-strategy truthfulness is traditionally considered the best possible solution concept in mechanism design, as it enables one to predict with confidence which strategies independent players will actually choose. Yet, as with any other form of equilibrium, it too can be extremely vulnerable to collusion. The problem of collusion is particularly evident for unrestricted combinatorial auctions, arguably the hardest type of auctions to design mechanisms for.
We thus investigate how much revenue can be guaranteed, in unrestricted combinatorial auctions, by dominant-strategy-truthful mechanisms that are collusion-resilient in a very strong sense; and obtain almost matching upper- and lower-bounds.
A probabilistically checkable proof (PCP) system enables proofs to be veried in time polylogarithmic in the length of a classical proof. Computationally sound (CS) proofs improve upon PCPs by additionally shortening the length of the transmitted proof to be polylogarithmic in the length of the classical proof.
In this paper we explore the ultimate limits of non-interactive proof systems with respect to time and space effciency. We present a proof system where the prover uses space polynomial in the space of a classical prover and time essentially linear in the time of a classical prover, while the verier uses time and space that are essentially constant. Further, this proof system is composable: there is an algorithm for merging two proofs of length k into a proof of the conjunction of the original two theorems in time polynomial in k, yielding a proof of length exactly k.
We deduce the existence of our proposed proof system by way of a natural new assumption about proofs of knowledge. In fact, a main contribution of our result is showing that knowledge can be "traded" for time and space effciency in noninteractive proof systems. We motivate this result with an explicit construction of noninteractive CS proofs of knowledge in the random oracle model.
We further our algorithmic and structural understanding of Nash equilibria. Specifically:
-- We distill the hard core of the complexity of Nash equilibria, showing that even correctly computing a logarithmic number of bits of the equilibrium strategies of a two-player win-lose game is as hard as the general problem.
-- We prove the following structural result about Nash equilibria: “the set of approximate equilibria of a zero-sum game is convex.â€
The Internet's current interdomain routing protocol, BGP (Border Gateway Protocol), has two modes of operation: eBGP (external BGP), used to exchange routing information between autonomous systems, and iBGP (internal BGP), used to propagate that information within an autonomous system (AS). In a “full mesh†iBGP configuration, every router has a BGP session with every border router in the AS. Because a full mesh configuration has a large number of iBGP sessions and does not scale well, configurations based on route reflectors are commonly used for intra-AS route dissemination. Unfortunately, route reflector configurations violate important correctness properties, including loop-free forwarding and complete visibility to all eBGP-learned best routes, especially in the face of router and link failures.
This paper presents and analyzes the first (to our knowledge) algorithm, BGPSep, to construct an iBGP session configuration that is both correct and more scalable than a full mesh. BGPSep uses the notion of a graph separator--—a small set of nodes whose removal partitions a graph into connected components of roughly equal sizes—--to choose route reflectors and iBGP sessions in a way that guarantees correctness. We evaluate an implementation of the BGPSep algorithm on several real-world and simulated network topologies and find that iBGP configurations generated by BGPSep have between 2.5 to 5x fewer iBGP sessions than a full mesh.
The efficient computation of Nash equilibria is one of the most formidable challenges in computational complexity today. The problem remains open for two-player games.
We show that the complexity of two-player Nash equilibria is unchanged when all outcomes are restricted to be 0 or 1. That is, win-or-lose games are as complex as the general case for two-player games.
There has been significant interest lately in the task of constructing codes that are testable with a small number of random probes. Ben-Sasson and Sudan show that the repeated tensor product of codes leads to a general class of locally testable codes. One question that is not settled by their work is the local testability of a code generated by a single application of the tensor product. Special cases of this question have been studied in the literature in the form of "tests for bivariate polynomials", where the tensor product has been shown to be locally testable for certain families of codes. However the question remained open for the tensor product of generic families of codes. Here we resolve the question negatively, giving families of codes whose tensor product does not have good local testability properties.
For Boolean polynomials in Z_p of sufficiently low degree we derive a relation expressing their values on one level set in terms of their values on another level set. We use this relation to derive linear upper and lower bounds, tight to within constant factor, on the degrees of various approximate majority functions, namely functions that take the value 0 on one level set, the value 1 on a different level set, and arbitrary 0-1 values on other Boolean inputs. We show sub-linear upper bounds in the case of moduli that are not prime powers.