Computing the variance in one pass

There a variety of algorithms described for computing the variance of a set of numbers. Recall that the variance of a set values is $$v = \frac{1}{n-1} \sum_i \left( x_i - \tfrac{1}{n} \sum_{j} x_j \right)^2.$$ (This is the sample variance that is sometimes computed, the differences do not matter for our discussion.)

The easiest way to compute this formula is by first computing the mean: $$ \text{mean} = \tfrac{1}{n-1} \sum_{j} x_j $$ and then computing the variance based on the mean $$v = \frac{1}{n-1} \sum_i \left( x_i - \text{mean} \right)^2.$$

But this requires two visits to each datapoint. If there are millions or billions of datapoints, this could be expensive.

In [26]:
function badvar1(x::Vector{Float64})
ex2 = 0.0
ex = 0.0
n = length(x)
for i=1:n
  ex2 = ex2 + x[i]^2
  ex = ex + x[i]
end
    @show ex2, ex^2
return 1.0/(n-1)*(ex2 - (ex)^2/n)
end
Out[26]:
badvar1 (generic function with 1 method)
In [17]:
x = randn(100)
basevar = x -> (length(x)/(length(x)-1))*mean((x - mean(x)).^2)
@show badvar1(x)
@show basevar(x)
badvar1(x) = 1.032311000268402
basevar(x) = 1.0323110002684024
Out[17]:
1.0323110002684024
In [18]:
x = randn(10000)
@show badvar1(x)
@show basevar(x)
badvar1(x) = 1.0098437328466676
basevar(x) = 1.0098437328466736
Out[18]:
1.0098437328466736

Variance computations are immune to shifts

In [19]:
x = randn(10000)+1e4
@show badvar1(x)
@show basevar(x)
badvar1(x) = 1.019557363451189
basevar(x) = 1.0195578158695435
Out[19]:
1.0195578158695435
In [20]:
x = randn(10000)+1e6
@show badvar1(x)
@show basevar(x)
badvar1(x) = 0.9928992899289929
basevar(x) = 1.0077890284079822
Out[20]:
1.0077890284079822
In [27]:
x = randn(10000)+1e8
@show badvar1(x)
@show basevar(x)
(ex2,ex ^ 2) = (1.0000000002153153e20,1.0000000002153186e24)
badvar1(x) = -34.40984098409841
basevar(x) = 1.0046044826796907
Out[27]:
1.0046044826796907
In [24]:
function goodvar(x::Vector{Float64})
    n = length(x); mean = 0.0; m2 = 0.0; N = 0.0
    for i=1:n
        N = N + 1
        delta = x[i] - mean
        mean = mean + delta/N
        m2 = m2 + delta*(x[i]-mean)
    end
    return m2/(n-1)
end
Out[24]:
goodvar (generic function with 1 method)
In [25]:
x = randn(10000)+1e8
basevar = var
@show badvar1(x)
@show basevar(x)
@show goodvar(x)
badvar1(x) = 81.92819281928193
basevar(x) = 1.016826575245556
goodvar(x) = 1.0168265735653041
Out[25]:
1.0168265735653041