Computing the variance in one pass¶

There a variety of algorithms described for computing the variance of a set of numbers. Recall that the variance of a set values is $$v = \frac{1}{n-1} \sum_i \left( x_i - \tfrac{1}{n} \sum_{j} x_j \right)^2.$$ (This is the sample variance that is sometimes computed, the differences do not matter for our discussion.)

The easiest way to compute this formula is by first computing the mean: $$ \text{mean} = \tfrac{1}{n-1} \sum_{j} x_j $$ and then computing the variance based on the mean $$v = \frac{1}{n-1} \sum_i \left( x_i - \text{mean} \right)^2.$$

But this requires two visits to each datapoint. If there are millions or billions of datapoints, this could be expensive.

function badvar1(x::Vector{Float64})
ex2 = 0.0
ex = 0.0
n = length(x)
for i=1:n
  ex2 = ex2 + x[i]^2
  ex = ex + x[i]
end
    @show ex2, ex^2
return 1.0/(n-1)*(ex2 - (ex)^2/n)
end

badvar1 (generic function with 1 method)

x = randn(100)
basevar = x -> (length(x)/(length(x)-1))*mean((x - mean(x)).^2)
@show badvar1(x)
@show basevar(x)

badvar1(x) = 1.032311000268402
basevar(x) = 1.0323110002684024

1.0323110002684024

x = randn(10000)
@show badvar1(x)
@show basevar(x)

badvar1(x) = 1.0098437328466676
basevar(x) = 1.0098437328466736

1.0098437328466736

Variance computations are immune to shifts

x = randn(10000)+1e4
@show badvar1(x)
@show basevar(x)

badvar1(x) = 1.019557363451189
basevar(x) = 1.0195578158695435

1.0195578158695435

x = randn(10000)+1e6
@show badvar1(x)
@show basevar(x)

badvar1(x) = 0.9928992899289929
basevar(x) = 1.0077890284079822

1.0077890284079822

x = randn(10000)+1e8
@show badvar1(x)
@show basevar(x)

(ex2,ex ^ 2) = (1.0000000002153153e20,1.0000000002153186e24)
badvar1(x) = -34.40984098409841
basevar(x) = 1.0046044826796907

1.0046044826796907

function goodvar(x::Vector{Float64})
    n = length(x); mean = 0.0; m2 = 0.0; N = 0.0
    for i=1:n
        N = N + 1
        delta = x[i] - mean
        mean = mean + delta/N
        m2 = m2 + delta*(x[i]-mean)
    end
    return m2/(n-1)
end

goodvar (generic function with 1 method)

x = randn(10000)+1e8
basevar = var
@show badvar1(x)
@show basevar(x)
@show goodvar(x)

badvar1(x) = 81.92819281928193
basevar(x) = 1.016826575245556
goodvar(x) = 1.0168265735653041

1.0168265735653041