Back to How-to

# Formulas for statistics

### T-statistic for correlations

t = r / sqrt[(1 - r^2) / (N - 2)], df = N - 2 (from here).

### Independent two-sample t-test, general case

t = (mean(X1) - mean(X2)) / S,

where S = sqrt(var(X1) / length(X1) + var(X2) / length(X2)).

Degrees of freedom equals (var(X1) / length(X1) + var(X2) / length(X2)) .^ 2 / ((var(X1)/length(X1)) .^ 2 / (length(X1) - 1) + (var(X2)/length(X2)) .^ 2 / (length(X2) - 1)).

From here, here. Matlab function for getting the p-value here.

### F-test for reduction in error variance

Say you have two models, New and Old, of which New adds one or more parameters to Old. These models generate two prediction vectors, pred_new and pred_old, for a vector of data. New will always explain at least as much variance as Old. The F-test tests whether New expplains more variance than would be expected by adding parameters unrelated to the data. Let

where
RSS_old = var(pred_old) * (n - 1);
RSS_new = var(pred_new) * (n - 1);
df1 = p_new - p_old;
df2 = n - p_new;

and n is the number of observations, p_old is the number of parameters (including the constant) of the old model, and p_new the numnber of parameters of the new model. Then F has an F-distribution, and it p-value can be calculated using

x = df1 * F / (df1 * F + df2);
a = df1 / 2;
b = df2 / 2;
p = 1 - betainc(x, a, b);

Matlab function here. Online sources refer to the regularized incomplete beta function, but using this leads to errors in Matlab.

### Variance

The variance is equal to Mean(X.^2) - Mean(X)^2. This means that the variance can be calculated by running through the values of a vector once and adding values to a Sum variable, and the squared values to a SquaredSum variable. From here.

### Multiple regression

If the model of Y is Xb, then b = inv(X' * X) * X' * Y. To remove the effect of a predictor, set its value in b to 0 before reconstructing using Xb. b is such that the combination of predictors explains the greatest possible amount of variance. Whether each predictor's b-value will be reliably estimated depends on correlations between the predictors.

If errors are added to the model, note that the term inv(X' * X) * X' * error will be added to b. So the fluctuations of the estimated b around the ideal b will vary around zero if the error itself varies around zero.

To remove the effect of a covariate on a dependent variable, substract its mean and use it as a predictor together with a ones column for the offset. Reconstruct using the offset plus residuals (i.e. Y - model). Then use this corrected variable in subsequent analyses.