teg_crosstab

[O, Chi2, p, lowvals] = teg_crosstab(x, y) returns the observed crosstab for two vectors x and y, and the results of the Chi squared test. lowvals returns the number of cells with fewer than 5 observations. If more than 10% of the cells have fewer than 5 expected counts, rows and columns are iteratively removed (such that most cells are preserved) until lowvals is under 10%.

function [O, Chi2, p, lowvals] = teg_crosstab(x, y)

ux = unique(x);
uy = unique(y);
N = length(x);
O = [];
E = [];
for ix = 1:length(ux),
    for iy = 1:length(uy),
        f = find(x == ux(ix) & y == uy(iy));
        O(ix, iy) = length(f);
    end;
end;
for ix = 1:length(ux),
    mx(ix) = sum(O(ix, :));
end;
for iy = 1:length(uy),
    my(iy) = sum(O(:, iy));
end;
for ix = 1:length(ux),
    for iy = 1:length(uy),
        E(ix, iy) = mx(ix) * my(iy) / N;
    end;
end;

flow = length(find(E(:) < 5));
lowvals = flow / length(E(:));
while lowvals >= 0.1,
    [ir, ic] = find(E < 5);
    uir = unique(ir);
    uic = unique(ic);
    if length(uir) / size(E, 1) <= length(uic) / size(E, 2),
        O(ir(1), :) = [];
        E(ir(1), :) = [];
    else,
        O(:, ic(1)) = [];
        E(:, ic(1)) = [];
    end;
    flow = length(find(E(:) < 5));
    lowvals = flow / length(E(:));
end;

[nx, ny] = size(E);
if nx * ny == 0,
    O = [];
    Chi2 = 0;
    p = 0.5;
    return;
end;

X = ((O - E) .^ 2) ./ E;
Chi2 = sum(X(:));
df = (length(ux) - 1) * (length(uy) - 1);
p = teg_chi2_cdf(Chi2, df);

function p = teg_chi2_cdf(x, k)
p = 1 - gammainc(x / 2, k / 2);