## Contents

```
function ANOVA_Assumptions()
```

% Michele Rotella % Psych 216 % Final Project Tutorial % June 5, 2012

## Exploration of Normality and Equal Variance Assumptions in ANOVA test %%

% The ANOVA (analysis of variance) tests hypotheses on the means of samples % from several different populations. It is used especially to test whether % the means of two or more samples are equal, which is relevant to analysis % of behavioral data, ie. testing whether task performance is different % across feedback conditions. % The goal of this tutorial is to explore the normality and equal variance % assumptions for the parametric test. % (1) NORMALITY: % By assuming the data is normal, we accept that the mean represents the % central tendency of the data. If data is non-normal, we are increasing % our chance of finding a false positive for the test. We expect the % ANOVA will be fairly robust to non-normal data. % (2) EQUAL VARIANCE: % The equal variance assumption assumes that each data group that is % compared comes from a population with equivalent spread. We expect the % ANOVA will be less robust to violations of this assumption. % To goal of this tutorial is to % (a) Explore the effects of violating the normality and equal variance % assumptions by considering the rate of false positives for the test % (the percent of time the ANOVA finds a difference in group means when % there is none). % (b) We consider how each of these assumptions is affected by sample size, % equal/unequal group numbers, and population characteristics. clear all close all clc % Set significance value alpha = 0.05; %%%%%%%%%%%%%%%%%%%%%%% Part 0: The ANOVA Test %%%%%%%%%%%%%%%%%%%%%%%%%%%% % Three test groups are randomly sampled from normal populations with % equal variance; one group has a different mean. We run an ANOVA on the % random samples 1000 times. iterations = 1000; % Conduct a 1-way ANOVA, where a p-value < 0.05 indicates a significant % difference in means between any two (or more) groups. for i = 1:iterations % Draw three test groups (each with 100 samples) G1 = 2 + 2.*randn(100,1); % Normal distribution with mean = 2; SD = 2 G2 = 2 + 2.*randn(100,1); G3 = 3 + 2.*randn(100,1); % Normal distribution with mean = 3; SD = 2 data = [G1, G2, G3]; % Run one-way ANOVA, suppress table output [p_0] = anova1(data,[],'off'); % Collect p-values in a vector pval_0(i) = p_0; % Collect data samples G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Visualize the averaged data figure('Name','Demonstration of ANOVA test') PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Averaged Samples from Three Normal Populations') % Calculate the percentage of p-values that incorrectly find no difference false_0 = find(pval_0 >= alpha); PercentFalse_0 = length(false_0)/iterations*100; % Plot the p-values across tests figure('Name','Results of Demonstration ANOVA') PlotHistogramPval(pval_0) title(sprintf('Demonstration Data\n False: %f %%',PercentFalse_0)); clear G1 G2 G3 G1_ALL G2_ALL G3_ALL % Results of the ANOVA show that a significant difference between group % means is found with a very low rate of a false negative (< 5%). %%%%%%%% Part 1. Exploring Non-Normality %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Case 1: As a baseline measure, we test for a difference in mean between % three groups of data generated from the same normal distribution with % mean = 2 and SD = 2. When we run an ANOVA test with all groups having the % same (large) number of samples = 100, we expect that there will be no % significant difference in the means. % Draw three test groups (each with 100 samples) and conduct ANOVA for i = 1:iterations % Define test groups of data G1 = 2 + 2.*randn(100,1); G2 = 2 + 2.*randn(100,1); G3 = 2 + 2.*randn(100,1); data = [G1, G2, G3]; % Perform one-way ANOVA using built-in MATLAB function [p_AllNormal] = anova1(data,[],'off'); % Collect data samples in vectors pval_AllNormal(i) = p_AllNormal; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate percentage of iterations that are false positives false_AllNormal = CalculatePercentFalse(pval_AllNormal, alpha, iterations); % Visualize the data figure('Name','Visualization of Data Samples Including Non-normal Populations') subplot(2,3,1) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Averaged Samples from All Normal Populations') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL % Given that we start with data sampled from the same, normalized % distribution, we expect the group means to be statistically equivalent % and predict that 95% of p-values should be greater than the significance % level, alpha = 0.05. The ANOVA test finds a significant difference in the % means between the three samples approximately 5% of the time, as % expected. %-------------------------------------------------------------------------- % Case 2: Now we consider the case when one of the data groups is % non-normal, but all population means are expected to be equal. The third % data group is sampled from an exponential distribution with mean = 2. The % ANOVA is balanced, where each group has the same (large) number of % samples = 100; To identify the effect only of the normality parameter, % the variance is kept the same. The variance of the normal populations is % 4, and the variance of the exponential distribution is 4. % Characteristics of Exponential distribution % 1/lambda = mean; % 1/(lambda*lambda) = variance; % mean = 2; lambda = 1/2; % variance = 4; sd = 2; for i = 1:iterations G1 = 2 + 2.*randn(100,1); G2 = 2 + 2.*randn(100,1); G3 = random('exp',2,100,1); data = [G1, G2, G3]; [p_wExpLarge] = anova1(data,[],'off'); pval_wExpLarge(i) = p_wExpLarge; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate percentage of false positives false_wExpLarge = CalculatePercentFalse(pval_wExpLarge, alpha, iterations); % Visualize the data subplot(2,3,2) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Non-Normal Exp. Group, Lg. Equal Size: n = 100') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL % Of all iterations, approximately 5% return false positive results showing ANOVA % test is robust to non-normality, given the same (large) number of samples % in each group. % ------------------------------------------------------------------------ % Case 3: We consider non-normally distributed data when we have equal, but % small sample sizes = 20. for i = 1:iterations G1 = 2 + 2.*randn(20,1); G2 = 2 + 2.*randn(20,1); G3 = random('exp',2,20,1); % Mean of exponential distribution is 2 data = [G1, G2, G3]; [p_wExpSmall] = anova1(data,[],'off'); pval_wExpSmall(i) = p_wExpSmall; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate percentage of false positives false_wExpSmall = CalculatePercentFalse(pval_wExpSmall, alpha, iterations); % Visualize the data subplot(2,3,3) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Non-Normal Exp. Group, Sm. Equal Size: n = 20') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL %-------------------------------------------------------------------------- % Case 4: When we have unequal sample sizes and data from a % non-normal distribution. The ANOVA test is unbalanced and we do not % expect it to be robust against non-normality. for i = 1:iterations group = {'G1','G1','G1','G1','G1','G1','G1','G1','G1','G1',... 'G1','G1','G1','G1','G1','G1','G1','G1','G1','G1',... 'G1','G1','G1','G1','G1','G1','G1','G1','G1','G1',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... }; G1 = 2 + 2.*randn(30,1); G2 = 2 + 2.*randn(20,1); G3 = random('exp',2,10,1); % Mean of exponenetial data is data = [G1', G2', G3']; [p_wExpUnequalGrp] = anova1(data, group,'off'); pval_wExpUnequalGrp(i) = p_wExpUnequalGrp; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate percentage of false positives false_wExpUnequalGrp = CalculatePercentFalse(pval_wExpUnequalGrp, alpha, iterations); % Visualize the data subplot(2,3,4) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Non-Normal Exp. Group, Unequal Size: n(G1,G2,G3) = 30,20,10') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL %-------------------------------------------------------------------------- % Case 5: Now the non-normal data is taken from the Poisson distribution % with equal mean and variance. We specify the Poisson distribution by % selecting the following parameters: % lambda = mean; % lambda = variance; % mean = 2; lambda = 2; % variance = 2; sd = sqrt(2); % In this case, two small groups (n = 15) come from a non-normal Poisson % distribution for i = 1:iterations G1 = 2 + sqrt(2).*randn(15,1); G2 = 2 + sqrt(2).*randn(15,1); G3 = random('poiss',2,15,1); data = [G1, G2, G3]; [p_wPoissSmall] = anova1(data,[],'off'); pval_wPoissSmall(i) = p_wPoissSmall; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate percentage of false positives false_wPoissSmall = CalculatePercentFalse(pval_wPoissSmall, alpha, iterations); % Visualize the data subplot(2,3,5) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Non-normal Poisson Group, Sm. Equal Size: n = 15') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL %-------------------------------------------------------------------------- % Case 6: Again, the non-normal data comes from a Poisson distribution, but % there are unequal sample sizes and only one non-normal group. for i = 1:iterations group = {'G1','G1','G1','G1','G1','G1','G1','G1','G1','G1',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... }; G1 = 2 + sqrt(2).*randn(10,1); G2 = 2 + sqrt(2).*randn(30,1); G3 = random('poiss',2,100,1); data = [G1', G2', G3']; [p_wPoissUnequalGrp] = anova1(data, group,'off'); pval_wPoissUnequalGrp(i) = p_wPoissUnequalGrp; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate percentage of false positives false_wPoissUnequalGrp = CalculatePercentFalse(pval_wPoissUnequalGrp, alpha, iterations); % Visualize the data subplot(2,3,6) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Non-normal Poisson Group, Unequal Size: n(G1,G2,G3) = 10,30,100') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL % Plot all ANOVA results for cases 1-6 figure('Name','Results of ANOVA: Exploring Data Normality ') subplot(2,3,1) PlotHistogramPval(pval_AllNormal) title(sprintf('All Normal Data, Lg. Equal Size: n = 100\n False Positives: %f %%',false_AllNormal)); subplot(2,3,2) PlotHistogramPval(pval_wExpLarge) title(sprintf('Non-Normal Exp. Group, Lg. Equal Size: n = 100\n False Positives: %f %%', false_wExpLarge)); subplot(2,3,3) PlotHistogramPval(pval_wExpSmall) title(sprintf('Non-Normal Exp. Group, Sm. Equal Size: n = 20\n False Positives: %f %%',false_wExpSmall)); subplot(2,3,4) PlotHistogramPval(pval_wExpUnequalGrp) title(sprintf('Non-Normal Exp. Group, Unequal Size: n(G1,G2,G3) = 30,20,10\n False Positives: %f %%',false_wExpUnequalGrp)); subplot(2,3,5) PlotHistogramPval(pval_wPoissSmall) title(sprintf('Non-normal Poisson Group, Sm. Equal Size: n = 15\n False Positives: %f %%',false_wPoissSmall)); subplot(2,3,6) PlotHistogramPval(pval_wPoissUnequalGrp) title(sprintf('Non-normal Poisson Group, Unequal Size: n(G1,G2,G3) = 10,30,100\n False Positives: %f %%', false_wPoissUnequalGrp)); % We see that even with unequal sample sizes and data sampled from two % different non-normal distributions, the rate of false positives of the % ANOVA test is not affected by the violation of normality. We conclude % that that we can apply the ANOVA test even when our data violates the % normality assumption. %%%%%%%% Part 2. Exploring Un-Equal Variance %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % The second assumption for the ANOVA test is that the test groups are % sampled from populations of equal variance. To test the severity of % violations of this assumption, we simulate an ANOVA on three groups of % normal data with a difference in variance and look at the influence of % changing sample size and the degree by which variance is different. %-------------------------------------------------------------------------- % Case 1: First, all data is sampled from normal distributions, with one % distribution having unequal variance. All test groups have the same large % (n = 100) number of samples. for i = 1:iterations G1 = 2 + 2.*randn(100,1); % Variance = 4 G2 = 2 + 2.*randn(100,1); % Variance = 4 G3 = 2 + 3.*randn(100,1); % Variance = 9 data = [G1, G2, G3]; [p_DiffVarLarge] = anova1(data,[],'off'); pval_DiffVarLarge(i) = p_DiffVarLarge; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate the percentage of false positives false_DiffVarLarge = CalculatePercentFalse(pval_DiffVarLarge, alpha, iterations); % Visualize the data figure('Name','Visualization of Samples from Populations with Unequal Variance') subplot(2,3,1) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Lg. Equal Size: n = 100; var (G3) = 9') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL %-------------------------------------------------------------------------- % Case 2: Data is sampled from normal distributions, with one % distribution having unequal variance, that is further amplified. All test % groups have the same large number of samples (n = 100); for i = 1:iterations G1 = 2 + 2.*randn(100,1); % Variance = 4 G2 = 2 + 2.*randn(100,1); % Variance = 4 G3 = 2 + 5.*randn(100,1); % Variance = 25 data = [G1, G2, G3]; [p_GreaterVarLarge] = anova1(data,[],'off'); pval_GreaterVarLarge(i) = p_GreaterVarLarge; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate the percentage of false positives false_GreaterVarLarge = CalculatePercentFalse(pval_GreaterVarLarge, alpha, iterations); % Visualize the data subplot(2,3,2) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Lg. Equal Size: n = 100; var(G3) = 25') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL %-------------------------------------------------------------------------- % Case 3: Data is sampled from normal distributions, with one % distribution having unequal variance. All test groups have the same small % (n = 30) number of samples. for i = 1:iterations G1 = 2 + 2.*randn(30,1); % Variance = 4 G2 = 2 + 2.*randn(30,1); % Variance = 4 G3 = 2 + 3.*randn(30,1); % Variance = 9 data = [G1, G2, G3]; [p_DiffVarSmall] = anova1(data,[],'off'); pval_DiffVarSmall(i) = p_DiffVarSmall; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate the percentage of false positives false_DiffVarSmall = CalculatePercentFalse(pval_DiffVarSmall, alpha, iterations); % Visualize the data subplot(2,3,3) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Sm. Equal Size: n = 30; var(G3) = 9') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL %-------------------------------------------------------------------------- % Case 4: Data is sampled from normal distributions, with one % distribution having unequal variance. All test groups have the same small % (n = 30) number of samples. Now the difference in variance is further % exaggerated. for i = 1:iterations G1 = 2 + 2.*randn(30,1); % Variance = 4 G2 = 2 + 2.*randn(30,1); % Variance = 4 G3 = 2 + 5.*randn(30,1); % Variance = 25 data = [G1, G2, G3]; [p_GreaterDiffVarSmall] = anova1(data,[],'off'); pval_GreaterDiffVarSmall(i) = p_GreaterDiffVarSmall; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate the percentage of false positives false_GreaterDiffVarSmall = CalculatePercentFalse(pval_GreaterDiffVarSmall, alpha, iterations); % Visualize the data subplot(2,3,4) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Sm. Equal Size, Lg. Variance: n = 30; var(G3) = 25') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL %-------------------------------------------------------------------------- % Case 5: Data is sampled from normal distributions, with one % distribution having unequal variance. All test groups have different % numbers of samples. for i = 1:iterations group = {'G1','G1','G1','G1','G1','G1','G1','G1','G1','G1',... 'G1','G1','G1','G1','G1','G1','G1','G1','G1','G1',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G2','G2','G2','G2','G2','G2','G2','G2','G2','G2',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... 'G3','G3','G3','G3','G3','G3','G3','G3','G3','G3',... }; G1 = 2 + 2.*randn(20,1); % Variance = 4 G2 = 2 + 2.*randn(50,1); % Variance = 4 G3 = 2 + 6.*randn(30,1); % Variance = 36 data = [G1', G2', G3']; [p_GreaterDiffVarUnequal] = anova1(data, group,'off'); pval_GreaterDiffVarUnequal(i) = p_GreaterDiffVarUnequal; G1_ALL (i,:) = G1; G2_ALL (i,:) = G2; G3_ALL (i,:) = G3; end % Calculate the percentage of false positives false_GreaterDiffVarUnequal = CalculatePercentFalse(pval_GreaterDiffVarUnequal, alpha, iterations); % Visualize the data subplot(2,3,5) PlotHistogramData(mean(G1_ALL,1), mean(G2_ALL,1), mean(G3_ALL,1)) title('Unequal Size, Lg. Variance: n(G1,G2,G3) = 20,50,30; var(G3) = 36') clear G1 G2 G3 G1_ALL G2_ALL G3_ALL % Plot all ANOVA results for cases 1-4 figure('Name','Results of ANOVA: Exploring Data with Unequal Variance ') subplot(2,3,1) PlotHistogramPval(pval_DiffVarLarge) title(sprintf('Lg. Equal Size: n = 100; var(G3) = 9\n False Positives: %f %%',false_DiffVarLarge)); subplot(2,3,2) PlotHistogramPval(pval_GreaterVarLarge) title(sprintf('Lg. Equal Size: n = 100; var(G3) = 25\n False Positives: %f %%',false_GreaterVarLarge)); subplot(2,3,3) PlotHistogramPval(pval_DiffVarSmall) title(sprintf('Sm. Equal Size: n = 30; var(G3) = 9\n False Positives: %f %%',false_DiffVarSmall)); subplot(2,3,4) PlotHistogramPval(pval_GreaterDiffVarSmall) title(sprintf('Sm. Equal Size, Lg. Variance: n = 30; var(G3) = 25\n False Positives: %f %%',false_GreaterDiffVarSmall)); subplot(2,3,5) PlotHistogramPval(pval_GreaterDiffVarUnequal) title(sprintf('Unequal Size, Lg. Variance: n(G1,G2,G3) = 20,50,30; var(G3) = 36\n False Positives: %f %%',false_GreaterDiffVarUnequal)); % Results show that violation of the assumption of equal variance more % severely impacts the accuracy of the ANOVA test. Increasing difference in % variance between two groups lead to a noticeable change in outcome. % The overall conclusion of the simulation is that the ANOVA-one test is % robust to violations of normality, but violation of equal variance may % lead to larger errors. In this case, a non-parametric test such as the % Welch's ANOVA may be more appropriate or an adjustment such as the % Greenhouse-Geisser's adjustment should be applied to the data. % Declare subfunctions function PlotHistogramData(Data1, Data2, Data3) % This function plots histrograms for three input data sets. The third data % set is highlighted using shading. % Visualize the averaged data hist(Data3); hold on hist(Data2) hist(Data1) % Recolor and change transparency h = findobj(gca,'Type','patch'); set(h(3), 'FaceColor','g', 'EdgeColor','w', 'FaceAlpha',.5, 'LineWidth', 1.5) set(h(2), 'FaceColor','w', 'EdgeColor','b', 'FaceAlpha',0, 'LineWidth', 1.5) set(h(1), 'FaceColor','w', 'EdgeColor','m', 'FaceAlpha',0, 'LineWidth', 1.5) % Label and create legend xlabel('Number of Samples') ylabel('Frequency') legend('Group 3', 'Group 2', 'Group 1') end function PlotHistogramPval(pVal, PercentFalse) % This function plots a histogram for a single data set. % Plot histogram hist(pVal) % Modify appearance h = findobj(gca,'Type','patch'); set(h(1),'FaceColor','b', 'EdgeColor','w', 'FaceAlpha',1) % Label axes xlabel('P-value') ylabel('Frequency') end function PercentFalse = CalculatePercentFalse(pVal, alpha, iterations) % This function calculates the percent of false positives false = find(pVal < alpha); PercentFalse = length(false)/iterations*100; end

```
end
```