Interactive SAS tutorials supporting the OpenIntro Introduction to Modern Statistics textbook.
The next logical target for statistical inference is comparing two proportions. Let’s continue from the last lesson with the General Social Survey data on opinions regarding the legalization of marijuana.
* Initialize this SAS session;
%include "~/my_shared_file_links/hammi002/sasprog/run_first.sas";
* Makes a working copy of GSS2010 data and check;
%use_data(gss2010);
%glimpse(gss2010);
For this lesson, we will compare two groups defined by the educational attainment variable, degree
. Let’s limit the dataset to the groups with lowest (5 = “Less than high school”) and highest (2 = “Graduate”) educational attainment, so we can see if their opinions differ. (As before, we are also limiting the data to have non-missing information for the grass
variable.) Then let’s calculate the observed proportion of each group that supports legalization of marijuana.
* Limit to data not missing question response;
data gss2010;
set gss2010(
where = (
not missing(grass) and
degree in (2, 5)
)
);
run;
* Calculate proportion of sample favoring legality of marijuana;
proc freq data=gss2010;
tables degree * grass / nocol nopct;
run;
Here we see big differences. Among the 136 respondents with a graduate degree, 53.7% believe marijuana should be legal, compared to only 33.3% of the 195 respondents with less than a high school education. So our difference in proportions (using the less-than-HS group as the reference) is 0.204 (0.537 – 0.333). This is our point estimate for the true difference in proportions between these 2 groups of Americans.
As before, let’s estimate a plausible range of values for this difference in proportions using 95% confidence intervals determined using 5,000 bootstrap re-samples. (Remember, no assumptions needed to use the bootstrapping method!)
* Load bootstrap macros;
%include "~/my_shared_file_links/hammi002/sasprog/load-randomization.sas";
* Simulate the null distribution of a difference in proportions,
* 5000 permutation samples;
%boot_2props(
ds = gss2010,
groupvar = degree,
g1value = 2,
g2value = 5,
compvar = grass,
cvalue = 1,
alpha = .05,
reps = 5000
);
The relevant macro this time is %boot_2props
. We need to specify the dataset, the group variable (and which values identify each of the 2 groups of interest), the comparison variable (and which value identifies the level of interest), the alpha level, and the number of bootstrap samples to generate. As before, since 1 = Legal, we are estimating proportions where people think marijuana should be legal.
Here are my relevant results:
Again, it’s always good to check that your look similar. Since randomization is involved, our results won’t be exaclty the same, but they should be close.
Bootstrap percentile interval
Let’s start with the easier confidence interval. The 95% bootstrap percentile confidence interval for our difference is (0.100, 0.310). One interpretation of this interval is that, with 95% confidence, we expect that the true difference in proportions of Americans who believe that marijuana should be legalized between those with a graduate degree, compared to those with less than a high school edcuation, is between 10.0% and 31.0%.
Bootstrap SE-based interval
Using the bootstrap standard error and the empirical rule, the confidence interval for this difference can also be calculated as:
\[CI = d \pm 2 \cdot SE = 0.203 \pm 2(0.054) = (9.5, 31.1)\]which is very close to the bootstrap percentile interval and would have a similar interpretation.
When dealing with two proportions, we almost always want to test the equality of those proportions. This test can be represented with the following hypotheses, based on two population parameter—the proportion of Americans with a graduate degree who support legalization of marijuana, \(\pi_G\), and the proportion of Americans with less than a high school education who support legalization of marijuana, \(\pi_L\).
\(H_0: \pi_G = \pi_L\), or \(\pi_G - \pi_L = 0\)
\(H_A: \pi_G \neq \pi_L\), or \(\pi_G - \pi_L \neq 0\)
Because a significance level is not given, we will specify \(\alpha = 0.05\).
To calculate the p-value, we will need to generate a distribution of differences as if the null hypothesis (of no difference) was true. When we have two groups, we can do this simply by shuffling the response variables among the observations. This is also called permuting the data. As with other permutation tests, this will give us a null distribution, from which we can calculate the % of samples with differences as or more extreme (in either direction from the null value) than our observed proportion.
* Simulate the null distribution of a difference in
* proportions, 5000 permutation samples;
%permute_2props(
ds = gss2010,
groupvar = degree,
g1value = 2,
g2value = 5,
compvar = grass,
cvalue = 1,
reps = 5000,
nullval = 0
);
Well, this histogram for the null distribution makes the point pretty clear. We have a very extreme sample that is very unlikely to have arisen from a population where these proportions are truly equal.
The p-value table below the histogram confirms this.
There were only 2 of 5000 permuted datasets that resulted in a difference as extreme or more extreme than our difference of 0.20. The 2-sided p-value was <0.001.
Based on this LOW p-value, what can we conclude? (Reject that \(H_0\)!) Because p (<.001) is less than our specified \(\alpha\), we reject the null hypothesis and conclude that there is a difference between these two proportions. The proportion of Americans with a graduate degree who support legalization of marijuana (54%) is significantly higher than the proportion of American with less than a high school education who support legalization of marijuana (33%).
Now let’s use the mathematical approximation methods for these same questions.
We saw in the last lesson that the central limit theorem applies to proportions. Well, it also applies to a difference in proportions. This means we can again use the normal distribution as an approximation of the sampling distribution of the difference we are trying to making inference about, subject to two model assumptions.
Assumption #1: Independence
Assumption #2: Sufficient sample size
For our data, these assumptions are met. The sampling scheme of the GSS assures independence of observations. And we have well over 10 responses in favor and against the legalization of marijuana in both groups.
When these assumptions hold, the estimate of the standard error for a difference in 2 proportions is:
\[SE(p_1 - p_2) = \sqrt{\frac{ p_1 \times (1 - p_1)}{n_1} + \frac{ p_2 \times (1 - p_2)}{n_2}}\]where \(p_i\) is the observed proportion in group \(i\) and \(n_i\) is the number of observations in group \(i\).
We can use this estimate in the formulas for both the confidence interval and the z test statistic calculation.
For our data, using information from previous output generated above, the estimate of the standard error of the difference is:
\[SE = \sqrt{\frac{ .537 \times .463}{136} + \frac{ .333 \times .667}{195}} = 0.054\]How does this compare to our original computational approach using the bootstrap? When rounded to 3 decimal places, it was exactly the same 0.054!
Let’s use the SE calculated above to estimate a confidence interval:
\[CI = Observed Statistic \pm z^* \cdot SE\]But first we need the \(z^*\) critical value that is associated with 95% confidence (and $$\alpha = 0.05). If you look up that value, you should see that it’s (still) 1.96. So our confidence interval for this difference, based on the mathematical approximation, is:
\[CI = 0.203 \pm 1.96(0.054) = (0.097, 0.309)\]With 95% confidence, we expect that the true difference in proportions of Americans who believe that marijuana should be legalized between those with a graduate degree, compared to those with less than a high school edcuation, is between 9.7% and 30.9%.
Following on the test described above, where the hypotheses were
\(H_0: \pi_G = \pi_L\), or \(\pi_G - \pi_L = 0\)
\(H_A: \pi_G \neq \pi_L\), or \(\pi_G - \pi_L \neq 0\)
and the significance level is \(\alpha = 0.05\), we can calculate a z statistic as:
\[z = \frac{Observed Statistic - Null Value}{SE}\]For our data, this is:
\[z = \frac{0.203 - 0.0}{0.054} = 3.76\]That’s a pretty big z score. We should expect a low p-value. Using our z-score probability calculator, we see that p < 0.001. (By the way, this level of precision is fine for a small p-value. We don’t gain much by reporting p = 0.0002, compared to p < 0.001.)
We will draw the same conclusion as with the randomization methods above. Because p (<.001) is less than 0.05, our specified \(\alpha\), we reject the null hypothesis and conclude that there is a significant difference between the proportion of Americans with a graduate degree who support legalization of marijuana (54%) and the proportion of American with less than a high school education who support legalization of marijuana (33%).
We can again leverage SAS PROC FREQ to do this for us. The code required to request a test for a difference in 2 proportions, along with a confidence interval for that difference, is:
* Calculate difference in proportions of 2 groups
* favoring legality of marijuana;
proc freq data=gss2010;
tables degree * grass / nocol nopct riskdiff(equal var=null cl=wald);
run;
Things to know:
tables group * response
)riskdiff
option is doing the work here. This option itself has many options. Stick with the ones here (equal var=null cl=wald
) for now.column = 2
option in addition to the other riskdiff
options above.Run the code above to see the output relevant to the difference in proportions. It should look like this:
Here’s what to notice:
That’s all there is to it. You have successfully completed this tutorial.