Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

The General Society Survey(GSS) is a sociological survey that has been regularly conducted since 1972 by the National Opinion Research Center at the University of Chicago. The GSS targets adults(ages 18+) living in households in the United States using an area probability design, meaning that they divide the US into smaller sub-areas within which a sample of households is selected at random to include in the survey. Within each household, one adult is chosen at random to respond to the GSS in a face-to-face interview conducted by NORC: The GSS’s use of random sampling provides strong support for the generalizability of survey results to the adults US population living in households as a whole.

However, although the GSS uses random sampling some nonresponse bias may still be present in the data. The people who choose to engage in an in-person survey on various sociological and political issues, some of which can be viewed as highly controversial and/or personal, may represent a specific subset of the population. It is possible that the desire to or comfort level with participating in such a survey may be correlated with other variables; for examples those with more controversial views(such as strong opinions on abortion or sex outside of marriage) may be less willing to share those opinions with a stranger.

The fact that the GSS is an observational study rather than an experiment where participants are randomly assigned to control and non-control groups means that there is no built-in control for confounding variables. Although we will still be able to observe correlations between variables present in the data and generalize those results to the non-institutionalized adult US population, the lack of random assignment prevents us from inferring causality

The full dataset used in this project comes from a cumulative file that contains responses from the GSS conducted in the years from 1972-2012, and includes 57,062 observations of 114 variables.


Part 2: Research question

Question: Is there an association between gender(‘sex’) and levels of education(‘educ’) for people in the US? It can be of interest to know whether one gender is more educated than the other.


Part 3: Exploratory data analysis


xtabs(gss['sex'],data=gss)
## sex
##   Male Female 
##  25146  31915

Above is the distribution of datapoints b/w by gender/sex. 25146(i.e. 44%) of the dataset is male while 56% of dataset contains female.

table2<-table(gss$`educ`,gss$sex)
prop.table(table2)
##     
##              Male       Female
##   0  0.0013533227 0.0013005958
##   1  0.0004569661 0.0002636343
##   2  0.0013884739 0.0011072640
##   3  0.0022145280 0.0019684693
##   4  0.0028296747 0.0026011916
##   5  0.0032339139 0.0035502751
##   6  0.0065732815 0.0066435840
##   7  0.0069072183 0.0079441798
##   8  0.0195792397 0.0260822187
##   9  0.0140956465 0.0196495422
##   10 0.0193507566 0.0269609997
##   11 0.0257658576 0.0339209449
##   12 0.1241717489 0.1832785560
##   13 0.0332882226 0.0500553632
##   14 0.0492468847 0.0591946851
##   15 0.0202822644 0.0238852664
##   16 0.0586322653 0.0641861610
##   17 0.0146229151 0.0149744275
##   18 0.0162750233 0.0184719757
##   19 0.0078387261 0.0055187444
##   20 0.0126544458 0.0076805455
plot(table2)

Above is the proportion distribution of genders at each education level along with a table graph. It suggests that the most frequently occurring level of education is 12 and females have slightly higher proportion of their population in this group.

We will now look at the boxplot of the two variables

ggplot(gss,aes(x=sex, y=educ)) + geom_boxplot()
## Warning: Removed 164 rows containing non-finite values (stat_boxplot).

The above boxplot shows that interquartile range of males is broader than that of females. Females have more outliers than males. The minimum education level of males is lower than that of females while males have higher maximum than females. However there seems to be no difference in median values in two boxplots.

Part 4: Inference

We will now investigate whether the pattern exhibited by given data and illustrated in EDA above really reflects a pattern in the population, or whether this pattern is just chance.

Conditions Check: Independence: Within groups(genders/sex) independence can be verified using the random sampling/assignment and the 10% condition; which holds as n_male= 25078 and n_female=31819 are both less than 10% of the respective male and female US population Since the gender means are not paired/dependent they can assumed to be independent Sample Size/Skew: Since sample size is large enough, skew(if any) in the population distribution doesn’t matter

In this test we are working with one categorical variable (with two levels) and one numerical response variable. We are comparing two means with a large enough sample size so we will be using t distribution method and theoretical method while doing inference.

We will first look at inference based on the hypothesis test

inference(y=educ, x=sex, data=gss,type=c("ht"),statistic=c('mean'),method=c("theoretical"),alternative=c("twosided"),sig_level=0.05)
## Warning: Missing null value, set to 0
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Male = 25078, y_bar_Male = 12.8953, s_Male = 3.3694
## n_Female = 31819, y_bar_Female = 12.6419, s_Female = 3.0208
## H0: mu_Male =  mu_Female
## HA: mu_Male != mu_Female
## t = 9.3201, df = 25077
## p_value = < 0.0001

In the above test we can observe the sample statistics of the test. It also gives the p-value which is <0.0001. Since p-value is less than significance level(0.05) we can reject the null hypothesis and go in favor alternative hypothesis. Thus its safe to say that the data provides convincing evidence of difference between average education level of males and females of the US.

We will now look at the confidence interval:

inference(y=educ, x=sex, data=gss,type=c("ci"),statistic=c('mean'),method=c("theoretical"),null=NULL,alternative=c("twosided"),sig_level=0.05,conf_level = 0.95)
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Male = 25078, y_bar_Male = 12.8953, s_Male = 3.3694
## n_Female = 31819, y_bar_Female = 12.6419, s_Female = 3.0208
## 95% CI (Male - Female): (0.2001 , 0.3067)

We are 95% confident that on average males have 0.2001 to 0.3067 higher education level than females.

Notice 0 doesn’t lie in this interval which further supports rejection of null hypothesis.

CI is included since it is possible to calculate and provides inference using confidence levels and solidify the hypothesis test.