Analyzing survey data in R

Analyzing survey data in R

survey weights

result of using complex sampling design
roughly: number of units in the population that a sample unit represents

library: survey


# Look at the apisrs dataset
glimpse(apisrs)

# Specify a simple random sampling for apisrs
apisrs_design <- svydesign(data = apisrs, weights = ~pw, fpc = ~fpc, id = ~1)

# Produce a summary of the design
summary(apisrs_design)

stratified design


# Glimpse the data
glimpse(apistrat)

# Summarize strata sample sizes
apistrat %>%
  count(stype)

# Specify the design
apistrat_design <- svydesign(data = apistrat, weights = ~pw, fpc = ~fpc, id = ~1, strata = ~stype)

# Look at the summary information stored in the design object
summary(apistrat_design)

Clustered design

dataset apiclus2. The schools were clustered based on school districts, dnum. Within a sampled school district, 5 schools were randomly selected for the sample. The schools are denoted by snum. The number of districts is given by fpc1 and the number of schools in the sampled districts is given by fpc2.


# Glimpse the data
glimpse(apiclus2)

# Specify the design
apiclus_design <- svydesign(id = ~dnum + snum, data = apiclus2, weights = ~pw, fpc = ~fpc1 + fpc2)

#Look at the summary information stored in the design object
summary(apiclus_design)

Contingency tables

svytable()


# Construct and display a frequency table
tab_D <- svytable(~Depressed,
           design = NHANES_design)
tab_D

Segmented Bar Graphs


# Add conditional proportions to tab_DH
tab_DH_cond <- tab_DH %>%
    as.data.frame() %>%
    group_by(HealthGen) %>%
    mutate(n_HealthGen = sum(Freq), Prop_Depressed = Freq/sum(Freq)) %>%
    ungroup()


# Create a segmented bar graph of the conditional proportions in tab_DH_cond
ggplot(data = tab_DH_cond,
       mapping = aes(x = HealthGen, y = Prop_Depressed, fill = Depressed)) + 
  geom_col() + 
  coord_flip()

notion image


# Estimate the totals for combos of Depressed and HealthGen
tab_totals <- svytotal(x = ~interaction(Depressed, HealthGen),
                     design = NHANES_design,
                     na.rm = TRUE)

# Print table of totals
print(tab_totals)


# Estimate the means for combos of Depressed and HealthGen
tab_means <- svymean(x = ~interaction(Depressed, HealthGen),
              design = NHANES_design,
              na.rm = TRUE)

# Print table of means
print(tab_means)

chi squared test


# Run a chi square test between Depressed and HealthGen
svychisq(~Depressed + HealthGen, 
    design = NHANES_design, 
    statistic = "Chisq")

survery weighted t-test

accounts for survey design
qunatitative variable ~ categorical variable
t = 0 is the most consistent with the null hypothesis
p value: the probability of getting this extreme a result

visualization

Add jitter to a scatterplot to see points more clearly.
Bubbleplot- takes in consideration the survey weights (the sample size that each of the point represents)


# Construct bubble plot
ggplot(data = NHANES20, 
       mapping = aes(x= Height, y = Weight, size = WTMEC4YR)) + 
    geom_point(alpha = 0.3) + 
    guides(size = "none")

in a linear model, if you want the slope of the second explanatory variable to vary, use * instead of + in the formula