Analyzing survey data in R

Analyzing survey data in R

  • survey weights
    • result of using complex sampling design
    • roughly: number of units in the population that a sample unit represents
  • library: survey
# Look at the apisrs dataset glimpse(apisrs) # Specify a simple random sampling for apisrs apisrs_design <- svydesign(data = apisrs, weights = ~pw, fpc = ~fpc, id = ~1) # Produce a summary of the design summary(apisrs_design)
 
  • stratified design
# Glimpse the data glimpse(apistrat) # Summarize strata sample sizes apistrat %>% count(stype) # Specify the design apistrat_design <- svydesign(data = apistrat, weights = ~pw, fpc = ~fpc, id = ~1, strata = ~stype) # Look at the summary information stored in the design object summary(apistrat_design)
 
  • Clustered design
  • dataset apiclus2. The schools were clustered based on school districts, dnum. Within a sampled school district, 5 schools were randomly selected for the sample. The schools are denoted by snum. The number of districts is given by fpc1 and the number of schools in the sampled districts is given by fpc2.
# Glimpse the data glimpse(apiclus2) # Specify the design apiclus_design <- svydesign(id = ~dnum + snum, data = apiclus2, weights = ~pw, fpc = ~fpc1 + fpc2) #Look at the summary information stored in the design object summary(apiclus_design)
 
  • Contingency tables
    • svytable()
    • # Construct and display a frequency table tab_D <- svytable(~Depressed, design = NHANES_design) tab_D
 
  • Segmented Bar Graphs
    • # Add conditional proportions to tab_DH tab_DH_cond <- tab_DH %>% as.data.frame() %>% group_by(HealthGen) %>% mutate(n_HealthGen = sum(Freq), Prop_Depressed = Freq/sum(Freq)) %>% ungroup()
      # Create a segmented bar graph of the conditional proportions in tab_DH_cond ggplot(data = tab_DH_cond, mapping = aes(x = HealthGen, y = Prop_Depressed, fill = Depressed)) + geom_col() + coord_flip()
notion image
# Estimate the totals for combos of Depressed and HealthGen tab_totals <- svytotal(x = ~interaction(Depressed, HealthGen), design = NHANES_design, na.rm = TRUE) # Print table of totals print(tab_totals)
# Estimate the means for combos of Depressed and HealthGen tab_means <- svymean(x = ~interaction(Depressed, HealthGen), design = NHANES_design, na.rm = TRUE) # Print table of means print(tab_means)
 
  • chi squared test
    • # Run a chi square test between Depressed and HealthGen svychisq(~Depressed + HealthGen, design = NHANES_design, statistic = "Chisq")
       
  • survery weighted t-test
    • accounts for survey design
    • qunatitative variable ~ categorical variable
    • t = 0 is the most consistent with the null hypothesis
    • p value: the probability of getting this extreme a result
    •  
  • visualization
    • Add jitter to a scatterplot to see points more clearly.
    • Bubbleplot- takes in consideration the survey weights (the sample size that each of the point represents)
    • # Construct bubble plot ggplot(data = NHANES20, mapping = aes(x= Height, y = Weight, size = WTMEC4YR)) + geom_point(alpha = 0.3) + guides(size = "none")
       
  • in a linear model, if you want the slope of the second explanatory variable to vary, use * instead of + in the formula