- survey weights
- result of using complex sampling design
- roughly: number of units in the population that a sample unit represents
# Look at the apisrs dataset
glimpse(apisrs)
# Specify a simple random sampling for apisrs
apisrs_design <- svydesign(data = apisrs, weights = ~pw, fpc = ~fpc, id = ~1)
# Produce a summary of the design
summary(apisrs_design)
# Glimpse the data
glimpse(apistrat)
# Summarize strata sample sizes
apistrat %>%
count(stype)
# Specify the design
apistrat_design <- svydesign(data = apistrat, weights = ~pw, fpc = ~fpc, id = ~1, strata = ~stype)
# Look at the summary information stored in the design object
summary(apistrat_design)
- dataset
apiclus2
. The schools were clustered based on school districts, dnum
. Within a sampled school district, 5 schools were randomly selected for the sample. The schools are denoted by snum
. The number of districts is given by fpc1
and the number of schools in the sampled districts is given by fpc2
.
# Glimpse the data
glimpse(apiclus2)
# Specify the design
apiclus_design <- svydesign(id = ~dnum + snum, data = apiclus2, weights = ~pw, fpc = ~fpc1 + fpc2)
#Look at the summary information stored in the design object
summary(apiclus_design)
# Construct and display a frequency table
tab_D <- svytable(~Depressed,
design = NHANES_design)
tab_D
# Add conditional proportions to tab_DH
tab_DH_cond <- tab_DH %>%
as.data.frame() %>%
group_by(HealthGen) %>%
mutate(n_HealthGen = sum(Freq), Prop_Depressed = Freq/sum(Freq)) %>%
ungroup()
# Create a segmented bar graph of the conditional proportions in tab_DH_cond
ggplot(data = tab_DH_cond,
mapping = aes(x = HealthGen, y = Prop_Depressed, fill = Depressed)) +
geom_col() +
coord_flip()
# Estimate the totals for combos of Depressed and HealthGen
tab_totals <- svytotal(x = ~interaction(Depressed, HealthGen),
design = NHANES_design,
na.rm = TRUE)
# Print table of totals
print(tab_totals)
# Estimate the means for combos of Depressed and HealthGen
tab_means <- svymean(x = ~interaction(Depressed, HealthGen),
design = NHANES_design,
na.rm = TRUE)
# Print table of means
print(tab_means)
# Run a chi square test between Depressed and HealthGen
svychisq(~Depressed + HealthGen,
design = NHANES_design,
statistic = "Chisq")
- survery weighted t-test
- accounts for survey design
- qunatitative variable ~ categorical variable
- t = 0 is the most consistent with the null hypothesis
- p value: the probability of getting this extreme a result
- visualization
- Add jitter to a scatterplot to see points more clearly.
- Bubbleplot- takes in consideration the survey weights (the sample size that each of the point represents)
# Construct bubble plot
ggplot(data = NHANES20,
mapping = aes(x= Height, y = Weight, size = WTMEC4YR)) +
geom_point(alpha = 0.3) +
guides(size = "none")
- in a linear model, if you want the slope of the second explanatory variable to vary, use * instead of + in the formula