merging 2+ datasets and comparing them before analysis
Some students ask how to merge and compare more than two datasets in R. Today we are going to use the NHANES 2017–18 data for a quick demonstration of that.
Let us download four datasets from the demographic and questionnaire sections. The demographic section has only one file called DEMO , so let’s download this one and from the questionnaire let’s download another three: diabetes, disability, drug use.
Now let’s import them one by one: remember to use the correct file path, “…./DEMO_J.XPT”
demo=”/Users/user/Downloads/DEMO_J.XPT”demo<-read_xpt(
demo,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)
The second dataset:
diabetes=”/Users/user/Downloads/DIQ_J.XPT”diabetes<-read_xpt(
diabetes,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)
The third dataset:
disability=”/Users/user/Downloads/DLQ_J.XPT”disability<-read_xpt(
disability,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)
and the fourth one:
drug=”/Users/user/Downloads/DUQ_J.XPT”drug<-read_xpt(
drug,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)
The next step is to merge them. I am using the tidyverse package to try all four types of join to be able to pick the right one later:
library(tidyverse)list(demo, diabetes, disability, drug) %>% reduce(left_join, by = “SEQN”)->left.joinlist(demo, diabetes, disability, drug) %>% reduce(right_join, by = “SEQN”)->right.joinlist(demo, diabetes, disability, drug) %>% reduce(inner_join, by = “SEQN”)->inner.joinlist(demo, diabetes, disability, drug) %>% reduce(full_join, by = “SEQN”)->full.join
Check the volume of missing values for each of the datasets using the purrr package:
t(map(left.join, ~mean(is.na(.))))
And just there is a childish way to compare the number of missing cases in each of the variables:
dim(left.join)->l.join
dim(right.join)->r.join
dim(inner.join)->i.join
dim(full.join)->f.join
cbind(l.join, r.join, i.join, f.join)->df
With little more effort we can tabulate…
as.data.frame(t(df), stringsAsFactors=T)->kompare
and visualise the missing cases by the datasets:
type<- c(“left.join”,”right.join”, “inner.join”,”full.join”)cbind(kompare, type)->huuaaaggplot(huuaaa, aes(x=V1, y=type, fill=type)) +
geom_bar(stat="identity") +
xlab("Percent of missing cases") + ylab("Type of join")
Too many missing cases to handle, but that is the topic for another day!