merging 2+ datasets and comparing them before analysis

infoart.ca
2 min readJul 20, 2021

Some students ask how to merge and compare more than two datasets in R. Today we are going to use the NHANES 2017–18 data for a quick demonstration of that.

Let us download four datasets from the demographic and questionnaire sections. The demographic section has only one file called DEMO , so let’s download this one and from the questionnaire let’s download another three: diabetes, disability, drug use.

Now let’s import them one by one: remember to use the correct file path, “…./DEMO_J.XPT”

demo=”/Users/user/Downloads/DEMO_J.XPT”demo<-read_xpt(
demo,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)

The second dataset:

diabetes=”/Users/user/Downloads/DIQ_J.XPT”diabetes<-read_xpt(
diabetes,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)

The third dataset:

disability=”/Users/user/Downloads/DLQ_J.XPT”disability<-read_xpt(
disability,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)

and the fourth one:

drug=”/Users/user/Downloads/DUQ_J.XPT”drug<-read_xpt(
drug,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = “unique”
)

The next step is to merge them. I am using the tidyverse package to try all four types of join to be able to pick the right one later:

library(tidyverse)list(demo, diabetes, disability, drug) %>% reduce(left_join, by = “SEQN”)->left.joinlist(demo, diabetes, disability, drug) %>% reduce(right_join, by = “SEQN”)->right.joinlist(demo, diabetes, disability, drug) %>% reduce(inner_join, by = “SEQN”)->inner.joinlist(demo, diabetes, disability, drug) %>% reduce(full_join, by = “SEQN”)->full.join

Check the volume of missing values for each of the datasets using the purrr package:

t(map(left.join, ~mean(is.na(.))))

And just there is a childish way to compare the number of missing cases in each of the variables:

dim(left.join)->l.join
dim(right.join)->r.join
dim(inner.join)->i.join
dim(full.join)->f.join
cbind(l.join, r.join, i.join, f.join)->df

With little more effort we can tabulate…

as.data.frame(t(df), stringsAsFactors=T)->kompare

and visualise the missing cases by the datasets:

type<- c(“left.join”,”right.join”, “inner.join”,”full.join”)cbind(kompare, type)->huuaaaggplot(huuaaa, aes(x=V1, y=type, fill=type)) +
geom_bar(stat="identity") +
xlab("Percent of missing cases") + ylab("Type of join")

Too many missing cases to handle, but that is the topic for another day!

--

--

infoart.ca

Center for Social Capital & Environmental Research | Posts by Bishwajit Ghose, lecturer at the University of Ottawa