Handling missing values in R

infoart.ca
3 min readJun 4, 2022

Create a sample df with missing values:

df = data.frame(col1=c(1:5),
col2=c(11:13, NA, NA),
col3=c(21:22, NA, NA, NA),
col4=rep(NA, 5))
> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 NA NA
4 4 NA NA NA
5 5 NA NA NA

Check the count of missing values for all columns:

colSums(is.na(df))> col1 col2 col3 col4 
0 2 3 5

Or for the entire df

table(is.na(df))FALSE TRUE 
10 10

i.e. there are 10 missing and 10 non-missing values in the df.

Keep columns with non-missing values only

df = as.data.frame(t(na.omit(t(df))))#check the df> df  col1
1 1
2 2
3 3
4 4
5 5

Keep rows with non-missing values only

df = df[complete.cases(df), ] 
[1] col1 col2 col3 col4
# no observations as all rows contain at least one missing value.

Replace all missings with zeros

df[is.na(df)] <- 0#check the df> df
col1 col2 col3 col4
1 1 11 21 0
2 2 12 22 0
3 3 13 0 0
4 4 0 0 0
5 5 0 0 0

Replace missing values with the smallest value in the column

df %<>% mutate_at(vars(col1,col2, col3), ~replace_na(.,min(., na.rm = TRUE)))#check the df> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 21 NA
4 4 11 21 NA
5 5 11 21 NA

Replace missing values with the highest value in the column

df %<>% mutate_at(vars(col1,col2, col3), ~replace_na(.,max(., na.rm = TRUE)))#check the df> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 22 NA
4 4 13 22 NA
5 5 13 22 NA

Replace missing values with column median

df %<>% mutate_at(vars(col1, col2), ~replace_na(.,median(., na.rm = TRUE)))#check the df> df  col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 NA NA
4 4 12 NA NA
5 5 12 NA NA

Replace missing values with column mean

df %<>% mutate_at(vars(col1, col2), ~replace_na(.,mean(., na.rm = TRUE)))#check the df> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 NA NA
4 4 12 NA NA
5 5 12 NA NA

N.B. Replacing with mean/median will not work for columns 3 and 4 due to precision error (possibly because there are more missing values than non-missing ones)

Here is a workaround using the tidyr package that basically replaces a missing value with the last non-missing value above it

df %<>% fill(col2) %>% fill(col3)#check the df   col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 22 NA
4 4 13 22 NA
5 5 13 22 NA

Replace missing values with the last non-missing value from the preceding rows (from left)

df %<>% mutate(col5 = coalesce (col4, col3, col2, col1))#check the df> df
col1 col2 col3 col4 col5
1 1 11 21 NA 21
2 2 12 22 NA 22
3 3 13 NA NA 13
4 4 NA NA NA 4
5 5 NA NA NA 5

Happy imputing!!!

--

--

infoart.ca

Center for Social Capital & Environmental Research | Posts by Bishwajit Ghose, lecturer at the University of Ottawa