Handling missing values in R

3 min readJun 4, 2022

Create a sample df with missing values:

df = data.frame(col1=c(1:5),
              col2=c(11:13, NA, NA),
              col3=c(21:22, NA, NA, NA),
              col4=rep(NA, 5))> df
  col1 col2 col3 col4
1    1   11   21   NA
2    2   12   22   NA
3    3   13   NA   NA
4    4   NA   NA   NA
5    5   NA   NA   NA

Check the count of missing values for all columns:

colSums(is.na(df))> col1 col2 col3 col4 
   0    2    3    5

Or for the entire df

table(is.na(df))FALSE TRUE 
 10    10

i.e. there are 10 missing and 10 non-missing values in the df.

Keep columns with non-missing values only

df = as.data.frame(t(na.omit(t(df))))#check the df> df  col1
1    1
2    2
3    3
4    4
5    5

Keep rows with non-missing values only

df = df[complete.cases(df), ] 
[1] col1 col2 col3 col4# no observations as all rows contain at least one missing value.

Replace all missings with zeros

df[is.na(df)] <- 0#check the df> df
  col1 col2 col3 col4
1    1   11   21    0
2    2   12   22    0
3    3   13    0    0
4    4    0    0    0
5    5    0    0    0

Replace missing values with the smallest value in the column

df %<>% mutate_at(vars(col1,col2, col3), ~replace_na(.,min(., na.rm = TRUE)))#check the df> df
  col1 col2 col3 col4
1    1   11   21   NA
2    2   12   22   NA
3    3   13   21   NA
4    4   11   21   NA
5    5   11   21   NA

Replace missing values with the highest value in the column

df %<>% mutate_at(vars(col1,col2, col3), ~replace_na(.,max(., na.rm = TRUE)))#check the df> df
  col1 col2 col3 col4
1    1   11   21   NA
2    2   12   22   NA
3    3   13   22   NA
4    4   13   22   NA
5    5   13   22   NA

Replace missing values with column median

df %<>% mutate_at(vars(col1, col2), ~replace_na(.,median(., na.rm = TRUE)))#check the df> df  col1 col2 col3 col4
1    1   11   21   NA
2    2   12   22   NA
3    3   13   NA   NA
4    4   12   NA   NA
5    5   12   NA   NA

Replace missing values with column mean

df %<>% mutate_at(vars(col1, col2), ~replace_na(.,mean(., na.rm = TRUE)))#check the df> df
   col1 col2 col3 col4
1    1   11   21   NA
2    2   12   22   NA
3    3   13   NA   NA
4    4   12   NA   NA
5    5   12   NA   NA

N.B. Replacing with mean/median will not work for columns 3 and 4 due to precision error (possibly because there are more missing values than non-missing ones)

Here is a workaround using the tidyr package that basically replaces a missing value with the last non-missing value above it

df %<>% fill(col2) %>% fill(col3)#check the df   col1 col2 col3 col4
1    1   11   21   NA
2    2   12   22   NA
3    3   13   22   NA
4    4   13   22   NA
5    5   13   22   NA

Replace missing values with the last non-missing value from the preceding rows (from left)

df %<>% mutate(col5 = coalesce (col4, col3, col2, col1))#check the df> df
  col1 col2 col3 col4 col5
1    1   11   21   NA   21
2    2   12   22   NA   22
3    3   13   NA   NA   13
4    4   NA   NA   NA    4
5    5   NA   NA   NA    5

Happy imputing!!!

Handling missing values in R

Written by infoart.ca