Handling missing values in R
Create a sample df with missing values:
df = data.frame(col1=c(1:5),
col2=c(11:13, NA, NA),
col3=c(21:22, NA, NA, NA),
col4=rep(NA, 5))> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 NA NA
4 4 NA NA NA
5 5 NA NA NA
Check the count of missing values for all columns:
colSums(is.na(df))> col1 col2 col3 col4
0 2 3 5
Or for the entire df
table(is.na(df))FALSE TRUE
10 10
i.e. there are 10 missing and 10 non-missing values in the df.
Keep columns with non-missing values only
df = as.data.frame(t(na.omit(t(df))))#check the df> df col1
1 1
2 2
3 3
4 4
5 5
Keep rows with non-missing values only
df = df[complete.cases(df), ]
[1] col1 col2 col3 col4# no observations as all rows contain at least one missing value.
Replace all missings with zeros
df[is.na(df)] <- 0#check the df> df
col1 col2 col3 col4
1 1 11 21 0
2 2 12 22 0
3 3 13 0 0
4 4 0 0 0
5 5 0 0 0
Replace missing values with the smallest value in the column
df %<>% mutate_at(vars(col1,col2, col3), ~replace_na(.,min(., na.rm = TRUE)))#check the df> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 21 NA
4 4 11 21 NA
5 5 11 21 NA
Replace missing values with the highest value in the column
df %<>% mutate_at(vars(col1,col2, col3), ~replace_na(.,max(., na.rm = TRUE)))#check the df> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 22 NA
4 4 13 22 NA
5 5 13 22 NA
Replace missing values with column median
df %<>% mutate_at(vars(col1, col2), ~replace_na(.,median(., na.rm = TRUE)))#check the df> df col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 NA NA
4 4 12 NA NA
5 5 12 NA NA
Replace missing values with column mean
df %<>% mutate_at(vars(col1, col2), ~replace_na(.,mean(., na.rm = TRUE)))#check the df> df
col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 NA NA
4 4 12 NA NA
5 5 12 NA NA
N.B. Replacing with mean/median will not work for columns 3 and 4 due to precision error (possibly because there are more missing values than non-missing ones)
Here is a workaround using the tidyr package that basically replaces a missing value with the last non-missing value above it
df %<>% fill(col2) %>% fill(col3)#check the df col1 col2 col3 col4
1 1 11 21 NA
2 2 12 22 NA
3 3 13 22 NA
4 4 13 22 NA
5 5 13 22 NA
Replace missing values with the last non-missing value from the preceding rows (from left)
df %<>% mutate(col5 = coalesce (col4, col3, col2, col1))#check the df> df
col1 col2 col3 col4 col5
1 1 11 21 NA 21
2 2 12 22 NA 22
3 3 13 NA NA 13
4 4 NA NA NA 4
5 5 NA NA NA 5
Happy imputing!!!