Some cool R packages for data cleaning

infoart.ca
7 min readJan 25, 2023

--

R is a powerful and versatile programming language that can be used for data cleaning tasks. One of the main advantages of using R for data cleaning is its wide range of packages that provide functions and tools for various data cleaning tasks.

Let’s talk about some with their use cases:

tidyr

tidyr is a package for tidying messy data. The main functions are gather(), spread(), separate() and unite(), which can be used to reshape data by gathering columns into key-value pairs, spreading key-value pairs into separate columns, separating a single column into multiple columns, and uniting multiple columns into a single column respectively.

  1. gather(): This function can be used to gather columns into key-value pairs, creating a “long” format data frame from a “wide” format data frame.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))

# Gather columns a and b into key-value pairs
data_gathered <- gather(data, key = "variable", value = "value", -id)

# Print the gathered data frame
print(data_gathered)

id variable value
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 1 b 5
6 2 b 6
7 3 b 7
8 4 b 8

2. spread(): This function can be used to spread key-value pairs into separate columns, creating a “wide” format data frame from a “long” format data frame.

# Create a sample data frame
data <- data.frame(id = c(1,1,2,2), variable = c("a","b","a","b"), value = c(1,5,2,6))

# Spread key-value pairs into separate columns
data_spread <- spread(data, variable, value)

# Print the spread data frame
print(data_spread)
id a b
1 1 1 5
2 2 2 6

3. separate(): This function can be used to separate a single column into multiple columns

# Create a sample data frame
data <- data.frame(id = 1:4, full_name = c("John Doe", "Jane Smith", "Bob Johnson", "Emily Davis"))

# Separate column full_name into first_name and last_name
data_separated <- separate(data, full_name, into = c("first_name", "last_name"), sep = " ")

# Print the separated data frame
print(data_separated)
id first_name last_name
1 1 John Doe
2 2 Jane Smith
3 3 Bob Johnson
4 4 Emily Davis

4. unite(): This function can be used to unite multiple columns into a single column.

# Create a sample data frame
data <- data.frame(id = 1:4, first_name = c("John", "Jane", "Bob", "Emily"), last_name = c("Doe", "Smith", "Johnson", "Davis"))

# unite columns first_name and last_name into a single column full_name
data_united <- unite(data, col = "full_name", first_name, last_name, sep = " ")

# Print the united data frame
print(data_united)

id full_name
1 1 John Doe
2 2 Jane Smith
3 3 Bob Johnson
4 4 Emily Davis

dplyr

dplyr is a package for data manipulation. The main functions are filter(), select(), arrange(), mutate() and summarize(), which can be used to filter rows, select columns, arrange rows, add new columns, and summarize data respectively.

library(dplyr)
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
data_filtered <- filter(data, a > 2)

Use cases:

  1. filter(): This function can be used to filter rows in a data frame based on a logical condition.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))

# Filter rows where a is greater than 2
data_filtered <- filter(data, a > 2)

# Print the filtered data frame
print(data_filtered)

id a b
1 3 3 7
2 4 4 8

2. select(): This function can be used to select specific columns from a data frame.

# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))

# Select columns id and b
data_selected <- select(data, id, b)

# Print the selected data frame
print(data_selected)

id b
1 1 5
2 2 6
3 3 7
4 4 8

3. arrange(): This function can be used to sort rows in a data frame based on one or more columns.

# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))

# Sort rows by column a in descending order
data_arranged <- arrange(data, desc(a))

# Print the arranged data frame
print(data_arranged)

id a b
1 4 4 8
2 3 3 7
3 2 2 6
4 1 1 5

4. mutate(): This function can be used to add new columns to a data frame based on existing columns or calculations.

# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))

# Add a new column c as the sum of columns a and b
data_mutated <- mutate(data, c = a + b)

# Print the mutated data frame
print(data_mutated)

id a b c
1 1 1 5 6
2 2 2 6 8
3 3 3 7 10
4 4 4 8 12

5. summarize(): This function can be used to summarize the data in a data frame by calculating summary statistics such as mean, sum, etc.

# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))

# Summarize data by calculating mean of columns a and b
data_summarized <- summarize(data, mean_a = mean(a), mean_b = mean(b))

# Print the summarized data frame
print(data_summarized)

mean_a mean_b
1 2.5 6.5

6. group_by(): This function can be used to group data by one or more columns and perform operations on the groups.

# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))

# Group data by column a and calculate mean of column b
data_grouped <- group_by(data, a) %>% summarize(mean_b = mean(b))

# Print the grouped data frame
print(data_grouped)

# A tibble: 4 × 2
a mean_b
<dbl> <dbl>
1 1 5
2 2 6
3 3 7
4 4 8

stringr

stringr is a package for working with strings. The main functions are str_detect(), str_count(), str_replace(), str_split(), and str_trim(), which can be used to detect patterns in strings, count the number of matches, replace matches, split strings, and remove whitespace respectively.

Use cases:

  1. str_detect(): This function can be used to detect whether a pattern exists in a character vector. It returns a logical vector indicating which elements of the input contain the pattern.
# Create a sample character vector
data <- c("apple", "banana", "cherry")

# Detect if the string contains "a"
detected <- str_detect(data, "a")

# Print the result
print(detected)

[1] TRUE TRUE FALSE

2. str_count(): This function can be used to count the number of matches of a pattern in a character vector.

# Create a sample character vector
data <- c("apple", "banana", "cherry")

# Count the number of "a" in the string
counted <- str_count(data, "a")

# Print the result
print(counted)

[1] 1 3 0

3. str_replace(): This function can be used to replace all matches of a pattern in a character vector with a replacement string.

# Create a sample character vector
data <- c("apple", "banana", "cherry")

# Replace all "a" with "o"
replaced <- str_replace(data, "a", "o")

# Print the result
print(replaced)

[1] "opple" "bonana" "cherry"

4. str_split(): This function can be used to split a character vector into substrings based on a pattern.

# Create a sample character vector
data <- c("apple,banana,cherry")

# Split the string by ","
split_data <- str_split(data, ",")

# Print the result
print(split_data)

[1] "apple" "banana" "cherry"

5. str_trim(): This function can be used to remove leading and trailing whitespace from a character vector.

# Create a sample character vector
data <- c(" apple ", " banana ", " cherry ")

# Remove whitespace
trimmed_data <- str_trim(data)

# Print the result
print(trimmed_data)

[1] "apple" "banana" "cherry"

lubridate

lubridate is a package for working with dates and times. The main functions are ymd(), hms(), dmy(), mdy(), and now(), which can be used to create date-time objects, extract different parts of a date-time (year, month, day, etc.), and format date-time objects.

Use cases

  1. ymd(): This function can be used to create date-time objects in the format of year-month-day.
# Create a date-time object for 2022-01-01
date <- ymd("2022-01-01")

# Print the result
print(date)
[1] "2022-01-01"

2. hms(): This function can be used to create date-time objects in the format of hour-minute-second.

# Create a date-time object for 12:30:15
time <- hms("12:30:15")

# Print the result
print(time)

[1] "12H 30M 15S"

3. ymd_hms(): This function can be used to create date-time objects in the format of year-month-day hour-minute-second.

# Create a date-time object for 2022-01-01 12:30:15
datetime <- ymd_hms("2022-01-01 12:30:15")

# Print the result
print(datetime)
[1] "2022-01-01 12:30:15 UTC"

4. year(): This function can be used to extract the year from a date-time object.

# Create a date-time object for 2022-01-01
date <- ymd("2022-01-01")

# Extract the year
year <- year(date)

# Print the result
print(year)

janitor

janitor is a package for cleaning up messy data. The main functions are clean_names(), remove_empty(), remove_constant(), tabyl(), and adorn_totals() which can be used to format variable names, remove empty rows and columns, and create frequency tables

Use cases:

  1. clean_names(): This function can be used to clean up the names of columns in a data frame by making them lowercase, removing special characters, and replacing spaces with underscores.
# Create a sample data frame
data <- data.frame(
"Column 1*%&" = 1:3,
"Column 2&^$" = c("A", "B", "C"),
"Column_3^%#" = c("apple", "banana", "cherry")
)

# Clean column names
data_cleaned <- clean_names(data)

# Print the cleaned data frame
print(data_cleaned)

column_1 column_2 column_3
1 1 A apple
2 2 B banana
3 3 C cherry

2. remove_empty(): This function can be used to remove rows or columns that are entirely empty from a data frame.

# Create a sample data frame
data <- data.frame(
"Column 1" = 1:3,
"Column 2" = c("A", "B", "C"),
"Column 3" = c("apple", "", "cherry")
)

# Remove empty rows
data_cleaned <- remove_empty(data, "rows")

# Print the cleaned data frame
print(data_cleaned)

3. remove_constant(): This function can be used to remove columns that contain the same value in all rows.

# Create a sample data frame
data <- data.frame(
"Column 1" = 1:3,
"Column 2" = c("A", "A", "A"),
"Column 3" = c("apple", "banana", "cherry")
)

# Remove constant columns
data_cleaned <- remove_constant(data)

Happy cleaning!

--

--

infoart.ca
infoart.ca

Written by infoart.ca

Center for Social Capital & Environmental Research | Posts by Bishwajit Ghose, BI consultant and lecturer at the University of Ottawa

Responses (1)