Some cool R packages for data cleaning
R is a powerful and versatile programming language that can be used for data cleaning tasks. One of the main advantages of using R for data cleaning is its wide range of packages that provide functions and tools for various data cleaning tasks.
Let’s talk about some with their use cases:
tidyr
tidyr is a package for tidying messy data. The main functions are gather(), spread(), separate() and unite(), which can be used to reshape data by gathering columns into key-value pairs, spreading key-value pairs into separate columns, separating a single column into multiple columns, and uniting multiple columns into a single column respectively.
- gather(): This function can be used to gather columns into key-value pairs, creating a “long” format data frame from a “wide” format data frame.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
# Gather columns a and b into key-value pairs
data_gathered <- gather(data, key = "variable", value = "value", -id)
# Print the gathered data frame
print(data_gathered)
id variable value
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 1 b 5
6 2 b 6
7 3 b 7
8 4 b 8
2. spread(): This function can be used to spread key-value pairs into separate columns, creating a “wide” format data frame from a “long” format data frame.
# Create a sample data frame
data <- data.frame(id = c(1,1,2,2), variable = c("a","b","a","b"), value = c(1,5,2,6))
# Spread key-value pairs into separate columns
data_spread <- spread(data, variable, value)
# Print the spread data frame
print(data_spread)
id a b
1 1 1 5
2 2 2 6
3. separate(): This function can be used to separate a single column into multiple columns
# Create a sample data frame
data <- data.frame(id = 1:4, full_name = c("John Doe", "Jane Smith", "Bob Johnson", "Emily Davis"))
# Separate column full_name into first_name and last_name
data_separated <- separate(data, full_name, into = c("first_name", "last_name"), sep = " ")
# Print the separated data frame
print(data_separated)
id first_name last_name
1 1 John Doe
2 2 Jane Smith
3 3 Bob Johnson
4 4 Emily Davis
4. unite(): This function can be used to unite multiple columns into a single column.
# Create a sample data frame
data <- data.frame(id = 1:4, first_name = c("John", "Jane", "Bob", "Emily"), last_name = c("Doe", "Smith", "Johnson", "Davis"))
# unite columns first_name and last_name into a single column full_name
data_united <- unite(data, col = "full_name", first_name, last_name, sep = " ")
# Print the united data frame
print(data_united)
id full_name
1 1 John Doe
2 2 Jane Smith
3 3 Bob Johnson
4 4 Emily Davis
dplyr
dplyr is a package for data manipulation. The main functions are filter()
, select()
, arrange()
, mutate()
and summarize()
, which can be used to filter rows, select columns, arrange rows, add new columns, and summarize data respectively.
library(dplyr)
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
data_filtered <- filter(data, a > 2)
Use cases:
- filter(): This function can be used to filter rows in a data frame based on a logical condition.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
# Filter rows where a is greater than 2
data_filtered <- filter(data, a > 2)
# Print the filtered data frame
print(data_filtered)
id a b
1 3 3 7
2 4 4 8
2. select(): This function can be used to select specific columns from a data frame.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
# Select columns id and b
data_selected <- select(data, id, b)
# Print the selected data frame
print(data_selected)
id b
1 1 5
2 2 6
3 3 7
4 4 8
3. arrange(): This function can be used to sort rows in a data frame based on one or more columns.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
# Sort rows by column a in descending order
data_arranged <- arrange(data, desc(a))
# Print the arranged data frame
print(data_arranged)
id a b
1 4 4 8
2 3 3 7
3 2 2 6
4 1 1 5
4. mutate(): This function can be used to add new columns to a data frame based on existing columns or calculations.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
# Add a new column c as the sum of columns a and b
data_mutated <- mutate(data, c = a + b)
# Print the mutated data frame
print(data_mutated)
id a b c
1 1 1 5 6
2 2 2 6 8
3 3 3 7 10
4 4 4 8 12
5. summarize(): This function can be used to summarize the data in a data frame by calculating summary statistics such as mean, sum, etc.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
# Summarize data by calculating mean of columns a and b
data_summarized <- summarize(data, mean_a = mean(a), mean_b = mean(b))
# Print the summarized data frame
print(data_summarized)
mean_a mean_b
1 2.5 6.5
6. group_by(): This function can be used to group data by one or more columns and perform operations on the groups.
# Create a sample data frame
data <- data.frame(id = 1:4, a = c(1, 2, 3, 4), b = c(5, 6, 7, 8))
# Group data by column a and calculate mean of column b
data_grouped <- group_by(data, a) %>% summarize(mean_b = mean(b))
# Print the grouped data frame
print(data_grouped)
# A tibble: 4 × 2
a mean_b
<dbl> <dbl>
1 1 5
2 2 6
3 3 7
4 4 8
stringr
stringr is a package for working with strings. The main functions are str_detect()
, str_count()
, str_replace()
, str_split()
, and str_trim()
, which can be used to detect patterns in strings, count the number of matches, replace matches, split strings, and remove whitespace respectively.
Use cases:
- str_detect(): This function can be used to detect whether a pattern exists in a character vector. It returns a logical vector indicating which elements of the input contain the pattern.
# Create a sample character vector
data <- c("apple", "banana", "cherry")
# Detect if the string contains "a"
detected <- str_detect(data, "a")
# Print the result
print(detected)
[1] TRUE TRUE FALSE
2. str_count(): This function can be used to count the number of matches of a pattern in a character vector.
# Create a sample character vector
data <- c("apple", "banana", "cherry")
# Count the number of "a" in the string
counted <- str_count(data, "a")
# Print the result
print(counted)
[1] 1 3 0
3. str_replace(): This function can be used to replace all matches of a pattern in a character vector with a replacement string.
# Create a sample character vector
data <- c("apple", "banana", "cherry")
# Replace all "a" with "o"
replaced <- str_replace(data, "a", "o")
# Print the result
print(replaced)
[1] "opple" "bonana" "cherry"
4. str_split(): This function can be used to split a character vector into substrings based on a pattern.
# Create a sample character vector
data <- c("apple,banana,cherry")
# Split the string by ","
split_data <- str_split(data, ",")
# Print the result
print(split_data)
[1] "apple" "banana" "cherry"
5. str_trim(): This function can be used to remove leading and trailing whitespace from a character vector.
# Create a sample character vector
data <- c(" apple ", " banana ", " cherry ")
# Remove whitespace
trimmed_data <- str_trim(data)
# Print the result
print(trimmed_data)
[1] "apple" "banana" "cherry"
lubridate
lubridate is a package for working with dates and times. The main functions are ymd()
, hms()
, dmy()
, mdy()
, and now()
, which can be used to create date-time objects, extract different parts of a date-time (year, month, day, etc.), and format date-time objects.
Use cases
- ymd(): This function can be used to create date-time objects in the format of year-month-day.
# Create a date-time object for 2022-01-01
date <- ymd("2022-01-01")
# Print the result
print(date)
[1] "2022-01-01"
2. hms(): This function can be used to create date-time objects in the format of hour-minute-second.
# Create a date-time object for 12:30:15
time <- hms("12:30:15")
# Print the result
print(time)
[1] "12H 30M 15S"
3. ymd_hms(): This function can be used to create date-time objects in the format of year-month-day hour-minute-second.
# Create a date-time object for 2022-01-01 12:30:15
datetime <- ymd_hms("2022-01-01 12:30:15")
# Print the result
print(datetime)
[1] "2022-01-01 12:30:15 UTC"
4. year(): This function can be used to extract the year from a date-time object.
# Create a date-time object for 2022-01-01
date <- ymd("2022-01-01")
# Extract the year
year <- year(date)
# Print the result
print(year)
janitor
janitor is a package for cleaning up messy data. The main functions are clean_names()
, remove_empty()
, remove_constant()
, tabyl()
, and adorn_totals()
which can be used to format variable names, remove empty rows and columns, and create frequency tables
Use cases:
- clean_names(): This function can be used to clean up the names of columns in a data frame by making them lowercase, removing special characters, and replacing spaces with underscores.
# Create a sample data frame
data <- data.frame(
"Column 1*%&" = 1:3,
"Column 2&^$" = c("A", "B", "C"),
"Column_3^%#" = c("apple", "banana", "cherry")
)
# Clean column names
data_cleaned <- clean_names(data)
# Print the cleaned data frame
print(data_cleaned)
column_1 column_2 column_3
1 1 A apple
2 2 B banana
3 3 C cherry
2. remove_empty(): This function can be used to remove rows or columns that are entirely empty from a data frame.
# Create a sample data frame
data <- data.frame(
"Column 1" = 1:3,
"Column 2" = c("A", "B", "C"),
"Column 3" = c("apple", "", "cherry")
)
# Remove empty rows
data_cleaned <- remove_empty(data, "rows")
# Print the cleaned data frame
print(data_cleaned)
3. remove_constant(): This function can be used to remove columns that contain the same value in all rows.
# Create a sample data frame
data <- data.frame(
"Column 1" = 1:3,
"Column 2" = c("A", "A", "A"),
"Column 3" = c("apple", "banana", "cherry")
)
# Remove constant columns
data_cleaned <- remove_constant(data)
Happy cleaning!