Using fuzzy joins in R: handling unmatched strings
Data often comes with inconsistencies, especially when dealing with real-world datasets. Variations in spellings, typos, or differing formats can make merging datasets challenging. This is where fuzzy joins proves to be so useful. In this blog post, we’ll explore how to handle such issues in R using the fuzzyjoin
package.
What is a Fuzzy Join?
A fuzzy join allows us to merge datasets even when the key columns don’t perfectly match. It uses similarity measures like string distance to determine the best matches between keys in the two datasets.
Setting Up
We’ll use the fuzzyjoin
package for fuzzy joins. Ensure the required packages are installed:
library(dplyr, fuzzyjoin, stringdist)
To beging with, lets inspect the two datasets with slight mismatches in country names.
df1 <- data.frame(
country = c("United States", "Canada", "United Kingdom", "Germany", "France"),
population = c(331002651, 37742154, 67886011, 83783942, 65273511))
> df1
country population
1 United States 331002651
2 Canada 37742154
3 United Kingdom 67886011
4 Germany 83783942
5 France 65273511
df2 <- data.frame(
country = c("Unitef Sates", "Canadda", "Unitd Kingom", "Geermanny", "Franccee"),
gdp = c(3.5, 4.0, 6.7, 1.8, 4.5))
> df2
country gdp
1 Unitef Sates 3.5
2 Canadda 4.0
3 Unitd Kingom 6.7
4 Geermanny 1.8
5 Franccee 4.5
As the country names are not matching, we won’t be able to perform operations such as joining or merging the datasets. This is where Fuzzy Join enters the picture:
We’ll use stringdist_join
from the fuzzyjoin
package to merge these datasets based on approximate matches.
stringdist_join(df1, df2,
mode = "inner",
by = "country",
max_dist = 2)
country.x population country.y gdp
1 United States 331002651 Unitef Sates 3.5
2 Canada 37742154 Canadda 4.0
3 United Kingdom 67886011 Unitd Kingom 6.7
4 Germany 83783942 Geermanny 1.8
5 France 65273511 Franccee 4.5
As we can see, the method correctly identified the closely matched names and joined the datasets accordingly.
Meaning of the parameters
by
: Specifies the columns to join.max_dist
: The maximum allowable string distance for matches.
The result shows that even with minor mismatches, the datasets have been successfully joined.
Freel free to experiment with different distance metrics ("osa"
, "lv"
, "dl"
, etc.) to find the best fit for your data.