Using fuzzy joins in R: handling unmatched strings

infoart.ca
2 min readJan 25, 2025

--

Data often comes with inconsistencies, especially when dealing with real-world datasets. Variations in spellings, typos, or differing formats can make merging datasets challenging. This is where fuzzy joins proves to be so useful. In this blog post, we’ll explore how to handle such issues in R using the fuzzyjoin package.

What is a Fuzzy Join?

A fuzzy join allows us to merge datasets even when the key columns don’t perfectly match. It uses similarity measures like string distance to determine the best matches between keys in the two datasets.

Setting Up

We’ll use the fuzzyjoin package for fuzzy joins. Ensure the required packages are installed:

library(dplyr, fuzzyjoin, stringdist)

To beging with, lets inspect the two datasets with slight mismatches in country names.

df1 <- data.frame(
country = c("United States", "Canada", "United Kingdom", "Germany", "France"),
population = c(331002651, 37742154, 67886011, 83783942, 65273511))

> df1
country population
1 United States 331002651
2 Canada 37742154
3 United Kingdom 67886011
4 Germany 83783942
5 France 65273511

df2 <- data.frame(
country = c("Unitef Sates", "Canadda", "Unitd Kingom", "Geermanny", "Franccee"),
gdp = c(3.5, 4.0, 6.7, 1.8, 4.5))


> df2
country gdp
1 Unitef Sates 3.5
2 Canadda 4.0
3 Unitd Kingom 6.7
4 Geermanny 1.8
5 Franccee 4.5


As the country names are not matching, we won’t be able to perform operations such as joining or merging the datasets. This is where Fuzzy Join enters the picture:

We’ll use stringdist_join from the fuzzyjoin package to merge these datasets based on approximate matches.

stringdist_join(df1, df2, 
mode = "inner",
by = "country",
max_dist = 2)


country.x population country.y gdp
1 United States 331002651 Unitef Sates 3.5
2 Canada 37742154 Canadda 4.0
3 United Kingdom 67886011 Unitd Kingom 6.7
4 Germany 83783942 Geermanny 1.8
5 France 65273511 Franccee 4.5

As we can see, the method correctly identified the closely matched names and joined the datasets accordingly.

Meaning of the parameters

  • by: Specifies the columns to join.
  • max_dist: The maximum allowable string distance for matches.

The result shows that even with minor mismatches, the datasets have been successfully joined.

Freel free to experiment with different distance metrics ("osa", "lv", "dl", etc.) to find the best fit for your data.

--

--

infoart.ca
infoart.ca

Written by infoart.ca

Center for Social Capital & Environmental Research | Posts by Bishwajit Ghose, BI consultant and lecturer at the University of Ottawa

No responses yet