Using fuzzy joins in R: handling unmatched strings

2 min readJan 25, 2025

Data often comes with inconsistencies, especially when dealing with real-world datasets. Variations in spellings, typos, or differing formats can make merging datasets challenging. This is where fuzzy joins proves to be so useful. In this blog post, we’ll explore how to handle such issues in R using the fuzzyjoin package.

What is a Fuzzy Join?

A fuzzy join allows us to merge datasets even when the key columns don’t perfectly match. It uses similarity measures like string distance to determine the best matches between keys in the two datasets.

Setting Up

We’ll use the fuzzyjoin package for fuzzy joins. Ensure the required packages are installed:

library(dplyr, fuzzyjoin, stringdist)

To beging with, lets inspect the two datasets with slight mismatches in country names.

df1 <- data.frame(
    country = c("United States", "Canada", "United Kingdom", "Germany", "France"),
    population = c(331002651, 37742154, 67886011, 83783942, 65273511))

> df1
         country population
1  United States  331002651
2         Canada   37742154
3 United Kingdom   67886011
4        Germany   83783942
5         France   65273511

df2 <- data.frame(
    country = c("Unitef Sates", "Canadda", "Unitd Kingom", "Geermanny", "Franccee"),
    gdp = c(3.5, 4.0, 6.7, 1.8, 4.5))


> df2
       country gdp
1 Unitef Sates 3.5
2      Canadda 4.0
3 Unitd Kingom 6.7
4    Geermanny 1.8
5     Franccee 4.5

As the country names are not matching, we won’t be able to perform operations such as joining or merging the datasets. This is where Fuzzy Join enters the picture:

We’ll use stringdist_join from the fuzzyjoin package to merge these datasets based on approximate matches.

stringdist_join(df1, df2, 
                mode = "inner",
                by = "country",
                max_dist = 2)


       country.x population    country.y gdp
1  United States  331002651 Unitef Sates 3.5
2         Canada   37742154      Canadda 4.0
3 United Kingdom   67886011 Unitd Kingom 6.7
4        Germany   83783942    Geermanny 1.8
5         France   65273511     Franccee 4.5

As we can see, the method correctly identified the closely matched names and joined the datasets accordingly.

Meaning of the parameters

by: Specifies the columns to join.
max_dist: The maximum allowable string distance for matches.

The result shows that even with minor mismatches, the datasets have been successfully joined.

Freel free to experiment with different distance metrics ("osa", "lv", "dl", etc.) to find the best fit for your data.

Using fuzzy joins in R: handling unmatched strings

What is a Fuzzy Join?

Setting Up

Meaning of the parameters

Written by infoart.ca

No responses yet