Fuzzy Matching

code

Using the fuzzyjoin package for inexact matching

Recently had the pleasure of fuzzy matching two datasets that included first and last name and date of birth. Here are some notes:

Toy Example

library(tidyverse)
library(fuzzyjoin)

Create our list of names and dobs.

inside_list <- tribble(
  ~first, ~last, ~dob,
  "frank", "johnson", "1980-01-01",
  "debbie", "willis", "1982-01-01"
) %>% mutate(dob = parse_date(dob, format = "%Y-%m-%d"))

Create a second list to be matched against

outside_list <- tribble(
  ~first, ~last, ~dob,
  "frank", "johnson", "1980-01-01",
  "debrah", "willis", "1982-01-02"
) %>% mutate(dob = parse_date(dob, format = "%Y-%m-%d"))

Use the function of your choice.

stringdist_join(outside_list, inside_list,
                by = c("first", "last", "dob"),
                ignore_case = TRUE,
                distance_col = "dist",
                mode = "left",
                method = "lv",
                max_dist = 20) %>% 
  mutate(total_distance = dob.dist + first.dist + last.dist) %>% 
  select(-ends_with(".dist")) %>% 
  knitr::kable()
first.x last.x dob.x first.y last.y dob.y dist total_distance
frank johnson 1980-01-01 frank johnson 1980-01-01 NA 0
frank johnson 1980-01-01 debbie willis 1982-01-01 NA 14
debrah willis 1982-01-02 frank johnson 1980-01-01 NA 14
debrah willis 1982-01-02 debbie willis 1982-01-01 NA 4

You can see the difference between debrah and debbie is 4.

Setting the distance threshold is often a function of how exact you want the matches to be.

See the fuzzyjoin reference for more.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/colemanrob/robcoleman.ca, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".