Using the fuzzyjoin package for inexact matching
Recently had the pleasure of fuzzy matching two datasets that included first and last name and date of birth. Here are some notes:
Create our list of names and dobs.
inside_list <- tribble(
~first, ~last, ~dob,
"frank", "johnson", "1980-01-01",
"debbie", "willis", "1982-01-01"
) %>% mutate(dob = parse_date(dob, format = "%Y-%m-%d"))
Create a second list to be matched against
outside_list <- tribble(
~first, ~last, ~dob,
"frank", "johnson", "1980-01-01",
"debrah", "willis", "1982-01-02"
) %>% mutate(dob = parse_date(dob, format = "%Y-%m-%d"))
Use the function of your choice.
stringdist_join(outside_list, inside_list,
by = c("first", "last", "dob"),
ignore_case = TRUE,
distance_col = "dist",
mode = "left",
method = "lv",
max_dist = 20) %>%
mutate(total_distance = dob.dist + first.dist + last.dist) %>%
select(-ends_with(".dist")) %>%
knitr::kable()
first.x | last.x | dob.x | first.y | last.y | dob.y | dist | total_distance |
---|---|---|---|---|---|---|---|
frank | johnson | 1980-01-01 | frank | johnson | 1980-01-01 | NA | 0 |
frank | johnson | 1980-01-01 | debbie | willis | 1982-01-01 | NA | 14 |
debrah | willis | 1982-01-02 | frank | johnson | 1980-01-01 | NA | 14 |
debrah | willis | 1982-01-02 | debbie | willis | 1982-01-01 | NA | 4 |
You can see the difference between debrah and debbie is 4.
Setting the distance threshold is often a function of how exact you want the matches to be.
See the fuzzyjoin reference for more.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/colemanrob/robcoleman.ca, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".