The goal of record linkage and deduplication is to detect which records belong to the same object in data sets where the identifiers of the objects contain errors and missing values. The main design considerations of reclin2 are: modularity/flexibility, speed and the ability to handle large data sets. The first points makes it easy for users to extend the package with custom process steps. This flexibility is obtained by using simple data structures and by following as close as possible common interfaces in R. For large problems it is possible to distribute the work over multiple worker nodes. A benchmark comparison to other record linkage packages for R, shows that for this specific benchmark, the fastLink package performs best. However, this package only performs one specific type of record linkage model. The performance of reclin2 is not far behind the of fastLink while allowing for much greater flexibility.
Combining different data sets is often an important step in many data
analysis projects. Sometimes the data sets will contain high quality
linkage keys, especially when the data sets are based on a common
register. For example, the samples for (nearly) all social surveys
performed at Statistics Netherlands are drawn from the population
register and therefore can be linked to each other (Bakker et al. 2014). In these
cases exact linkage can be used. In exact linkage, records are linked
when they agree exactly on the linkage keys used. Exact linkage can be
performed in R using base functions such as merge
. However, it is not
uncommon that data sets have to be linked on keys such as ‘first name’,
‘last name’ and ‘address’. Often these variables contain errors and/or
missing values and, therefore, exact linkage is not possible. That is
where probabilistic record linkage methods come into play
(Herzog et al. 2007; Christen 2012). These methods will calculate some sort of
likelihood that two records belong to the same object (person, company,
…). This will be called a match. Only record pairs with a high enough
likelihood are linked to each other. The goal is to minimise the number
of false links (linking two records that do not belong to the same
object) and the number of missed links (not linking two records that
do belong to the same object).
The process of probabilistic record linkage generally consists of the following steps: (1) Generate pairs of records from each of the two data sets that are to be linked; (2) Compare the two records of the pair and generate a comparison vector (in the simplest case this a vector of ones and zeros coding agreement/disagreement on each of linkage keys); (3) Estimate a model that predicts based on the comparison vector a likelihood that the two records belong to the same object; (4) Select pairs with a high enough likelihood; (5) Using the selected pairs, generate the final linked dataset. reclin2 offers different methods for most of these steps and by mixing the different methods a custom linkage process can be developed. This is discussed in more detail with examples in the section on the record linkage process.
A variant of record linkage is deduplication. Here there is only one data set and one wants to determine which records belong to the same object. For example, a customer database can contain the same customer multiple times with slightly different information (e.g. different email addresses). Deduplication is usually performed by linking a dataset to itself. Matches are then duplicate records. The principles are, therefore, the same as with regular record linkage and in the remainder of paper we will focus on regular record linkage of two data sets.
Record linkage can be computationally and memory intensive. In principle each record from a data set has to be compared to each record in the other data set. Therefore, when the two data sets are of size \(N_1\) and \(N_2\) respectively the computational complexity and memory requirements are of order \(O(N_1 N_2)\). For example, at Statistics Netherlands one common data set is the population register containing in the order of \(10^7\) records; other data sets can be in the order of \(10^3\)–\(10^5\), resulting in \(10^{10}\)-\(10^{12}\) possible comparisons.
reclin2 is a package that provides a set of tools to perform probabilistic record linkage. It is the successor of the reclin package. The reason for the update was to be able to provide better support for the core design considerations of the reclin/reclin2 package. Unfortunately this was not possible while keeping backward compatibility, therefore it was decided to continue with a new package. The core design considerations are:
Modularity/flexibility.
Speed.
Ability to handle large datasets.
The last two points are important because of the aforementioned issues with the size of the problem. The first point is important, as in practice no record linkage project is the same and, therefore, a common need is to vary on the default procedure. The next section will discuss how we tried to address the points above. Besides reclin2 other packages exist for probabilistic record linkage. There is the RecordLinkage (Sariyar and Borg 2022) package that implements various methods such as classic probabilistic record linkage based on the Fellegi and Sunter (1969) model and methods based on machine learning. Furthermore, there is the fastLink (Enamorado et al. 2019, 2020) package that focuses on a fast and flexible implementation of the Fellegi-Sunter model. The main difference of reclin2 with these packages is the focus on the previous three points: fastLink scores well on points 2 and 3, but only supports one type of model while RecordLinkage scores well on point 3 and better than fastLink on point 1, but lacks some flexibility and speed. Points 2 and 3 are investigated in a later section using a benchmark.
One of the main considerations when designing the package was
flexibility. Therefore, the package has been designed as a set of
functions that operate on
data.table objects
(Dowle and Srinivasan 2021). The main object of the package is the pairs
object which
is a subclass of data.table
. The pairs
object contains pairs of
records from the two datasets that are to be linked (called x
and
y
). The first two columns of the pairs
object contain the indices of
the corresponding records from the two data sets. Most functions of the
package accept a pairs
object and return a pairs
object. The package
has functions for different steps in the linkage process (as described
in the next section). By combining the different available functions a
custom data linkage process can be built. Furthermore, as the pairs
object is a data.table
it is also easy for the user to manipulate it.
For example, new columns can be derived and pairs can be filtered.
Functions that do not manipulate the pairs
object are designed to
follow as closely as possible the common interfaces of base R functions.
For example, the function problink_em
that can be used to estimate the
parameters of the Fellegi-Sunter model accepts similar input as other
modelling functions in R: e.g. a formula to specify the model and a
data
argument to pass in the data on which to estimate the model. The
corresponding predict
function can be used to calculate likelihoods
for pairs being a match. It is therefore also easy to use models from
other packages, such as machine learning methods, to estimate the
likelihoods. Where a package such as RecordLinkage has functions for a
number of machine learning methods, reclin2 does not need these as the
user is free to call these themselves as demonstrated in the section on
the record linkage process below.
The other two design considerations, speed and being able to handle
large datasets, are obtained in two ways. First, by using a data.table
as the main object. Most methods have an inplace
argument (default
value is FALSE
). When set to TRUE
the pairs
object is modified
using the [, :=]
operation of a data.table
. This prevents
unnecessary copies, decreasing memory consumption and increasing speed.
Second, there is the option to create a cluster and distribute the
computational load over multiple cores. Using functionality from the
parallel or snow (Tierney et al. 2021)
packages, multiple R processes are started and the data is distributed
over these processes. Each process then generates a subset of the pairs
which are kept within the process. Subsequent operations on the pairs,
such as comparison, are also distributed over the processes where each
process applies the operation to its subset of pairs. One of the more
computationally intensive operations during linkage is comparing the
records from the two data sets to each other. This problem scales well
when parallelising. Therefore parallelization can lead to a significant
speed up. Furthermore, when using a
snow cluster the
computation can also be distributed over multiple machines. This can not
only lead to a speed up, but also means that the memory of multiple
machines can be utilized allowing for larger problems than could be
handled on a single machine. To work with a cluster, special functions
with the cluster_
prefix are offered. The cluster functions generating
the pairs expect as on of their inputs a valid cluster created for
example using makeCluster
from the parallel package. When using the
cluster variant, the object is no longer a data.table
and it becomes
more difficult to manually manipulate the object. The package has a few
functions to help with this which will be discussed at the end of the
next section.
This section section will give an overview of the linkage process and show how the functions in reclin2 can be used for this. The discussion will be brief. An overview of the main steps of a record linkage process has already been given in the introduction section of the paper. A more extensive description can be found in the package vignettes and documentation. Also, we will not go into detail into the methods used as these are well described in, for example, Herzog et al. (2007) and Christen (2012).
The first step in the linkage process is to generate pairs of records
from the two data sets x
and y
. There are a number of functions for
this: the function pair
generates all possible pairs. However, this
can lead to impractically large numbers of pairs. Therefore, often
methods are applied to reduce the total number of pairs. One commonly
used method is blocking where only pairs are generated that agree on
some key. This, of course, only works when a good enough quality key is
available, otherwise true matches are lost. Another method in the
package, pairs_simsum
, is to generate pairs that agree on a given
number of variables (e.g. they have to agree on either the postcode or
the town name) (Christen 2012). In the example below we use blocking on
‘postcode’, e.g. pairs are only generated when they agree exactly on
‘postcode’ ([...]
in the examples indicate removed output).
> library(reclin2)
[...]> data("linkexample1", "linkexample2")
> (pairs <- pair_blocking(linkexample1, linkexample2, "postcode"))
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 3 1
[...]
A data.table
is returned with the added class pairs
. The columns
.x
and .y
contain the row indices into the two data sets. A copy
(when the original data sets are not modified this is only a reference)
of the two data sets is stored in the attributes x
and y
. This makes
some of the next function calls easier.
There also exist cluster variants of these functions that return a
cluster_pairs
object:
> library(parallel)
> cl <- makeCluster(2)
> cpairs <- cluster_pair_blocking(cl, linkexample1, linkexample2, "postcode")
When calling the cluster variants of the pair generating algorithms, the
records from x
are randomly distributed over the nodes of the cluster
and y
is copied to each cluster node. On each node the corresponding
pair function is called. The resulting pair
object is stored on each
node in an environment in the environment reclin2:::reclin_env
(the
default name of this environment is "default"
). The cluster_pairs
object is a list with a copy of the cluster object and the name of the
environment on the cluster nodes in which the pairs are stored.
The next step in the linkage process is to compare the pair of records on a set of common variables in both data sets. For this the package contains various comparison functions. The default function checks for exact agreement. However, for text fields such as names and addresses, it often better to allow for spelling errors. For this some of the functions from the stringdist (van der Loo 2014) package are imported. For classic record linkage using the Fellegi-Sunter model is necessary that these are translated into a similarity score between 0 and 1 where 1 is complete agreement which is what the functions included in reclin2 do. In the example below, we provide a comparison function for ‘firstname’, ‘lastname’ and ‘address’:
> (compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ comparators = list(lastname = jaro_winkler(0.9), firstname = jaro_winkler(0.9),
+ address = jaro_winkler(0.9) ), inplace = TRUE))
[...]
.x .y lastname firstname address sex1: 1 1 1.000000 0.4722222 0.9230769 NA
2: 1 2 0.000000 0.5833333 0.8641026 TRUE
3: 1 3 0.447619 0.4642857 0.9333333 TRUE
[...]
The Jaro-Winkler string similarity score is used: a value of one
indicates complete agreement, a value of zero indicated complete
disagreement (no overlap in letters) and values in between indicate
partial agreement. The 0.9
in the function call is a threshold used,
among others, by the EM-algorithm discussed below as this method only
handles complete agreement or disagreement: values above 0.9 are
considered to agree completely. We see that the first record from x
agrees exactly only on ‘lastname’ with the first record of y
, while
‘sex’ cannot be compared as it is missing in at least one of the data
sets.
The compare_pairs
method is also implemented for the cluster_pairs
object. For more flexibility there is also the compare_vars
method.
This function only compares one variable at the time, but it allows for
different names of the variables in the two data sets, generating
multiple output columns out of one comparison and for more complex
comparisons where multiple variables are taken into account. As an
example of the latter, the code below compares records on first name and
last name allowing for the two parts of a name to be swapped:
> comp_name <- function(x, y) {
+ equal <- identical()
+ regular <- equal(x[[1]], y[[1]]) & equal(x[[2]], y[[2]])
+ swapped <- equal(x[[1]], y[[2]]) & equal(x[[2]], y[[1]])
+ regular | swapped
+ }
> compare_vars(pairs, "name_swap", on_x = c("firstname", "lastname"),
+ comparator = comp_name)
[...]
.x .y lastname firstname address sex name_swap1: 1 1 1.000000 0.4722222 0.9230769 NA FALSE
2: 1 2 0.000000 0.5833333 0.8641026 TRUE FALSE
3: 1 3 0.447619 0.4642857 0.9333333 TRUE FALSE
[...]
When records are compared on multiple columns, the comparison function
receives two data.table
objects as its inputs.
The goal of probabilistic record linkage is to generate a likelihood for each pair that the two records in the pair belong to the same record. This likelihood is based on the comparison vector. The traditional method is the model by Fellegi and Sunter (1969). The parameters of this model are usually estimated using a EM-algorithm (Winkler 2000). However, reclin2 considers this just a model as any other model and uses the same interface as any other model that can be estimated in R:
> m <- problink_em(~ lastname + firstname + address + sex, data = pairs)
> (pairs <- predict(m, pairs = pairs, add = TRUE))
[...]
.x .y lastname firstname address sex weights1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862
2: 1 2 0.000000 0.5833333 0.8641026 TRUE -5.9463949
3: 1 3 0.447619 0.4642857 0.9333333 TRUE 0.8042090
[...]
The range of the weights depends on the number of variables and the
estimated parameters in the model. They are log-likelihood ratios (Fellegi and Sunter 1969).
The values of the weights themselves are not directly of use, except
that a higher weight indicates that a pairs if more likely a match. In
principle, a weight above zero indicates that the pair is more likely a
match than not. However, in practice, a threshold higher than zero is
often used in order to reduce the likelihood of false links. The predict
function of the EM-model also has the option to estimate posterior
probabilities. Thresholds for the weights (or probabilities) are often
determined by manually inspecting pairs around potential threshold
values (Herzog et al. 2007). These methods can also be used for cluster_pairs
objects.
As the pairs
object is a regular data.table
object, it is also
relatively easy to estimate other models on the data set. In principle
this is a classification problem: the pairs need to be divided into two
categories: matches and non-matches. For example, when for a part of the
pairs the true match status is known, a supervised learning method can
be used. In the example below the ‘id’ field is used to derive the true
match status for the dataset (in practice this would probably only be
available for a subset) and predict a linkage probability using logistic
regression:
> compare_vars(pairs, "true", on_x = "id", on_y = "id", inplace = TRUE)
> mglm <- glm(true ~ lastname + firstname, data = pairs,
+ family = binomial())
> pairs[, pglm := predict(mglm, type = "response")]
In the first line of this example, a column named ‘true’ is added to the
dataset (in place). This column is a comparison of the ‘id’ from the
first data set (on_x = "id"
) to the ‘id’ column of the second data set
(on_y = "id"
). This column ‘true’ contains the true match status. In
the second line, a logistic regression model is estimated that predicts
the match status using the two columns ‘firstname’ and ‘lastname’. Using
this model the probability of a true match is estimated in the final
line of the example and added to the data set.
In order to link the pairs a suitable threshold needs to be determined
for the weights. Records with a weight above this threshold are
classified as a match. Also, we generally know that each person only has
one record in each data set. So, generally we will want to enforce
one-to-one linkage. This will also generally improve the quality of the
linked data set. reclin2 has two methods for enforcing one-to-one
linkage. The method select_n_to_m
tries to select the pairs in such a
way that the total weight of the selected pairs is maximised while
linking each record from each data set to at most one record from the
other data set (using its arguments it is also possible to enforce
n-to-one or one-to-n linkage). A faster method that can lead to less
links is select_greedy
that will try to select the pair with the
highest weight for each record. Below the first method is applied;
records with a weight below 0 are not considered (threshold = 0
):
> (pairs <- select_n_to_m(pairs, "weights", variable = "select", threshold = 0))
[...]
.x .y lastname firstname address sex weights select1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862 FALSE
2: 1 2 0.000000 0.5833333 0.8641026 TRUE -5.9463949 FALSE
3: 1 3 0.447619 0.4642857 0.9333333 TRUE 0.8042090 FALSE
4: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218 TRUE
[...]
The method creates a logical column (name given by the ‘variable’
argument) in the pairs object with the selected pairs. The first pair
has a high enough weight to be selected, but there is another candidate
for the first record of y
that is more likely, namely record 2 from
x
(see fourth row of the output above).
Up until now we are still working with a set of pairs. The goal is to
get an actually linked dataset containing matched records from both data
sets. This can be done using the link
method. This function takes the
pairs and the name of a logical column indicating which pairs are
selected and it will generate the final linked data set. The output is
similar to that of merge
. The method also has arguments all_x
and
all_y
that function the same as the corresponding all.x
and all.y
arguments of merge
.
> (linked_data_set <- link(pairs, selection = "select"))
: 4 pairs
Total number of pairs
.y .x id.x lastname.x firstname.x address.x sex.x postcode.x id.y1: 1 2 2 Smith George 12 Mainstr M 1234 AB 2
2: 2 3 3 Johnson Anna 61 Mainstr F 1234 AB 3
3: 3 4 4 Johnson Charles 61 Mainstr M 1234 AB 4
4: 4 6 6 Schwartz Ben 1 Eaststr M 6789 XY 6
lastname.y firstname.y address.y sex.y postcode.y1: Smith Gearge 12 Mainstreet <NA> 1234 AB
2: Jonson A. 61 Mainstreet F 1234 AB
3: Johnson Charles 61 Mainstr F 1234 AB
4: Schwartz Ben 1 Main M 6789 XY
For the cluster_pairs
the steps above need to change a little bit as
select_n_to_m
needs to consider all pairs and, therefore, does not
work with objects of type cluster_pair
where the pairs are distributed
over the cluster nodes. Therefore, we first need to copy the relevant
pairs to the main R process. We can use a selection variable for this
only returning the pairs with a weight above zero:
> cpairs <- predict(m, pairs = cpairs, add = TRUE)
> select_threshold(cpairs, "weights", variable = "initial", threshold = 0)
> local_cpairs <- cluster_collect(cpairs, "initial")
> local_cpairs <- select_n_to_m(local_cpairs, "weights", variable = "select")
The first line calculates weights for the cpairs
object. In the second
line a logical column ‘initial’ is created which is TRUE
for records
with a weight higher than 0. Using cluster_collect
, we collect the
pairs from the worker processes into the main R process. Using the
second argument we only collect pairs for which the column ‘initial’ is
TRUE
. The local_cpairs
object is a regular pairs
object (and,
therefore, also a data.table
) on which we can use the regular
select_n_to_m
method.
cluster_pair
objectsIt is easy to do manual manipulations on the regular pairs
object. For
the cluster_pairs
the pairs
are distributed over the worker nodes.
There are a couple of functions to help with this. The already mentioned
cluster_collect
function copies the pairs locally. The cluster_call
function accepts the cluster_pairs
and a function. It will call the
function on each node and pass it the pairs
, x
and y
. The results
of the functions are copied back locally. For example to get the number
of pairs on each node:
> unlist(cluster_call(cpairs, \(p, ...) nrow(p)))
1] 9 8 [
The cluster_modify_pairs
function can be used to modify the pairs. The
arguments are the cluster_pairs
and a function with the same arguments
as for cluster_call
. The result of that function overwrites the pairs
object on the worker node (except when NULL
). In the example below,
this is used to remove pairs with a weight of zero or lower.
> (cluster_modify_pairs(cpairs, \(p, ...) p[weights > 0, ]))
'default' with size: 2
Cluster : 6 records
First data set: 5 records
Second data set: 15 pairs
Total number of pairs: 'postcode'
Blocking on
:
Showing a random selection of pairs
.x .y lastname firstname address sex weights initial1: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106 TRUE
2: 4 3 1.000000 1.0000000 1.0000000 FALSE 15.4915816 TRUE
[...]
Note, that the original cpairs
object has been modified. Using the
new_name
argument it is also possible to generate a new set of pairs.
In this section the performance of the packages for data linkage will be investigated using example data from the Eurostat financed ESSnet (European Statistical System Centres and Networks of Excellence) project on Data Integration (Eurostat 2011). The two data sets ‘PDR’ and ‘CIS’ were linked to each other. The datasets have 24,750 and 24,613 records respectively resulting in 612,562,500 possible pairs when linking the complete data sets. To study the effect of the size of the problem on the performance, samples were drawn from the two data sets based on the postcode where care was taken to sample the same postcodes in the two datasets. The sample fraction was varied from 0.1 to 1.0 in steps of 0.1.
The methods are compared on total computation time and memory use. These were measured using the ‘time’ program on a Linux server. The virtual server has 16 3.2GHz Intel Xeon Gold 6146 cores and 512GB of memory and runs on Ubuntu 20.04. For the time the reported ‘elapsed (wall clock) time’ is used and for the memory usage the ‘maximum resident set size’. For reclin2 also the effect of different numbers of worker nodes was investigated using the cluster functions. It was attempted to keep the methods used a close as possible to each other. No blocking was applied. The EM-algorithm was used. For comparing the names the Jaro-Winkler similarity score was used with a threshold of 0.85. The quality of the resulting record linkage was recorded, but as the methods were not optimally tuned, it is difficult to compare these results and these results are, therefore, not reported. The complete code for the benchmark and all results can be found on Github (Laan 2022).
Figures 1 and 2 show the computation times and the memory usage respectively as a function of the size of the problem. For the RecordLinkage package larger problem sizes than reported here were not investigated as the system started running out of memory. RecordLinkage has the option to also work from disk for large problems. The performance of this was not investigated as this would only lead to longer computation times which were already longer than those of the other packages. For this specific problem the fastLink package performs better than the other packages. Especially the memory usage is substantially lower because of their use of specialised data structures (Enamorado et al. 2019). The difference in computation time between fastLink and reclin2 using multiple cores is limited (approx. factor 1.7 for the largest problem). However, fastLink is tailored for use with the EM-algorithm while the other packages are more general.
The speed-up of the reclin2 benchmark is not proportional to the number of cores used. This is caused mainly by the fact that some of the steps take place in the main process: reading and sampling the data, starting the worker nodes, and importantly, finalizing the record linkage using one-to-one matching. We are unfortunately limited by Amdal’s law (Amdahl 1967) although the generation, comparison and (in case of the EM-algorithm) tabulation of pair and calculating the predictions will take an increasingly larger part of the running time as the size of problem increases.
By using simple data structures, namely data.table
objects, and
providing a set of functions that operate on these structures, we have
built a flexible and well performing toolkit for record linkage. For
users, it is easy to extend on the methods present in the package.
Either by manipulating the data structures directly, by writing custom
functions or by using existing functions. An example of the latter, is
the relative ease with which existing machine learning methods can be
used in the record linkage process.
The package manages to keep memory use limited and on machines with 256GB of memory it should handle problems up to approximately \(10^9\) pairs. By combining the memory of multiple machines this can of course be extended. When one is only interested in using the Fellegi-Sunter model of record linkage with an EM-algorithm to estimate the parameters of that model, the fastLink package is probably the best choice. It performs better and the model and EM-algorithm used is more flexible than that currently present in reclin2. The main advantage of reclin2 over fastLink is the flexibility reclin2 provides. Especially for non standard problems this is important.
The package is still being developed. One of the things that is being worked on is the option to take into account the uniqueness of a certain attribute. For example, agreement on a rare family name is a stronger indication of a match than agreement on a common family name. However, we hope that making it easy for users to extend and modify the record linkage processes also lowers the threshold for contributing to the package.
reclin2, reclin, RecordLinkage, fastLink, data.table, snow, stringdist
Finance, HighPerformanceComputing, MissingData, NaturalLanguageProcessing, OfficialStatistics, TimeSeries, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Laan, "reclin2: a Toolkit for Record Linkage and Deduplication", The R Journal, 2022
BibTeX citation
@article{RJ-2022-038, author = {Laan, D. Jan van der}, title = {reclin2: a Toolkit for Record Linkage and Deduplication}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {2}, issn = {2073-4859}, pages = {320-328} }