Record linkage

Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining different data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked.

Naming conventions

"Record linkage" is the term used by statisticians, epidemiologists, and historians, among others, to describe the process of joining records from one data source with another that describe the same entity. However, many other terms are used for this process. Unfortunately, this profusion of terminology has led to few cross-references between these research communities.^[1]^[2]

Computer scientists often refer to it as "data matching" or as the "object identity problem". Commercial mail and database applications refer to it as "merge/purge processing" or "list washing". Other names used to describe the same concept include: "coreference/entity/identity/name/record resolution", "entity disambiguation/linking", "fuzzy matching", "duplicate detection", "deduplication", "record matching", "(reference) reconciliation", "object identification", "data/information integration" and "conflation".^[3]

While they share similar names, record linkage and linked data are two separate approaches to processing and structuring data. Although both involve identifying matching entities across different data sets, record linkage standardly equates "entities" with human individuals; by contrast, Linked Data is based on the possibility of interlinking any web resource across data sets, using a correspondingly broader concept of identifier, namely a URI.

History

The initial idea of record linkage goes back to Halbert L. Dunn in his 1946 article titled "Record Linkage" published in the American Journal of Public Health.^[4]

Howard Borden Newcombe then laid the probabilistic foundations of modern record linkage theory in a 1959 article in Science.^[5] These were formalized in 1969 by Ivan Fellegi and Alan Sunter, in their pioneering work "A Theory For Record Linkage", where they proved that the probabilistic decision rule they described was optimal when the comparison attributes were conditionally independent.^[6] In their work they recognized the growing interest in applying advances in computing and automation to large collections of administrative data, and the Fellegi-Sunter theory remains the mathematical foundation for many record linkage applications.

Since the late 1990s, various machine learning techniques have been developed that can, under favorable conditions, be used to estimate the conditional probabilities required by the Fellegi-Sunter theory. Several researchers have reported that the conditional independence assumption of the Fellegi-Sunter algorithm is often violated in practice; however, published efforts to explicitly model the conditional dependencies among the comparison attributes have not resulted in an improvement in record linkage quality.^{[citation needed]} On the other hand, machine learning or neural network algorithms that do not rely on these assumptions often provide far higher accuracy, when sufficient labeled training data is available.^[7]

Record linkage can be done entirely without the aid of a computer, but the primary reasons computers are often used to complete record linkages are to reduce or eliminate manual review and to make results more easily reproducible. Computer matching has the advantages of allowing central supervision of processing, better quality control, speed, consistency, and better reproducibility of results.^[8]

Methods

Data preprocessing

Record linkage is highly sensitive to the quality of the data being linked, so all data sets under consideration (particularly their key identifier fields) should ideally undergo a data quality assessment prior to record linkage. Many key identifiers for the same entity can be presented quite differently between (and even within) data sets, which can greatly complicate record linkage unless understood ahead of time. For example, key identifiers for a man named William J. Smith might appear in three different data sets as so:

Data set	Name	Date of birth	City of residence
Data set 1	William J. Smith	1/2/73	Berkeley, California
Data set 2	Smith, W. J.	1973.1.2	Berkeley, CA
Data set 3	Bill Smith	Jan 2, 1973	Berkeley, Calif.

In this example, the different formatting styles lead to records that look different but in fact all refer to the same entity with the same logical identifier values. Most, if not all, record linkage strategies would result in more accurate linkage if these values were first normalized or standardized into a consistent format (e.g., all names are "Surname, Given name", and all dates are "YYYY/MM/DD"). Standardization can be accomplished through simple rule-based data transformations or more complex procedures such as lexicon-based tokenization and probabilistic hidden Markov models.^[9] Several of the packages listed in the Software Implementations section provide some of these features to simplify the process of data standardization.

Entity resolution

Entity resolution is an operational intelligence process, typically powered by an entity resolution engine or middleware, whereby organizations can connect disparate data sources with a view to understanding possible entity matches and non-obvious relationships across multiple data silos. It analyzes all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non-obvious relationships exist between those identities.

Entity resolution engines are typically used to uncover risk, fraud, and conflicts of interest, but are also useful tools for use within customer data integration (CDI) and master data management (MDM) requirements. Typical uses for entity resolution engines include terrorist screening, insurance fraud detection, USA Patriot Act compliance, organized retail crime ring detection and applicant screening.

For example: Across different data silos – employee records, vendor data, watch lists, etc. – an organization may have several variations of an entity named ABC, which may or may not be the same individual. These entries may, in fact, appear as ABC1, ABC2, or ABC3 within those data sources. By comparing similarities between underlying attributes such as address, date of birth, or social security number, the user can eliminate some possible matches and confirm others as very likely matches.

Entity resolution engines then apply rules, based on common sense logic, to identify hidden relationships across the data. In the example above, perhaps ABC1 and ABC2 are not the same individual, but rather two distinct people who share common attributes such as address or phone number.

Data matching

While entity resolution solutions include data matching technology, many data matching offerings do not fit the definition of entity resolution. Here are four factors that distinguish entity resolution from data matching, according to John Talburt, director of the UALR Center for Advanced Research in Entity Resolution and Information Quality:

Works with both structured and unstructured records, and it entails the process of extracting references when the sources are unstructured or semi-structured
Uses elaborate business rules and concept models to deal with missing, conflicting, and corrupted information
Utilizes non-matching, asserted linking (associate) information in addition to direct matching
Uncovers non-obvious relationships and association networks (i.e. who's associated with whom)

In contrast to data quality products, more powerful identity resolution engines also include a rules engine and workflow process, which apply business intelligence to the resolved identities and their relationships. These advanced technologies make automated decisions and impact business processes in real time, limiting the need for human intervention.

Deterministic record linkage

The simplest kind of record linkage, called deterministic or rules-based record linkage, generates links based on the number of individual identifiers that match among the available data sets.^[10] Two records are said to match via a deterministic record linkage procedure if all or some identifiers (above a certain threshold) are identical. Deterministic record linkage is a good option when the entities in the data sets are identified by a common identifier, or when there are several representative identifiers (e.g., name, date of birth, and sex when identifying a person) whose quality of data is relatively high.

As an example, consider two standardized data sets, Set A and Set B, that contain different bits of information about patients in a hospital system. The two data sets identify patients using a variety of identifiers: Social Security Number (SSN), name, date of birth (DOB), sex, and ZIP code (ZIP). The records in two data sets (identified by the "#" column) are shown below:

Data Set	#	SSN	Name	DOB	Sex	ZIP
Set A	1	000956723	Smith, William	1973/01/02	Male	94701
	2	000956723	Smith, William	1973/01/02	Male	94703
	3	000005555	Jones, Robert	1942/08/14	Male	94701
	4	123001234	Sue, Mary	1972/11/19	Female	94109
Set B	1	000005555	Jones, Bob	1942/08/14
Set B	2		Smith, Bill	1973/01/02	Male	94701

The most simple deterministic record linkage strategy would be to pick a single identifier that is assumed to be uniquely identifying, say SSN, and declare that records sharing the same value identify the same person while records not sharing the same value identify different people. In this example, deterministic linkage based on SSN would create entities based on A1 and A2; A3 and B1; and A4. While A1, A2, and B2 appear to represent the same entity, B2 would not be included into the match because it is missing a value for SSN.

Handling exceptions such as missing identifiers involves the creation of additional record linkage rules. One such rule in the case of missing SSN might be to compare name, date of birth, sex, and ZIP code with other records in hopes of finding a match. In the above example, this rule would still not match A1/A2 with B2 because the names are still slightly different: standardization put the names into the proper (Surname, Given name) format but could not discern "Bill" as a nickname for "William". Running names through a phonetic algorithm such as Soundex, NYSIIS, or metaphone, can help to resolve these types of problems. However, they may still stumble over surname changes as the result of marriage or divorce, but then B2 would be matched only with A1 since the ZIP code in A2 is different. Thus, another rule would need to be created to determine whether differences in particular identifiers are acceptable (such as ZIP code) and which are not (such as date of birth).

As this example demonstrates, even a small decrease in data quality or small increase in the complexity of the data can result in a very large increase in the number of rules necessary to link records properly. Eventually, these linkage rules will become too numerous and interrelated to build without the aid of specialized software tools. In addition, linkage rules are often specific to the nature of the data sets they are designed to link together. One study was able to link the Social Security Death Master File with two hospital registries from the Midwestern United States using SSN, NYSIIS-encoded first name, birth month, and sex, but these rules may not work as well with data sets from other geographic regions or with data collected on younger populations.^[11] Thus, continuous maintenance testing of these rules is necessary to ensure they continue to function as expected as new data enter the system and need to be linked. New data that exhibit different characteristics than was initially expected could require a complete rebuilding of the record linkage rule set, which could be a very time-consuming and expensive endeavor.

Probabilistic record linkage

Probabilistic record linkage, sometimes called fuzzy matching (also probabilistic merging or fuzzy merging in the context of merging of databases), takes a different approach to the record linkage problem by taking into account a wider range of potential identifiers, computing weights for each identifier based on its estimated ability to correctly identify a match or a non-match, and using these weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below another threshold are considered to be non-matches; pairs that fall between these two thresholds are considered to be "possible matches" and can be dealt with accordingly (e.g., human reviewed, linked, or not linked, depending on the requirements). Whereas deterministic record linkage requires a series of potentially complex rules to be programmed ahead of time, probabilistic record linkage methods can be "trained" to perform well with much less human intervention.

Many probabilistic record linkage algorithms assign match/non-match weights to identifiers by means of two probabilities called $u$

 
   
     
       u
     
   
   {\displaystyle u}

u

and

m

 
   
     
       m
     
   
   {\displaystyle m}

$m$ . The $u$

 
   
     
       u
     
   
   {\displaystyle u}

$u$ probability is the probability that an identifier in two non-matching records will agree purely by chance. For example, the $u$

 
   
     
       u
     
   
   {\displaystyle u}

$u$ probability for birth month (where there are twelve values that are approximately uniformly distributed) is $1/12\approx 0.083$

 
   
     
       1
       
         /
       
       12
       ≈
       0.083
     
   
   {\displaystyle 1/12\approx 0.083}

$1/12\approx 0.083$ ; identifiers with values that are not uniformly distributed will have different $u$

 
   
     
       u
     
   
   {\displaystyle u}

$u$ probabilities for different values (possibly including missing values). The $m$

 
   
     
       m
     
   
   {\displaystyle m}

$m$ probability is the probability that an identifier in matching pairs will agree (or be sufficiently similar, such as strings with low Jaro-Winkler or Levenshtein distance). This value would be $1.0$

 
   
     
       1.0
     
   
   {\displaystyle 1.0}

$1.0$ in the case of perfect data, but given that this is rarely (if ever) true, it can instead be estimated. This estimation may be done based on prior knowledge of the data sets, by manually identifying a large number of matching and non-matching pairs to "train" the probabilistic record linkage algorithm, or by iteratively running the algorithm to obtain closer estimations of the $m$

 
   
     
       m
     
   
   {\displaystyle m}

$m$ probability. If a value of $0.95$

 
   
     
       0.95
     
   
   {\displaystyle 0.95}

$0.95$ were to be estimated for the $m$

 
   
     
       m
     
   
   {\displaystyle m}

$m$ probability, then the match/non-match weights for the birth month identifier would be:

Outcome

Proportion of links

Proportion of non-links

Frequency ratio

Weight

Match

m=0.95

u\approx 0.083

m/u\approx 11.4

\log _{2}{m/u}\approx 3.51

Non-match

1-m=0.05

1-u\approx 0.917

(1-m)/(1-u)\approx 0.0545

\log _{2}{(1-m)/(1-u)}\approx -4.20

The same calculations would be done for all other identifiers under consideration to find their match/non-match weights. Then, every identifier of one record would be compared with the corresponding identifier of another record to compute the total weight of the pair: the match weight is added to the running total whenever a pair of identifiers agree, while the non-match weight is added (i.e. the running total decreases) whenever the pair of identifiers disagrees. The resulting total weight is then compared to the aforementioned thresholds to determine whether the pair should be linked, non-linked, or set aside for special consideration (e.g. manual validation).^[12]

Blocking

Determining where to set the match/non-match thresholds is a balancing act between obtaining an acceptable sensitivity (or recall, the proportion of truly matching records that are linked by the algorithm) and positive predictive value (or precision, the proportion of records linked by the algorithm that truly do match). Various manual and automated methods are available to predict the best thresholds, and some record linkage software packages have built-in tools to help the user find the most acceptable values. Because this can be a very computationally demanding task, particularly for large data sets, a technique known as blocking is often used to improve efficiency. Blocking attempts to restrict comparisons to just those records for which one or more particularly discriminating identifiers agree, which has the effect of increasing the positive predictive value (precision) at the expense of sensitivity (recall).^[12] For example, blocking based on a phonetically coded surname and ZIP code would reduce the total number of comparisons required and would improve the chances that linked records would be correct (since two identifiers already agree), but would potentially miss records referring to the same person whose surname or ZIP code was different (due to marriage or relocation, for instance). Blocking based on birth month, a more stable identifier that would be expected to change only in the case of data error, would provide a more modest gain in positive predictive value and loss in sensitivity, but would create only twelve distinct groups which, for extremely large data sets, may not provide much net improvement in computation speed. Thus, robust record linkage systems often use multiple blocking passes to group data in various ways in order to come up with groups of records that should be compared to each other.

Machine learning

In recent years, a variety of machine learning techniques have been used in record linkage. It has been recognized^[7] that the classic Fellegi-Sunter algorithm for probabilistic record linkage outlined above is equivalent to the Naive Bayes algorithm in the field of machine learning,^[13] and suffers from the same assumption of the independence of its features (an assumption that is typically not true).^[14]^[15] Higher accuracy can often be achieved by using various other machine learning techniques, including a single-layer perceptron,^[7] random forest, and SVM.^[16] In conjunction with distributed technologies,^[17] accuracy and scale for record linkage can be improved further.

Human-machine hybrid record linkage

High quality record linkage often requires a human–machine hybrid system to safely manage uncertainty in the ever changing streams of chaotic big data.^[18]^[19] Recognizing that linkage errors propagate into the linked data and its analysis, interactive record linkage systems have been proposed. Interactive record linkage is defined as people iteratively fine tuning the results from the automated methods and managing the uncertainty and its propagation to subsequent analyses.^[20] The main objectives of interactive record linkage systems is to manually resolve uncertain linkages and validate the results until it is at acceptable levels for the given application. Variations of interactive record linkage that enhance privacy during the human interaction steps have also been proposed.^[21]^[22]

Privacy-preserving record linkage

Record linkage is increasingly required across databases held by different organisations, where the complementary data held by these organisations can, for example, help to identify patients that are susceptible to certain adverse drug reactions (linking hospital, doctor, pharmacy databases). In many such applications, however, the databases to be linked contain sensitive information about people which cannot be shared between the organisations.^[23]

Privacy-preserving record linkage (PPRL) methods have been developed with the aim to link databases without the need of sharing the original sensitive values between the organisations that participate in a linkage.^[24]^[25] In PPRL, generally the attribute values of records to be compared are encoded or encrypted in some form. A popular such encoding technique used are Bloom filter,^[26] which allows approximate similarities to be calculated between encoded values without the need for sharing the corresponding sensitive plain-text values. At the end of the PPRL process only limited information about the record pairs classified as matches is revealed to the organisations that participate in the linkage process. The techniques used in PPRL^[24] must guarantee that no participating organisation, nor any external adversary, can compromise the privacy of the entities that are represented by records in the databases being linked.^[27]

Mathematical model

In an application with two files, A and B, denote the rows (records) by $\alpha (a)$ in file A and $\beta (b)$ in file B. Assign $K$ characteristics to each record. The set of records that represent identical entities is defined by

$M=\left\{(a,b);a=b;a\in A;b\in B\right\}$

 
   
     
       M
       =
       
         {
         
           (
           a
           ,
           b
           )
           ;
           a
           =
           b
           ;
           a
           ∈
           A
           ;
           b
           ∈
           B
         
         }
       
     
   
   {\displaystyle M=\left\{(a,b);a=b;a\in A;b\in B\right\}}

M=\left\{(a,b);a=b;a\in A;b\in B\right\}

and the complement of set $M$

 
   
     
       M
     
   
   {\displaystyle M}

M

, namely set

U

 
   
     
       U
     
   
   {\displaystyle U}

$U$ representing different entities is defined as

$U=\{(a,b);a\neq b;a\in A;b\in B\}$

 
   
     
       U
       =
       {
       (
       a
       ,
       b
       )
       ;
       a
       ≠
       b
       ;
       a
       ∈
       A
       ;
       b
       ∈
       B
       }
     
   
   {\displaystyle U=\{(a,b);a\neq b;a\in A;b\in B\}}

U=\{(a,b);a\neq b;a\in A;b\in B\}

.

A vector, $\gamma$

 
   
     
       γ
     
   
   {\displaystyle \gamma }

\gamma

is defined, that contains the coded agreements and disagreements on each characteristic:

$\gamma \left[\alpha (a),\beta (b)\right]=\{\gamma ^{1}\left[\alpha (a),\beta (b)\right],...,\gamma ^{K}\left[\alpha (a),\beta (b)\right]\}$

 
   
     
       γ
       
         [
         
           α
           (
           a
           )
           ,
           β
           (
           b
           )
         
         ]
       
       =
       {
       
         γ
         
           1
         
       
       
         [
         
           α
           (
           a
           )
           ,
           β
           (
           b
           )
         
         ]
       
       ,
       .
       .
       .
       ,
       
         γ
         
           K
         
       
       
         [
         
           α
           (
           a
           )
           ,
           β
           (
           b
           )
         
         ]
       
       }
     
   
   {\displaystyle \gamma \left[\alpha (a),\beta (b)\right]=\{\gamma ^{1}\left[\alpha (a),\beta (b)\right],...,\gamma ^{K}\left[\alpha (a),\beta (b)\right]\}}

\gamma \left[\alpha (a),\beta (b)\right]=\{\gamma ^{1}\left[\alpha (a),\beta (b)\right],...,\gamma ^{K}\left[\alpha (a),\beta (b)\right]\}

where $K$

 
   
     
       K
     
   
   {\displaystyle K}

K

is a subscript for the characteristics (sex, age, marital status, etc.) in the files. The conditional probabilities of observing a specific vector

\gamma

 
   
     
       γ
     
   
   {\displaystyle \gamma }

$\gamma$ given $(a,b)\in M$

 
   
     
       (
       a
       ,
       b
       )
       ∈
       M
     
   
   {\displaystyle (a,b)\in M}

$(a,b)\in M$ , $(a,b)\in U$

 
   
     
       (
       a
       ,
       b
       )
       ∈
       U
     
   
   {\displaystyle (a,b)\in U}

$(a,b)\in U$ are defined as

$m(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in M\right\}=\sum _{(a,b)\in M}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|M\right]$

 
   
     
       m
       (
       γ
       )
       =
       P
       
         {
         
           γ
           
             [
             
               α
               (
               a
               )
               ,
               β
               (
               b
               )
             
             ]
           
           
             |
           
           (
           a
           ,
           b
           )
           ∈
           M
         
         }
       
       =
       
         ∑
         
           (
           a
           ,
           b
           )
           ∈
           M
         
       
       P
       
         {
         
           γ
           
             [
             
               α
               (
               a
               )
               ,
               β
               (
               b
               )
             
             ]
           
         
         }
       
       ⋅
       P
       
         [
         
           (
           a
           ,
           b
           )
           
             |
           
           M
         
         ]
       
     
   
   {\displaystyle m(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in M\right\}=\sum _{(a,b)\in M}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|M\right]}

m(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in M\right\}=\sum _{(a,b)\in M}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|M\right]

and

$u(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in U\right\}=\sum _{(a,b)\in U}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|U\right],$

 
   
     
       u
       (
       γ
       )
       =
       P
       
         {
         
           γ
           
             [
             
               α
               (
               a
               )
               ,
               β
               (
               b
               )
             
             ]
           
           
             |
           
           (
           a
           ,
           b
           )
           ∈
           U
         
         }
       
       =
       
         ∑
         
           (
           a
           ,
           b
           )
           ∈
           U
         
       
       P
       
         {
         
           γ
           
             [
             
               α
               (
               a
               )
               ,
               β
               (
               b
               )
             
             ]
           
         
         }
       
       ⋅
       P
       
         [
         
           (
           a
           ,
           b
           )
           
             |
           
           U
         
         ]
       
       ,
     
   
   {\displaystyle u(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in U\right\}=\sum _{(a,b)\in U}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|U\right],}

u(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in U\right\}=\sum _{(a,b)\in U}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|U\right],

respectively.^[6]

Applications

Master data management

Most Master data management (MDM) products use a record linkage process to identify records from different sources representing the same real-world entity. This linkage is used to create a "golden master record" containing the cleaned, reconciled data about the entity. The techniques used in MDM are the same as for record linkage generally. MDM expands this matching not only to create a "golden master record" but to infer relationships also. (i.e. a person has a same/similar surname and same/similar address, this might imply they share a household relationship).

Data warehousing and business intelligence

Record linkage plays a key role in data warehousing and business intelligence. Data warehouses serve to combine data from many different operational source systems into one logical data model, which can then be subsequently fed into a business intelligence system for reporting and analytics. Each operational source system may have its own method of identifying the same entities used in the logical data model, so record linkage between the different sources becomes necessary to ensure that the information about a particular entity in one source system can be seamlessly compared with information about the same entity from another source system. Data standardization and subsequent record linkage often occur in the "transform" portion of the extract, transform, load (ETL) process.

Historical research

Record linkage is important to social history research since most data sets, such as census records and parish registers were recorded long before the invention of National identification numbers. When old sources are digitized, linking of data sets is a prerequisite for longitudinal study. This process is often further complicated by lack of standard spelling of names, family names that change according to place of dwelling, changing of administrative boundaries, and problems of checking the data against other sources. Record linkage was among the most prominent themes in the History and computing field in the 1980s, but has since been subject to less attention in research.^{[citation needed]}

Medical practice and research

Record linkage is an important tool in creating data required for examining the health of the public and of the health care system itself. It can be used to improve data holdings, data collection, quality assessment, and the dissemination of information. Data sources can be examined to eliminate duplicate records, to identify under-reporting and missing cases (e.g., census population counts), to create person-oriented health statistics, and to generate disease registries and health surveillance systems. Some cancer registries link various data sources (e.g., hospital admissions, pathology and clinical reports, and death registrations) to generate their registries. Record linkage is also used to create health indicators. For example, fetal and infant mortality is a general indicator of a country's socioeconomic development, public health, and maternal and child services. If infant death records are matched to birth records, it is possible to use birth variables, such as birth weight and gestational age, along with mortality data, such as cause of death, in analyzing the data. Linkages can help in follow-up studies of cohorts or other groups to determine factors such as vital status, residential status, or health outcomes. Tracing is often needed for follow-up of industrial cohorts, clinical trials, and longitudinal surveys to obtain the cause of death and/or cancer. An example of a successful and long-standing record linkage system allowing for population-based medical research is the Rochester Epidemiology Project based in Rochester, Minnesota.^[28]

Criticism of existing software implementations

The main reasons cited are:^{[citation needed]}

Project costs: costs typically in the hundreds of thousands of dollars
Time: lack of enough time to deal with large-scale data cleansing software
Security: concerns over sharing information, giving an application access across systems, and effects on legacy systems
Scalability: Due to the absence of unique identifiers in records, record linkage is computationally expensive and difficult to scale.^[29]
Accuracy: Changing business data and capturing of all rules for linking is a tough and extensive exercise

Notes and references

^ "Cristen, P & T: Febrl - Freely extensible biomedical record linkage (Manual, release 0.3) p.9". Archived from the original on 2016-03-11. Retrieved 2006-04-21.
^ Elmagarmid, Ahmed; Panagiotis G. Ipeirotis; Vassilios Verykios (January 2007). "Duplicate Record Detection: A Survey" (PDF). IEEE Transactions on Knowledge and Data Engineering. 19 (1): pp. 1–16. Bibcode:2007ITKDE..19E0581E. doi:10.1109/tkde.2007.250581. S2CID 386036. Retrieved 2009-03-30.
^ Singla, Parag; Domingos, Pedro (December 2006). "Entity Resolution with Markov Logic" (PDF). Sixth International Conference on Data Mining (ICDM'06). pp. 572–582. doi:10.1109/ICDM.2006.65. ISBN 9780769527024. S2CID 12211870. Retrieved 1 March 2023.
^ Dunn, Halbert L. (December 1946). "Record Linkage". American Journal of Public Health. 36 (12): pp. 1412–1416. doi:10.2105/AJPH.36.12.1412. PMC 1624512. PMID 18016455.
^ Newcombe, H. B.; J.M. Kennedy; S.J. Axford; A. P. James (October 1959). "Automatic Linkage of Vital Records". Science. 130 (3381): 954–959. Bibcode:1959Sci...130..954N. doi:10.1126/science.130.3381.954. PMID 14426783.
^ ^a ^b Fellegi, Ivan; Sunter, Alan (December 1969). "A Theory for Record Linkage" (PDF). Journal of the American Statistical Association. 64 (328): pp. 1183–1210. doi:10.2307/2286061. JSTOR 2286061.
^ ^a ^b ^c Wilson, D. Randall, D. Randall (July 31 – August 5, 2011). Beyond Probabilistic Record Linkage: Using Neural Networks and Complex Features to Improve Genealogical Record Linkage (PDF). Proceedings of International Joint Conference on Neural Networks. San Jose, California, USA.
^ Winkler, William E. "Matching and Record Linkage" (PDF). U.S. Bureau of the Census. Retrieved 12 November 2011.
^ Churches, Tim; Peter Christen; Kim Lim; Justin Xi Zhu (13 December 2002). "Preparation of name and address data for record linkage using hidden Markov models". BMC Medical Informatics and Decision Making. 2 9. doi:10.1186/1472-6947-2-9. PMC 140019. PMID 12482326.
^ Roos, LL; Wajda A (April 1991). "Record linkage strategies. Part I: Estimating information and evaluating approaches". Methods of Information in Medicine. 30 (2): 117–123. doi:10.1055/s-0038-1634828. PMID 1857246. S2CID 23501719.
^ Grannis, SJ; Overhage JM; McDonald CJ (2002). "Analysis of identifier performance using a deterministic linkage algorithm". Proc AMIA Symp.: 305–9. PMC 2244404. PMID 12463836.
^ ^a ^b Blakely, Tony; Salmond, Clare (December 2002). "Probabilistic record linkage and a method to calculate the positive predictive value". International Journal of Epidemiology. 31 (6): 1246–1252. doi:10.1093/ije/31.6.1246. PMID 12540730.
^ Quass, Dallan, and Starkey, Paul. “Record Linkage for Genealogical Databases,” ACM SIGKDD ’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, August 24–27, 2003, Washington, D.C.
^ Langley, Pat, Wayne Iba, and Kevin Thompson. “An Analysis of Bayesian Classifiers,” In Proceedings of the 10th National Conference on Artificial Intelligence, (AAAI-92), AAAI Press/MIT Press, Cambridge, MA, pp. 223-228, 1992.
^ Michie, D.; Spiegelhalter, D.; Taylor, C. (1994). Machine Learning, Neural and Statistical Classification. Hertfordshire, England: Ellis Horwood. ISBN 0-13-106360-X.
^ Ramezani, M.; Ilangovan, G.; Kum, H-C. (2021). Evaluation of machine learning algorithms in a human-computer hybrid record linkage system (PDF). Vol. 2846. CEUR workshop proceedings.
^ "Fuzzy Matching With Spark". Spark Summit. 17 July 2014.
^ Bronstein, Janet M.; Lomatsch, Charles T.; Fletcher, David; Wooten, Terri; Lin, Tsai Mei; Nugent, Richard; Lowery, Curtis L. (2008-05-01). "Issues and Biases in Matching Medicaid Pregnancy Episodes to Vital Records Data: The Arkansas Experience". Maternal and Child Health Journal. 13 (2): 250–259. doi:10.1007/s10995-008-0347-z. ISSN 1092-7875. PMID 18449631. S2CID 22259447.
^ Boscoe, Francis P.; Schrag, Deborah; Chen, Kun; Roohan, Patrick J.; Schymura, Maria J. (2010-12-15). "Building Capacity to Assess Cancer Care in the Medicaid Population in New York State". Health Services Research. 46 (3): 805–820. doi:10.1111/j.1475-6773.2010.01221.x. ISSN 0017-9124. PMC 3087842. PMID 21158856.
^ Kum, Hye-Chung; Krishnamurthy, Ashok; Machanavajjhala, Ashwin; Reiter, Michael K; Ahalt, Stanley (March 2014). "Privacy preserving interactive record linkage (PPIRL)". Journal of the American Medical Informatics Association. 21 (2): 212–220. doi:10.1136/amiajnl-2013-002165. ISSN 1067-5027. PMC 3932473. PMID 24201028.
^ Kum, H-C.; Ragan, E.; Ilangovan, G.; Ramezani, M.; Li, Q.; Schmit, C. (2019). Enhancing Privacy through an Interactive On-demand Incremental Information Disclosure Interface: Applying Privacy-by-Design to Record Linkage (PDF). Fifteenth Symposium on Usable Privacy and Security (SOUPS). pp. 175–189. ISBN 978-1-939133-05-2.
^ Ragan, Eric D.; Kum, Hye-Chung; Ilangovan, Gurudev; Wang, Han (2018-04-21). "Balancing Privacy and Information Disclosure in Interactive Record Linkage with Visual Masking". Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. New York, NY, USA: ACM. pp. 1–12. doi:10.1145/3173574.3173900. ISBN 9781450356206. S2CID 5051254.
^ Vatsalan, D; Sehili, Z; Christen, P; Rahm, E (2017). "Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges". Handbook of Big Data Technologies. pp. 851–895. doi:10.1007/978-3-319-49340-4_25. hdl:1885/247396. ISBN 978-3-319-49339-8.
^ ^a ^b Christen, P; Ranbaduge, T; Schnell, R (2020). Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing. Heidelberg: Springer. doi:10.1007/978-3-030-59706-1. ISBN 978-3-030-59706-1. S2CID 222821833.
^ Gkoulalas-Divanis, A; Vatsalan, D; Karapiperis, D; Kantarcioglu, M (2021). "Modern Privacy-Preserving Record Linkage Techniques: An Overview". IEEE Transactions on Information Forensics and Security. 16: 4966–4987. Bibcode:2021ITIF...16.4966G. doi:10.1109/TIFS.2021.3114026. S2CID 239088979.
^ Schnell, R; Bachteler, T; Reiher, J (2009). "Privacy-Preserving Record Linkage using Bloom filters". BMC Medical Informatics and Decision Making. 9 41. doi:10.1186/1472-6947-9-41. PMC 2753305. PMID 19706187.
^ Vidanage, A (2022). Efficient Cryptanalysis Techniques for Privacy-Preserving Record Linkage (Thesis). Canberra: Australian National University. doi:10.25911/VSBZ-A727. hdl:1885/254502.
^ St. Sauver JL; Grossardt BR; Yawn BP; Melton LJ 3rd; Pankratz JJ; Brue SM; Rocca WA (2012). "Data Resource Profile: The Rochester Epidemiology Project (REP) medical records-linkage system". Int J Epidemiol. 41 (6): 1614–24. doi:10.1093/ije/dys195. PMC 3535751. PMID 23159830.{{cite journal}}: CS1 maint: numeric names: authors list (link)
^ "Entity Resolution at Scale". 14 February 2020.

External links

Notes

This article is a direct transclusion of the Wikipedia article and therefore may not meet the same editing standards as LIMSwiki.

[1] "Cristen, P & T: Febrl - Freely extensible biomedical record linkage (Manual, release 0.3) p.9". Archived from the original on 2016-03-11. Retrieved 2006-04-21.

[2] Elmagarmid, Ahmed; Panagiotis G. Ipeirotis; Vassilios Verykios (January 2007). "Duplicate Record Detection: A Survey" (PDF). IEEE Transactions on Knowledge and Data Engineering. 19 (1): pp. 1–16. Bibcode:2007ITKDE..19E0581E. doi:10.1109/tkde.2007.250581. S2CID 386036. Retrieved 2009-03-30.

[3] Singla, Parag; Domingos, Pedro (December 2006). "Entity Resolution with Markov Logic" (PDF). Sixth International Conference on Data Mining (ICDM'06). pp. 572–582. doi:10.1109/ICDM.2006.65. ISBN 9780769527024. S2CID 12211870. Retrieved 1 March 2023.

[4] Dunn, Halbert L. (December 1946). "Record Linkage". American Journal of Public Health. 36 (12): pp. 1412–1416. doi:10.2105/AJPH.36.12.1412. PMC 1624512. PMID 18016455.

[5] Newcombe, H. B.; J.M. Kennedy; S.J. Axford; A. P. James (October 1959). "Automatic Linkage of Vital Records". Science. 130 (3381): 954–959. Bibcode:1959Sci...130..954N. doi:10.1126/science.130.3381.954. PMID 14426783.

[FellegiSunter-6] Fellegi, Ivan; Sunter, Alan (December 1969). "A Theory for Record Linkage" (PDF). Journal of the American Statistical Association. 64 (328): pp. 1183–1210. doi:10.2307/2286061. JSTOR 2286061.

[ReferenceA-7] Wilson, D. Randall, D. Randall (July 31 – August 5, 2011). Beyond Probabilistic Record Linkage: Using Neural Networks and Complex Features to Improve Genealogical Record Linkage (PDF). Proceedings of International Joint Conference on Neural Networks. San Jose, California, USA.

[8] Winkler, William E. "Matching and Record Linkage" (PDF). U.S. Bureau of the Census. Retrieved 12 November 2011.

[9] Churches, Tim; Peter Christen; Kim Lim; Justin Xi Zhu (13 December 2002). "Preparation of name and address data for record linkage using hidden Markov models". BMC Medical Informatics and Decision Making. 2 9. doi:10.1186/1472-6947-2-9. PMC 140019. PMID 12482326.

[10] Roos, LL; Wajda A (April 1991). "Record linkage strategies. Part I: Estimating information and evaluating approaches". Methods of Information in Medicine. 30 (2): 117–123. doi:10.1055/s-0038-1634828. PMID 1857246. S2CID 23501719.

[11] Grannis, SJ; Overhage JM; McDonald CJ (2002). "Analysis of identifier performance using a deterministic linkage algorithm". Proc AMIA Symp.: 305–9. PMC 2244404. PMID 12463836.

[prl-12] Blakely, Tony; Salmond, Clare (December 2002). "Probabilistic record linkage and a method to calculate the positive predictive value". International Journal of Epidemiology. 31 (6): 1246–1252. doi:10.1093/ije/31.6.1246. PMID 12540730.

[13] Quass, Dallan, and Starkey, Paul. “Record Linkage for Genealogical Databases,” ACM SIGKDD ’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, August 24–27, 2003, Washington, D.C.

[14] Langley, Pat, Wayne Iba, and Kevin Thompson. “An Analysis of Bayesian Classifiers,” In Proceedings of the 10th National Conference on Artificial Intelligence, (AAAI-92), AAAI Press/MIT Press, Cambridge, MA, pp. 223-228, 1992.

[15] Michie, D.; Spiegelhalter, D.; Taylor, C. (1994). Machine Learning, Neural and Statistical Classification. Hertfordshire, England: Ellis Horwood. ISBN 0-13-106360-X.

[16] Ramezani, M.; Ilangovan, G.; Kum, H-C. (2021). Evaluation of machine learning algorithms in a human-computer hybrid record linkage system (PDF). Vol. 2846. CEUR workshop proceedings.

[spark-17] "Fuzzy Matching With Spark". Spark Summit. 17 July 2014.

[18] Bronstein, Janet M.; Lomatsch, Charles T.; Fletcher, David; Wooten, Terri; Lin, Tsai Mei; Nugent, Richard; Lowery, Curtis L. (2008-05-01). "Issues and Biases in Matching Medicaid Pregnancy Episodes to Vital Records Data: The Arkansas Experience". Maternal and Child Health Journal. 13 (2): 250–259. doi:10.1007/s10995-008-0347-z. ISSN 1092-7875. PMID 18449631. S2CID 22259447.

[19] Boscoe, Francis P.; Schrag, Deborah; Chen, Kun; Roohan, Patrick J.; Schymura, Maria J. (2010-12-15). "Building Capacity to Assess Cancer Care in the Medicaid Population in New York State". Health Services Research. 46 (3): 805–820. doi:10.1111/j.1475-6773.2010.01221.x. ISSN 0017-9124. PMC 3087842. PMID 21158856.

[20] Kum, Hye-Chung; Krishnamurthy, Ashok; Machanavajjhala, Ashwin; Reiter, Michael K; Ahalt, Stanley (March 2014). "Privacy preserving interactive record linkage (PPIRL)". Journal of the American Medical Informatics Association. 21 (2): 212–220. doi:10.1136/amiajnl-2013-002165. ISSN 1067-5027. PMC 3932473. PMID 24201028.

[21] Kum, H-C.; Ragan, E.; Ilangovan, G.; Ramezani, M.; Li, Q.; Schmit, C. (2019). Enhancing Privacy through an Interactive On-demand Incremental Information Disclosure Interface: Applying Privacy-by-Design to Record Linkage (PDF). Fifteenth Symposium on Usable Privacy and Security (SOUPS). pp. 175–189. ISBN 978-1-939133-05-2.

[22] Ragan, Eric D.; Kum, Hye-Chung; Ilangovan, Gurudev; Wang, Han (2018-04-21). "Balancing Privacy and Information Disclosure in Interactive Record Linkage with Visual Masking". Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. New York, NY, USA: ACM. pp. 1–12. doi:10.1145/3173574.3173900. ISBN 9781450356206. S2CID 5051254.

[23] Vatsalan, D; Sehili, Z; Christen, P; Rahm, E (2017). "Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges". Handbook of Big Data Technologies. pp. 851–895. doi:10.1007/978-3-319-49340-4_25. hdl:1885/247396. ISBN 978-3-319-49339-8.

[lsd-24] Christen, P; Ranbaduge, T; Schnell, R (2020). Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing. Heidelberg: Springer. doi:10.1007/978-3-030-59706-1. ISBN 978-3-030-59706-1. S2CID 222821833.

[25] Gkoulalas-Divanis, A; Vatsalan, D; Karapiperis, D; Kantarcioglu, M (2021). "Modern Privacy-Preserving Record Linkage Techniques: An Overview". IEEE Transactions on Information Forensics and Security. 16: 4966–4987. Bibcode:2021ITIF...16.4966G. doi:10.1109/TIFS.2021.3114026. S2CID 239088979.

[26] Schnell, R; Bachteler, T; Reiher, J (2009). "Privacy-Preserving Record Linkage using Bloom filters". BMC Medical Informatics and Decision Making. 9 41. doi:10.1186/1472-6947-9-41. PMC 2753305. PMID 19706187.

[27] Vidanage, A (2022). Efficient Cryptanalysis Techniques for Privacy-Preserving Record Linkage (Thesis). Canberra: Australian National University. doi:10.25911/VSBZ-A727. hdl:1885/254502.

[data_resource_profile-28] St. Sauver JL; Grossardt BR; Yawn BP; Melton LJ 3rd; Pankratz JJ; Brue SM; Rocca WA (2012). "Data Resource Profile: The Rochester Epidemiology Project (REP) medical records-linkage system". Int J Epidemiol. 41 (6): 1614–24. doi:10.1093/ije/dys195. PMC 3535751. PMID 23159830.{{cite journal}}: CS1 maint: numeric names: authors list (link)

[29] "Entity Resolution at Scale". 14 February 2020.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]