One question we get quite often about setting up an instance of OpenEMPI has to do with the configuration of the blocking algorithm. Everyone is familiar to some extent with matching since that is what an EMPI is supposed to do but what exactly is blocking? In this blog entry we first explain what blocking is about conceptually and in a subsequent entry we will get into the specifics of how to configure the blocking algorithm of your OpenEMPI instance.
When a new record is added to OpenEMPI, the system needs to determine if there are any existing records that match this new record. If any such matching records are found, a link is created between each of the existing matching records and the new record since all those records are considered to be duplicate records that refer to the same physical patient. The brute force approach for identifying such duplicate records would be to compare the new record against every other record in the system and have the matching algorithm evaluate every such pair. This approach would certainly work but it does not scale. If the performance implications of the brute force approach are not easy to see when considering what happens when one new record is added, they become especially clear when you consider the effect of the brute force strategy when are you first setting up your instance of OpenEMPI. Let’s say you have aggregated the patient records from all the facilities that you are integrating together and you have come up with a grand total of 100,000 records (this is a fairly small federation of healthcare facilities). You now need to run the matching algorithm against all records in the system to identify all record pairs and link them together. If you use the brute force approach to generate all record pairs that need to be compared with one another, you will end up having to evaluate 10 billion record pairs. Clearly there has to be a better way of approaching this problem.
This is where blocking comes in. The purpose of blocking is to identify record pairs that are likely to result in a match and eliminate record pairs that will certainly not do so. For example, there is no reason to compare the patient record of Odysseas Pentakalos from Virginia with the record for Jane Smith from Michigan because it is close to impossible that these two records refer to the same patient. The blocking algorithm works by identifying candidate records from a source record that may potentially result in a match, and the system generates all record pairs between the source record and each of the candidate records and passes those pairs down the pipeline for evaluation.
The blocking algorithm currently used by OpenEMPI operates by partitioning the complete set of records in the system into blocks (which explains where the name “blocking” algorithm came from) and then only comparing record pairs formed by pairing together the records in each block. This results in a considerable reduction in the number of record pairs evaluated. The next question is how is the partitioning of the records done? When configuring the blocking algorithm you must select one or more record attributes forming what is referred to as the blocking key. The system then partitions the complete set of records into blocks that have the same value for the blocking key. For example, let’s say that we choose the zip or postal code as the single attribute comprising our blocking key. The system will then form blocks for each distinct zip code in our records and all the records in each block will have the same value in the zip code field. To quantify the improvement in reducing the number of record pairs evaluated in this example, let’s assume our repository had a total of 24 records in it, with 6 distinct values for the zip code in those records. To simplify the calculation, we assume that the blocking key of zip code evenly partitions the 24 records into 6 blocks of 4 records in each block. Without using the blocking algorithm, the brute force approach would require that we evaluate n(n-1)/2 or 276 pairs. By using the blocking algorithm, we have 6 blocks generating 6 record pairs each (4×3/2) for a total of only 36 record pairs. The blocking algorithm allowed us to reduce the total number of record pairs that we have to evaluate from 276 to 36 or a factor of almost 8.
The blocking algorithm is not only used for the initial evaluation of record pairs into the system but also whenever new records are added or updated. For example, when a new record is added to the system with zip code 10000, the blocking algorithm identifies all the records in the system with zip code 10000, forms record pairs between the new record and each of the existing records with zip code 10000 and passes those record pairs on to the matching algorithm for evaluation.
There are quite a few blocking algorithms that researchers have devised over the years. If you are interested in learning more about the various algorithms that are available, I highly recommend the excellent survey article by Peter Christen [1]. The ultimate goal for the blocking algorithm is to reduce the number of record pairs that need to be evaluated for a match, but the algorithm cannot get overly aggressive because then it will start eliminating record pairs that will result in a match, causing the system to generate many false negatives (this is a concise way to refer to record pairs that were classified as non-matches where in reality they are matches). Achieving the right balance between reducing the number of record pairs to be evaluated while not eliminating matching pairs from evaluation requires careful configuration of the algorithm. In a future entry we will discuss the proper configuration of the blocking algorithm used in OpenEMPI.
- Christen, Peter, “A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication”, IEEE Transactions on Knowledge and Data Engineering, 2011.