A critical science in healthcare that has many dimensions and use-cases or misuse-cases.
These efforts go through great length to remove Direct Identifiers, those values that are publicly known to uniquely identify a single individual. For example a Driver’s License number, Passport number, Medical Records Number, Email Address, Personal Phone Number, etc.
These efforts then struggle with the Indirect Identifiers, also known as Quasi-Identifiers. These are values that are not unique to that individual, but do describe a narrow aspect about the individual. For example a birth day, gender, postal/zip code, etc. There is also the 'little' issue about free-text fields.
The struggle with De-Identification is that these Indirect Identifiers are often needed by the research project. They very often need to know the gender, age, and region they live. Thus often times these efforts leave some risk.
The concern is that with some risk left in a de-identified dataset there is a possibility that someone who has legitimate (or illegitimate) access might try to re-identify the individuals and thus violate privacy. This is an ‘attack’ upon the de-identified dataset.
These Patient Matching projects are most prevalent in the USA, where our government has forbidden funding to even discuss a national Patient Identity project. Thus in the USA, Patient Identity Matching, is the only choice. This is not really true, the private sector can solve the problem; but the healthcare private sector is far to fragmented to work together on this… Kind of true, more to come on that… My view is a good Patient Identifier enhances Privacy.
De-Identification -- Break the binding:
I have been involved lately with a few De-Identification projects. To be complete De-Identification, Anonymization, and Pseudonymization. Where the goal is to end up with a set of data that is useful for some research project, yet has as low of a Privacy risk to the individuals for whom the data is about.These efforts go through great length to remove Direct Identifiers, those values that are publicly known to uniquely identify a single individual. For example a Driver’s License number, Passport number, Medical Records Number, Email Address, Personal Phone Number, etc.
These efforts then struggle with the Indirect Identifiers, also known as Quasi-Identifiers. These are values that are not unique to that individual, but do describe a narrow aspect about the individual. For example a birth day, gender, postal/zip code, etc. There is also the 'little' issue about free-text fields.
The struggle with De-Identification is that these Indirect Identifiers are often needed by the research project. They very often need to know the gender, age, and region they live. Thus often times these efforts leave some risk.
The concern is that with some risk left in a de-identified dataset there is a possibility that someone who has legitimate (or illegitimate) access might try to re-identify the individuals and thus violate privacy. This is an ‘attack’ upon the de-identified dataset.
Patient Identity Matching -- Make the binding:
I have also been involved lately with a few Patient Matching projects. Where the goal is to end up with a cross-reference between many different Patient Identifiers, that is to identify when two different Patient Identifiers are actually about the same human. This is often referred to as De-Duplication, as you are removing duplication, when you are actually not removing it but just assertively acknowledging it.These Patient Matching projects are most prevalent in the USA, where our government has forbidden funding to even discuss a national Patient Identity project. Thus in the USA, Patient Identity Matching, is the only choice. This is not really true, the private sector can solve the problem; but the healthcare private sector is far to fragmented to work together on this… Kind of true, more to come on that… My view is a good Patient Identifier enhances Privacy.
Binding Methodology:
I see these as two sides of the same coin. In the one case we are struggling to break any identification linkage, where as in the other we are trying to use any fragment of truth to create linkages. The motivations are very different, the outcome is very different; but the methods are very much the same.Correlations between direct identifiers gives a positive match. Correlations between indirect identifiers gives evidence of a possible match. Each possible match has a strength based on that specific indirect identifier population characteristic (gender only gives a 50% confidence). Some threshold of ‘possible’ matches is considered sufficient to indicate an actual match. Any dissonance breaks any matches, or indicates dirty data.
Data is often sub-optimal, aka dirty. Dealing with False-Positives, and False-Negatives turns into more art than science.
Risk... There is always risk, no matter how you slice it.
My other blog articles on these topics can be found at De-Identification, Anonymization, Pseudonymization, and Patient Identity.