Archive (2016–2006)

A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use

Journal: Methods of Information in Medicine
Subtitle: A journal stressing, for more than 50 years, the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care
ISSN: 0026-1270

Focus Theme: Medical Imaging High Performance Methods
Guest Editors: C. Kulikowski, L. Gong

Issue: 2012 (Vol. 51): Issue 3 2012
Pages: 229-241

A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use

Original Article

B. S. Erdal (1, 2), J. Liu (1), J. Ding (1), J. Chen (1), C. B. Marsh (3), J. Kamal (1), B. D. Clymer (2)

(1) Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA; (2) Electrical and Computer Engineering, The Ohio State University, Columbus, Ohio, USA; (3) Internal Medicine, The Ohio State University Medical Center, Columbus, Ohio, USA


de-identification, data warehouse


Objective: To qualify the use of patient clinical records as non-human-subject for research purpose, electronic medical record data must be de-identified so there is minimum risk to protected health information exposure. This study demonstrated a robust framework for structured data de-identification that can be applied to any relational data source that needs to be de-identified.

Methods: Using a real world clinical data warehouse, a pilot implementation of limited subject areas were used to demonstrate and evaluate this new de-identification process. Query results and performances are compared between source and target system to validate data accuracy and usability.

Results: The combination of hashing, pseudonyms, and session dependent randomizer provides a rigorous de-identification framework to guard against 1) source identifier exposure; 2) internal data analyst manually linking to source identifiers; and 3) identifier cross-link among different researchers or multiple query sessions by the same researcher. In addition, a query rejection option is provided to refuse queries resulting in less than preset numbers of subjects and total records to prevent users from accidental subject identification due to low volume of data. This framework does not prevent subject re-identification based on prior knowledge and sequence of events. Also, it does not deal with medical free text de-identification, although text de-identification using natural language processing can be included due its modular design.

Conclusion: We demonstrated a framework resulting in HIPAA Compliant databases that can be directly queried by researchers. This technique can be augmented to facilitate inter-institutional research data sharing through existing middleware such as caGrid.

You may also be interested in...


D. A. Dorr1 , W. F. Phillips2 , S. Phansalkar 3, 4 , S. A. Sims 3, 4 , J. F. Hurdle 3, 4

Methods Inf Med 2006 45 3: 246-252

An Agile Framework

Focus Theme: Chronic Disease Registries

Supplementary Material

V. Kannan (1), J. S. Fish (1), J. M. Mutz (1), A. R. Carrington (1), K. Lai (1), L. S. Davis (1), J. E. Youngblood (1), M. R. Rauschuber (1), K. A. Flores (1), E. J. Sara (1), D. G. Bhat (1), D. L. Willett (1)

Methods Inf Med 2017 56 Open: e74-e83

Experiences from a Large Clinical Follow-up Study

Focus Theme - GMDS 2015

M. Kaspar (1), M. Ertl (2), G. Fette (1, 3), G. Dietrich (3), M. Toepfer (3), C. Angermann (1), S. Störk (1), F. Puppe (3)

Methods Inf Med 2016 55 4: 381-386