Archive (2016–2006)

Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer

Journal: Methods of Information in Medicine
Subtitle: A journal stressing, for more than 50 years, the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care
ISSN: 0026-1270

Focus Theme: Medical Imaging High Performance Methods
Guest Editors: C. Kulikowski, L. Gong

Issue: 2012 (Vol. 51): Issue 3 2012
Pages: 242-251

Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer

Original Article

V. Jouhet (1), G. Defossez (1), A. Burgun (2), P. Le Beux (2), P. Levillain (3, 4), P. Ingrand (1, 5), V. Claveau (6)

(1) Unité d’épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers, France; (2) INSERM U936, Faculté de médecine, Université de Rennes 1, Rennes, France; (3) Anatomie et cytologie pathologiques, Centre Hospitalier Universitaire de Poitiers, Poitiers, France; (4) Centre de Regroupement Informatique et Statistique en Anatomo-Pathologie de Poitou-Charentes, Faculté de médecine, Université de Poitiers, Poitiers, France:; (5) INSERM, CIC 802, Poitiers, France; (6) IRISA – CNRS UMR 6074, Rennes, France


Medical Informatics, pathology, neoplasm, free text, automated classification


Objective: Our study aimed to construct and evaluate functions called “classifiers”, produced by supervised machine learning techniques, in order to categorize automatically pathology reports using solely their content.

Methods: Patients from the Poitou-Charentes Cancer Registry having at least one pathology report and a single non-metastatic invasive neoplasm were included. A descriptor weighting function accounting for the distribution of terms among targeted classes was developed and compared to classic methods based on inverse document frequencies. The classification was performed with support vector machine (SVM) and Naive Bayes classifiers. Two levels of granularity were tested for both the topographical and the morphological axes of the ICD-O3 code. The ability to correctly attribute a precise ICD-O3 code and the ability to attribute the broad category defined by the International Agency for Research on Cancer (IARC) for the multiple primary cancer registration rules were evaluated using F1-measures.

Results: 5121 pathology reports produced by 35 pathologists were selected. The best performance was achieved by our class-weighted descriptor, associated with a SVM classifier. Using this method, the pathology reports were properly classified in the IARC categories with F1-measures of 0.967 for both topography and morphology. The ICD-O3 code attribution had lower performance with a 0.715 F1-measure for topography and 0.854 for morphology.

Conclusion: These results suggest that free-text pathology reports could be useful as a data source for automated systems in order to identify and notify new cases of cancer. Future work is needed to evaluate the improvement in performance obtained from the use of natural language processing, including the case of multiple tumor description and possible incorporation of other medical documents such as surgical reports.

You may also be interested in...

S. A. Vural 1, S. Nalbantoglu 2, N. Oznur 1, M. F. Bozkurt3

Tierärztliche Praxis Großtiere 2008 36 3: 209-212


Section 4: Sensor, Signal and Imaging Informatics


H. Müller, Managing Editor for the IMIA Yearbook Section on Sensors, Signals, and Imaging Informatics

Yearb Med Inform 2008 : 64-64


Section 3: Health Information Systems


C. Bréant, Managing Editor for the IMIA Yearbook Section on Health Information Systemsar

Yearb Med Inform 2008 : 52-54