Correlation Engine 2.0
Clear Search sequence regions

Sizes of these terms reflect their relevance to your search.

The recognition of named entities (NER) is an elementary task in biomedical text mining. A number of NER solutions have been proposed in recent years, taking advantage of available annotated corpora, terminological resources and machine-learning techniques. Currently, the best performing solutions combine the outputs from selected annotation solutions measured against a single corpus. However, little effort has been spent on a systematic analysis of methods harmonizing the annotation results and measuring against a combination of Gold Standard Corpora (GSCs). We present Totum, a machine learning solution that harmonizes gene/protein annotations provided by heterogeneous NER solutions. It has been optimized and measured against a combination of manually curated GSCs. The performed experiments show that our approach improves the F-measure of state-of-the-art solutions by up to 10% (achieving ≈70%) in exact alignment and 22% (achieving ≈82%) in nested alignment. We demonstrate that our solution delivers reliable annotation results across the GSCs and it is an important contribution towards a homogeneous annotation of MEDLINE abstracts. Availability and implementation: Totum is implemented in Java and its resources are available at


David Campos, Sérgio Matos, Ian Lewin, José Luís Oliveira, Dietrich Rebholz-Schuhmann. Harmonization of gene/protein annotations: towards a gold standard MEDLINE. Bioinformatics (Oxford, England). 2012 May 1;28(9):1253-61

Expand section icon Mesh Tags

Expand section icon Substances

PMID: 22419783

View Full Text