Simply put, Information extraction ( IE) accomplish these tasks :

* Take natural language text from a document source , and extract the essential facts about one or more predefined fact types.

* Represent each fact as a template whose slots are filled on the basis of what is found in the text.

IE is typically carried out in support of other tasks, usually forms part of application or pipeline of processes. The results of IE is either stored in a databases or subjected to querying or data mining; integrated in knowledge bases to allow reasoning or presented to users for annotation or curation tasks .

Thus, IE is an application of Natural language processing (NLP). As the term implies, the goal is to extract information from text , and the aim is to do so without requiring the end user to read the text. In contrast, information Retrieval (IR) like Search engine is the activity of finding documents that answer an information need with the help of an index.

IE have dealt primarily with news resources , and more recently with scientific publications. In sciences, general language grammar and dictionary are not enough. Scientific fields use many technical terms, only a few are found in common discourses. To some extends, this kind of terms can be listed in auxiliary terminologies. however, automatic term recognition ( ATR) is useful for IE to extract named entities on the basis of their internal structures.

Regardless of what IE approaches was used in the passed, scientific fields, especially biology and Biomedicine is not well suited with IE systems that doesn't make used of ontology and linguistic lexicons. The best exemple is GenIE " Genome Information Extraction" from the institute for Computational linguistic at the University of Stuttgart. they uses Ontology-driven information Extraction technologies that goes behind extracting simple facts from sentences. their aim is to deal with anaphoric reference and information from each sentence merged or a relation must ne established between events.

For instance, if a sentence refers explicitly to a binding action, and the following sentence is pointing to the gene expression regulation du to the interaction between binding factors and promoters sequences, then the dependency between events should be capture.

A must read " Literature mining for the biologist: from information retrieval to biological discovery" by Peer Bork et al. Nature Review Genetics 2006.

"

DNA MANIA