Being able to share

The Open Text Mining Interface (OTMI) is an initiative from Nature Publishing Group (NPG). It aims to enable scholarly publishers, among others, to disclose their full text for indexing and text-mining purposes but without giving it away in a form that is readily human-readable. Here is their wiki page.

The Open Text Mining Interface provides for a range of structured disclosure options, from word vectors (list of word occurences with frequency counts) and the presentation of text 'snippets' out of narrative order, to the presentation of full text in "raw" or "reduced" form.

The Open Text Mining Interface (OTMI) is aimed to initiate text mining capabilities in sciences publications. Something researchers are waiting for decades now. Nature Blog NASCENT initial demo uses the 23 March issue of Nature .

Embedded in the HTML of the abstract and full-text file for each article is a tag like this:

   <link rel="OTMI" type="application/atom+xml" href="../otmi/otmi-nature04614.xml"/>

which points to an OTMI file — a machine-readable representation of the text. (Technically, it's an Atom Entry document with various XML namespace extensions to allow us to include additional information.) As I write this, the example files for our test issue contain the following information:


   * Bibliographic details (of the kind you might also find in the table RSS feeds)
   * Word vectors. That is, a list of all the words that appear in the article and the number of occurrences. (There's also a stop-word list of very common words that have been excluded.) This enables the construction of the most basic types of search index.
   * 'Snippets'. Basically sentences, presented out of order, which allows more sophisticated indexing and text mining (e.g., the kind that looks out for common constructions such as "A binds to B" or "X inhibits Y"), but not, of course, anything that looks across sentence boundaries.


Note for that for both words and sentences — actually quite hard concepts to define in strict computational terms — the algorithms used to tokenize the text are defined in the OTMI file using regular expressions, so anyone — or anything — examining the file can in principle know exactly how the text was processed to create the respective lists. Note also that the word vectors will usually be redundant if you have the sentences, but they include both for the purposes of this demo (and who know, maybe it's useful to some people if they provide both).

There are still a lot of things that could be improved here. For example:

  1. Allow for text from different sections of an article (e.g., abstract, figure legends) to be labelled as such.
  2. Allow for text to be presented in normal human-readable form for publishers who are willing to provide this.
  3. Add a list of cited articles, providing at least DOIs but perhaps other information too. This would, of course, open up the content to citation analysis.
  4. Add references to the OTMI files from the corresponding RSS feed items (and from the log-in page where content is access-controlled).
  5. Add references to a common stop-word list instead of repeating it in each OTMI file.
  6. Add rights information.
  7. Add references to associated data files and/or database entries.
  8. Provide an actual spec. ;)


They intend to make at least some of these changes (and perhaps others besides) over the coming weeks, so expect the example files to change before your eyes. There's also an even more basic issue around whether an Atom entry document is the right starting point. For example, perhaps an RDF/XML format would be more useful, at least to some people.

The example of RSS shows how powerful a relatively simple common standard can be when it comes to aggregating content from multiple sources (even when it's messed up as badly as RSS ;). So maybe an approach like OTMI (or a better one dreamt up by someone else) can help those who want to index and text-mine scientific and other content. Like RSS, I think publishers might also come to see this as a kind of advert for their content because it should help interested readers to discover it. And on the basis that a something is always better than nothing, it also doesn't force publishers to give away the human-readable form of their content — they can limit themselves to snippets or even just word vectors if they want to.

Frederic