<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet title="XSL formatting" type="text/xsl" href="http://blog.isavoir.com/feed/rss2/xslt" ?><rss version="2.0"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:wfw="http://wellformedweb.org/CommentAPI/"
  xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
  <title>DNA MANIA - Open Science</title>
  <link>http://blog.isavoir.com/</link>
  <description>Bioinformatic, Text Mining, Biological Text Mining, Name entity recognition, Genomic, System Biology, Semantic, Computational Biology, Semantic Web, Knowledge management, Biomedicine, Ontology, Thesaurus, Terminology, Corpora, Content management</description>
  <language>en</language>
  <pubDate>Sat, 05 Jul 2008 13:58:56 +0200</pubDate>
  <copyright>iSavoir @ 2007 copyright reserved</copyright>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Dotclear</generator>
  
    
  <item>
    <title>Open Text Mining Interface</title>
    <link>http://blog.isavoir.com/post/2007/03/11/Open-Text-Mining-Interface</link>
    <guid isPermaLink="false">urn:md5:0e6f4cd04f00c7a2e35ae08f5a25d884</guid>
    <pubDate>Sun, 11 Mar 2007 15:04:00 +0100</pubDate>
    <dc:creator>Frédéric</dc:creator>
        <category>Text Mining</category>
        <category>Open Science</category><category>Open Text Mining Interface</category><category>OTMI</category><category>RSS feeds</category><category>Text Mining</category>    
    <description>&lt;p&gt;&lt;img src=&quot;http://blog.isavoir.com/public/otmi.gif&quot; alt=&quot;OTMI&quot; style=&quot;float:left; margin: 0 1em 1em 0;&quot; /&gt; Nature might not quite be in the Open
Publishing business like PLoS, but they are an important player nevertheless. I
hope the OTMI gets picked up by other publications. It would be nice to have a
publication data standard and as one of the top two scientific journals, Nature
has the clout to make this happen. Being able to mine journals and search for
information is invaluable (open or otherwise), and using standard formats like
OPML is an excellent idea.&lt;/p&gt;    &lt;h3&gt;Being able to share&lt;/h3&gt;
&lt;p&gt;The Open Text Mining Interface (OTMI) is an initiative from Nature
Publishing Group (NPG). It aims to enable scholarly publishers, among others,
to disclose their full text for indexing and text-mining purposes but without
giving it away in a form that is readily human-readable. Here is their &lt;a href=&quot;http://blog.isavoir.com/post/2007/03/11/&quot; hreflang=&quot;en&quot;&gt;wiki page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Open Text Mining Interface provides for a range of structured disclosure
options, from word vectors (list of word occurences with frequency counts) and
the presentation of text 'snippets' out of narrative order, to the presentation
of full text in &amp;quot;raw&amp;quot; or &amp;quot;reduced&amp;quot; form.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;&lt;a href=&quot;http://blog.isavoir.com/tag/Open%20Text%20Mining%20Interface&quot;&gt;Open Text Mining
Interface&lt;/a&gt;&lt;/strong&gt; (OTMI) is aimed to initiate &lt;strong&gt;&lt;a href=&quot;http://blog.isavoir.com/tag/text%20mining&quot;&gt;text mining&lt;/a&gt;&lt;/strong&gt; capabilities in sciences
publications. Something researchers are waiting for decades now. Nature Blog
NASCENT initial demo uses the &lt;a href=&quot;http://www.nature.com/nature/journal/v440/n7083/index.html&quot; hreflang=&quot;fr&quot;&gt;23
March issue of Nature&lt;/a&gt; &lt;a href=&quot;http://blogs.nature.com/wp/nascent/&quot; hreflang=&quot;fr&quot;&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Embedded in the HTML of the abstract and full-text file for each article is
a tag like this:&lt;/p&gt;
&lt;pre&gt;
   &amp;lt;link rel=&amp;quot;OTMI&amp;quot; type=&amp;quot;application/atom+xml&amp;quot; href=&amp;quot;../otmi/otmi-nature04614.xml&amp;quot;/&amp;gt;
&lt;/pre&gt;
&lt;p&gt;which points to an &lt;a href=&quot;http://www.nature.com/nature/journal/v440/n7083/otmi/otmi-nature04614.xml&quot; hreflang=&quot;fr&quot;&gt;OTMI file&lt;/a&gt; — a machine-readable representation of the text.
(Technically, it's an Atom Entry document with various XML namespace extensions
to allow us to include additional information.) As I write this, the example
files for our test issue contain the following information:&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;pre&gt;
   * Bibliographic details (of the kind you might also find in the table &lt;a href=&quot;http://blog.isavoir.com/tag/RSS%20feeds&quot;&gt;RSS feeds&lt;/a&gt;)
   * Word vectors. That is, a list of all the words that appear in the article and the number of occurrences. (There's also a stop-word list of very common words that have been excluded.) This enables the construction of the most basic types of search index.
   * 'Snippets'. Basically sentences, presented out of order, which allows more sophisticated indexing and text mining (e.g., the kind that looks out for common constructions such as &amp;quot;A binds to B&amp;quot; or &amp;quot;X inhibits Y&amp;quot;), but not, of course, anything that looks across sentence boundaries.
&lt;/pre&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;Note for that for both words and sentences — actually quite hard concepts to
define in strict computational terms — the algorithms used to tokenize the text
are defined in the OTMI file using regular expressions, so anyone — or anything
— examining the file can in principle know exactly how the text was processed
to create the respective lists. Note also that the word vectors will usually be
redundant if you have the sentences, but they include both for the purposes of
this demo (and who know, maybe it's useful to some people if they provide
both).&lt;/p&gt;
&lt;p&gt;There are still a lot of things that could be improved here. For
example:&lt;/p&gt;
&lt;pre&gt;
  1. Allow for text from different sections of an article (e.g., abstract, figure legends) to be labelled as such.
  2. Allow for text to be presented in normal human-readable form for publishers who are willing to provide this.
  3. Add a list of cited articles, providing at least DOIs but perhaps other information too. This would, of course, open up the content to citation analysis.
  4. Add references to the &lt;a href=&quot;http://blog.isavoir.com/tag/OTMI&quot;&gt;OTMI&lt;/a&gt; files from the corresponding RSS feed items (and from the log-in page where content is access-controlled).
  5. Add references to a common stop-word list instead of repeating it in each OTMI file.
  6. Add rights information.
  7. Add references to associated data files and/or database entries.
  8. Provide an actual spec. ;)
&lt;/pre&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;They intend to make at least some of these changes (and perhaps others
besides) over the coming weeks, so expect the example files to change before
your eyes. There's also an even more basic issue around whether an Atom entry
document is the right starting point. For example, perhaps an RDF/XML format
would be more useful, at least to some people.&lt;/p&gt;
&lt;p&gt;The example of RSS shows how powerful a relatively simple common standard
can be when it comes to aggregating content from multiple sources (even when
it's messed up as badly as RSS ;). So maybe an approach like OTMI (or a better
one dreamt up by someone else) can help those who want to index and text-mine
scientific and other content. Like RSS, I think publishers might also come to
see this as a kind of advert for their content because it should help
interested readers to discover it. And on the basis that a something is always
better than nothing, it also doesn't force publishers to give away the
human-readable form of their content — they can limit themselves to snippets or
even just word vectors if they want to.&lt;/p&gt;
&lt;p&gt;Frederic&lt;/p&gt;</description>
    
    
    
          <comments>http://blog.isavoir.com/post/2007/03/11/Open-Text-Mining-Interface#comment-form</comments>
      <wfw:comment>http://blog.isavoir.com/post/2007/03/11/Open-Text-Mining-Interface#comment-form</wfw:comment>
      <wfw:commentRss>http://blog.isavoir.com/feed/rss2/comments/87151</wfw:commentRss>
      </item>
    
</channel>
</rss>