NIF 2.0 Draft

From NLP2RDF-Wiki
Jump to: navigation, search

NOTE: This is not an official W3C page we are just adopting the style and will remove all mentions of W3C soon.

For general information, Use Cases and the rationale behind NIF see the About page

Abstract

The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The core of NIF consists of a vocabulary, which can represent Strings as RDF resources. A special URI Scheme is used to pinpoint annotations to a part of a document. These URIs can then be used to attach annotations to the respective character sequence. Employing these URIs, annotations can be published on the Web as Linked Data and interchanged between NLP tools and applications.

This document - the specification of NIF 1.0 - will remain mostly stable. The only corrections will either be clarifications, improving the readability of the text or spelling mistakes or additional NLP domain vocabularies. Major changes will be collected on the NIF 2.0 Draft page and included in the next version of NIF.

Status of This Document

This section describes the status of this document at the time of its publication. This document is kept in a MediaWiki and the latest version can be found under this URI: http://wiki.nlp2rdf.org/wiki/NIF_2.0_Draft . Older revisions are available here: View history.

If you wish to make comments regarding this document, please send them to (, subscription form, archives). All feedback is welcome.

This document was published by the LOD2 FP7 EU Project as a Public Working Draft and will change frequently depending on the discussion on the mailing list. It is inappropriate to cite this document as other than work in progress. This document is intended to become a stable NIF 2.0 specification. This document is published under CC-BY-3.0 licence and is intended to be exploited by everyone who wishes to do so in any way he seems fit.

Contents


NIF 1.0 in a nutshell

NIF Architecture

NIF consists of the following three components:

  1. Structural Interoperability : URI recipes are used to anchor annotations in documents with the help of fragment identifiers. The URI recipes are complemented by two ontologies (String Ontology and Structured Sentence Ontology), which are used to describe the basic types of these URIs (i.e. String, Document, Word, Sentence) as well as the relations between them (subString, superString etc.).
  2. Conceptual Interoperability: The Structured Sentence Ontology (SSO) was especially developed to connect existing ontologies with the String Ontology and thus attach common annotations to the text fragment URIs. The NIF ontology can easily be extended and integrates several NLP ontologies such as OLiA for the morpho-syntactical NLP domains, the SCMS Vocabulary and DBpedia for entity linking, as well as the NERD Ontology (below for details on the ontologies).
  3. Access Interoperability: A REST interface description for NIF components and web services allows NLP tools to interact on a programmatic level.

NIF-1.0 is stable and can be implemented. The experience and feedback collected during this implementation will be collected as NIF-2.0-draft. This specification is complemented by the information in this Wiki, which contains documentation on how to integrate NLP tools and adapt them to NIF. Also reference implementations are available on the and under Implementations. A allows to test the web services. An overview of the architecture can be found in the next section.

Structural Interoperability

Structural Interoperability is concerned with how the RDF is structured to represent annotations.

NIF-2.0 URI Schemes

This section is under discussion, please send feedback to the mailing list. Documents:

String Ontology

 The String Ontology might be generalized into a fragment ontology drafts: 

The String Ontology is a vocabulary to describe Strings and builds the foundation for the NLP Interchange Format. It has a class String and a property anchorOf to anchor URIs in a given text and describe the relations between these string URIs. The ontology can be found here: The available properties are descriptive. Only some properties are actually required for NIF, all others are optional and it is even discouraged to use them per default as they will cause quite a high amount of logical assertions (there are many transitive properties). Overall, only these terms are primarily important:

Structured Sentence Ontology

The Structured Sentence Ontology (SSO) is built upon the String Ontology and additionally provides classes for three basic units: Sentences, Phrases and Words. Conceptual interoperability is ensured in NIF by providing ontologies and vocabularies for representing the actual annotations in RDF. For each NLP domain a pre-existing vocabulary was chosen that serves the most common use cases and facilitates interoperability. Details are described elsewhere: Part-Of-Speech tags and Syntax uses the Ontologies of Linguistic Annotation [1]; Entity Linking is realized using NERD [2], note that the property sso:oen - meaning 'one entity per name' - is explained and formalized there.

The ontology can be found here:

Normative requirements

The URI recipes of NIF are designed to make it possible to have zero overhead and only use one triple per annotation, such as in the following example:

 ld:  .
 str:  .
 revyu:  .
ld:offset_14406_14418 rev:hasComment "Hey Tim, good idea that Semantic Web!" . 

Some additional triples are added, however, to ease programmatic access.

1. All URIs created by the above-mentioned URI recipes should be typed with the respective OWL Class for the recipe (str:OffsetBasedString or str:ContextHashBasedString)

This produces one additional triple per generated URI:

 ld:  .
 str:  .
 revyu:  .
ld:offset_14406_14418 rev:hasComment "Hey Tim, good idea that Semantic Web!" . 
ld:offset_14406_14418 rdf:type str:OffsetBasedString . 
2. In each returned NIF model there should be at least one uri that relates to the document as a whole and either references the page with the property str:sourceUrl or includes the whole text of the document with str:sourceString

The definition of Document in NIF is closely tied to the request issued to annotate it. So each piece of text that is sent to a service is treated as a document. This produces three additional triples per Document (or request).

 ld:  .
 str:  .
ld:offset_0_25482 rdf:type str:OffsetBasedString . 
ld:offset_0_25482 rdf:type str:Document . 
ld:offset_0_25482 str:sourceUrl  . 
3. For each document in a NIF model all other strings, that use NIF URIs should be reachable via the subStringTrans property by inference. Either str:subString must be used for this or a sub property thereof

Programatically, it might be required to iterate over the strings contained in the document. This means that they have to be connected via a property. To achieve this requirement, either str:subString must be added between the URIs (where appropiate) or another property that is ardfs:subPropertyOf str:subString . Example of such sub properties are sso:word, sso:firstWord, sso:lastWord, sso:child.

 ld:  .
 str:  .
ld:offset_0_25482 rdf:type str:Document . 
ld:offset_0_25482 str:subString ld:offset_14406_14418 

Design Choices and Future Work

This section explains the rationale behind the design choices of this section. Discussion and proposals are vital and possible via the comment section below or as written on the Get Involved page.

  • The additional brackets “(” and “)” in the Context-Hash-based URIs were introduced to make the hash more distinguishable. If there is a sentence “The dog barks.” and the contxt size is too big, i.e. 10, then “The”, “dog” and “barks” would have the same hash. Compare md5("The dog barks.") vs. md5("The (dog) barks.") vs. md5("The dog (barks).")
  • Note that Context-Hash-based URIs are unique identifiers of a specific string only if the context size is chosen sufficiently large. If a, for example, complete sentence is repeated in the document, parts of the preceding and/or subsequent sentences are to be included to make the reference to the string unique. However, in many NLP applications, a unique reference to a specific strings is not necessary, but rather, all word forms within the same minimal context (e.g., one preceding and one following word) are required to be analysed in the same way. Then, a Context-Hash-based URI refers uniquely to a word class, not one specific string. Using a small context, a programmer can therefore refer to a whole class of words rather than an individual word, e.g., she can assign every occurring string “the“ (with one preceding and one following white space as context) the part-of-speech tag “DT”: md5(“ (the) “); The resulting URI is http://www.w3.org/DesignIssues/LinkedData.html#hash_md5_1_5_8dc0d6c8afa469c52ac4981011b3f582_the”
  • The two ontologies should not be included into the NIF output per default (via owl:import). Some axioms will produce a lot of entailed triples, especially the transitive properties. It is generally better to just write some additional client code to add the required annotations, such as nextWordTrans or sub and superString than inferring them by a reasoner such as Pellet. SPARQL CONSTRUCT is the way to go in this case.

    CONSTRUCT {?word1 sso:nextWordTrans ?word2} 
    WHERE {?word1 sso:nextWord ?word2};

    executed once and

    CONSTRUCT {?word1 sso:nextWordTrans ?word2} 
    WHERE {?word1 sso:nextWordTrans ?word2};

    executed N-1 times (N is the number of words of the longest sentence) might just serve the same purpose and will be a lot faster.

  • At the moment only two URI Recipes were included, which operate on text in general. Text is defined as either a sequence of characters or simply everything that can be opened in a text editor. This also entails that HTML, XML, Source Code, CSS and everything else (except e.g. binary formats) can be addressed with NIF-URIs. This is the most general use case and in the future more content-specific URI Recipes such as XPath/XPointer for XML might be included.
  • The explicit typing of the URIs with the class OffsetBasedString is useful for determining the type of the URI and before parsing it. Of course, the explicit typing is redundant, because the information is already included in the URI. This redundancy might be removed in the future.
  • The compatibility problem RFC 5147 originates in the dilemma that normally fragment ids of URIs are media-type specific. RFC 5147 is valid for plain text(i.e. text without markup), which is disjoint with html. Achieving full compatibility comes with additional (unnecessary) ballast, while providing hardly any advantages for NLP tools. See the full discussion here. Final email.

Conceptual Interoperability

Conceptual Interoperability is concerned with which ontologies and vocabularies are used to represent the actual annotations in RDF. We divided the different output of tools into different NLP domains. For each domain a vocabulary was or will be chosen that serves the most common use cases and facilitates interoperability. In simple cases a property has been created in the Structured Sentence Ontology. In the more complex cases fully developed linguistic ontologies already existed and were reused. This part of the specification is also extended on the fly, i.e. everything in this section is stable (backward-compatible), but additional NLP domains might be appended, any time. New NLP domains can be requested or proposed here.

Golden rule of conceptual interoperability

Here is the golden rule of interoperability with respect to a reference ontology:

Alongside the local annotations used by the tool, minimal information must be added to be able to disambiguate the local annotations with the help of a reference ontology for data integration and interoperability.

NLP Domains

Begin/end index and context

Adding extra triples for indexes and context produces redundant information as the begin and end index as well as the context can be calculated from the URI. Properties for describing context and begin and end index can be found in the String Ontology.

 ld:  .
 str:  .
 xsd:  .
ld:offset_14406_14418 str:beginIndex "14406"^^xsd:int  .
ld:offset_14406_14418 str:endIndex "14418"^^xsd:int  .
ld:offset_14406_14418 str:leftContext " it " .
ld:offset_14406_14418 str:leftContext4 " it " .
ld:offset_14406_14418 str:beginIndex ".

Lemmata, stems, stop words, etc.

Lemma and stem annotations are realized as simple data type properties in the Structured Sentence Ontology. A class is used for stopwords (sso:StopWord)

 ld:  .
 sso:  .
ld:offset_14406_14414 sso:stem "Semant"  .
ld:offset_14406_14414 sso:lemma "Semantic"  .
ld:offset_14406_14414 a sso:StopWord  .

Part of speech tags

Annotations for part of speech tags must make use of . OLiA connects local annotation tag sets with a global reference ontology. Therefore it allows to keep the specific part of speech tag at a fine granularity, while at the same time having a coarse grained reference model. The RDF output must contain the original tag as plain literal using the property sso:posTag as well as a link to the respective OLiA individual in the annotation model for the used tag set using the property sso:oliaLink. Here is an example using the Penn tag set:

 ld:  .
 sso:  .
 penn:  .
ld:offset_14406_14414 sso:posTag "JJ"  .
ld:offset_14406_14414 sso:oliaLink penn:JJ  .

The extended RDF output can additionally contain the following:

  1. all classes of the OLiA reference ontology (especially from olia.owl and olia-top.owl) that belong to this individual can be copied and added to the str:String via rdf:type
  2. the rdfs:subClassOf axioms between these classes must be copied as well.

All necessary information has to be copied for conciseness and self-containedness. If information is copied in this way it is unnecessary for the client to include the three OLiA reference ontologies: , , In total, these three models amount to almost 1000 OWL classes. An overview of OLiA can be found . Here is an example of the extended output:

 ld:  .
 sso:  .
 penn:  .
 olia-top:  .
 olia:  .
ld:offset_14406_14414 sso:posTag "JJ"  .
ld:offset_14406_14414 sso:oliaLink penn:JJ  .
ld:offset_14406_14414 rdf:type olia:Adjective .
ld:offset_14406_14414 rdf:type olia-top:MorphosyntacticCategory .
ld:offset_14406_14414 rdf:type olia-top:LinguisticConcept .
olia:Adjective rdfs:subClassOf olia-top:MorphosyntacticCategory.
olia-top:MorphosyntacticCategory rdfs:subClassOf olia-top:LinguisticConcept.

Currently, there are some small disadvantages, which might be fixed in the next NIF version:

  • The ontologies might still change a little now and then and are unversioned. In the next NIF version, we will provide stable releases.
  • The semantics are not exactly modelled as instances of str:String are intuitively not instances of olia-top:LinguisticConcept. Normally you would model these classes as disjoint, but the amount of RDF triples is reduced the way it is currently done. Note that this saves an extra pattern in each SPARQL query.
  • Also for the sake of conciseness subClassOf axioms from OLiA are repeated. This might formally be considered as ontology highjacking. Again, it is more concise than importing 1000 OWL classes.

Syntax Analysis

in development

Topic Models

An ontology draft for Topic models is available here:

Named Entity Recognition and Entity Linking

For describing Named Entity Recognition (NER), NIF uses one property from the . For entity linking, the property must be used.

 ld:  .
 scms:     .
 dbpedia:     .
ld:offset_22849_22852 scms:means dbpedia:World_Wide_Web_Consortium .

For entity typing the Named Entity Recognition and Disambiguation Ontology (http://nerd.eurecom.fr/ontology/) must be used alongside the local annotation types. Here are three examples for Spotlight, Zemanta and Extractiv to showcase how important a reference ontology is, even for such a trivial annotation as organization. Note the three different spellings “Organisation”, “organization”, “ORGANIZATION”;

 dbo:     .
 nerd:     .
ld:offset_22849_22852 rdf:type dbo:Organisation .
ld:offset_22849_22852 rdf:type nerd:Organization .
 zemanta:      .
 nerd:     .
ld:offset_22849_22852 rdf:type zemanta:organization .
ld:offset_22849_22852 rdf:type nerd:Organization .
 extractiv:     .
 nerd:     .
ld:offset_22849_22852 rdf:type extractiv:ORGANIZATION .
ld:offset_22849_22852 rdf:type nerd:Organization .

Access Interoperability

Most aspects of access interoperability are already tackled by using the RDF standard. This section contains several smaller additions, which further improves interoperability and accessibility of NIF Components. They are normative, but play a minor importance compared to the Structural and Conceptual Interoperability. While the first two allow general interoperability, Access provides easier integration and off-the-shelf solutions.

Workflow

NIF workflow

NIF itself is a format, which can be used for import and export of data from and to NLP tools. Therefore NIF enables to create ad-hoc workflows following a client-server model or the SOA principle. In this approach, the client is responsible for implementing the workflow. The diagram below shows the communication model. The client sends requests to the different tools either as text or RDF and then receives an answer in RDF. This RDF can be aggregated into a local RDF model. Transparently, external data in RDF can also be requested and added without any additional formalism. For acquiring and merging external data from knowledge bases, the plentitude of existing RDF techniques (such as Linked Data or SPARQL) can be used.

Interfaces

Currently the main interface is a wrapper that provides the NIF Web service. Other interfaces, such as CLI or a Java interface (using Jena) are easily possible and can be provided in addition to a Web service. The Web service must:

  1. be stateless
  2. treat POST and GET the same
  3. accept the parameters in the next section

The component may have any number of additional parameters to configure and fine tune results.

Parameters

Note that to send the parameters to a Web service they must be url encoded first. These parameters must be used for each request (required):

  • input-type = "text" | "nif-owl" Determines the content required for the next parameter input, either plain text or RDF/XML in NIF
  • input = text | rdf/xml Either the text or RDF in XML format in NIF

These parameters can be used for each request (optional):

  • nif = "true" | "nif-1.0" If the application already has a web service, use this parameter to enable NIF output (allows to develop NIF in parallel to existing output, legacy support). It is recommend for a client to always send this with each request.
  • format = "rdfxml" | "ntriples" | "turtle" | "n3" | "json" The RDF serialisation format. Standard RDF frameworks generally support all the different formats.
  • prefix = uriprefix A prefix, which is used to create any URIs. Therefore the client should ensure, that the prefix is valid, when used at the start of the uri, e.g. http://test.de/test#. If input-type=“nif-owl” was used, then the server must not change existing uris and only use the prefix when creating new URIs. It is recommended that the client matches the prefix to the previously used prefix. If the parameter is missing it should be substituted by a sensible default (e.g. the web service uri).
  • urirecipe = "offset" | "context-hash" the urirecipe that should be used (default is “offset”)
  • context-length = integer If the given uri recipe is context-hash, the client can determine the length of the context with it. The default must be 10.
  • debug = "true" | "false" This options determines, if additional debug messages and errors should be written within the RDF. See also the next Section.

Errors

For easier debugging, errors and messages have to be written as RDF output. If a fatal error occurs (i.e. the component was unable to produce any output), the RDF output should contain a message using the underspecified error ontology provided by NLP2RDF: Errors can be given any URI or blank node, must be typed as error:Error, must specify whether they are fatal or not and must contain an error message. Here is an example:

 error:     .
ld:error_1 rdf:type error:Error
ld:error_1 error:fatal "1"^^xsd:boolean .
ld:error_1 error:hasMessage  "Wrong input parameter ..., could not parse ontology" .

If the debug parameter is set, the component is allowed to mix error messages with the RDF that contains the annotations:

 error:     .
ld:error_1 rdf:type error:Error
ld:error_1 error:fatal "0"^^xsd:boolean .
ld:error_1 error:hasMessage  " Sentence 5 and 6 had a low confidence value for sentence splitting. " .
ld:error_2 rdf:type error:Error
ld:error_2 error:fatal "0"^^xsd:boolean .
ld:error_2 error:hasMessage  "Could not add extended output for OLiA." .

References

  1. C. Chiarcos. Ontologies of linguistic annotation: Survey and perspectives. In LREC. European Language Resources Association, 2012.
  2. G. Rizzo, R. Troncy, S. Hellmann, and M. Bruemmer. NERD meets NIF: Lifting NLP extraction results to the linked data cloud. In LDOW, 2012.
Personal tools
Namespaces

Variants
Views
Actions
Back to main:
NIF 2.0 Draft
Documentation
ToDo - Help Wanted
Navigation
Toolbox