ITS2NIF2ITS

From NLP2RDF-Wiki
Jump to: navigation, search

Contents

General note

Some notes on the round-trip conversion of ITS to NIF. The usefulness of this procedure is described in this use case, we describe the algorithms, which preform the transformation below and give working examples. The problem can bes described as follows: ITS selects as target for metadata and annotations element and attribute nodes of the HTML5 or XML document . Many NLP tools, however, either accept only text to provide annotations or they strip all tags and work on plain text internally. For a general solution, we therefore need to transform the HTML5/XML to plain text beforehand and create a mapping of the text segments to the DOM. As such a mapping can never be perfect due to the impedance mismatch between the data structures, we will state any known limitations here.

Note: I am not aware of any NLP tool annotating the DOM directly. Please if you know any. SebastianHellmann (talk) 11:08, 6 August 2012 (CEST)

Notes on ITS/RDF ontology

  • the ontology will most likely be included in NIF to standardize NLP tools using NIF wrappers
  • open: Name and namespace
  • a list of problematic ITS properties might be found below

Notes on ITS2NIF

  • works for HTML5 or XML to NIF in general
  • this algorithm is based on the assumption that the concatenation of the text selected by xpath produces the plain text of the document in the correct order
  • open issue, requires input by ITS expert any ITS rule needs to be expanded and attached to the respective elements.
  • ITS attribute metadata is not currently not included and might need to be treated separately
  • otherwise the transformation should be lossless, i.e. ITS metadata should be completely mappable to NIF and RDF.

Notes on NIF2ITS

  • We assume here, that we still have the mapping from the text segments to the DOM from the previous step (ITS2NIF) .
  • If annotations in NIF are too complex (e.g. overlapping or alternative candidates) conversion to ITS might fail. This might not be relevant in practice as (1) it might not occur frequently and (2) workarounds might be found, but it is not yet well researched.
  • The transition from NIF2ITS works very well as long as the annotations produced by the NLP web service match the text segments of the mapping. More research is needed for the other cases.

Notes on NIF 2.0

  • NIF 2.0 is not final yet and things might still change.
  • One change might be the grounding of identifiers on RFC5147 as discussed here

ITS2NIF

Example

It is always good to have an accompanying running example:

<html>
  <body>
    <h2 translate = "yes" >Welcome to <span 
        its-disambig-ident-ref = "http://dbpedia.org/resource/Dublin” 
        translate = "no" >Dublin</span> in <b translate="no">Ireland</b>!
    </h2>
  </body>
</html>

Algorithm

Preprocessing The algorithm is intended to extract the text from the XML/HTML/DOM for an NLP tool and can produce a lot of "phantom" predicates from excessive whitespace, which (1) increases the size of the intermediate mapping and (2) extracts this whitespace as text, which might decrease NLP performance. We recommend to normalize whitespace in the input XML/HTML/DOM in order to minimize such phantom predicates. The optimized example looks like this:(TODO add link to optimization algortihm (e.g. html safe whitespace rules))

<html><body><h2 translate = "yes" >Welcome to <span 
   its-disambig-ident-ref = "http://dbpedia.org/resource/Dublin” 
   translate = "no">Dublin</span> in <b translate="no">Ireland</b>!</h2></body></html>
  1. Get an ordered list of all text nodes of the XML document
  2. Generate XPath expression for each non-empty text node of all leaf elements and remember them
  3. Get the text for each node and make a tuple with the XPath expressions (X,T)
  4. if all ITS annotations are supposed to be transferred to RDF, get all annotated non-leaf elements and map them to an ordered list of their sub nodes (this feature is for the variant (b) of the use case)
  5. Since the text nodes have a certain order we now have a list of ordered tuples ((x0,t0), (x1,t1), ..., (xn,tn))
  6. (optional) The list with the XPath-to-text mapping can be kept in memory. In case this mapping has to be serialized, we offer two serialization formats (XML and RDF):
 itsrdf: //www.w3.org/2005/11/its/rdf#> .
//example.com/exampledoc.html#xpath(x0)> 
    itsrdf:xpath2nif //example.com/exampledoc.html#offset_b0_e0>
//example.com/exampledoc.html#xpath(x1)>
    itsrdf:xpath2nif //example.com/exampledoc.html#offset_b1_e1>
# ...
//example.com/exampledoc.html#xpath(xn)>
    itsrdf:xpath2nif //example.com/exampledoc.html#offset_bn_en>
>
   x="xpath(x0)" b="b0" e="e0" />
   x="xpath(x1)" b="b1" e="e1" />
   
   x="xpath(xn)" b="bn" e="en" />
>

where

b0 = 0
e0 = b0 + (Number of characters of t0) 
b1 = e0 +1 
e1 = b1 + (Number of characters of t1) 
...
bn = e(n-1) +1 
en = bn + (Number of characters of tn) 

Example (continued)

 itsrdf: //www.w3.org/2005/11/its/rdf#> .
# "Welcome to "
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[1])> 
    itsrdf:nif //example.com/exampledoc.html#offset_0_11> .
# "Dublin"
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/span[1]/text()[1])> 
    itsrdf:nif //example.com/exampledoc.html#offset_11_17> .
# " in "
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[2])> 
    itsrdf:nif //example.com/exampledoc.html#offset_17_21> .
# "Ireland"
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/b[1]/text()[1])> 
    itsrdf:nif //example.com/exampledoc.html#offset_21_28> .
# "!"
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[3])> 
    itsrdf:nif //example.com/exampledoc.html#offset_28_29> .
# "Welcome to Dublin Ireland!"
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text())> 
    itsrdf:nif //example.com/exampledoc.html#offset_0_29> .
>
   x="xpath(/html/body[1]/h2[1]/text()[1])" b="0" e="11" />
   x="xpath(/html/body[1]/h2[1]/span[1]/text()[1])" b="11" e="17" />
   x="xpath(/html/body[1]/h2[1]/text()[2])" b="17" e="21" />
   x="xpath(/html/body[1]/h2[1]/b[1]/text()[1])" b="21" e="28" />
   x="xpath(/html/body[1]/h2[1]/text()[3])" b="28" e="29" />
   x="xpath(/html/body[1]/h2[1])" b="0" e="29" />
>

9. Generate RDF and NIF in the following manner:
9.1. Create a context URI and attach the whole concatenated text as reference.
9.2. Now attach any ITS metadata items from the XML to respective NIF URIs using the ITS/RDF ontology (TODO Name).
9.3. Omit all irrelevant URIs (those that do not carry annotations, they will just bloat the data).

Example (continued):

 itsrdf: //www.w3.org/2005/11/its/rdf#> .
# TODO we might be able to encode provenance here
//example.com/exampledoc.html#offset_0_29>
    rdf:type             str:Context ;
# concatenate the whole text
    str:isString         "$(t0+t1+t2+...+tn)" ; 
    itsrdf:translate     "yes"^^//www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ;
    str:occursIn      //example.com/exampledoc.html> .
//example.com/exampledoc.html#offset_11_17> 
    rdf:type              str:String ;
    itsrdf:translate     "no"^^//www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ;
    itsrdf:disambigIdentRef  //dbpedia.org/resource/Dublin> ;
    str:referenceContext //example.com/exampledoc.html#offset_0_29> .
//example.com/exampledoc.html#offset_21_28> 
    rdf:type              str:String ;
    itsrdf:translate     "no"^^//www.w3.org/TR/its-2.0/its.xsd#yesOrNo> ;
    str:referenceContext //example.com/exampledoc.html#offset_0_29> .

NIF2ITS

Example (continued)

We choose DBpedia Spotlight() here, because wrapper implementation for NIF 2.0 has already started and DBpedia Spotlight will host one of the reference implementations for NIF 2.0.

For this example let's assume DBpedia Spotlight linked "Ireland" to DBpedia:

//example.com/exampledoc.html#offset_21_28> 
    rdf:type                 str:String ;
    itsrdf:disambigIdentRef  //dbpedia.org/resource/Ireland> .
//dbpedia.org/resource/Ireland> 
    rdf:type                 /nerd.eurecom.fr/ontology#Country> .

Algorithm

  1. Send the text to any NIF web service.
  2. Since NIF 2.0 will most likely use the itsrdf ontology output will be compatible directly
  3. Use the mapping from ITS2NIF to reintegrate annotations in the original ITS annotated document. Three cases can occur.

Case 1: NLP Annotation matches text segment exactly

Solution: Attach the annotation to the parent element of the text node

Example (continued)

# based on:
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/b[1]/text()[1])> 
   itsrdf:nif //example.com/exampledoc.html#offset_21_28> .
# and:
//example.com/exampledoc.html#offset_21_28> 
    itsrdf:disambigIdentRef  //dbpedia.org/resource/Ireland> .
# we can attach the metadata to the parent node:
# TODO there is definitely more ITS metadata, we can extract from RDF (e.g. entitytype)
<b its-disambig-ident-ref="http://dbpedia.org/resource/Dublin” 
   translate="no">Ireland</b>

Case 2: NIF annotation is substring of a text node

Solution: Create a new element, e.g. for HTML5 "span" . Different example as the case is not covered.

# Input:
 
<html>
  <body>
     <h2 >Welcome to Dublin in Ireland!</h2>
  </body>
</html>
 
# ITS2NIF
 
//example.com/exampledoc.html#xpath(/html/body[1]/h2[1]/text()[1])>
    itsrdf:nif //example.com/exampledoc.html#offset_0_29>
 
# DBpedia Spotlight returns:
 
//example.com/exampledoc.html#offset_21_28> 
    itsrdf:disambigIdentRef  //dbpedia.org/resource/Ireland> .
 
# NIF2ITS 
 
<html>
  <body>
     <h2 >Welcome to Dublin in <span 
          its-disambig-ident-ref="http://dbpedia.org/resource/Ireland” >Ireland</span>!</h2>
  </body>
</html>

Case 3: NIF annotation start in one region and ends in another.

No straight mapping possible, needs more research. If both regions have the same parent, it is possible.

Problematic ITS properties in RDF and NIF

Some of the ITS properties can not be mapped straightforward to RDF/OWL, because RDF/OWL have different mechanisms to express the information. NIF wrappers are implemented by third parties and implementations might not include the following properties, because they would not make sense in RDF or NIF.

its:entityTypeSourceRef

Required information should ideally be retrievable via rdfs:isDefinedBy and no developer will voluntarily implement this output.

Note: NERD doesn't define it, yet. I have written a notfication for them , SebastianHellmann (talk) 23:48, 5 August 2012 (CEST)
curl -L -H "Accept: application/rdf+xml" http://dbpedia.org/ontology/Organisation  | grep -C3 isDefinedBy

its:disambigType

This property will have a hard time get adopted by the semantic web community, everybody is trying to get away from strings and use URIs.

its:entityTypeRef

This property would create an inconsistency, when used directly in NIF, because: nerd:Place owl:disjointWith str:String, so the type is attached to the linked data URI (dbpedia:Dublin) as displayed in the example above and explained in NIF meets NERD .

Retrieved from "http://wiki.nlp2rdf.org/index.php?title=ITS2NIF2ITS&oldid=633"
Personal tools
Namespaces

Variants
Views
  • Read
  • View source
  • View history
Actions
Back to main:
NIF 2.0 Draft
Documentation
ToDo - Help Wanted
Navigation
Toolbox
  • What links here
  • Related changes
  • Special pages
  • Printable version