Use cases and requirements

From NLP2RDF-Wiki
Revision as of 16:45, 14 March 2013 by Croeder (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Please refrain from linking to anchors in the contents, the titles are still changing. Use the stable #Anchors

Contents

Use cases

Use case collection

These use cases are not yet equally important, we might neglect some.

#common Common NLP annotations

For the sentence below, an NLP developer wants to have the following common, non-overlapping annotations in RDF and OWL to load them in a triple store and query them with SPARQL: tokens, sentence, stems, lemmas, Part-Of-Speech .

#semantic-blogging Representing key words and links to Geonames and DBpedia in a blogging scenario

Note that although the database in this use case is real, the use case itself is fictitious (realistic, but completely made up). This use case was inspired by an idea of

A web developer sets up a blog for a customer. The customer is a researcher, who manages his data collection (Caucasian spiders) in a semantic data wiki powered by OntoWiki and a triple store. The existing data comprises biological classification and location of spider findings in the Caucasus regions and is already connected to (biological classification) and GeoNames The customer wants a blogging system, that automatically annotates the research-related blog entries with keywords and links to DBpedia and GeoNames. The researcher tells the web developer that these annotations should be converted to RDF and loaded into the triple store alongside the research data, so that the blog posts and the data are integrated and can be browsed and queried.

#data-extraction-and-lineage Linking text from the OCR'd source and to the extracted RDF factoids.

This use case is one of the problems that I'm dealing with: how to link the raw text string to the imaged document after a good OCR and linking to any RDF generated by NLP extraction code. So in the case of this string:

  • "Capt." is an instance of mil:Rank_Captain_Australia
  • "Boddington" is a specific org:person
  • "B. Coy." is a specific instance of mil:ArmyCompany
  • ld:offset_1_16 dc:source #DocumentImage

The objective is to document the decisions and the provenance of the NLP related triples and allow for the string to be queried using the extracted rdf. This would allow for information retrieval queries that are a little bit more sophisticated than simple bag of word queries.

#even-extraction Searching text for statements of events.

RDF representations of NLP structures would enable solutions using a triple store. See [[BioNLP09][1]]

#log-files Data integration of plain text log files

A system admin manages a big server farm with thousands of different web service applications and server software. She has around 15 log file analyser tools (some of them do NLP for recognizing persons). To have a better overview and control, the output of all tools shall be converted to RDF, so she can load it in a triple store and query over the aggregated data. The log files are mostly plain text files, accessible by URL, either on the network via http(s) or (s)ftp ( see e.g. ) or on the local file system such as file:///home/sebastian/svn/nlp2rdf/nlp2rdf.lod2.eu/usecases/access.log . She integrates all tools by implementing

#generalizedPOS Ability to query generalized Part-Of-Speech annotations

For many use cases and practical applications, POS tag sets are too fine-granular and the sublime distinction between individual tags is irrelevant. Often, it is sufficient to know whether a token is of a general class of tags, such as Noun, Adjective, Verb, etc. Given a text, we want to answer two queries: 1. retrieve the types of individual tokens and (2) retrieve all tokens of a certain type.

We target the casual web developer with this use cases, who has zero to little knowledge about NLP systems.

We searched 15 minutes on stackoverflow.com to show the widespread relevance of this use case:

#non-contiguous Mark (non-contiguous) words that should not be translated

The text below is to be transfered to an machine translation (MT) system. To improve quality of translation, all occurrences of persons and named entities are marked to be treated special by the MT (this might be e.g. its:translate=no ). Ralf Der is a professor at University of Leipzig, but der is also the, Wiktionary a German definite article, which might also be capitalized if at the beginning of a sentence. Note: The "non-contiguous" is stressed here, because this use case only needs a single annotation: <> its:translate "no" . . So, if we are able to annotate a non-contiguous region, we only require one triple. The use case is complicated by (1) the bold and italicized occurrence of Der Agent, where Der occurs as an article at the beginning of a sentence and (2) the occurrence of Ders ( an s is attached to the person Der for the German possesive case). String matching seems impossible without retrieving the negative.

Zurück zu Ders Laptop. Nach etlichen Minuten hat das eingemauerte Männchen eine ganze Menge Bewegungen erlernt, mit denen es sich spielend aus seiner misslichen Lage befreien könnte, tut es aber nicht. Sobald es eine Hand und einen Fuß auf der Mauer hat, findet es offenbar etwas anderes in der Box wieder spannender – und hat dann binnen Kurzem vergessen, dass es ein Draußen gibt. Der Agent verharrt in seiner Gegenwart. Sieht so autonomes, zielgerichtetes Verhalten aus? „Dafür ist es noch zu früh“, sagt Der. „Wir versuchen Roboter derzeit lediglich dazu zu bringen, sich mit möglichst geringem Aufwand autonom zu verhalten.“ Ay und Der verschaffen Robotern also erst einmal eine Art Körpergefühl. Damit vertreten sie eine der beiden derzeit vorherrschenden Strömungen auf dem Gebiet der KI. Die Anhänger der anderen Richtung ver­suchen ihre Maschinen darauf zu trimmen, Aufgaben möglichst gut zu lösen – etwa indem sie neuronale Netze für gut erledigte Jobs belohnen. „Wir wollen dagegen erst einmal sehen, was unsere Roboter können, um die Fähigkeiten dann eventuell später zu nutzen“, erklärt Ralf Der – task independent learning heißt das.

#rdface RDFaCE: a WYSIWYM content editor

RDFaCE (RDFa Content Editor) is a rich text editor that supports different views for semantic content authoring. One of the main features of RDFaCE is combining the results of different NLP APIs for automatic content annotation. The main challenge here is the heterogeneity of the existing NLP APIs in terms of API access, URI generation and output data structure. Different NLP APIS use different parameter names like “content”, “text”, “lookupText” and etc. to indicate the input of their REST API. Furthermore, they use either their own URI schemes like:

or use external URIs like:

for recognizing the entities which are extracted.

Another important issue is that each API returns different properties with different names and in a different structure. Following screenshot shows the outputs of two sample NLP APIs:

OutputNLPAPIs.png

To cope with these heterogeneity issues, RDFaCE employs two strategies:

  1. Uses a server-side proxy which handles the access heterogeneities by hard coding the input parameters and connection requirements of each individual API.
  2. Maps the output of each NLP API to a predefined structure as shown in the following: Array (“label”,”type”,”uri”,”positions”=>Array(“start”,”end”), properties=>Array(“predicate”,”object”))

Implementing NIF standard in RDFaCE will facilitate the integration process to a great extent by removing the diversities of different NLP APIs. Using NIF, adding new NLP APIs to RDFaCE will be straightforward and additional efforts to handle heterogeneity between different data formats will be removed. For more information about RDFaCE refer to RDFaCE project page available at http://aksw.org/Projects/RDFaCE

Compatibility with other projects/formats

#its-2.0 ITS 2.0

The company CMS-International develops a CMS system that has integrated components which use formats defined by the internationalization tag set. The company has a customer from the energy industry domain. The customer wants the CMS to become smarter and recognize relevant entities in the content and link it to external and internal knowledge bases. As CMS-International has no internal NLP know how, they are required to integrate an existing NLP system into their CMS. To save costs, they are using an existing NIF2ITS transformer component to integrate a NIF web service.

Scenario:

  1. Internally the CMS uses HTML to store and transfer content. The transformer component converts HTML to plain text and remembers the position of the HTML tags.
  2. The transformer component sends the plain text to a NIF web service
  3. The NIF web service returns the annotations in NIF format.
  4. The transformer component converts the NIF annotations to an ITS compatible and includes it into the original HTML.

Variants:

  • (a) CMS has XML instead HTML
  • (b) Instead of NER and Entity Linking, the NIF web service is specialized in machine translation.
  • (c) The CMS HTML already contains ITS annotations that contain important information for the NIF web service. The transformer does not send the content as plain text to the NIF web service. Instead the transformer converts the HTML+ITS to NIF first and sends the data in NIF to the NIF web service.

#lafgraf LAF / GrAF

POWLA

Stanbol Semantic Enhancement Engines

The following text is taken from their

The Ontologies used by the Stanbol Enhancer can be found at http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/generic/servicesapi/src/main/resources/ To get the Turtle for the example used in the picture you can use

  curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
       --data "Apache Stanbol can detect famous entities such as \
           Paris or Bob Marley" http://dev.iks-project.eu:8081/enhancer

OLiA

NERD

lemon

PROV-AQ: Provenance Access and Query

Open Annotation

Links:

WikiData

ToDos and Ideas

  1. research-articles rdfication: RDFize the open subset of PubMed Central including content as well as structure
    1. Taken from http://lists.w3.org/Archives/Public/public-lod/2012Jul/0029.html
  2. Annotate several non-contiguous strings
    1. Sometimes it is useful to group certain strings together. This equals creating a category or collection of contiguous substrings.
      1. Collection of all articles the: The fragment identifier introduced by a hash mark # is the optional last part of a URL for a document. It is typically used to identify a portion of that document. The generic syntax is specified in RFC 3986. The hash mark separator in URIs does not belong to the fragment identifier.
    2. non-contiguous entities
      1. President, Obama belong to entity President Barack Obama (skip "B.")
      2. President B. Obama will call for a new minimum tax rate for individuals making more than $1 million a year.
  3. Robust identifiers for documents, which change frequently
  4. Test for consistency of identifiers
  5. Check consistency of RDF (e.g. tokenization or encoding)
  6. Precise text search, i.e. paragraph, sentences
  7. Annotate elements and attributes (ITS 2.0)
  8. Exchange an annotation services easily
  9. Reuse of NLP components
  10. Combine tool output
  11. Create applications based on generalized entity type (e.g. using NERD )
  12. Ability to express relations between strings in RDF ( e.g. Dependencies )
  13. Provide several candidates/alternatives for an annotation, ideally ranked by confidence
  14. Annotate large text corpora

Requirements

Compatibility

Compatibility with RDF.One of the main requirements, driving the development of NIF, was the need to convert any NLP tool output to RDF as virtually all software developed within the LOD2 Project is based on RDF and the underlying triple store.

Coverage

The wide range of potential NLP tools requires that the produced format and ontology is sufficiently general to cover all annotations and be able to convert them to RDF.

Structural Interoperability.

NLP tools with a NIF wrapper should produce unanimous output, which allows to merge annotations from different tools consistently. Here structural interoperability refers to the way how annotations are represented.

Conceptual Interoperability.

In addition to structural interoperability, tools are furthermore supposed to use the same vocabularies for the same kind of annotations. The focus lies on what annotations are used.

Granularity

The ontology is supposed to handle different kind of granularities not limited to the document level, which can be considered very coarse-grained. As basic units we identified a document collection, the document, the paragraph and the sentence. A keyword search for example might rank a document higher, where the keywords appear in the same paragraph.

(Robust) Web annotation.

The format should be able to use textual Web content (i.e. blogs) as input and provide robust standoff annotations for dynamic web data.

Provenance

In addition to the required robustness, it was also required to track provenance of the primary data.

Simplicity

We want to encourage third parties to contribute their NLP tools to the LOD2 Stack. Therefore the format should be as simple as possible to ease integration and adoption.

Scalability

An especially important requirement is imposed on the format with regard to scalability in two dimensions: Firstly, the triple count is required to be as low as possible to reduce the memory and index footprint in all applications and triple stores. Secondly, the complexity of OWL axioms should be as low as possible to allow fast reasoning.

Personal tools
Namespaces

Variants
Actions
Back to main:
NIF 2.0 Draft
Documentation
ToDo - Help Wanted
Navigation
Toolbox