API documentation (v1.0)

The DirectNLP API is a REST service, that operates on the data you provide using HTTP requests. This document describes the version 1.0 of the API.

Rate limiting

When using the API, we enforce limits on how much data you may process during the time of one minute. For every user, we allocate a number of credit points that can be spent every minute. The available amount of credits depends on the plan. For example, the free community plan allows spending up to 100 credits every minute, which is equivalent of performing word analysis on text with length up to 10,000 characters. This is about the size of an essay of four pages and about 100 average length tweets. In one day, you can process approximately 15 megabytes of data with the free plan. The paid plans allow you to process much higher volumes of text.

Every request has an associated cost, that depends on the complexity of the operations and the amount of data you process. The size of a text block is 100 characters, including white space. Note that incomplete blocks are counted as full blocks, so a text with 101 characters counts as two blocks. The total cost of a request is computed by multiplying the sum of costs of the operations by the number of text blocks you have in your documents. For example, if you provide a document with size of 1500 characters and wish to perform language detection and morphological analysis, the total cost will be 15 · (1 + 1) = 30 credits. 1500 characters are equal to 15 blocks of text, the cost of language detection is 1 credit and morphological analysis is worth another 1 credit per block.

Another limitation is that you may process only up to 10,000 characters during a single request. This is necessary to avoid overloading the servers. Fortunately, for most use cases it should be possible for the client to split the content into chunks that are smaller than this size.

HTTP requests

To use the API, the client has to perform HTTP requests with POST method to URL:

https://directnlp.com/api/v1.0

Both the API requests and responses use JSON format to transfer the data. This means that the request must supply the Content-Type: application/json header.

The JSON content requires the mandatory fields defined in the following table:

Attribute Data type Description
api_key string The API key for the user account
task string Execute given task
doc string The input document

Example

Let us demonstrate the API usage with a simple Python script using the popular requests library.

import requests
from pprint import pprint

data = {
    'api_key': insert_your_api_key_here,
    'task': 'analyze',
    'language': 'detect',
    'doc': 'It is raining cats and dogs!'
}
result = requests.post('https://directnlp.com/api/v1.0', json=data,
                       headers={'Content-Type': 'application/json'})
pprint (result.json())

Output

In addition to the output of the analyze task, there are few other fields in the result. First on is result, which is always OK, if the request succeeded and NOK, if there was an error.

The second field is the cost, which will be computed for every requests based on the criteria described in the Rate Limiting section. In this particular case the cost is 2, because we processed one block of text and performed language detection and word analysis.

{
  "analysis": [
    {
      "words": ["It","is","raining","cats","and","dogs","!"],
      "lemmas": ["it","be","rain","cat","and","dog","!"],
      "postags": ["PRON","VERB","VERB","NOUN","CONJ","NOUN","PUNCT"]
    }
  ],
  "cost": 2,
  "language": "en",
  "result": "OK"
}

Multiple tasks and documents

It is also possible to replace task or doc parameters with their plural forms, so you could execute more than a single task on several documents with a single HTTP request. However, note that if you use docs parameter instead of doc, the output will be a list of dictionaries, one per document. Otherwise it is just a single dictionary.

Attribute Data type Description
tasks string list Execute the list of given tasks
docs string list The list of input documents

The rest of the required parameters depend on the task you wish to execute, which is described along with the examples in the following sections.

Tasks

This section describes the available tasks for DirectNLP API. Every task lists the required parameters and describes the output along with practical examples to get you up to speed.

Language detection

Task: detect_language

Language detection is required by most other NLP tasks of the API. If you don't know the language of the documents, it is possible to detect it automatically from the provided documents. Currently the API assumes that all the given documents are in the same language.

Example

Detect the language of a German phrase. The result is stored in language attribute.

{
  "api_key": "25FBA15B30BD4D39",
  "task": "detect_language",
  "doc": "Guten Tag, meine Damen und Herren."
}

Output

{
  "cost": 1,
  "language": "de",
  "result": "OK"
}

Word analysis

Task: analyze

Word analysis performs word and sentence tokenization and morphological analysis to obtain word lemmas, part-of-speech tags and related features. This is a necessary step for many applications such as document indexing and information retrieval.

Example

Analyzing an English document with two sentences.

{
  "api_key": "25FBA15B30BD4D39",
  "task": "analyze",
  "language": "en",
  "doc": "Two white cats went on a quest. How could they go without me?"
}

Output

By default, the analysis contains the tokenized words, lemmas and part-of-speech tags. Note that the results for different sentences are stored in separate dictionaries.

{
  "analysis": [
    {
      "words": ["Two", "white", "cats", "went", "on", "a", "quest", "."],
      "lemmas": ["two", "white", "cat", "go", "on", "a", "quest", "."],
      "postags": ["NUM", "ADJ", "NOUN", "VERB", "ADP", "DET", "NOUN", "PUNCT"]
    },
    {
      "words": ["How", "could", "they", "go", "without", "me", "?"],
      "lemmas": ["how", "could", "they", "go", "without", "me", "?"],
      "postags": ["ADV", "VERB", "PRON", "VERB", "ADP", "PRON", "PUNCT"]
    }
  ],
  "cost": 1,
  "result": "OK"
}

The word analysis takes a number of options, which are summarized in the table below.

Attribute Data type Default value Description
with_words boolean True Tokenized words, whitespace excluded
with_lemmas boolean True Dictionary form of the word
with_postags boolean True Part-of-speech tags (word types)
with_shapes boolean False Compressed shape of the words
with_positions boolean False Character indexes of word start and end positions
with_sentence_text boolean False Include the text of the sentence
with_skipgrams boolean False Compute set of skipgrams for every sentence

Description

The word analysis outoput is stored in analysis field, which is a list. The list contains a dictionary for every sentence in the document.

words contain the the tokenized words as they are present in the input. The specific tokenization rules vary from language to language and try to cover the best common ground.

lemmas contain the dictionary forms of the word and are useful for removing suffixes and normalizing inflected word forms. Main use case for lemmas is document indexing. For example, if a document contains "went" in the simple past tense, it could be matched by a query containing the present form "go". This normalization is especially important for languages with highly inflected words such as Estonian.

postags i.e part-of-speech tags can be though of as different word types such as nouns, verbs and adjectives. These vary vastly among languages, but are normalized in DirectNLP to make them usable for multilingual natural language processing. The following table sums up the part-of-speech tags used in DirectNLP using the English language as an example.

Part-of-speech Example Description
NOUN A cat Nouns are for identifying things, places or to name a particular one
ADJ A red car Adjectives are used to describe attributes or traits of things
VERB The fox jumped Verbs describe an action or a state, often also referred to as a predicate
ADP He went to shop Adpositions are used to describe spatial and temporal relations
ADV She paints beautifully Adverbs are used to modify the meanining of verbs, clauses, but also adjectives
PRON She talked to him A word that can take a place of a noun
CONJ Rick and Morty Conjunctions are words that connect clauses and sentences
INTJ Oh! Interjenction is an abrupt remark
NUM Three musketeers Numericals are for expressing numbers
DET The man, every person. Determiner denotes the kind of a noun.
PUNCT !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ Punctuation marks are used to separate sentences and clauses, but also used as special symbols such as $ and €
X   Anything that does not fall into above categories

shapes denote the changes in uppercase and lowercase characters, digits and punctuation used in a token. This is a useful feature for representing similar tokens, for example 12,35 and 12.48 have shape D.D , words EvilCorp and MegaCorp have shape ULUL , dates like 1986.12.21 and 1950.02.12 have shape D.D.D .

Shape character Description
U uppercase letters
L lowercase letters
D Digits
. basic punctuation , . ; : ! ? + - * / = | _
( opening parenthesis ( { [ <
) closing parenthesis ) } ] >

positions represent tokenized word positions as tuples (x, y) with positions from the beginning of the sentence. The x is the start position and y the end position of the token and are stored in separate lists.

sentence_text is useful when you are interested in using the tokenized sentences in our client application.

Example

Consider the following example that employs both word position analysis and sentence tokenization. Note that we turn of lemmas and postag analysis for this example, because they are enabled by default.

{
  "api_key": "25FBA15B30BD4D39",
  "task": "analyze",
  "language": "en",
  "doc": "Hello world! How are you?",
  "with_words": true,
  "with_lemmas": false,
  "with_postags": false,
  "with_positions": true,
  "with_sentence_text": true
}

Output

The output contains both the tokenized words with their start and ending positions along with the full text for every tokenized sentence.

{
  "analysis": [
    {
      "sentence_text": "Hello world!",
      "words": ["Hello", "world", "!"],
      "xs": [0, 6, 11],
      "ys": [5, 11, 12]
    },
    {
      "sentence_text": "How are you?",
      "words": ["How", "are", "you", "?"],
      "xs": [0, 4, 8, 11],
      "ys": [3, 7, 11, 12]
    }
  ],
  "cost": 1,
  "result": "OK"
}

skipgrams outputs a list of unigrams and feature pairs found in the sentence, which can be used as a more in-depth replacement for simple bag-of-words feature extraction. It uses words, lemmas, postags and shapes of a sentence to formulate pairs of features present in close proximity of the sentence. It helps to encode the sentence structure as features to some degree, but also increases the memory usage if client application uses it.