6. Comparison with other Python librairies

This module only deals with the splitting of a string into elementary constituents, called Span, what is called a tokenization, and eventually the tagging of these elements using the classes Token and Tokens. So basically a Token is a Span (that will be described shortly below) with additional attributes that can be customized freely by users, whereas a Tokens is just a collection (i.e. a list) of Token objects.

What is called a Token in the present library is often called a Token in other libraries, see below. What is called a Span in the present library is just a restriction of a Token, to construct a simpler object that, after designing the Token and Tokens classes, appeared to be be a more usefull object than the Token and Tokens ones, at least for our needs.

What is called a Tokens in the present library is often called a Document in other librairies, see below. The two notions are different in the sense that several instances of Tokens can be associated to a given string, seen as the parent document, and named the parent string in the following. The parent string is available from the attribute Span.string and its sub-class method Token.string. So many Span or Token objects can be attached to a given parent string, and this behavior is the strength of the class and its usefullness.

The originality of the Span class is its ability to deal with non-overlapping sub-parts of the parent string in a versatile way. That is, one can combine any number of range of positions of the parent string, and put them together in a single entity of Span. In addition, the selected characters positions from the parent string are interpreted as sets of positions inside the parent string. Thus the basic algebra associated to sets is implemented in the Span class (union, intersection, difference and symmetric_difference are all implemented at the level of the Span class). In addition one can add or remove a range from a Span object with ease. Importantly, there is no predefined rule to define a Span from its parent string ; so one can use the splitting method one wants: from a term detector, from regular expression (regex), by-hand construction position by position, … All together, the complete set of (non-overlapping) characters of the parent string that form the token, or span, is called the child string. The child string is available from the method str(Span) and its sub-class method str(Token). The way one reconstructs a child string from several non-contiguous ranges of position from the parent string is captured by the attribute Span.subtoksep or Token.subtoksep, which is by default the empty space (character 32 in ASCII alphabet).

Below we present the difference in conception between the TokenSpan library and its Span, Token and Tokens, with other libraries available on the Python Package Installer (pypi).

6.1. sklearn

Scikit-Learn (sklearn) implements a tokenizer in its features extraction package, which is quite versatile, since it allows

  • making e.g. \(n\)-chargrams or \(n\)-grams of a collection of different sizes, through the option ngram_range=(min,max) accepting \(n\in[\min,\max]\)

  • specific truncation scheme of strings, with the token_pattern option, accepting any Regular Expression

  • list of stop words, list of vocabulary (sometimes generating some sets of contradicting arguments whose precedence is handled in the back)

  • a full set of statistical counting parameters : max_features, max_df, min_df… see e.g. the CountVectorizer model of feature extraction (TfidfVectorizer has similar options).

Unfortunately these options are hindered in some specific algorithm (mainly some complete bag-of-words construction) and do not provide any access to the Token subsequent objects. In the present library, the Span, Token and Tokens objects are versatile, and adaptation to later machine learning algorithms is quite easy to implement, even for beginners in the field of NLP.

sklearn is available on pypi:scikit-learn

6.2. nltk

The Natural Language ToolKit (nltk) offers a large variety of tokenizers in its module nltk.tokenize. Most of these tokenizers are nevertheless based on nltk.tokenize.api.TokenizerI which takes as entry a string and returns either - a list of tuples in the form (start,end) such that string[start:end] corresponds to the string of the token. This is the method tokenize in the companion module iamtokenizing. - a list of strings, each string corresponding to a token. This is the method tokenize of the module nltk.tokenize.api.TokenizerI.

In contrary, the Span class allows non-contiguous sub-part of a parent string to be adressed in a single object. In addition, creating object from string allows to add attributes on the flow during the specific process adpated to the NLP problam at hand.

nltk is available on pypi:nltk

6.3. spaCy

spaCy implements a tokenizer having some interesting properties, like the recurent looking at the different portion of the tokens, allowing exceptions for hyphenized words, suffixes, prefixes and infixes, in addition to stop-words exceptions and co. (see full API documentation). In contrast, the tokenizer is so deeply implemented in the spaCy.Pipeline object that any optional customization requires painfull (if not impossible) adaptation of the spaCy code. One can cite the missing possibility of using sentence, n-chagrams, or simply n-grams as tokens for later NLP pipeline, among other difficulties dealing with spaCy.

The Span object overcomes all the limitations spaCy presents in its tokenization process, at the expense that the user has now to construct the rules for splitting the document in the desired way (note there are several already implemented tools in the companion module iamtokenizing to extract char-grams and n-grams). Also, the ability to add in-the-flow attributes to a spaCy.Token object is quite similar to our approach in the tokenspan.Token and tokenspan.Tokens class. To a large extend, the tokenspan.Tokens class here has similarities with the spaCy.Doc object. Note nevertheless that the main advantage of spaCy is still its implementation in cython, which should be vastly faster than treatments from the Span, Token or Tokens classes, that are written in pure Python. In addition, spaCy proposes many language models that are already trained, and should speed up the industrial applications of NLP in specific contexts. This is not the case of tokenspan, developped for specific usages, and hand-crafted by the users for dedicated tasks.

spaCy is available on pypi:spacy

6.4. gatenlp

gatenlp allows easy annotations of a string, and handle overlapping annotations to exist. The generation of annotations is quite simple, and their versatility is ensured by the usage of the Features class, that handles Python dictionnaries. In addition, one can group several annotations related to a given document in a class called AnnotationSet.

All these ressemble quite a lot to the Token and Tokens class, though Tokens class is a composite form in between the gatenlp.AnnotationSet and gatenlp.Document classes. The originality of the tokenspan library is here to present the Span class as well. A gatenlp.Span also exist, but it does not allow for non-contiguous parts of a string to be generated. In addition, only overlapping and/or informations are available from the gatenlp.Span class, whereas tokenspan regards the Span class as a set of positions in a parent string, and naturally allows set operations to exist.

gatenlp is available on pypi:gatenlp

6.5. gensim

Though it is widely used for advanced usages on NLP, like vectorization of text using neural networks, and despite gensim is acclaimed for its speed performance in accomplishing these vectorization tasks, there are scarce tools to tokenize a string in this library. In practice, gensim takes as entry a list of tokens already prepared, and offers only few utilities to help the user doing the tokenization process. The Span object fills this gap between the text and its vectorizations using gensim.

gensim is available on pypi:gensim