Documentation of the package Tokenizer

The documentation is available on https://nlp.frama.io/tokenspan/ The PyPi package is available on https://pypi.org/project/tokenspan/ The official repository is on https://framagit.org/nlp/tokenspan/

Summary

There are several different tokenizers in this package. Most of them have been constructed from the Token and Tokens class. The Token class is a sub-class of the Span class, and Tokens can be seen as a collection of Token instances.

Important remark: The Span class (handling the cutting of the parent string) appeared at version 0.4. Previously, the cutting of the string was handled by the Token class, which also deals with the personnalisable attributes of the tokens.

Span class

The Span class is documented as Span_UserGuide, and corresponds to the abstraction of a collection of characters positions from a parent string, in order to extract a sub-string, called children string in this context, in a versatile way. In addition, the basic algebra of set of positions is available from the Span class, allowing powerfull manipulation of many different tokens from a single parent string.

Token and Tokens classes

The Token and Tokens classes are kind of containers usefull to construct elaborated tokenizers. They work together since a Token object (kind of a class constructed on top of Python string) transforms to a Tokens one once splitted (Tokens class can be seen as a collection of Token instances).

The Token and Tokens classes API presents all methods available for these two classes.

The Token and Tokens classes are explained in a series of documents, either in Markdown or Jupyter-Notebook format:

  • TokenTokens_1_Basics : Understanding the difference between Token (basically a string with associated non-overlapping ranges for extracting sub-strings) and Tokens (a list of Token) classes

  • TokenTokens_2_RangesAndSpans : Understanding the underlying concept of ranges of non-overlapping sub-strings in a Token instance, and the relation between the absolute positions (position inside the mother string) and the relative positions (position inside the token)

  • TokenTokens_3_BasicExample : Construction of a basic example of tokenizer

  • TokenTokens_4_AttributesDefinition : Construction of the attributes associated to a Token instance, and understanding of the transmission to the Tokens instances.

Advanced Tokenizers

There are several more elaborated tokenizer constructed from the Span, Token and Tokens classes. Those have been displaced to a specific package, called iamtokenizing, and available on https://pypi.org/project/iamtokenizing/.