Documentation of the package Tokenizer
¶
The documentation is available on https://nlp.frama.io/tokenspan/ The PyPi package is available on https://pypi.org/project/tokenspan/ The official repository is on https://framagit.org/nlp/tokenspan/
Summary¶
There are several different tokenizers in this package. Most of them have been constructed from the Token
and Tokens
class. The Token
class is a sub-class of the Span
class, and Tokens
can be seen as a collection of Token
instances.
Important remark: The Span
class (handling the cutting of the parent string) appeared at version 0.4. Previously, the cutting of the string was handled by the Token
class, which also deals with the personnalisable attributes of the tokens.
Span
class¶
The Span
class is documented as Span_UserGuide, and corresponds to the abstraction of a collection of characters positions from a parent string, in order to extract a sub-string, called children string in this context, in a versatile way. In addition, the basic algebra of set of positions is available from the Span
class, allowing powerfull manipulation of many different tokens from a single parent string.
Token
and Tokens
classes¶
The Token
and Tokens
classes are kind of containers usefull to construct elaborated tokenizers. They work together since a Token
object (kind of a class constructed on top of Python string
) transforms to a Tokens
one once splitted (Tokens
class can be seen as a collection of Token
instances).
The Token
and Tokens
classes API presents all methods available for these two classes.
The Token
and Tokens
classes are explained in a series of documents, either in Markdown or Jupyter-Notebook format:
TokenTokens_1_Basics : Understanding the difference between
Token
(basically a string with associated non-overlapping ranges for extracting sub-strings) andTokens
(a list ofToken
) classesTokenTokens_2_RangesAndSpans : Understanding the underlying concept of ranges of non-overlapping sub-strings in a
Token
instance, and the relation between the absolute positions (position inside the mother string) and the relative positions (position inside the token)TokenTokens_3_BasicExample : Construction of a basic example of tokenizer
TokenTokens_4_AttributesDefinition : Construction of the attributes associated to a
Token
instance, and understanding of the transmission to theTokens
instances.
Advanced Tokenizers¶
There are several more elaborated tokenizer constructed from the Span
, Token
and Tokens
classes. Those have been displaced to a specific package, called iamtokenizing
, and available on https://pypi.org/project/iamtokenizing/.