2. Token and Tokens classes - Chapter 1 : the basics

We introduce the Token and the Tokens classes, as usefull tools for later implementations of Natural Language Processing (NLP) tasks in a versatile, yet efficient and easy to run environment.

This is achieved by the construction of two classes at two different levels :

  • Token class, which subclasses the Python string class, allows sub-token easy representation, and handle any attributes to be shared among different object.

  • Tokens class, which (kind of) subclasses the Python list class (in fact it’s just a collection of Token objects).

2.1. Motivation

As string objects can not be fed to a computer in order to involve mathematical manipulation (especially prediction), one always need to separate a document into atomic string entities, called tokens, and vectorize them as a first step of any subsequent mathematical analysis of documents. Those tokens can be just

  • words : say, string entities separated by spaces,

  • bi-grams : mobile windows of two consecutive words along the document,

  • n-grams : mobile windows of n consecutive words,

  • sentence : say, string entities separated par backspace \n,

  • n-chargrams : mobile windows of n consecutive characters, or anything in between, that is, complicated association of characters, words, hyphenation, sentences, ellipsis, …

In order to manipulate those elements from a computer, one needs a versatile object able to handle many different situations in a unified manner. Unfortunately, as actually implemented in the standard libraries, tokenization suffers from some limitations. For the sake of convenience, we focus only on some popular Python implementations of tokenization, in a separate document available on the official documentation ; below we

In order to avoid hindering tokenization specificities inside an end-to-end algorithm having tokenization at its basis (as for sklearn or spaCy libraries), we want to create a Tokenization class that will do only tokenization process. In order to avoid getting simple strings as tokenizer output (as for nltk library), we will create a Token object which can be easily adapted to later usages at will.

Since tokenizing a text consists in splitting a string in sub-parts, starting from a Token object before the tokenization process will naturally ends with many different Token instances once the tokenization takes place. We decided to group the instances of Token coming from one tokenization step inside a grouping class, called Tokens.

2.2. Summary of the Token and Tokens classes

The tokenization process (handled later by a class Tokenizer not described in this document) will sit on top of the following classes

  • Token class, which combines the notions of a string class in Python, allowing to manipulate the Token instances as easily as a usual string, in addition to be able to add any attribute and method in the flow, for later adaptation to personalized pipeline. In addition, it also inlcudes some range properties, allowing to define a Token as a combination of several sub-parts of the parent string.

  • Tokens class, which mimics the list class in Python, allowing to put many Token instances in packets.

Each of these two classes has methods constructing some instances of the other one. For instance, once splitted, a Token instance becomes a collection of several tokens, which are then grouped in a Tokens instance. One can then unsplit this single Tokens instance into a unique Token instance by ways of re-gluing processes, even though the Tokens instance contained several Token ones.

Important note : Token and Tokens does not per se perform tokenization process. They are just tool classes helpfull to quickly make more complicated tokenizer class while conserving basic conceptual models for the related objects. See the companion library iamtokenizing for such constructions.

from tokenspan import Token, Tokens
text = "A really simple string for illustration."

2.3. Basic Usage of Token class

Basic usage of the Token class is as a container for a string. It is constructed from a string, with argument string at the instanciation.

For instance Token objects implement the usual len, [n] indexing or [start:stop:step] slices.

token = Token(string=text)
print(len(token)==len(text))
print(token[12:45])
print(text[12:45])
print(token[12:45:2])
print(text[12:45:2])
True
ple string for illustration.
ple string for illustration.
pesrn o lutain
pesrn o lutain

Basic attributes of a Token instance are its string, which is simply the string initiating the process of tokenization.

token.string
'A really simple string for illustration.'

The second important attribute is ranges, which is a collection of basic Python range objects. Once the ranges is not trivial (that is, token.ranges!=[range(0,len(token.string),], one can start understanding what the Token object can do.

Note the ranges attribute is always a collection of range, that is, for a single range, one still has to give the parameter ranges inside a list, otherwise one gets a ValueError.

token = Token(string=text,
              ranges=[range(2,15),])
print("Token length : {}".format(len(token)))
print(token[:])
print("*"*len(token))
print(token[0])
print("*"*len(token))
print(token[:])
print("*"*len(token))
print(token[0:5])
print("*"*len(token))
print(token[10:100])
print("*"*len(token))
try:
    Token(ranges=range(2,15),string=text)
except ValueError as error:
    print("ValueError{}".format(error.args))
Token length : 13
really simple
*************
r
*************
really simple
*************
reall
*************
ple
*************
ValueError('r is not an instance of Range',)

That is, the Token object behaves as a string, but with filtered characters given by the ranges attribute. The length of the Token instance is now given by the size of the range.

There are possibilities to have several range in the ranges list. We will come back later to this possibility.

The str magic function allows to extract the token representing the Token object.

str(token)
'really simple'

Importantly, note that str(token) is always a part of token.string. Eventually, they are equal when the ranges attribute is not instanciated.

token.string
'A really simple string for illustration.'

To capture whether a Token has an empty string, there is a boolean evaluation of the Token object. It works for both characters and strings extracted from the Token instance, as well as for the entire Token object inwhich case it answers the question does the Token object has a non empty string ?

As usual with list and string, asking for a precise element can returns an IndexError, but not when passing a slice as argument, which will returns an empty string in our case.

To make the entire Token instance returns False, its string must be empty. this can be done by either giving no string at the instanciation, of giving explicitely an empty ranges. Note that giving no ranges parameter at the instanciation is understood as the ranges correpsonding to the entire string parameter.

print(bool(token[5]))
try:
    print(bool(token[len(token)+5]))
except IndexError as error:
    print("IndexError{}".format(error.args))
print(bool(token[len(token)+5:len(token)+10]))
print(bool(token))

print(bool(Token()))
print(bool(Token(string=text,ranges=[])))
print(bool(Token(string=text,ranges=[range(10,10),])))
True
IndexError('string index out of range',)
False
True
False
False
False

There are basic research possibilities inside the Token class, as the search for a sub-string.

print('really' in token)
print('ae' in token)
True
False

Additionally, since Token kind of sub-classes the str object from Python, all the string methods already available from the Python string class are converted to Token methods. Below the example of the upper() method. Note one has to catch the outcome of the str method in order to apply further methods, unless one just wants to extract some properties of the string.

Note : all the encoding issues of the string can be solved in the usual way, thanks to the str sub-class underneath Token.

print(token.upper())
print(token.startswith('really'))
bytes_ = token.encode('latin-1')
print(bytes_)
REALLY SIMPLE
True
b'really simple'

Nevertheless, some basic string methods returns some list of strings. Basic examples are the use of partition and split. Let us see how Python string handles these two cases.

s = "really simple string"
print(s.split(' '))
print(s.partition(' '))
['really', 'simple', 'string']
('really', ' ', 'simple string')

One sees that split will return a list of strings, each of them being the sub-strings of the initial one once the spaces are removed, whereas partition separates the initial string in three parts, with the intermediary one being the first of the string it founds during its execution.

It would be handy to use split and/or partition to generate some tokens from a string. Nevertheless, it is not obvious how to deal with the outcome of these methods, and still to be able to add attributes to them and to keep these extra attributes for later processes. That is why we disallow these basic methods in the Token class and replace them by more usefull behaviors. Since one is interested in generating sub-strings, one proproses to capture all the sub-strings into an other class, that is called Tokens.

One can basically think of Tokens instance as a list of Token objects. It has len and its list representation allows to print every Token elements it has. One generates a Tokens instance by passing a Token one through the split or partition methods. Nevertheless, these methods now takes either a list of (start,stop) tuples (for split) or a single tuple start,stop (for partition).

tokens = token.partition(6,7)
print(len(tokens))
list(tokens)
3
[Token('really', [(2,8)]), Token(' ', [(8,9)]), Token('simple', [(9,15)])]
tokens = token.split([range(6,7),])
print(len(tokens))
list(tokens)
3
[Token('really', [(2,8)]), Token(' ', [(8,9)]), Token('simple', [(9,15)])]

The usefullness of the split method is that it handles several partitions at once. basic usage for tokenization is to pass the cuts from a REGEX research. This will be done in the following chapters. Here we restrict to basic usages of Token and Tokens objects.

The third and last way of passing from Token to Tokens is by the use of the slice method, which consists in overlapping creation of Token from start to stop by step, hence its name. This method allow the creation of chargrams instaneously.

tokens = token.slice(0,6,3)
print(len(tokens))
list(tokens)
4
[Token('rea', [(2,5)]),
 Token('eal', [(3,6)]),
 Token('all', [(4,7)]),
 Token('lly', [(5,8)])]

Now we pass to the basic explanation of how the Tokens class works.

2.4. Basic usage of Tokens class

The understanding of the Tokens[start:stop] helps clarifying the interest in the two interrelated Tokens and Token class.

In fact, Tokens[n] will return the n-th element of the attribute Tokens.tokens (recall that a Tokens object can be thought as nothing but a list of Token objects). Tokens[n] will thus just return a Token object.

In contrary, Tokens[start:stop] returns a collection of Token objects, so it returns a Tokens object. Eventually, one can use also a step parameters, if it helps in later developments. Note the specific representation of the Tokens class : it gives the number of Token instances it contains, and print the concatenated version of all strings in all the Token objects present in Tokens.

So the usage of Tokens and Token as two different levels (or two different classes) is driven by the necessity to collect different tokens inside a part of a document. As long as they are related to the same initial string, one should not split-up the different Token instances inside a Tokens one. It also helps to make go and back movement in between the different scales of the token and the whole document, seen as a collection of tokens. We will explore in later chapter the usefullness of such behaviors.

tokens = token.partition(6,7)
tokens[0:1]
Tokens(1 Token) : 
-----------------
really
tokens[0]
Token('really', [(2,8)])

Being a kind of list, Tokens implements len, append, extend, insert, and __add__ as well. It does not implement remove, index, pop, sort or clear, so the list representation of Tokens has some limits. If one understands how these methods works and what they do, one sees quite easilly why they are not implemented : How to sort a list of tokens if not by hand ? How to tell the list what kind of token one want to remove, or get the index of a given token ? These methods must be handle by hand, which is not that complicated in fact, as we will see.

Let us first see how one can work with the few list-like methods implemented in Tokens. Basic methods insert, extend and append work in place, as for the list. For convenience, wee add Token from the Tokens object itself, but the usual usage is to attach extra tokens to a Tokens instance.

tokens.insert(1,tokens[0])
Tokens(4 Token) : 
-----------------
really
really
 
simple
tokens.extend(tokens[0:2])
Tokens(6 Token) : 
-----------------
really
really
 
simple
really
really
tokens.append(tokens[0])
Tokens(7 Token) : 
-----------------
really
really
 
simple
really
really
really

And now we see how to clean the nasty Tokens : simply construct a list of Token objects and pass it to a new Tokens instance.

tokens_list = [tokens[1],tokens[2],tokens[3]]
new_tokens = Tokens(tokens_list)
new_tokens
Tokens(3 Token) : 
-----------------
really
 
simple

Here, one could have used the slicing method as well

initial_tokens = tokens[1:4]
initial_tokens
Tokens(3 Token) : 
-----------------
really
 
simple

Or use some combinations of append, extend again from an almost empty Tokens

Note that __add__ (the + operation) works on Tokens as extend

new_tokens = tokens + tokens[1:]
new_tokens
Tokens(13 Token) : 
------------------
really
really
 
simple
really
really
really
really
 
simple
really
really
really

.. and can be used as a unary operator.

initial_tokens += new_tokens[:2]
initial_tokens
Tokens(5 Token) : 
-----------------
really
 
simple
really
really

Tokens instance has len and str attribute as well, given by the concatenated version of all strings of the different Token contained in Tokens, with the _NEW_Token_ string intertwinned. This string can not be changed unless hard coding it, but it is present only in this representation, so it is kind of useless anyways, since the main focus of later works would be on Token and not on Tokens objects.

print(len(tokens))
str(tokens)
7
'really_NEW_Token_really_NEW_Token_ _NEW_Token_simple_NEW_Token_really_NEW_Token_really_NEW_Token_really'

2.5. Tokenization usages of Token and Tokens

In practice, the main usage of the Tokens object is in the possibility to re-glue the different Token instances contained in it. This is achieved by the method join(start,stop,step).

start and stop arguments serve in case one wants to glue only a part of the Tokens list. When they are not given, the full list of Tokens is used and re-glued.

Let us discard the step argument for a while, and let us undo the tokenization for the above split.

tokens = initial_tokens[:3]
tokens.join()
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
  warnings.warn(mess, category=BoundaryWarning)
Token('really simple', [(2,15)])

There is also a slice(start,stop,size,step) method, which stays in the Tokens realm, and allows constructing some n-grams quite easilly. So below we restart with the entire first string, and isolate each word by the split method.

token = Token(string=text)
tokens = token.split((range(1,2),range(8,9),range(15,16),range(22,23),range(26,27),))
tokens
Tokens(11 Token) : 
------------------
A
 
really
 
simple
 
string
 
for
 
illustration.

We now apply the slice method to obtain a Tokens of bi-grams. Note that any of the start=0, stop=len(Tokens), size=1 or step=1 parameter is required ; they default values are indicated here.

tokens_slice1 = tokens.slice(size=3,step=2)
tokens_slice1
Tokens(5 Token) : 
-----------------
A really
really simple
simple string
string for
for illustration.

One defined step=2 above because the spaces string are still considered as a Token here, and the size is on the Tokens elements, that is, the number of Token that will be glued for each new element in the resulting Tokens. Compare the above with (here we put start and stop explicitely to remember the order) the following.

tokens.slice(0,len(tokens),size=3,step=1)
Tokens(9 Token) : 
-----------------
A really
 really 
really simple
 simple 
simple string
 string 
string for
 for 
for illustration.

There are still 3 Token per line, but every second line has two space Token elements.

Note one could have done as well …

tokens_slice2 = tokens[::2].slice(0,len(tokens[::2]),2,1)
tokens_slice2
Tokens(5 Token) : 
-----------------
A really
really simple
simple string
string for
for illustration.

… to both withdraw the spaces and use simpler arguments. The reason why has to do with the ranges attributes and how they are handle underneath. This is the topic for the next chapter. We just leave you with these attributes in the two above slicing cases, in order for you to think about.

for tok in tokens_slice1:
    print(tok.ranges)
print("="*15)
for tok in tokens_slice2:
    print(tok.ranges)
[range(0, 8)]
[range(2, 15)]
[range(9, 22)]
[range(16, 26)]
[range(23, 40)]
===============
[range(0, 1), range(2, 8)]
[range(2, 8), range(9, 15)]
[range(9, 15), range(16, 22)]
[range(16, 22), range(23, 26)]
[range(23, 26), range(27, 40)]

Hint : a Token can be described by several range in the ranges attribute, and overlapping between range is forbiden. The different range in Token.ranges are glued for representation by a subtoksep string separator of length 1 (here a space, as by default). Hence the two solutions looks quite the same, despite completely different ranges … well, not completely different, only the space is missing in the second option.

from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:55:53 2022