2. Token and Tokens classes - Chapter 1 : the basics¶
We introduce the Token
and the Tokens
classes, as usefull tools for later implementations of Natural Language Processing (NLP) tasks in a versatile, yet efficient and easy to run environment.
This is achieved by the construction of two classes at two different levels :
Token
class, which subclasses the Pythonstring
class, allows sub-token easy representation, and handle any attributes to be shared among different object.Tokens
class, which (kind of) subclasses the Pythonlist
class (in fact it’s just a collection ofToken
objects).
2.1. Motivation¶
As string objects can not be fed to a computer in order to involve mathematical manipulation (especially prediction), one always need to separate a document into atomic string entities, called tokens, and vectorize them as a first step of any subsequent mathematical analysis of documents. Those tokens can be just
words : say, string entities separated by spaces,
bi-grams : mobile windows of two consecutive words along the document,
n-grams : mobile windows of n consecutive words,
sentence : say, string entities separated par backspace
\n
,n-chargrams : mobile windows of n consecutive characters, or anything in between, that is, complicated association of characters, words, hyphenation, sentences, ellipsis, …
In order to manipulate those elements from a computer, one needs a versatile object able to handle many different situations in a unified manner. Unfortunately, as actually implemented in the standard libraries, tokenization suffers from some limitations. For the sake of convenience, we focus only on some popular Python
implementations of tokenization, in a separate document available on the official documentation ; below we
In order to avoid hindering tokenization specificities inside an end-to-end algorithm having tokenization at its basis (as for sklearn
or spaCy
libraries), we want to create a Tokenization
class that will do only tokenization process. In order to avoid getting simple strings as tokenizer output (as for nltk
library), we will create a Token
object which can be easily adapted to later usages at will.
Since tokenizing a text consists in splitting a string in sub-parts, starting from a Token
object before the tokenization process will naturally ends with many different Token
instances once the tokenization takes place. We decided to group the instances of Token
coming from one tokenization step inside a grouping class, called Tokens
.
2.2. Summary of the Token
and Tokens
classes¶
The tokenization process (handled later by a class Tokenizer
not described in this document) will sit on top of the following classes
Token
class, which combines the notions of astring
class in Python, allowing to manipulate theToken
instances as easily as a usual string, in addition to be able to add any attribute and method in the flow, for later adaptation to personalized pipeline. In addition, it also inlcudes somerange
properties, allowing to define aToken
as a combination of several sub-parts of the parent string.Tokens
class, which mimics thelist
class in Python, allowing to put manyToken
instances in packets.
Each of these two classes has methods constructing some instances of the other one. For instance, once splitted, a Token
instance becomes a collection of several tokens, which are then grouped in a Tokens
instance. One can then unsplit this single Tokens
instance into a unique Token
instance by ways of re-gluing processes, even though the Tokens
instance contained several Token
ones.
Important note : Token
and Tokens
does not per se perform tokenization process. They are just tool classes helpfull to quickly make more complicated tokenizer class while conserving basic conceptual models for the related objects. See the companion library iamtokenizing for such constructions.
from tokenspan import Token, Tokens
text = "A really simple string for illustration."
2.3. Basic Usage of Token
class¶
Basic usage of the Token
class is as a container for a string. It is constructed from a string, with argument string
at the instanciation.
For instance Token
objects implement the usual len
, [n]
indexing or [start:stop:step]
slices.
token = Token(string=text)
print(len(token)==len(text))
print(token[12:45])
print(text[12:45])
print(token[12:45:2])
print(text[12:45:2])
True
ple string for illustration.
ple string for illustration.
pesrn o lutain
pesrn o lutain
Basic attributes of a Token
instance are its string
, which is simply the string initiating the process of tokenization.
token.string
'A really simple string for illustration.'
The second important attribute is ranges
, which is a collection of basic Python range
objects. Once the ranges
is not trivial (that is, token.ranges!=[range(0,len(token.string),]
, one can start understanding what the Token
object can do.
Note the ranges
attribute is always a collection of range
, that is, for a single range
, one still has to give the parameter ranges
inside a list, otherwise one gets a ValueError
.
token = Token(string=text,
ranges=[range(2,15),])
print("Token length : {}".format(len(token)))
print(token[:])
print("*"*len(token))
print(token[0])
print("*"*len(token))
print(token[:])
print("*"*len(token))
print(token[0:5])
print("*"*len(token))
print(token[10:100])
print("*"*len(token))
try:
Token(ranges=range(2,15),string=text)
except ValueError as error:
print("ValueError{}".format(error.args))
Token length : 13
really simple
*************
r
*************
really simple
*************
reall
*************
ple
*************
ValueError('r is not an instance of Range',)
That is, the Token
object behaves as a string, but with filtered characters given by the ranges
attribute. The length of the Token
instance is now given by the size of the range.
There are possibilities to have several range
in the ranges
list. We will come back later to this possibility.
The str
magic function allows to extract the token representing the Token
object.
str(token)
'really simple'
Importantly, note that str(token)
is always a part of token.string
. Eventually, they are equal when the ranges
attribute is not instanciated.
token.string
'A really simple string for illustration.'
To capture whether a Token
has an empty string, there is a boolean evaluation of the Token
object. It works for both characters and strings extracted from the Token
instance, as well as for the entire Token
object inwhich case it answers the question does the Token
object has a non empty string ?
As usual with list and string, asking for a precise element can returns an IndexError
, but not when passing a slice
as argument, which will returns an empty string in our case.
To make the entire Token
instance returns False
, its string must be empty. this can be done by either giving no string at the instanciation, of giving explicitely an empty ranges
. Note that giving no ranges
parameter at the instanciation is understood as the ranges
correpsonding to the entire string
parameter.
print(bool(token[5]))
try:
print(bool(token[len(token)+5]))
except IndexError as error:
print("IndexError{}".format(error.args))
print(bool(token[len(token)+5:len(token)+10]))
print(bool(token))
print(bool(Token()))
print(bool(Token(string=text,ranges=[])))
print(bool(Token(string=text,ranges=[range(10,10),])))
True
IndexError('string index out of range',)
False
True
False
False
False
There are basic research possibilities inside the Token
class, as the search for a sub-string.
print('really' in token)
print('ae' in token)
True
False
Additionally, since Token
kind of sub-classes the str
object from Python, all the string
methods already available from the Python string
class are converted to Token
methods. Below the example of the upper()
method. Note one has to catch the outcome of the str
method in order to apply further methods, unless one just wants to extract some properties of the string.
Note : all the encoding issues of the string can be solved in the usual way, thanks to the str
sub-class underneath Token
.
print(token.upper())
print(token.startswith('really'))
bytes_ = token.encode('latin-1')
print(bytes_)
REALLY SIMPLE
True
b'really simple'
Nevertheless, some basic string
methods returns some list of strings. Basic examples are the use of partition
and split
. Let us see how Python string
handles these two cases.
s = "really simple string"
print(s.split(' '))
print(s.partition(' '))
['really', 'simple', 'string']
('really', ' ', 'simple string')
One sees that split
will return a list of strings, each of them being the sub-strings of the initial one once the spaces are removed, whereas partition
separates the initial string in three parts, with the intermediary one being the first of the string it founds during its execution.
It would be handy to use split
and/or partition
to generate some tokens from a string. Nevertheless, it is not obvious how to deal with the outcome of these methods, and still to be able to add attributes to them and to keep these extra attributes for later processes. That is why we disallow these basic methods in the Token
class and replace them by more usefull behaviors. Since one is interested in generating sub-strings, one proproses to capture all the sub-strings into an other class, that is called Tokens
.
One can basically think of Tokens
instance as a list of Token
objects. It has len
and its list
representation allows to print every Token
elements it has. One generates a Tokens
instance by passing a Token
one through the split
or partition
methods. Nevertheless, these methods now takes either a list of (start,stop)
tuples (for split
) or a single tuple start,stop
(for partition
).
tokens = token.partition(6,7)
print(len(tokens))
list(tokens)
3
[Token('really', [(2,8)]), Token(' ', [(8,9)]), Token('simple', [(9,15)])]
tokens = token.split([range(6,7),])
print(len(tokens))
list(tokens)
3
[Token('really', [(2,8)]), Token(' ', [(8,9)]), Token('simple', [(9,15)])]
The usefullness of the split
method is that it handles several partitions at once. basic usage for tokenization is to pass the cuts from a REGEX research. This will be done in the following chapters. Here we restrict to basic usages of Token
and Tokens
objects.
The third and last way of passing from Token
to Tokens
is by the use of the slice
method, which consists in overlapping creation of Token
from start
to stop
by step
, hence its name. This method allow the creation of chargrams instaneously.
tokens = token.slice(0,6,3)
print(len(tokens))
list(tokens)
4
[Token('rea', [(2,5)]),
Token('eal', [(3,6)]),
Token('all', [(4,7)]),
Token('lly', [(5,8)])]
Now we pass to the basic explanation of how the Tokens
class works.
2.4. Basic usage of Tokens
class¶
The understanding of the Tokens[start:stop]
helps clarifying the interest in the two interrelated Tokens
and Token
class.
In fact, Tokens[n]
will return the n
-th element of the attribute Tokens.tokens
(recall that a Tokens
object can be thought as nothing but a list of Token
objects). Tokens[n]
will thus just return a Token
object.
In contrary, Tokens[start:stop]
returns a collection of Token
objects, so it returns a Tokens
object. Eventually, one can use also a step
parameters, if it helps in later developments. Note the specific representation of the Tokens
class : it gives the number of Token
instances it contains, and print the concatenated version of all strings in all the Token
objects present in Tokens
.
So the usage of Tokens
and Token
as two different levels (or two different classes) is driven by the necessity to collect different tokens inside a part of a document. As long as they are related to the same initial string, one should not split-up the different Token
instances inside a Tokens
one. It also helps to make go and back movement in between the different scales of the token and the whole document, seen as a collection of tokens. We will explore in later chapter the usefullness of such behaviors.
tokens = token.partition(6,7)
tokens[0:1]
Tokens(1 Token) :
-----------------
really
tokens[0]
Token('really', [(2,8)])
Being a kind of list, Tokens
implements len
, append
, extend
, insert
, and __add__
as well. It does not implement remove
, index
, pop
, sort
or clear
, so the list representation of Tokens
has some limits. If one understands how these methods works and what they do, one sees quite easilly why they are not implemented : How to sort a list of tokens if not by hand ? How to tell the list what kind of token one want to remove, or get the index of a given token ? These methods must be handle by hand, which is not that complicated in fact, as we will see.
Let us first see how one can work with the few list-like methods implemented in Tokens
. Basic methods insert
, extend
and append
work in place, as for the list. For convenience, wee add Token
from the Tokens
object itself, but the usual usage is to attach extra tokens to a Tokens
instance.
tokens.insert(1,tokens[0])
Tokens(4 Token) :
-----------------
really
really
simple
tokens.extend(tokens[0:2])
Tokens(6 Token) :
-----------------
really
really
simple
really
really
tokens.append(tokens[0])
Tokens(7 Token) :
-----------------
really
really
simple
really
really
really
And now we see how to clean the nasty Tokens
: simply construct a list of Token
objects and pass it to a new Tokens
instance.
tokens_list = [tokens[1],tokens[2],tokens[3]]
new_tokens = Tokens(tokens_list)
new_tokens
Tokens(3 Token) :
-----------------
really
simple
Here, one could have used the slicing method as well
initial_tokens = tokens[1:4]
initial_tokens
Tokens(3 Token) :
-----------------
really
simple
Or use some combinations of append
, extend
again from an almost empty Tokens
…
Note that __add__
(the +
operation) works on Tokens
as extend
…
new_tokens = tokens + tokens[1:]
new_tokens
Tokens(13 Token) :
------------------
really
really
simple
really
really
really
really
simple
really
really
really
.. and can be used as a unary operator.
initial_tokens += new_tokens[:2]
initial_tokens
Tokens(5 Token) :
-----------------
really
simple
really
really
Tokens
instance has len
and str
attribute as well, given by the concatenated version of all strings of the different Token
contained in Tokens
, with the _NEW_Token_
string intertwinned. This string can not be changed unless hard coding it, but it is present only in this representation, so it is kind of useless anyways, since the main focus of later works would be on Token
and not on Tokens
objects.
print(len(tokens))
str(tokens)
7
'really_NEW_Token_really_NEW_Token_ _NEW_Token_simple_NEW_Token_really_NEW_Token_really_NEW_Token_really'
2.5. Tokenization usages of Token
and Tokens
¶
In practice, the main usage of the Tokens
object is in the possibility to re-glue the different Token
instances contained in it. This is achieved by the method join(start,stop,step)
.
start
and stop
arguments serve in case one wants to glue only a part of the Tokens
list. When they are not given, the full list of Tokens
is used and re-glued.
Let us discard the step
argument for a while, and let us undo the tokenization for the above split.
tokens = initial_tokens[:3]
tokens.join()
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
warnings.warn(mess, category=BoundaryWarning)
Token('really simple', [(2,15)])
There is also a slice(start,stop,size,step)
method, which stays in the Tokens
realm, and allows constructing some n-grams quite easilly. So below we restart with the entire first string, and isolate each word by the split
method.
token = Token(string=text)
tokens = token.split((range(1,2),range(8,9),range(15,16),range(22,23),range(26,27),))
tokens
Tokens(11 Token) :
------------------
A
really
simple
string
for
illustration.
We now apply the slice
method to obtain a Tokens
of bi-grams. Note that any of the start=0
, stop=len(Tokens)
, size=1
or step=1
parameter is required ; they default values are indicated here.
tokens_slice1 = tokens.slice(size=3,step=2)
tokens_slice1
Tokens(5 Token) :
-----------------
A really
really simple
simple string
string for
for illustration.
One defined step=2
above because the spaces string are still considered as a Token
here, and the size
is on the Tokens
elements, that is, the number of Token
that will be glued for each new element in the resulting Tokens
. Compare the above with (here we put start
and stop
explicitely to remember the order) the following.
tokens.slice(0,len(tokens),size=3,step=1)
Tokens(9 Token) :
-----------------
A really
really
really simple
simple
simple string
string
string for
for
for illustration.
There are still 3 Token
per line, but every second line has two space Token
elements.
Note one could have done as well …
tokens_slice2 = tokens[::2].slice(0,len(tokens[::2]),2,1)
tokens_slice2
Tokens(5 Token) :
-----------------
A really
really simple
simple string
string for
for illustration.
… to both withdraw the spaces and use simpler arguments. The reason why has to do with the ranges
attributes and how they are handle underneath. This is the topic for the next chapter. We just leave you with these attributes in the two above slicing cases, in order for you to think about.
for tok in tokens_slice1:
print(tok.ranges)
print("="*15)
for tok in tokens_slice2:
print(tok.ranges)
[range(0, 8)]
[range(2, 15)]
[range(9, 22)]
[range(16, 26)]
[range(23, 40)]
===============
[range(0, 1), range(2, 8)]
[range(2, 8), range(9, 15)]
[range(9, 15), range(16, 22)]
[range(16, 22), range(23, 26)]
[range(23, 26), range(27, 40)]
Hint : a Token
can be described by several range
in the ranges
attribute, and overlapping between range
is forbiden. The different range
in Token.ranges
are glued for representation by a subtoksep
string separator of length 1 (here a space, as by default). Hence the two solutions looks quite the same, despite completely different ranges
… well, not completely different, only the space is missing in the second option.
from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:55:53 2022