4. Token and Tokens classes - Chapter 3 : A basic example

In the previous chapters, we introduced the Token and Tokens objects as simple representation of a Python string. We showed the basic methods for splitting and re-gluing the different component of a string in term of its Token components, and how to collect them in Tokens instance.

We here will show a simple example, exploring the few possibilities these two classes allow. We limit ourselve to the methods we already discussed in the previous chapters, namely the Token.split, and the Tokens.join methods.

4.1. Motivation

We would like to implement the simple Lucene-like tokenizer. Lucene is an open source Apache library providing powerfull searching and indexing tools for large corpus of documents. To index correctly, one has to adopt a convenient tokenizer that can be used to parse all the texts of the corpus in a similar fashion to extract the relevant information. Unfortunately, the Lucene library is written in JAVA. Here, we will implement a Lucene-like tokenizer. This is the one used by the SciKitLearn.CountVectorizer object. Once we have such a tokenizer, one can generalize it by adding a few possibilities offered by the Token and Tokens versatility.

To construct the Lucene-like tokenizer, we need a few knowledge from the Regular Expression - REGEX. They are implemented in Python through the re package, which also offers a quick introduction to the use of REGEX.

We start by instanciating a simple string, wich will serve as support for later illustrations of the ranges attribute.

Recall that the except ModuleNotFoundError is here to handle the case where one has not installed the package.

from tokenspan import Token, Tokens
import re

text = "A really simple string for illustration.\n"
text += "With a few more words than in the previous chapters.\n"

Once the Token object is instanciated with the above text, the tokenization just corresponds to the use of split. This method takes as parameter a list of tuples (start,stop) at which the cutting will take place. The basic usage of the REGEX is to obtain these (start,stop) tuples in an automatic way.

4.2. Lucene-like tokenization

The Lucene-like tokenization consists in cutting all tokens by its boundary. The regular expression underneath is simply '(?u)\b\w\w+\b' where \b stands for the boundary (a space or a punctuation for instance) and \w for an alpha-numeric character (that is, an ASCII one if you wish). One find all occurences of such a REGEX using the re.finditer method. finditer returns a generator so ne has to consume it once otherwise it will return empty results. Then one can extract the positions from its start() and end() methods, or the alternative span() method. See the re package documentation for more details.

Once the cuts generated, one simply feed Token.split with them to end with a Tokens object containing all the tokens.

token = Token(string=text)
regex_gen = re.finditer(r'(?u)\b\w+\b',token.string)
cuts = [range(r.start(),r.end()) for r in regex_gen]
tokens = token.split(cuts)
tokens
Tokens(33 Token) : 
------------------

A
 
really
 
simple
 
string
 
for
 
illustration
.

With
 
a
 
few
 
more
 
words
 
than
 
in
 
the
 
previous
 
chapters
.

Now we want to extract all the meaningfull strings from the Tokens object. We can use the slicing process for instance.

meaningfull_tokens = tokens[1:12:2]+tokens[13::2]
meaningfull_tokens
Tokens(16 Token) : 
------------------
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters

One may have produce the same result with a more automatic way of filter the Token objects, thanks to their string representation for instance.

meaningfull_tokens = Tokens([tok for tok in tokens 
                             if str(tok) not in [' ','.\n','']])
meaningfull_tokens
Tokens(16 Token) : 
------------------
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters

And if we are familiar with REGEX, one can do even more simple filter by simply rejecting all non-alpha tokens.

meaningfull_tokens = Tokens([tok for tok in tokens 
                             if not re.search(r'\W',str(tok)) and bool(tok)])
meaningfull_tokens
Tokens(16 Token) : 
------------------
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters

Now we just have to extract the tokens we have constructed and use them in our next language treatment.

We here see the main philosophy behind the Token and Tokens classes. There is no direct implementation of a tokenizer. Rather, one has many tools to design our own tokenizer, adapted to our need for a given task. Yet we did not explore the possibility to attach personnalized attributes to the tokens, that will be the subject for the next chapter.

Nevertheless, one can still see some interesting features of the Token and Tokens classes in the following.

4.3. Multi-range Token

Let us realize that all the Token in meaningfull_tokens still conserve the attachment to the parent string, in their Token.string attribute. In addition, all these strings are in fact just reference to the same, original one.

ids = [id(tok.string) for tok in meaningfull_tokens]
bools = [id(text)==i for i in ids]
print(all(bools))
True

So all the Token still have reference to their ranges that are related to their string attribute. One can thus construct more elaborated Token by using some more advanced methods of Token and Tokens classes.

For instance, one can add easilly bi-grams to the meaningfull_tokens using the Tokens.slice method.

bigrams_tokens = meaningfull_tokens.slice(size=2)
list(bigrams_tokens)
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
  warnings.warn(mess, category=BoundaryWarning)
[Token('A really', [(0,1),(2,8)]),
 Token('really simple', [(2,8),(9,15)]),
 Token('simple string', [(9,15),(16,22)]),
 Token('string for', [(16,22),(23,26)]),
 Token('for illustration', [(23,26),(27,39)]),
 Token('illustration With', [(27,39),(41,45)]),
 Token('With a', [(41,45),(46,47)]),
 Token('a few', [(46,47),(48,51)]),
 Token('few more', [(48,51),(52,56)]),
 Token('more words', [(52,56),(57,62)]),
 Token('words than', [(57,62),(63,67)]),
 Token('than in', [(63,67),(68,70)]),
 Token('in the', [(68,70),(71,74)]),
 Token('the previous', [(71,74),(75,83)]),
 Token('previous chapters', [(75,83),(84,92)])]

And then, one can constrcut a full set of Token by concatenating the two Tokens instances.

all_tokens = bigrams_tokens + meaningfull_tokens
all_tokens
Tokens(31 Token) : 
------------------
A really
really simple
simple string
string for
for illustration
illustration With
With a
a few
few more
more words
words than
than in
in the
the previous
previous chapters
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters

Perhaps more interestingly, one can select the bi-grams one wants to construct. For instance, suppose that, for a reason or an other, one thinks that 'really simple' and 'a few more' only deserve to become some n-grams. Then constructingthem is quite simple. See the example below.

Note that only Tokens can be add to Tokens (the operation Token+Tokens would result in a ValueError), so one has to be vigilant with giving meaningfull_tokens[0:1] and not meaningfull_tokens[0] to add the first Token to the set.

really_simple_token = meaningfull_tokens.join(1,3)
print(really_simple_token)
a_few_more_token = meaningfull_tokens.join(7,10)
print(a_few_more_token)

# reconstruct the final Tokens object

all_tokens = meaningfull_tokens[0:1] + meaningfull_tokens[3:7] + meaningfull_tokens[10:]
all_tokens += Tokens([really_simple_token,a_few_more_token])
all_tokens
really simple
a few more
Tokens(13 Token) : 
------------------
A
string
for
illustration
With
words
than
in
the
previous
chapters
really simple
a few more

Here it is, quite simple isn’t it ?

4.4. Complete code

To conclude, we will simply construct again the above Tokens set, with changing the subtoksep to be an underscore '-'. Then we will have a complete working example in a single block, for later reuse if you want, and we will be abble to see the number of sub-range in each Token quite easilly.

# construct all the Token
token = Token(string=text,subtoksep='_')
regex_gen = re.finditer(r'(?u)\b\w+\b',token.string)
cuts = [range(r.start(),r.end()) for r in regex_gen]
tokens = token.split(cuts)

# filter the Token
meaningfull_tokens = Tokens([tok for tok in tokens 
                             if not re.search(r'\W',str(tok)) and bool(tok)])

# manipulate some of the Token
really_simple_token = meaningfull_tokens.join(1,3)
a_few_more_token = meaningfull_tokens.join(7,10)

# construct the final set of Token
all_tokens = meaningfull_tokens[0:1]+meaningfull_tokens[3:7]+meaningfull_tokens[10:]
all_tokens += Tokens([really_simple_token,a_few_more_token])
all_tokens
Tokens(13 Token) : 
------------------
A
string
for
illustration
With
words
than
in
the
previous
chapters
really_simple
a_few_more

One sees that a few lines of codes suffice to construct a quite interesting Tokenizer. Of course, the ultimate design of the Tokenizer of your dreams is up to you. The Token and Tokens class are just designed to make your life easier. We believe the algorithmic approach underneath the Token and Tokens construction can help making computer manipulate more cleverly the string, since there is a clear algebra at their disposal. After all, one simply adds the Token to each other to construct multi-ranges Token, and one simply adds the Tokens to each other to construct more elaborated sets of tokens. Then there is only the Token.split and Token.slice to pass from Token to Tokens, and Tokens.join to convert back to Token instances. The Tokens.slice is just a convenient rewording of the Tokens addition and join processes, see the design below (note it is a bit simplified from the exact method in the Tokens).

def slice(self,start=0,stop=None,size=1,step=1):
    """Glue the different `Token` objects present in the `Tokens.tokens` 
    list and returns a list of `Token objects` with overlapping strings 
    among the different `Token` objects, all together grouped in a 
    `Tokens` instance."""
    return Tokens([self.join(i,i+size) 
                   for i in range(start,stop-size+1,step)])

One can worry about the reverse process : what would be the inverse of adding Token or Tokens ? This ,we believe, can be handle using some tree structure. This is the reason for the introduction of the Token.parent attribute, that we will discuss in a later chapter.

4.5. Concluding remark on overlapping Token

As we have seen in the previous chapter, there is no overlapping possibilities at the Token level. In fact, such overlapping possibilities must be constructed at the Tokens level. To illustrate this, let us add the token 'really' to all_tokens

overlap_tokens = all_tokens + meaningfull_tokens[1:2]
overlap_tokens
Tokens(14 Token) : 
------------------
A
string
for
illustration
With
words
than
in
the
previous
chapters
really_simple
a_few_more
really

… and then join them all together to recover a single Token from this entire set.

overlap_tokens.join()
Token('A_really_simple_string_for_illustration_With_a_few_more_words_than_in_the_previous_chapters', [(0,1),(2,8),(9,15),(16,22),(23,26),(27,39),(41,45),(46,47),(48,51),(52,56),(57,62),(63,67),(68,70),(71,74),(75,83),(84,92)])

This is now constituted of many ranges, but more importantly the overlap of the string 'really' with itself has been properly handle by the instanciation of this new Token object.

To insist even more, let us recall you that despite the richness of the Token object, a tokenization process ends up at the Tokens level, since only this later one represent the entire document that one started with.

from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:56:12 2022