4. Token and Tokens classes - Chapter 3 : A basic example¶
In the previous chapters, we introduced the Token
and Tokens
objects as simple representation of a Python string. We showed the basic methods for splitting and re-gluing the different component of a string in term of its Token
components, and how to collect them in Tokens
instance.
We here will show a simple example, exploring the few possibilities these two classes allow. We limit ourselve to the methods we already discussed in the previous chapters, namely the Token.split
, and the Tokens.join
methods.
4.1. Motivation¶
We would like to implement the simple Lucene-like tokenizer. Lucene is an open source Apache library providing powerfull searching and indexing tools for large corpus of documents. To index correctly, one has to adopt a convenient tokenizer that can be used to parse all the texts of the corpus in a similar fashion to extract the relevant information. Unfortunately, the Lucene library is written in JAVA. Here, we will implement a Lucene-like tokenizer. This is the one used by the SciKitLearn.CountVectorizer object. Once we have such a tokenizer, one can generalize it by adding a few possibilities offered by the Token
and Tokens
versatility.
To construct the Lucene-like tokenizer, we need a few knowledge from the Regular Expression - REGEX. They are implemented in Python through the re package, which also offers a quick introduction to the use of REGEX.
We start by instanciating a simple string, wich will serve as support for later illustrations of the ranges
attribute.
Recall that the except ModuleNotFoundError
is here to handle the case where one has not installed the package.
from tokenspan import Token, Tokens
import re
text = "A really simple string for illustration.\n"
text += "With a few more words than in the previous chapters.\n"
Once the Token
object is instanciated with the above text
, the tokenization just corresponds to the use of split
. This method takes as parameter a list of tuples (start,stop)
at which the cutting will take place. The basic usage of the REGEX is to obtain these (start,stop)
tuples in an automatic way.
4.2. Lucene-like tokenization¶
The Lucene-like tokenization consists in cutting all tokens by its boundary. The regular expression underneath is simply '(?u)\b\w\w+\b'
where \b
stands for the boundary (a space or a punctuation for instance) and \w
for an alpha-numeric character (that is, an ASCII one if you wish). One find all occurences of such a REGEX using the re.finditer
method. finditer
returns a generator so ne has to consume it once otherwise it will return empty results. Then one can extract the positions from its start()
and end()
methods, or the alternative span()
method. See the re package documentation for more details.
Once the cuts
generated, one simply feed Token.split
with them to end with a Tokens
object containing all the tokens.
token = Token(string=text)
regex_gen = re.finditer(r'(?u)\b\w+\b',token.string)
cuts = [range(r.start(),r.end()) for r in regex_gen]
tokens = token.split(cuts)
tokens
Tokens(33 Token) :
------------------
A
really
simple
string
for
illustration
.
With
a
few
more
words
than
in
the
previous
chapters
.
Now we want to extract all the meaningfull strings from the Tokens
object. We can use the slicing process for instance.
meaningfull_tokens = tokens[1:12:2]+tokens[13::2]
meaningfull_tokens
Tokens(16 Token) :
------------------
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters
One may have produce the same result with a more automatic way of filter the Token
objects, thanks to their string representation for instance.
meaningfull_tokens = Tokens([tok for tok in tokens
if str(tok) not in [' ','.\n','']])
meaningfull_tokens
Tokens(16 Token) :
------------------
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters
And if we are familiar with REGEX, one can do even more simple filter by simply rejecting all non-alpha tokens.
meaningfull_tokens = Tokens([tok for tok in tokens
if not re.search(r'\W',str(tok)) and bool(tok)])
meaningfull_tokens
Tokens(16 Token) :
------------------
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters
Now we just have to extract the tokens we have constructed and use them in our next language treatment.
We here see the main philosophy behind the Token
and Tokens
classes. There is no direct implementation of a tokenizer. Rather, one has many tools to design our own tokenizer, adapted to our need for a given task. Yet we did not explore the possibility to attach personnalized attributes to the tokens, that will be the subject for the next chapter.
Nevertheless, one can still see some interesting features of the Token
and Tokens
classes in the following.
4.3. Multi-range Token
¶
Let us realize that all the Token
in meaningfull_tokens
still conserve the attachment to the parent string, in their Token.string
attribute. In addition, all these strings are in fact just reference to the same, original one.
ids = [id(tok.string) for tok in meaningfull_tokens]
bools = [id(text)==i for i in ids]
print(all(bools))
True
So all the Token
still have reference to their ranges
that are related to their string
attribute. One can thus construct more elaborated Token
by using some more advanced methods of Token
and Tokens
classes.
For instance, one can add easilly bi-grams to the meaningfull_tokens
using the Tokens.slice
method.
bigrams_tokens = meaningfull_tokens.slice(size=2)
list(bigrams_tokens)
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
warnings.warn(mess, category=BoundaryWarning)
[Token('A really', [(0,1),(2,8)]),
Token('really simple', [(2,8),(9,15)]),
Token('simple string', [(9,15),(16,22)]),
Token('string for', [(16,22),(23,26)]),
Token('for illustration', [(23,26),(27,39)]),
Token('illustration With', [(27,39),(41,45)]),
Token('With a', [(41,45),(46,47)]),
Token('a few', [(46,47),(48,51)]),
Token('few more', [(48,51),(52,56)]),
Token('more words', [(52,56),(57,62)]),
Token('words than', [(57,62),(63,67)]),
Token('than in', [(63,67),(68,70)]),
Token('in the', [(68,70),(71,74)]),
Token('the previous', [(71,74),(75,83)]),
Token('previous chapters', [(75,83),(84,92)])]
And then, one can constrcut a full set of Token
by concatenating the two Tokens
instances.
all_tokens = bigrams_tokens + meaningfull_tokens
all_tokens
Tokens(31 Token) :
------------------
A really
really simple
simple string
string for
for illustration
illustration With
With a
a few
few more
more words
words than
than in
in the
the previous
previous chapters
A
really
simple
string
for
illustration
With
a
few
more
words
than
in
the
previous
chapters
Perhaps more interestingly, one can select the bi-grams one wants to construct. For instance, suppose that, for a reason or an other, one thinks that 'really simple'
and 'a few more'
only deserve to become some n-grams. Then constructingthem is quite simple. See the example below.
Note that only Tokens
can be add to Tokens
(the operation Token
+Tokens
would result in a ValueError
), so one has to be vigilant with giving meaningfull_tokens[0:1]
and not meaningfull_tokens[0]
to add the first Token
to the set.
really_simple_token = meaningfull_tokens.join(1,3)
print(really_simple_token)
a_few_more_token = meaningfull_tokens.join(7,10)
print(a_few_more_token)
# reconstruct the final Tokens object
all_tokens = meaningfull_tokens[0:1] + meaningfull_tokens[3:7] + meaningfull_tokens[10:]
all_tokens += Tokens([really_simple_token,a_few_more_token])
all_tokens
really simple
a few more
Tokens(13 Token) :
------------------
A
string
for
illustration
With
words
than
in
the
previous
chapters
really simple
a few more
Here it is, quite simple isn’t it ?
4.4. Complete code¶
To conclude, we will simply construct again the above Tokens
set, with changing the subtoksep
to be an underscore '-'
. Then we will have a complete working example in a single block, for later reuse if you want, and we will be abble to see the number of sub-range in each Token
quite easilly.
# construct all the Token
token = Token(string=text,subtoksep='_')
regex_gen = re.finditer(r'(?u)\b\w+\b',token.string)
cuts = [range(r.start(),r.end()) for r in regex_gen]
tokens = token.split(cuts)
# filter the Token
meaningfull_tokens = Tokens([tok for tok in tokens
if not re.search(r'\W',str(tok)) and bool(tok)])
# manipulate some of the Token
really_simple_token = meaningfull_tokens.join(1,3)
a_few_more_token = meaningfull_tokens.join(7,10)
# construct the final set of Token
all_tokens = meaningfull_tokens[0:1]+meaningfull_tokens[3:7]+meaningfull_tokens[10:]
all_tokens += Tokens([really_simple_token,a_few_more_token])
all_tokens
Tokens(13 Token) :
------------------
A
string
for
illustration
With
words
than
in
the
previous
chapters
really_simple
a_few_more
One sees that a few lines of codes suffice to construct a quite interesting Tokenizer. Of course, the ultimate design of the Tokenizer of your dreams is up to you. The Token
and Tokens
class are just designed to make your life easier. We believe the algorithmic approach underneath the Token
and Tokens
construction can help making computer manipulate more cleverly the string, since there is a clear algebra at their disposal. After all, one simply adds the Token
to each other to construct multi-ranges Token
, and one simply adds the Tokens
to each other to construct more elaborated sets of tokens. Then there is only the Token.split
and Token.slice
to pass from Token
to Tokens
, and Tokens.join
to convert back to Token
instances. The Tokens.slice
is just a convenient rewording of the Tokens
addition and join
processes, see the design below (note it is a bit simplified from the exact method in the Tokens
).
def slice(self,start=0,stop=None,size=1,step=1):
"""Glue the different `Token` objects present in the `Tokens.tokens`
list and returns a list of `Token objects` with overlapping strings
among the different `Token` objects, all together grouped in a
`Tokens` instance."""
return Tokens([self.join(i,i+size)
for i in range(start,stop-size+1,step)])
One can worry about the reverse process : what would be the inverse of adding Token
or Tokens
? This ,we believe, can be handle using some tree structure. This is the reason for the introduction of the Token.parent
attribute, that we will discuss in a later chapter.
4.5. Concluding remark on overlapping Token
¶
As we have seen in the previous chapter, there is no overlapping possibilities at the Token
level. In fact, such overlapping possibilities must be constructed at the Tokens
level. To illustrate this, let us add the token 'really'
to all_tokens
…
overlap_tokens = all_tokens + meaningfull_tokens[1:2]
overlap_tokens
Tokens(14 Token) :
------------------
A
string
for
illustration
With
words
than
in
the
previous
chapters
really_simple
a_few_more
really
… and then join
them all together to recover a single Token
from this entire set.
overlap_tokens.join()
Token('A_really_simple_string_for_illustration_With_a_few_more_words_than_in_the_previous_chapters', [(0,1),(2,8),(9,15),(16,22),(23,26),(27,39),(41,45),(46,47),(48,51),(52,56),(57,62),(63,67),(68,70),(71,74),(75,83),(84,92)])
This is now constituted of many ranges, but more importantly the overlap of the string 'really'
with itself has been properly handle by the instanciation of this new Token
object.
To insist even more, let us recall you that despite the richness of the Token
object, a tokenization process ends up at the Tokens
level, since only this later one represent the entire document that one started with.
from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:56:12 2022