3. Token and Tokens classes - Chapter 2 : ranges
and subtoksep
attributes, and the Span representation¶
In the previous chapter, we introduced the Token
and Tokens
objects as simple representation of a Python string. We showed the basic methods for splitting and re-gluing the different component of a string in term of its Token
components, and how to collect them in Tokens
instance.
We here will show more about the passage from string to Token
, and how to put several separated strings into the same Token
, what is sometimes called a Span
in other libraries. This process is handled directly by the Token
object in the present construction. At some point it might be unclear, given the span extension of the Token
class, why the Tokens
class has been implemented. This will ultimately be clear in the next chapters : it is because one want to add attributes to the Token
, and not to the Tokens
!
There are several remarks in this NoteBook about the construction and limits of the design. They can be droped at first reading. In fact, most of the ranges
and subtoksep
properties are of no interest for basic usages of the Token
and Tokens
classes, so feel free to pass most of the materials covered in the present NoteBook and pass directly to the next chapter, where one implements a simple tokenizer in details.
3.1. Summary of the ranges
and subtoksep
attributes¶
Every Token
object has the following attributes :
string
: the associated complete stringranges
: the intervals into thestring
which define the string representation of theToken
, namelystr(Token)
. This attribute is a list of basic Pythonrange
objects.subtoksep
: when they are severalrange
in theranges
attribute, the string representationstr(Token)
glues the different sub-tokens withsubtoksep
as the separator. To avoid missing some usefull behaviors, one advises to use a one-length string element assubtoksep
, default being the space symbolchr(32)
in Python terminology
There are also carry_attributes
and parent
attributes to the Token
class, but let us not think about that now.
We start by instanciating a simple string, wich will serve as support for later illustrations of the ranges
attribute.
Recall that the except ModuleNotFoundError
is here to handle the case where one has not installed the package.
from tokenspan import Token, Tokens
text = "A really simple string for illustration."
3.2. Basic Usage of ranges
attribute¶
Basic usage of the Token
class is as a container for a string. It is constructed from a string, with argument string
at the instanciation. More precisely, it has a sub-string behavior from the complete Token.string
string. The way one pass from the complete string to its sub-string representation is through the ranges
attribute. This attribute can be designed by hand, as we will do in the following for illustration.
Recall that without ranges
parameter when one instanciates the Token
object, the Token.ranges
attribute is designed to describe the entire Token.string
string.
Recall also that the ranges
attribute must be a list of range
, whatever might be the length of this list. In particular, for a single range
, Token.ranges=[range(start,stop)]
is the standard convention. Also, for all subsequent functionnalities, all step
option of range
must be 1.
token = Token(string=text)
print(token.ranges)
print(str(token))
print(token.string)
print("#"*len(token.string))
token.ranges = [range(8)]
print(token.ranges)
print(str(token))
print(token.string)
print("#"*len(token.string))
token.ranges = [range(9,22)]
print(token.ranges)
print(str(token))
print(token.string)
print("#"*len(token.string))
[range(0, 40)]
A really simple string for illustration.
A really simple string for illustration.
########################################
[range(0, 8)]
A really
A really simple string for illustration.
########################################
[range(9, 22)]
simple string
A really simple string for illustration.
########################################
Now, let us construct a few Token
from the same string.
tok1 = Token(string=text,ranges=[range(8)])
print(str(tok1))
tok2 = Token(string=text,ranges=[range(9,22)])
print(str(tok2))
tok3 = Token(string=text,ranges=[range(27,39)])
print(str(tok3))
tok4 = Token(string=text,ranges=[range(0,1),range(9,22),range(27,39)])
print(str(tok4))
A really
simple string
illustration
A simple string illustration
One sees that there is not much differences between a string representation of a Token
object having one or several range
. This is in fact where the Token.subtoksep
appears. Let us change this parameter for the above examples. By default this parameter is the space symbol.
tok1 = Token(string=text,
ranges=[range(8)],
subtoksep='_')
print(str(tok1))
tok2 = Token(string=text,
ranges=[range(9,22)],
subtoksep='_')
print(str(tok2))
tok3 = Token(string=text,
ranges=[range(27,39)],
subtoksep='_')
print(str(tok3))
tok4 = Token(string=text,
ranges=[range(0,1),range(9,22),range(27,39)],
subtoksep='_')
print(str(tok4))
A really
simple string
illustration
A_simple string_illustration
If there is no change when there is a single range
, one sees that calling str(Token)
when there are several range
will automatically glue the different sub-tokens using the subtoksep
. But the usual spaces (as any other character in fact) inside a given range
is not affected by the subtoksep
, see 'simple string'
in tok4
.
3.3. Slicing procedure in a Token
¶
How is the slicing process handled in a Token
? Well, the subtoksep
counts as any other other character in the string representation. It also counts in len
in fact. Let us illustrate this.
print(tok1[:5])
print(tok4[:5])
print(tok4[5:10])
print(tok4[10:20])
A rea
A_sim
ple s
tring_illu
print(len(tok4))
print(len(tok4[:5]))
28
5
And if we change the subtoksep
, its length is automatically taken into account for the calculation of the length of the complete Token
and the slicing process.
tok5 = Token(string=text,
ranges=[range(0,1),range(9,22),range(27,39)],
subtoksep='_&_')
print(str(tok5))
print(len(tok5))
print(tok5[:5])
A_&_simple string_&_illustration
32
A_&_s
So the Token
class handles all the machinery of a quite normal string, thanks to the ranges
and subtoksep
attributes.
3.4. Absolute and relative coordinates¶
It seems everything is handled underneath such that a Token
object is just a sub-string of its Token.string
string. But there are sometimes some subtleties to be understood. one of them is the absolute versus relative coordinates, which may give headaches to users. Fortunately enough, things are quite workable, trusting the machinery behind the Token
class.
The absolute coordinate system is the position of the text inside the parent string given at the instanciation of the Token
object. Most of the time, one should not worry about it, except perhaps in the Token.append
and Token.remove
methods that will be discussed later in this chapter.
The relative coordinate system is the position inside the Token.ranges
object. This is the natural position if one sees the Token
as its string representation, namely str(Token)
.
So in short :
absolute position refers to the position in the string
Token.string
relative coordinate refers to the position in the string
str(Token)
As an example, let us construct a simple string of digits, where the digit 0
appears at position 0
, the digit 1
at position 10
, the digit 2
at position 20
and so on. In between of decade, there are the natural digits from 1
to 9
. On top of this string, we construct the Token
which will represent a string of size 40
made of one decade over two, up to position 80
of the digits
string.
root = '123456789'
digits = ''.join([str(i)+root for i in range(10)])
tok_digits = Token(string=digits,
ranges=[range(10,20),range(30,40),
range(50,60),range(70,80)],
subtoksep='#')
str(tok_digits)
'1123456789#3123456789#5123456789#7123456789'
Then the relative coordinates range from 0
to 39+3*len(subtoksep)
since there are 4 ranges and so 3 subtoksep
separators inserted, whereas the absolute ones range from 0
to 99
without interruption of a subtoksep
.
print("Relative coordinates from 0 to 20")
print(str(tok_digits)[:20])
print("Relative length: {}".format(len(tok_digits)))
print("\n")
print("Absolute coordinates from 0 to 20")
print(tok_digits.string[:20])
print("Absolute length: {}".format(len(tok_digits.string)))
Relative coordinates from 0 to 20
1123456789#312345678
Relative length: 43
Absolute coordinates from 0 to 20
01234567891123456789
Absolute length: 100
One more time, this is just a quite ridiculous complexity in the presentation for almost nothing, since most of the usages will never find un-natural outcome using the basic tools of Token
and Tokens
classes, as long as subtoksep
is of length 1.
3.5. Enlarge the Token, and combine overlapping ranges¶
Suppose one want to collapse, for some reason, tok1
and tok2
in a new Token
called tok12
. Then the related string representation will be given by the concatenation of the two previous strings, and the resulting ranges
attribute illustrate the combination, as well as the subtoksep
that is now present.
tok12 = tok1 + tok2
print(tok12) # equivalent to print(str(tok12))
print(tok12.ranges)
A really_simple string
[range(0, 8), range(9, 22)]
Note that this Token
has not mush difference with the initial string from position 0 to 22, except for the subtoksep
that is different from a normal space symbol in our illustration.
Note in passing that there are blocking process that avoid adding two Token
if they do not have the same subtoksep
…
tok1 + tok5
Token('A really_simple string_illustration', [(0,8),(9,22),(27,39)])
.. and one can compare the resulting Token
, in this case it is tok1
. In that case the addition is not commutative : tok1 + tok5 != tok5 + tok1
! (comparing Token
will be detailled in a later chapter).
print(tok1 + tok5 == tok1)
print(tok5 + tok1 == tok5)
print(tok1 == tok5)
False
False
False
In practise, the second Token
is simply rejected from the concatenation construction.
Let us come back to the concatenation procedure. What would happen if one try to concatenate tok3
and tok4
, since they have a part of the initial string in common ? In fact they are special handling underneath, which will recalculate all ranges
such that overlapping disapear.
tok34 = tok3 + tok4
print(tok34)
print(tok34.ranges)
print(tok34 == tok4)
A_simple string_illustration
[range(0, 1), range(9, 22), range(27, 39)]
True
Note that the two different processes are not the same at all ! tok34 == tok3
because the string representation of tok4
is already present in tok3
, whereas tok1
and tok5
were simply incompatible for addition ! This can be seen because the addition in this later case is commutative, as illustrated by the last line below.
tok43 = tok4 + tok3
print(tok43)
print(tok43.ranges)
print(tok43 == tok4)
print(tok34 == tok43)
A_simple string_illustration
[range(0, 1), range(9, 22), range(27, 39)]
True
True
To illustrate further the non-overlapping catching, let us try to create a new Token
with some overlapping range
objects, and realise that the construction in fact destroys the independant range
and fuse them towards a single-range
Token
instance.
tok6 = Token(string=text,
ranges=[range(0,9),range(9,22),range(27,39),range(30,39),range(10,15)],
subtoksep='_')
print(tok6)
print(tok6.ranges)
A really simple string_illustration
[range(0, 22), range(27, 39)]
There is only two range
surviving the construction process, since range(10,15)
is entirely contained in range(9,22)
, and the same is true for range(30,39)
which give no more information than range(27,39)
relative to the initial string. In addition the two ranges range(0,9)
and range(9,22)
are transformed naturally to the range(0,22)
since they in fact represent this entire range when taking together.
3.5.1. The combining function (for developpers)¶
For those interested in the construction of the Token
class, we reproduce the function _combineRanges(ranges)
which avoids the proliferation of overlapping range
in the Token
. This function is available in the tokenizer/tokentokens.py
module.
def _combineRanges(ranges):
"""
Take a list of range objects, and transform it such that overlapping
ranges and consecutive ranges are combined.
`ranges` is a list of `range` object, all with `range.step==1`
(not verified by this function, but required for
the algorithm to work properly).
"""
if len(ranges)<2:
return ranges
r_ = sorted([(r.start,r.stop) for r in ranges])
temp = [list(r_[0]),]
for start,stop in r_[1:]:
if temp[-1][1] >= start:
temp[-1][1] = max(temp[-1][1], stop)
else:
temp.append([start, stop])
r = [range(*t) for t in temp]
return r
3.6. Token.append
and Token.remove
¶
There are two facilities to design the Token.ranges
attributes, namely the one which add a range
to Token.ranges
using a method called Token.append(range)
or Token.append(list_of_range)
, and a method to remove a range
, called Token.remove(range)
or Token.remove(list_of_range)
.
Note that appending a range
using Token.append
also check for overlapping interval, and will not duplicate the range
in Token.ranges
. Since one want to append a new range
, this range is given in absolute coordinates, that is, in the counting of Token.string
.
In the contrary, Token.remove
will withdraw the passing range
from the relative coordinates. That is, if the removed range is not overlapping with some Token.ranges
, it will not be removed. Nevertheless, Token.remove
uses the absolute coordinates.
If you are not at ease with the names append
and remove
, because they are too close to the Python list methods, you can use append_range
and remove_range
, which are aliases for the two previous ones.
Importantly, append
, remove
and their aliases work in place, i.e. they transform the Token
object itself. This is the same behavior as the same methods of a Python list.
To illustrate this, we come back to our digits string introduced in the discussion about absolute and relative coordinates above. The tok_digits
instance has ranges
attributes in the form [(10,20),(30,40),(50,60),(70,80)]
, so appending the range (20,30)
should fuse its two first sub-ranges because the overlapping are automatically fusionned after an Token.append
process. From the resulting Token.append
one can remove the same range (20,30)
to come back to the intial object. If one remove the range (15,35)
, then one should end up with a Token.ranges
of the form [(10,25),(35,40),(50,60),(70,80)]
. Let us see all of this (plus a few more) in the examples below.
print(tok_digits.ranges)
print("append (20,30)")
tok_digits.append_range(range(20,30))
print(tok_digits.ranges)
print("remove (20,30)")
tok_digits.remove_range(range(20,30))
print(tok_digits.ranges)
print("remove (15,35)")
tok_digits.remove_range(range(15,35))
print(tok_digits.ranges)
print("remove (15,35) and (75,125), far too long for the Token.string")
tok_digits.remove_ranges([range(15,35),range(75,125)])
print(tok_digits.ranges)
[range(10, 20), range(30, 40), range(50, 60), range(70, 80)]
append (20,30)
[range(10, 40), range(50, 60), range(70, 80)]
remove (20,30)
[range(10, 20), range(30, 40), range(50, 60), range(70, 80)]
remove (15,35)
[range(10, 15), range(35, 40), range(50, 60), range(70, 80)]
remove (15,35) and (75,125), far too long for the Token.string
[range(10, 15), range(35, 40), range(50, 60), range(70, 75)]
One sees that removing a range that does not exist in the absolute coordinates produce nothing (that’s the example of removing the range (75,125)
at the last step, which in fact remove only the range (75,80)
as this is the only available one in the Token
object at that step).
In the same way, adding a range from the outside of the Token.string
will produce nothing.
print(tok_digits.ranges)
print("append (120,130)")
tok_digits.append_range(range(120,130))
print(tok_digits.ranges)
[range(10, 15), range(35, 40), range(50, 60), range(70, 75)]
append (120,130)
[range(10, 15), range(35, 40), range(50, 60), range(70, 75)]
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
warnings.warn(mess, category=BoundaryWarning)
Now we pass to the basic explanation of how the Token.ranges
transform towards the Tokens
objets.
3.7. From Token
to Tokens
classes¶
The three mechanisms to pass from Token
to Tokens
are using either the partition
, split
or slice
methods. We review they mechanisms once a multi-ranged Token
is involved in the process.
One more time, we prefer the debugging representation of the Tokens
class for illustration, or its list representation, than the str(Tokens)
representation, considered more messy.
We first see that everything is done to not bother the user with the position arguments, the start
and stop
parameter of Token.partition
is calculated from the string representation str(Token)
. In addition, the subtoksep
are conserved by the splitting processes. So, when one cuts the tok4
string from position start=2
to position stop=8
using the partition
method, one really isolates str(tok4)[2:8]
in the middle Token
of the resulting Tokens
.
tokens = tok4.partition(2,8)
print(list(tokens))
print(str(tokens[0])==str(tok4)[:2])
print(str(tokens[1])==str(tok4)[2:8])
print(str(tokens[2])==str(tok4)[8:])
[Token('A_', [(0,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_illustration', [(15,22),(27,39)])]
True
True
True
The way the subtoksep
are conserved is due to the insertion of empty range
in the Token
combined in the Tokens
object : see the first line below.
for tok in tokens:
print(tok.ranges)
[range(0, 1), range(9, 9)]
[range(9, 15)]
[range(15, 22), range(27, 39)]
Note finally that the behavior is less clear as soon as one use a subtoksep
with length larger than 1, as is illustrated below, where several different solutions exist for the same strings. This is because it is quite clear in Python what to do with a range(start,stop)
which always corresponds to a semi-open (mathematical) interval [start,stop[
including the start
and rejecting the stop
, but when either the start
or the stop
pops on a subtoksep
, does one have to put it in the left or in the right interval after splitting the string ? The answer is below : if you design the splitting to be performed for a one-character sized subtoksep
, the algorithm wait for the subtoksep
to be entirely on the left of stop
to display it in the left range
. See the illustration below.
This is the reason why using len(subtoksep)==1
is highly recommended.
tokens = tok5.partition(1,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok5.partition(2,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok5.partition(3,9)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok5.partition(4,10)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
[Token('A', [(0,1)]), Token('_&_simp', [(1,1),(9,13)]), Token('le string_&_illustration', [(13,22),(27,39)])]
total length = 32
[Token('A', [(0,1)]), Token('simple', [(9,15)]), Token(' string_&_illustration', [(15,22),(27,39)])]
total length = 29
[Token('A', [(0,1)]), Token('simple', [(9,15)]), Token(' string_&_illustration', [(15,22),(27,39)])]
total length = 29
[Token('A_&_', [(0,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_&_illustration', [(15,22),(27,39)])]
total length = 32
In contrary, using a correct subtoksep
with length 1 never destroys the Token
string.
tokens = tok4.partition(1,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok4.partition(2,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
[Token('A', [(0,1)]), Token('_simple', [(1,1),(9,15)]), Token(' string_illustration', [(15,22),(27,39)])]
total length = 28
[Token('A_', [(0,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_illustration', [(15,22),(27,39)])]
total length = 28
An other important special case is when one tries to split the initial Token
on either its initial or final character. See the example below
tokens = tok4.partition(0,1)
print(list(tokens))
tokens = tok4.partition(len(tok4)-12,len(tok4))
print(list(tokens))
[Token('', [(0,0)]), Token('A', [(0,1)]), Token('_simple string_illustration', [(1,1),(9,22),(27,39)])]
[Token('A_simple string_', [(0,1),(9,22),(27,27)]), Token('illustration', [(27,39)]), Token('', [(39,39)])]
In that case, the left-most of right-most Token
is empty. Nevertheless, one can remedy to that by using the parameter remove_empty=True
(default is False
) when calling partition
. Note is kills one Token
in that case.
tokens = tok4.partition(0,1,remove_empty=True)
print(list(tokens))
tokens = tok4.partition(len(tok4)-12,len(tok4),True)
print(list(tokens))
[Token('A', [(0,1)]), Token('_simple string_illustration', [(1,1),(9,22),(27,39)])]
[Token('A_simple string_', [(0,1),(9,22),(27,27)]), Token('illustration', [(27,39)])]
The behaviors of split
and slice
are quite similar, so we simply give one example of each
tokens = tok4.split((range(0,1),range(2,8),
range(len(tok4)-12,len(tok4))),remove_empty=False)
print(list(tokens))
tokens = tok4.split((range(0,1),range(2,8),
range(len(tok4)-12,len(tok4))),remove_empty=True)
print(list(tokens))
[Token('', [(0,0)]), Token('A', [(0,1)]), Token('_', [(1,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_', [(15,22),(27,27)]), Token('illustration', [(27,39)]), Token('', [(39,39)])]
[Token('A', [(0,1)]), Token('_', [(1,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_', [(15,22),(27,27)]), Token('illustration', [(27,39)])]
Remark how the Token
containing only a subtoksep
handles its length: by keeping two empty range
, which ensure its length is still the one of the subtoksep
. One more time this can not be done using len(subtoksep)>1
.
tokens = tok4.split((range(0,1),range(2,8),
range(len(tok4)-12,len(tok4))),remove_empty=True)
tokens[1].ranges
[range(1, 1), range(9, 9)]
3.8. From Tokens
to Token
classes¶
Let us now study the join
procedure. Basically, Tokens.join(start,stop,step)
will take all Token
that are in the slice Tokens[start:stop:step]
and concatenates them.
Recall that mising argument are calculate to fill the Tokens
, but order must be conserved, or one must call them explicitely using their names when calling the method.
print(list(tokens))
[Token('A', [(0,1)]), Token('_', [(1,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_', [(15,22),(27,27)]), Token('illustration', [(27,39)])]
tok = tokens.join()
print(tok)
print(tok.ranges)
A_simple string_illustration
[range(0, 1), range(9, 22), range(27, 39)]
tok = tokens.join(1,4,1)
print(tok)
print(tok.ranges)
_simple string_
[range(1, 1), range(9, 22), range(27, 27)]
tok = tokens.join(start=2,step=2)
print(tok)
print(tok.ranges)
simple_illustration
[range(9, 15), range(27, 39)]
Underneath, the association use only the ranges
, picking the elements [start:stop:step]
from the below list, then applying the _combineRanges
to the resulting elements, and reconstructing a Token
object from those.
print([tok.ranges for tok in tokens])
[[range(0, 1)], [range(1, 1), range(9, 9)], [range(9, 15)], [range(15, 22), range(27, 27)], [range(27, 39)]]
As a final remark, let us realize that all the Token
generated in this NoteBook nonetheless have the same string, but this string is not conserved by any of them, since objects are passed by reference in Python.
So keeping the complete string in a great number of Token
objects will not bring any memory usage trouble in principle.
strings = [tok.string for tok in [tok1,tok2,tok3,tok4,tok5]]
bools = [strings[0]==s for s in strings[1:]]
print(all(bools))
stringsId = [id(tok.string) for tok in [tok1,tok2,tok3,tok4,tok5]]
bools = [stringsId[0]==s for s in stringsId[1:]]
print(all(bools))
True
True
from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:56:03 2022