3. Token and Tokens classes - Chapter 2 : ranges and subtoksep attributes, and the Span representation

In the previous chapter, we introduced the Token and Tokens objects as simple representation of a Python string. We showed the basic methods for splitting and re-gluing the different component of a string in term of its Token components, and how to collect them in Tokens instance.

We here will show more about the passage from string to Token, and how to put several separated strings into the same Token, what is sometimes called a Span in other libraries. This process is handled directly by the Token object in the present construction. At some point it might be unclear, given the span extension of the Token class, why the Tokens class has been implemented. This will ultimately be clear in the next chapters : it is because one want to add attributes to the Token, and not to the Tokens !

There are several remarks in this NoteBook about the construction and limits of the design. They can be droped at first reading. In fact, most of the ranges and subtoksep properties are of no interest for basic usages of the Token and Tokens classes, so feel free to pass most of the materials covered in the present NoteBook and pass directly to the next chapter, where one implements a simple tokenizer in details.

3.1. Summary of the ranges and subtoksep attributes

Every Token object has the following attributes :

  • string : the associated complete string

  • ranges : the intervals into the string which define the string representation of the Token, namely str(Token). This attribute is a list of basic Python range objects.

  • subtoksep : when they are several range in the ranges attribute, the string representation str(Token) glues the different sub-tokens with subtoksep as the separator. To avoid missing some usefull behaviors, one advises to use a one-length string element as subtoksep, default being the space symbol chr(32) in Python terminology

There are also carry_attributes and parent attributes to the Token class, but let us not think about that now.

We start by instanciating a simple string, wich will serve as support for later illustrations of the ranges attribute.

Recall that the except ModuleNotFoundError is here to handle the case where one has not installed the package.

from tokenspan import Token, Tokens
text = "A really simple string for illustration."

3.2. Basic Usage of ranges attribute

Basic usage of the Token class is as a container for a string. It is constructed from a string, with argument string at the instanciation. More precisely, it has a sub-string behavior from the complete Token.string string. The way one pass from the complete string to its sub-string representation is through the ranges attribute. This attribute can be designed by hand, as we will do in the following for illustration.

Recall that without ranges parameter when one instanciates the Token object, the Token.ranges attribute is designed to describe the entire Token.string string.

Recall also that the ranges attribute must be a list of range, whatever might be the length of this list. In particular, for a single range, Token.ranges=[range(start,stop)] is the standard convention. Also, for all subsequent functionnalities, all step option of range must be 1.

token = Token(string=text)
print(token.ranges)
print(str(token))
print(token.string)
print("#"*len(token.string))

token.ranges = [range(8)]
print(token.ranges)
print(str(token))
print(token.string)
print("#"*len(token.string))

token.ranges = [range(9,22)]
print(token.ranges)
print(str(token))
print(token.string)
print("#"*len(token.string))
[range(0, 40)]
A really simple string for illustration.
A really simple string for illustration.
########################################
[range(0, 8)]
A really
A really simple string for illustration.
########################################
[range(9, 22)]
simple string
A really simple string for illustration.
########################################

Now, let us construct a few Token from the same string.

tok1 = Token(string=text,ranges=[range(8)])
print(str(tok1))
tok2 = Token(string=text,ranges=[range(9,22)])
print(str(tok2))
tok3 = Token(string=text,ranges=[range(27,39)])
print(str(tok3))
tok4 = Token(string=text,ranges=[range(0,1),range(9,22),range(27,39)])
print(str(tok4))
A really
simple string
illustration
A simple string illustration

One sees that there is not much differences between a string representation of a Token object having one or several range. This is in fact where the Token.subtoksep appears. Let us change this parameter for the above examples. By default this parameter is the space symbol.

tok1 = Token(string=text,
             ranges=[range(8)],
             subtoksep='_')
print(str(tok1))
tok2 = Token(string=text,
             ranges=[range(9,22)],
             subtoksep='_')
print(str(tok2))
tok3 = Token(string=text,
             ranges=[range(27,39)],
             subtoksep='_')
print(str(tok3))
tok4 = Token(string=text,
             ranges=[range(0,1),range(9,22),range(27,39)],
             subtoksep='_')
print(str(tok4))
A really
simple string
illustration
A_simple string_illustration

If there is no change when there is a single range, one sees that calling str(Token) when there are several range will automatically glue the different sub-tokens using the subtoksep. But the usual spaces (as any other character in fact) inside a given range is not affected by the subtoksep, see 'simple string' in tok4.

3.3. Slicing procedure in a Token

How is the slicing process handled in a Token ? Well, the subtoksep counts as any other other character in the string representation. It also counts in len in fact. Let us illustrate this.

print(tok1[:5])
print(tok4[:5])
print(tok4[5:10])
print(tok4[10:20])
A rea
A_sim
ple s
tring_illu
print(len(tok4))
print(len(tok4[:5]))
28
5

And if we change the subtoksep, its length is automatically taken into account for the calculation of the length of the complete Token and the slicing process.

tok5 = Token(string=text,
             ranges=[range(0,1),range(9,22),range(27,39)],
             subtoksep='_&_')
print(str(tok5))
print(len(tok5))
print(tok5[:5])
A_&_simple string_&_illustration
32
A_&_s

So the Token class handles all the machinery of a quite normal string, thanks to the ranges and subtoksep attributes.

3.4. Absolute and relative coordinates

It seems everything is handled underneath such that a Token object is just a sub-string of its Token.string string. But there are sometimes some subtleties to be understood. one of them is the absolute versus relative coordinates, which may give headaches to users. Fortunately enough, things are quite workable, trusting the machinery behind the Token class.

The absolute coordinate system is the position of the text inside the parent string given at the instanciation of the Token object. Most of the time, one should not worry about it, except perhaps in the Token.append and Token.remove methods that will be discussed later in this chapter.

The relative coordinate system is the position inside the Token.ranges object. This is the natural position if one sees the Token as its string representation, namely str(Token).

So in short :

  • absolute position refers to the position in the string Token.string

  • relative coordinate refers to the position in the string str(Token)

As an example, let us construct a simple string of digits, where the digit 0 appears at position 0, the digit 1 at position 10, the digit 2 at position 20 and so on. In between of decade, there are the natural digits from 1 to 9. On top of this string, we construct the Token which will represent a string of size 40 made of one decade over two, up to position 80 of the digits string.

root = '123456789'
digits = ''.join([str(i)+root for i in range(10)])
tok_digits = Token(string=digits,
                   ranges=[range(10,20),range(30,40),
                           range(50,60),range(70,80)],
                  subtoksep='#')
str(tok_digits)
'1123456789#3123456789#5123456789#7123456789'

Then the relative coordinates range from 0 to 39+3*len(subtoksep) since there are 4 ranges and so 3 subtoksep separators inserted, whereas the absolute ones range from 0 to 99 without interruption of a subtoksep.

print("Relative coordinates from 0 to 20")
print(str(tok_digits)[:20])
print("Relative length: {}".format(len(tok_digits)))
print("\n")
print("Absolute coordinates from 0 to 20")
print(tok_digits.string[:20])
print("Absolute length: {}".format(len(tok_digits.string)))
Relative coordinates from 0 to 20
1123456789#312345678
Relative length: 43


Absolute coordinates from 0 to 20
01234567891123456789
Absolute length: 100

One more time, this is just a quite ridiculous complexity in the presentation for almost nothing, since most of the usages will never find un-natural outcome using the basic tools of Token and Tokens classes, as long as subtoksep is of length 1.

3.5. Enlarge the Token, and combine overlapping ranges

Suppose one want to collapse, for some reason, tok1 and tok2 in a new Token called tok12. Then the related string representation will be given by the concatenation of the two previous strings, and the resulting ranges attribute illustrate the combination, as well as the subtoksep that is now present.

tok12 = tok1 + tok2
print(tok12) # equivalent to print(str(tok12))
print(tok12.ranges)
A really_simple string
[range(0, 8), range(9, 22)]

Note that this Token has not mush difference with the initial string from position 0 to 22, except for the subtoksep that is different from a normal space symbol in our illustration.

Note in passing that there are blocking process that avoid adding two Token if they do not have the same subtoksep

tok1 + tok5
Token('A really_simple string_illustration', [(0,8),(9,22),(27,39)])

.. and one can compare the resulting Token, in this case it is tok1. In that case the addition is not commutative : tok1 + tok5 != tok5 + tok1 ! (comparing Token will be detailled in a later chapter).

print(tok1 + tok5 == tok1)
print(tok5 + tok1 == tok5)
print(tok1 == tok5)
False
False
False

In practise, the second Token is simply rejected from the concatenation construction.

Let us come back to the concatenation procedure. What would happen if one try to concatenate tok3 and tok4, since they have a part of the initial string in common ? In fact they are special handling underneath, which will recalculate all ranges such that overlapping disapear.

tok34 = tok3 + tok4
print(tok34)
print(tok34.ranges)
print(tok34 == tok4)
A_simple string_illustration
[range(0, 1), range(9, 22), range(27, 39)]
True

Note that the two different processes are not the same at all ! tok34 == tok3 because the string representation of tok4 is already present in tok3, whereas tok1 and tok5 were simply incompatible for addition ! This can be seen because the addition in this later case is commutative, as illustrated by the last line below.

tok43 = tok4 + tok3
print(tok43)
print(tok43.ranges)
print(tok43 == tok4)
print(tok34 == tok43)
A_simple string_illustration
[range(0, 1), range(9, 22), range(27, 39)]
True
True

To illustrate further the non-overlapping catching, let us try to create a new Token with some overlapping range objects, and realise that the construction in fact destroys the independant range and fuse them towards a single-range Token instance.

tok6 = Token(string=text,
             ranges=[range(0,9),range(9,22),range(27,39),range(30,39),range(10,15)],
             subtoksep='_')
print(tok6)
print(tok6.ranges)
A really simple string_illustration
[range(0, 22), range(27, 39)]

There is only two range surviving the construction process, since range(10,15) is entirely contained in range(9,22), and the same is true for range(30,39) which give no more information than range(27,39) relative to the initial string. In addition the two ranges range(0,9) and range(9,22) are transformed naturally to the range(0,22) since they in fact represent this entire range when taking together.

3.5.1. The combining function (for developpers)

For those interested in the construction of the Token class, we reproduce the function _combineRanges(ranges) which avoids the proliferation of overlapping range in the Token. This function is available in the tokenizer/tokentokens.py module.

def _combineRanges(ranges):
    """
Take a list of range objects, and transform it such that overlapping 
ranges and consecutive ranges are combined. 

`ranges` is a list of `range` object, all with `range.step==1` 
(not verified by this function, but required for 
the algorithm to work properly).
    """
    if len(ranges)<2:
        return ranges
    r_ = sorted([(r.start,r.stop) for r in ranges])
    temp = [list(r_[0]),]
    for start,stop in r_[1:]:
        if temp[-1][1] >= start:
            temp[-1][1] = max(temp[-1][1], stop)
        else:
            temp.append([start, stop])
    r = [range(*t) for t in temp]
    return r

3.6. Token.append and Token.remove

There are two facilities to design the Token.ranges attributes, namely the one which add a range to Token.ranges using a method called Token.append(range) or Token.append(list_of_range), and a method to remove a range, called Token.remove(range) or Token.remove(list_of_range).

Note that appending a range using Token.append also check for overlapping interval, and will not duplicate the range in Token.ranges. Since one want to append a new range, this range is given in absolute coordinates, that is, in the counting of Token.string.

In the contrary, Token.remove will withdraw the passing range from the relative coordinates. That is, if the removed range is not overlapping with some Token.ranges, it will not be removed. Nevertheless, Token.remove uses the absolute coordinates.

If you are not at ease with the names append and remove, because they are too close to the Python list methods, you can use append_range and remove_range, which are aliases for the two previous ones.

Importantly, append, remove and their aliases work in place, i.e. they transform the Token object itself. This is the same behavior as the same methods of a Python list.

To illustrate this, we come back to our digits string introduced in the discussion about absolute and relative coordinates above. The tok_digits instance has ranges attributes in the form [(10,20),(30,40),(50,60),(70,80)], so appending the range (20,30) should fuse its two first sub-ranges because the overlapping are automatically fusionned after an Token.append process. From the resulting Token.append one can remove the same range (20,30) to come back to the intial object. If one remove the range (15,35), then one should end up with a Token.ranges of the form [(10,25),(35,40),(50,60),(70,80)]. Let us see all of this (plus a few more) in the examples below.

print(tok_digits.ranges)
print("append (20,30)")
tok_digits.append_range(range(20,30))
print(tok_digits.ranges)
print("remove (20,30)")
tok_digits.remove_range(range(20,30))
print(tok_digits.ranges)
print("remove (15,35)")
tok_digits.remove_range(range(15,35))
print(tok_digits.ranges)
print("remove (15,35) and (75,125), far too long for the Token.string")
tok_digits.remove_ranges([range(15,35),range(75,125)])
print(tok_digits.ranges)
[range(10, 20), range(30, 40), range(50, 60), range(70, 80)]
append (20,30)
[range(10, 40), range(50, 60), range(70, 80)]
remove (20,30)
[range(10, 20), range(30, 40), range(50, 60), range(70, 80)]
remove (15,35)
[range(10, 15), range(35, 40), range(50, 60), range(70, 80)]
remove (15,35) and (75,125), far too long for the Token.string
[range(10, 15), range(35, 40), range(50, 60), range(70, 75)]

One sees that removing a range that does not exist in the absolute coordinates produce nothing (that’s the example of removing the range (75,125) at the last step, which in fact remove only the range (75,80) as this is the only available one in the Token object at that step).

In the same way, adding a range from the outside of the Token.string will produce nothing.

print(tok_digits.ranges)
print("append (120,130)")
tok_digits.append_range(range(120,130))
print(tok_digits.ranges)
[range(10, 15), range(35, 40), range(50, 60), range(70, 75)]
append (120,130)
[range(10, 15), range(35, 40), range(50, 60), range(70, 75)]
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
  warnings.warn(mess, category=BoundaryWarning)

Now we pass to the basic explanation of how the Token.ranges transform towards the Tokens objets.

3.7. From Token to Tokens classes

The three mechanisms to pass from Token to Tokens are using either the partition, split or slice methods. We review they mechanisms once a multi-ranged Token is involved in the process.

One more time, we prefer the debugging representation of the Tokens class for illustration, or its list representation, than the str(Tokens) representation, considered more messy.

We first see that everything is done to not bother the user with the position arguments, the start and stop parameter of Token.partition is calculated from the string representation str(Token). In addition, the subtoksep are conserved by the splitting processes. So, when one cuts the tok4 string from position start=2 to position stop=8 using the partition method, one really isolates str(tok4)[2:8] in the middle Token of the resulting Tokens.

tokens = tok4.partition(2,8)
print(list(tokens))
print(str(tokens[0])==str(tok4)[:2])
print(str(tokens[1])==str(tok4)[2:8])
print(str(tokens[2])==str(tok4)[8:])
[Token('A_', [(0,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_illustration', [(15,22),(27,39)])]
True
True
True

The way the subtoksep are conserved is due to the insertion of empty range in the Token combined in the Tokens object : see the first line below.

for tok in tokens:
    print(tok.ranges)
[range(0, 1), range(9, 9)]
[range(9, 15)]
[range(15, 22), range(27, 39)]

Note finally that the behavior is less clear as soon as one use a subtoksep with length larger than 1, as is illustrated below, where several different solutions exist for the same strings. This is because it is quite clear in Python what to do with a range(start,stop) which always corresponds to a semi-open (mathematical) interval [start,stop[ including the start and rejecting the stop, but when either the start or the stop pops on a subtoksep, does one have to put it in the left or in the right interval after splitting the string ? The answer is below : if you design the splitting to be performed for a one-character sized subtoksep, the algorithm wait for the subtoksep to be entirely on the left of stop to display it in the left range. See the illustration below.

This is the reason why using len(subtoksep)==1 is highly recommended.

tokens = tok5.partition(1,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok5.partition(2,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok5.partition(3,9)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok5.partition(4,10)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
[Token('A', [(0,1)]), Token('_&_simp', [(1,1),(9,13)]), Token('le string_&_illustration', [(13,22),(27,39)])]
total length = 32
[Token('A', [(0,1)]), Token('simple', [(9,15)]), Token(' string_&_illustration', [(15,22),(27,39)])]
total length = 29
[Token('A', [(0,1)]), Token('simple', [(9,15)]), Token(' string_&_illustration', [(15,22),(27,39)])]
total length = 29
[Token('A_&_', [(0,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_&_illustration', [(15,22),(27,39)])]
total length = 32

In contrary, using a correct subtoksep with length 1 never destroys the Token string.

tokens = tok4.partition(1,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
tokens = tok4.partition(2,8)
print(list(tokens))
print("total length = {}".format(sum(len(tok) for tok in tokens)))
[Token('A', [(0,1)]), Token('_simple', [(1,1),(9,15)]), Token(' string_illustration', [(15,22),(27,39)])]
total length = 28
[Token('A_', [(0,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_illustration', [(15,22),(27,39)])]
total length = 28

An other important special case is when one tries to split the initial Token on either its initial or final character. See the example below

tokens = tok4.partition(0,1)
print(list(tokens))
tokens = tok4.partition(len(tok4)-12,len(tok4))
print(list(tokens))
[Token('', [(0,0)]), Token('A', [(0,1)]), Token('_simple string_illustration', [(1,1),(9,22),(27,39)])]
[Token('A_simple string_', [(0,1),(9,22),(27,27)]), Token('illustration', [(27,39)]), Token('', [(39,39)])]

In that case, the left-most of right-most Token is empty. Nevertheless, one can remedy to that by using the parameter remove_empty=True (default is False) when calling partition. Note is kills one Token in that case.

tokens = tok4.partition(0,1,remove_empty=True)
print(list(tokens))
tokens = tok4.partition(len(tok4)-12,len(tok4),True)
print(list(tokens))
[Token('A', [(0,1)]), Token('_simple string_illustration', [(1,1),(9,22),(27,39)])]
[Token('A_simple string_', [(0,1),(9,22),(27,27)]), Token('illustration', [(27,39)])]

The behaviors of split and slice are quite similar, so we simply give one example of each

tokens = tok4.split((range(0,1),range(2,8),
                     range(len(tok4)-12,len(tok4))),remove_empty=False)
print(list(tokens))
tokens = tok4.split((range(0,1),range(2,8),
                     range(len(tok4)-12,len(tok4))),remove_empty=True)
print(list(tokens))
[Token('', [(0,0)]), Token('A', [(0,1)]), Token('_', [(1,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_', [(15,22),(27,27)]), Token('illustration', [(27,39)]), Token('', [(39,39)])]
[Token('A', [(0,1)]), Token('_', [(1,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_', [(15,22),(27,27)]), Token('illustration', [(27,39)])]

Remark how the Token containing only a subtoksep handles its length: by keeping two empty range, which ensure its length is still the one of the subtoksep. One more time this can not be done using len(subtoksep)>1.

tokens = tok4.split((range(0,1),range(2,8),
                     range(len(tok4)-12,len(tok4))),remove_empty=True)
tokens[1].ranges
[range(1, 1), range(9, 9)]

3.8. From Tokens to Token classes

Let us now study the join procedure. Basically, Tokens.join(start,stop,step) will take all Token that are in the slice Tokens[start:stop:step] and concatenates them. Recall that mising argument are calculate to fill the Tokens, but order must be conserved, or one must call them explicitely using their names when calling the method.

print(list(tokens))
[Token('A', [(0,1)]), Token('_', [(1,1),(9,9)]), Token('simple', [(9,15)]), Token(' string_', [(15,22),(27,27)]), Token('illustration', [(27,39)])]
tok = tokens.join()
print(tok)
print(tok.ranges)
A_simple string_illustration
[range(0, 1), range(9, 22), range(27, 39)]
tok = tokens.join(1,4,1)
print(tok)
print(tok.ranges)
_simple string_
[range(1, 1), range(9, 22), range(27, 27)]
tok = tokens.join(start=2,step=2)
print(tok)
print(tok.ranges)
simple_illustration
[range(9, 15), range(27, 39)]

Underneath, the association use only the ranges, picking the elements [start:stop:step] from the below list, then applying the _combineRanges to the resulting elements, and reconstructing a Token object from those.

print([tok.ranges for tok in tokens])
[[range(0, 1)], [range(1, 1), range(9, 9)], [range(9, 15)], [range(15, 22), range(27, 27)], [range(27, 39)]]

As a final remark, let us realize that all the Token generated in this NoteBook nonetheless have the same string, but this string is not conserved by any of them, since objects are passed by reference in Python.

So keeping the complete string in a great number of Token objects will not bring any memory usage trouble in principle.

strings = [tok.string for tok in [tok1,tok2,tok3,tok4,tok5]]
bools = [strings[0]==s for s in strings[1:]]
print(all(bools))
stringsId = [id(tok.string) for tok in [tok1,tok2,tok3,tok4,tok5]]
bools = [stringsId[0]==s for s in stringsId[1:]]
print(all(bools))
True
True
from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:56:03 2022