5. Token and Tokens classes - Chapter 4 : Introduction to the Attributes

Now that we are quite familiar with the basics of the Token and Tokens classes, we would like to introduce an other functionnality, that recover several aspects of the tokenization process. This is the possibility to attach some attributes to each Token. In the following we will give the first details of the implementation, namely how to construct the attributes. Then we will discuss how they are passed from the Token to the Tokens classes, and finally in the next chapter we will have all the tools to discuss how to compare two Token objects, and to elaborate even more on their usage in a non-trivial example later on.

5.1. Motivation

The tokenization process is not only about cutting and separating parts of a large string into atomic quantities that can be given to a computer. A crucial component of the later usage it to be able to identify a token asn interelated atom in a larger construction, that is the sentence, the document or the corpus. To do that, a basic idea might be to attach some functions to each token. For instance one may want to identify a string as being a verb, a stopword (i.e. a meaningless token), a physical unit, … Perhaps later one may want to join several tokens into a single entity, … This is the principle behind the attributes: to be able to create an extra structure around the simple strings, such that later algorithms can read and treat these extra structure. We here show how to construct such an extra structure using the Token and Tokens classes.

We start by instanciating a simple string, as usual in these guides.

Recall that the except ModuleNotFoundError is here to handle the case where one has not installed the package.

from tokenspan import Token, Tokens
import re

text = "A really simple string for illustration."

Then one can instanciate a few Token objects. Here we do that manually, since you now know how to do exactlythe same thing in an automatic way from the previous chapter.

tok1 = Token(string=text,ranges=[range(2,8)])
tok2 = Token(string=text,ranges=[range(9,15)])
tok3 = Token(string=text,ranges=[range(16,22)])
tok4 = Token(string=text,ranges=[range(27,39)])
print(tok1)
print(tok2)
print(tok3)
print(tok4)
really
simple
string
illustration

5.2. Generate some attributes

One can now generate some attributes, it is simply done by using the method Token.setattr(name,dict_of_attributes) with name a string and dict_of_attributes some dictionnary of attributes that one would like to preserve for later use.

tok1.setattr('typo',{'type':'adverb','len':len(tok1)})
tok2.setattr('typo',{'type':'adjective','len':len(tok2)})
tok3.setattr('typo',{'type':'name','len':len(tok3)})
tok4.setattr('typo',{'type':'name','len':len(tok4)})
Token('illustration', [(27,39)])

Here we just give some exmple, keep in mind that the attributes are totally free to be settled the way you desire. The only thing to remember is that they accept only Python dictionnaries. In addition, if one tries to overwrite an existing attribute, an AttributeError is raised.

try:
    tok1.setattr('test',[1,2,3])
except TypeError as error:
    print("TypeError{}".format(error.args))

try:
    tok1.setattr('typo',[1,2,3])
except AttributeError as error:
    print("AttributeError{}".format(error.args))
TypeError("'list' object is not a mapping",)
AttributeError('Attribute typo already exist',)

The only way to modify an attribute is by overwritting it directly, that is, one calls Token.name_of_attribute, with the name_of_attribute the already given name during the Token.setattr

tok1.typo = {'type':'adverb','len':len(tok1),'extra':True}

Note that there is no danger in producing a method to implement extra attributes. In partitular, if one chooses to use an already existing Token method as a name for an attribute, there will be an AttributeError raising.

try:
    tok1.setattr('slice',{'test':'should raise an AttributeError'})
except AttributeError as error:
    print("AttributeError{}".format(error.args))
AttributeError('Attribute slice already exist',)
tok1.slice(size=3)
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
  warnings.warn(mess, category=BoundaryWarning)
Tokens(4 Token) : 
-----------------
rea
eal
all
lly

The complete list of methods already implemented in the Token class are part of the module tokentokens.py

import tokenspan.tokentokens as toktest_
print(toktest_._token_methods)
['get_subToken', 'attributes', 'keys', 'values', 'items', 'fusion_attributes', 'append', 'remove', 'append_range', 'remove_range', 'start', 'stop', '_append_range', '_remove_range', 'union', 'difference', 'intersection', 'symmetric_difference', 'copy', 'set_string_methods', 'slice', 'split', 'partition', '_prepareTokens']

5.3. Token.set_string_methods()

There is a special method attached to the Token class, the set_string_methods() method. It attaches the basic Python string method to the attribute string_method.

token = Token(string="SimPle StrIng")
token.set_string_methods()
token.string_methods
{'upper': 'SIMPLE STRING',
 'lower': 'simple string',
 'swapcase': 'sIMpLE sTRiNG',
 'capitalize': 'Simple string',
 'casefold': 'simple string',
 'isalnum': False,
 'isalpha': False,
 'isascii': True,
 'isdecimal': False,
 'isdigit': False,
 'isidentifier': False,
 'islower': False,
 'isnumeric': False,
 'isprintable': True,
 'isspace': False,
 'istitle': False,
 'isupper': False}

To conclude this part about constructing the personnalized attributes and attach them to the Token, remember the following important remarks

  1. It is not possible to pass the personnalized attributes as a parameter during the instanciation of the Token object. That is, one has necessarilly to use the setattr method.

  2. setattr is a quite rigid method. It takes a string and a dictionnary. Having said that, one is free to put anything one wants in the dictionnary …

  3. Only those personnalized attributes that have been constructed using the setattr method will remain attached to the Token instance, and later eventually passed to the Tokens constructed on top of them. That is, one might be tempted to create any attribute (and this is obviously allowed in Python) as Token.my_attribute = something, but none of these ones will stay for long attached to the Token during the tokenization procedure.

5.4. Accessing the attributes

Once the attributes are defined, there are sevral ways to access them.

  1. One can either call directly Token.name_of_attribute with the correct name of the attribute previously defined.

  2. One can call Token.attributes to get the Python frozenset corresponding to all the names of the personnalized attributes we previously defined, and then call Token.name_of_attribute with one of this name, or getattr(Token,'name_of_attribute') if one prefers

  3. One can call Token.keys() to get a generator of all the names of the personnalized attributes

  4. One can call Token.values() to get a generator of all the dictionnaries corresponding to the personnalized attributes

  5. One can call Token.items() to get a generator of tuples, corresponding to the names and the values of the personnalized atrtibutes

The multiplicity of access should help you to designed the Token the way you prefer.

Note that the methods keys(), values(), items() (all without parameter) are reminiscent of the Python dictionnaries behaviors. Nevertheless, here the outcome are generators instead of lists.

print(tok1.typo)
print(tok2.typo)
print(list(tok1.values()))
print(list(tok2.values()))
{'type': 'adverb', 'len': 6, 'extra': True}
{'type': 'adjective', 'len': 6}
[{'type': 'adverb', 'len': 6, 'extra': True}]
[{'type': 'adjective', 'len': 6}]
print(tok1.attributes)
print(tok2.attributes)
print(list(tok1.keys()))
print(list(tok2.keys()))
frozenset({'typo'})
frozenset({'typo'})
['typo']
['typo']
print(list(tok1.items()))
print(list(tok2.items()))
[('typo', {'type': 'adverb', 'len': 6, 'extra': True})]
[('typo', {'type': 'adjective', 'len': 6})]

5.5. Token.copy() and its reset_attributes parameter

One can work on a Token, feed it with some attributes, and then want to change its attributes without destroying the information created before. This can be done by storing a copy of the Token using the Token.copy() method.

tok1_copy = tok1.copy()
tok1.typo = {'type':'adverb','len':len(tok1)}
print(tok1.typo)
print(tok1_copy.typo)
{'type': 'adverb', 'len': 6}
{'type': 'adverb', 'len': 6, 'extra': True}

There is a parameter for Token.copy(reset_attributes=True) (default is False), which will destroy all the attributes before making the copy.

tok1_copy = tok1.copy(reset_attributes=True)
tok1.typo = {'type':'adverb','len':len(tok1)}
print(tok1.typo)
print(tok1_copy.attributes)
{'type': 'adverb', 'len': 6}
frozenset()

5.6. carry_attributes parameter of the class Token

At the instanciation of a Token object, one can precise carry_attribute=False (default is True), in which case the attributes will be reset automatically after a copy is done, whatever the reset_attributes one chooses during the Token.copy() handling.

tok4 = Token(string=text,
             ranges=[range(27,39)],
             carry_attributes=False)
print(tok4.attributes)
tok4.setattr('typo',{'type':'name','len':len(tok4)})
print(tok4.attributes)
tok4_copy = tok4.copy()

print("\n")
print(tok4_copy.attributes)
print(tok4.attributes)
frozenset()
frozenset({'typo'})


frozenset()
frozenset({'typo'})

carry_attributes also changes quite a lot the comparaison between two Token objects, as we will see in a later chapter.

5.7. Transfering attributes to Tokens

Once generated, the Token inside the Tokens will keep their attributes.

tokens = Tokens([tok1,tok2,tok3,tok4])
tokens[0].attributes
frozenset({'typo'})

In addition, there are a few functionnalities in Tokens that allow to have a picture of the different attributes inside each Token it contains. These are the

  • attributes_keys which gives the frozenset of all available attributes, but has no mention of which Token bring such attribute with itself or not

  • attributes_map which gives a list of all Token indices having the attribute, in the form of a dictionnary

  • attributes_values which gives a list containing the values of each Token attributes, in the form of a dictionnary

We comment these attribute of the class Tokens below.

tokens.attributes_keys
frozenset({'typo'})

Tokens.attribute_keys contains only one attribute, since all the Token it contains have all the same personnalized attribute.

tokens.attributes_map
{'typo': [0, 1, 2]}

In this example, Tokens.attribute_map signifies that the personnalized attribute typo is present in Token at position 0, 1 and 2 in the Tokens list. It might be strange that tokens[3] (which corresponds to tok4) has no typo attribute. We will see below why.

tokens.attributes_values
{'typo': [{'type': 'adverb', 'len': 6},
  {'type': 'adjective', 'len': 6},
  {'type': 'name', 'len': 6},
  {}]}

One sees the values contained in the attribute typo for the four Token in tokens. Clearly, tok4 has not passed its typo attribute in comparison with the three other ones. The reason why is because of the parameter carry_attributes which reseted all the personalized attributes that tok4 may have. Internally, the construction of the Tokens instance uses the Token.copy procedure in order to handle such a behavior.

Note that this behavior of destroying the personnalized attributes when passing from Token to Tokens might be usefull when, for instance, one has completely messed up the atrtibutes and want to start anew, or when the attributes are useless, or even when one wants to make a new iteration between Token and Tokens. It is less clear why one may want different behaviors depending on the Token in Tokens, that is, it is perhaps better that all the Token in Tokens have the same carry_attributes parameter. Anyway, here it is just a demonstration that things are versatile in these classes.

To see the mechanism behind carry_attributes, one will invert: we will render tok1 carrying incompatible, and recover carry_attributes=True for tok4, then generate a new Tokens and call the three above facilities for handling personnalized attributes at the Tokens level.

tok1.carry_attributes = False
tok4.carry_attributes = True
tokens = Tokens([tok1,tok2,tok3,tok4])
print(tokens.attributes_map)
print(tokens.attributes_values)
{'typo': [1, 2, 3]}
{'typo': [{}, {'type': 'adjective', 'len': 6}, {'type': 'name', 'len': 6}, {'type': 'name', 'len': 12}]}

That is, the only missing typo attributes is now at position 0: this is indeed tok1.

We nevertheless see that we never destroyed the information about tok4 in the previus process. Here it is the same, because all Token are passed via copy to the Tokens, their original attributes (including the personnalized ones) have never been endangered, whereas their copy representations have no more personnalized attributes.

print("In tokens :")
for tok in tokens:
    print(tok.attributes)
print("\nIn each Token:")
for tok in [tok1,tok2,tok3,tok4]:
    print(list(tok.items()))
In tokens :
frozenset()
frozenset({'typo'})
frozenset({'typo'})
frozenset({'typo'})

In each Token:
[('typo', {'type': 'adverb', 'len': 6})]
[('typo', {'type': 'adjective', 'len': 6})]
[('typo', {'type': 'name', 'len': 6})]
[('typo', {'type': 'name', 'len': 12})]

It is believed that the above attributes_keys, attributes_map and attributes_values can help understanding the subsequent structure of the Token and of the Tokens. if they were not sufficient, there are dictionnary-like methods as well: keys(), values() and items(). As for the Token counterparts, these method return generator, and not list, so list have to be constructed by hand, or the generator directly passed directly to some extractor (for instance a loop).

print(list(tokens.keys()))
print(list(tokens.values()))
print(list(tokens.items()))
['typo']
[[{}, {'type': 'adjective', 'len': 6}, {'type': 'name', 'len': 6}, {'type': 'name', 'len': 12}]]
[('typo', [{}, {'type': 'adjective', 'len': 6}, {'type': 'name', 'len': 6}, {'type': 'name', 'len': 12}])]

5.8. Going back to Token

As one has now the use of, the way to come back to Token object are via the Tokens.join method. When it comes to the attributes, there is no choice done by the Tokens object. It will simply collect all the personnalized atrtibutes, and restitute them in the same way as they were before the Tokens, except that what was associated to the tokens[0].perso_attr will end up at the first position of a list that has been generated in the same order as the Token were instanciated in the Tokens object… Hum, perhaps better to show examples at this level…

We first give again the four typo attributes (as they are in the tokens object, once they have been transmitted and potentially erased by the copy process) for convenience.

for i,tok in enumerate(tokens):
    try:
        print("tok{}: {}".format(i+1,tok.typo))
    except AttributeError:
        print("tok{}: no typo attribute".format(i+1))
tok1: no typo attribute
tok2: {'type': 'adjective', 'len': 6}
tok3: {'type': 'name', 'len': 6}
tok4: {'type': 'name', 'len': 12}

Let us start with just joining tok1 and tok2.

tok12 = tokens.join(0,2)
tok12.typo
{'type': [{}, 'adjective'], 'len': [{}, 6]}

Then let us join tok2 and tok3, then tok3 and tok4, then tok2 and tok4 and finally tok1 and tok4 (just to manipulate a bit the join method one more time)

print("tok2 and tok3 : {}".format(tokens.join(1,3).typo))
print("tok3 and tok4 : {}".format(tokens.join(2,4).typo))
print("tok2 and tok4 : {}".format(tokens.join(1,4,2).typo))
print("tok1 and tok4 : {}".format(tokens.join(0,4,3).typo))
tok2 and tok3 : {'type': ['adjective', 'name'], 'len': [6, 6]}
tok3 and tok4 : {'type': ['name', 'name'], 'len': [6, 12]}
tok2 and tok4 : {'type': ['adjective', 'name'], 'len': [6, 12]}
tok1 and tok4 : {'type': [{}, 'name'], 'len': [{}, 12]}

And finally let us glue all Token together, in the same order as they appear in the Tokens list.

tok = tokens.join()
tok.typo
{'type': [{}, 'adjective', 'name', 'name'], 'len': [{}, 6, 6, 12]}

Several remarks are in order :

  • There is no way to remember the position of the attributes before the Tokens.join: they combine in the list in the order they are in the Tokens, but this may have nothing to do with the position of the Token in the initial string !

  • When a Token in Tokens does not have an attribute before the Tokens.join, it is restituted as an empty dictionnary {} in the attribute list after the Tokens.join. See the example below, with alternating carry_attributes being True and False

  • What you want to do with the resulting list of attributes is up to you. One more time, this is the illustration that Token and Tokens are better seen as tools as end-to-end tokenizer protocols.

toks = [tok1,tok2,tok3,tok4]
for tok in toks:
    tok.carry_attributes = True
for t1,t2 in zip(toks,toks[1:]+toks[:1]):
    t1.carry_attributes = True
    t2.carry_attributes = False
    tokens = Tokens([tok1,tok2,tok3,tok4])
    tok = tokens.join()
    print(tok.typo)
{'type': ['adverb', {}, 'name', 'name'], 'len': [6, {}, 6, 12]}
{'type': ['adverb', 'adjective', {}, 'name'], 'len': [6, 6, {}, 12]}
{'type': ['adverb', 'adjective', 'name', {}], 'len': [6, 6, 6, {}]}
{'type': [{}, 'adjective', 'name', 'name'], 'len': [{}, 6, 6, 12]}

5.9. Complete code

As a final illustration, we will reproduce the complete code, and add some attributes, to see how everythoing is handle smoothly inside the Token and Tokens handling.

text = "A really simple string for illustration."

tok1 = Token(string=text,ranges=[range(2,8)])
tok2 = Token(string=text,ranges=[range(9,15)])
tok3 = Token(string=text,ranges=[range(16,22)])
tok4 = Token(string=text,ranges=[range(27,39)])

tok1.setattr('typo',{'type':'adverb','len':len(tok1)})
tok2.setattr('typo',{'type':'adjective','len':len(tok2)})
tok3.setattr('typo',{'type':'name','len':len(tok3)})
tok4.setattr('typo',{'type':'name','len':len(tok4)})

tok1.setattr('extra',{'test':True})
tok2.setattr('extra',{'test':False})

tok1.setattr('compare',{'tok2':0,'tok3':1,'tok4':2})
tok2.setattr('compare',{'tok1':1,'tok3':2,'tok4':0})
tok3.setattr('compare',{'tok1':0,'tok2':2,'tok4':1})
tok4.setattr('compare',{'tok1':1,'tok2':0,'tok3':2})

t1234 = Tokens([tok1,tok2,tok3,tok4])
tok1234 = t1234.join()

t4321 = Tokens([tok4,tok3,tok2,tok1])
tok4321 = t4321.join()

for tok in [tok1234,tok4321]:
    print(str(tok)+' :')
    print("-"*(len(tok)+2))
    for attr in ['typo','extra','compare']:
        print(attr+' : {}'.format(getattr(tok,attr)))
    print("\n")
really simple string illustration :
-----------------------------------
typo : {'type': ['adverb', 'adjective', 'name', 'name'], 'len': [6, 6, 6, 12]}
extra : {'test': [True, False, {}, {}]}
compare : {'tok1': [{}, 1, 0, 1], 'tok3': [1, 2, {}, 2], 'tok4': [2, 0, 1, {}], 'tok2': [0, {}, 2, 0]}


really simple string illustration :
-----------------------------------
typo : {'type': ['name', 'name', 'adjective', 'adverb'], 'len': [12, 6, 6, 6]}
extra : {'test': [{}, {}, False, True]}
compare : {'tok1': [1, 0, 1, {}], 'tok4': [{}, 1, 0, 2], 'tok2': [0, 2, {}, 0], 'tok3': [2, {}, 2, 1]}

Remark that the string of the two different Token is the same, because of the ranges they share and that conserve the initial order of the parent string.

from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:56:24 2022