5. Token and Tokens classes - Chapter 4 : Introduction to the Attributes¶
Now that we are quite familiar with the basics of the Token
and Tokens
classes, we would like to introduce an other functionnality, that recover several aspects of the tokenization process. This is the possibility to attach some attributes to each Token
. In the following we will give the first details of the implementation, namely how to construct the attributes. Then we will discuss how they are passed from the Token
to the Tokens
classes, and finally in the next chapter we will have all the tools to discuss how to compare two Token
objects, and to elaborate even more on their usage in a non-trivial example later on.
5.1. Motivation¶
The tokenization process is not only about cutting and separating parts of a large string into atomic quantities that can be given to a computer. A crucial component of the later usage it to be able to identify a token asn interelated atom in a larger construction, that is the sentence, the document or the corpus. To do that, a basic idea might be to attach some functions to each token. For instance one may want to identify a string as being a verb, a stopword (i.e. a meaningless token), a physical unit, … Perhaps later one may want to join several tokens into a single entity, … This is the principle behind the attributes: to be able to create an extra structure around the simple strings, such that later algorithms can read and treat these extra structure. We here show how to construct such an extra structure using the Token
and Tokens
classes.
We start by instanciating a simple string, as usual in these guides.
Recall that the except ModuleNotFoundError
is here to handle the case where one has not installed the package.
from tokenspan import Token, Tokens
import re
text = "A really simple string for illustration."
Then one can instanciate a few Token
objects. Here we do that manually, since you now know how to do exactlythe same thing in an automatic way from the previous chapter.
tok1 = Token(string=text,ranges=[range(2,8)])
tok2 = Token(string=text,ranges=[range(9,15)])
tok3 = Token(string=text,ranges=[range(16,22)])
tok4 = Token(string=text,ranges=[range(27,39)])
print(tok1)
print(tok2)
print(tok3)
print(tok4)
really
simple
string
illustration
5.2. Generate some attributes¶
One can now generate some attributes, it is simply done by using the method Token.setattr(name,dict_of_attributes)
with name
a string and dict_of_attributes
some dictionnary of attributes that one would like to preserve for later use.
tok1.setattr('typo',{'type':'adverb','len':len(tok1)})
tok2.setattr('typo',{'type':'adjective','len':len(tok2)})
tok3.setattr('typo',{'type':'name','len':len(tok3)})
tok4.setattr('typo',{'type':'name','len':len(tok4)})
Token('illustration', [(27,39)])
Here we just give some exmple, keep in mind that the attributes are totally free to be settled the way you desire. The only thing to remember is that they accept only Python dictionnaries. In addition, if one tries to overwrite an existing attribute, an AttributeError
is raised.
try:
tok1.setattr('test',[1,2,3])
except TypeError as error:
print("TypeError{}".format(error.args))
try:
tok1.setattr('typo',[1,2,3])
except AttributeError as error:
print("AttributeError{}".format(error.args))
TypeError("'list' object is not a mapping",)
AttributeError('Attribute typo already exist',)
The only way to modify an attribute is by overwritting it directly, that is, one calls Token.name_of_attribute
, with the name_of_attribute
the already given name during the Token.setattr
tok1.typo = {'type':'adverb','len':len(tok1),'extra':True}
Note that there is no danger in producing a method to implement extra attributes. In partitular, if one chooses to use an already existing Token
method as a name for an attribute, there will be an AttributeError
raising.
try:
tok1.setattr('slice',{'test':'should raise an AttributeError'})
except AttributeError as error:
print("AttributeError{}".format(error.args))
AttributeError('Attribute slice already exist',)
tok1.slice(size=3)
/home/gozat/.local/lib/python3.8/site-packages/tokenspan/tools.py:28: BoundaryWarning: At least one of the boundaries has been modified
warnings.warn(mess, category=BoundaryWarning)
Tokens(4 Token) :
-----------------
rea
eal
all
lly
The complete list of methods already implemented in the Token
class are part of the module tokentokens.py
import tokenspan.tokentokens as toktest_
print(toktest_._token_methods)
['get_subToken', 'attributes', 'keys', 'values', 'items', 'fusion_attributes', 'append', 'remove', 'append_range', 'remove_range', 'start', 'stop', '_append_range', '_remove_range', 'union', 'difference', 'intersection', 'symmetric_difference', 'copy', 'set_string_methods', 'slice', 'split', 'partition', '_prepareTokens']
5.3. Token.set_string_methods()
¶
There is a special method attached to the Token
class, the set_string_methods()
method. It attaches the basic Python string method to the attribute string_method
.
token = Token(string="SimPle StrIng")
token.set_string_methods()
token.string_methods
{'upper': 'SIMPLE STRING',
'lower': 'simple string',
'swapcase': 'sIMpLE sTRiNG',
'capitalize': 'Simple string',
'casefold': 'simple string',
'isalnum': False,
'isalpha': False,
'isascii': True,
'isdecimal': False,
'isdigit': False,
'isidentifier': False,
'islower': False,
'isnumeric': False,
'isprintable': True,
'isspace': False,
'istitle': False,
'isupper': False}
To conclude this part about constructing the personnalized attributes and attach them to the Token
, remember the following important remarks
It is not possible to pass the personnalized attributes as a parameter during the instanciation of the
Token
object. That is, one has necessarilly to use thesetattr
method.setattr
is a quite rigid method. It takes a string and a dictionnary. Having said that, one is free to put anything one wants in the dictionnary …Only those personnalized attributes that have been constructed using the
setattr
method will remain attached to theToken
instance, and later eventually passed to theTokens
constructed on top of them. That is, one might be tempted to create any attribute (and this is obviously allowed in Python) asToken.my_attribute = something
, but none of these ones will stay for long attached to theToken
during the tokenization procedure.
5.4. Accessing the attributes¶
Once the attributes are defined, there are sevral ways to access them.
One can either call directly
Token.name_of_attribute
with the correct name of the attribute previously defined.One can call
Token.attributes
to get the Python frozenset corresponding to all the names of the personnalized attributes we previously defined, and then callToken.name_of_attribute
with one of this name, orgetattr(Token,'name_of_attribute')
if one prefersOne can call
Token.keys()
to get a generator of all the names of the personnalized attributesOne can call
Token.values()
to get a generator of all the dictionnaries corresponding to the personnalized attributesOne can call
Token.items()
to get a generator of tuples, corresponding to the names and the values of the personnalized atrtibutes
The multiplicity of access should help you to designed the Token
the way you prefer.
Note that the methods keys()
, values()
, items()
(all without parameter) are reminiscent of the Python dictionnaries behaviors. Nevertheless, here the outcome are generators instead of lists.
print(tok1.typo)
print(tok2.typo)
print(list(tok1.values()))
print(list(tok2.values()))
{'type': 'adverb', 'len': 6, 'extra': True}
{'type': 'adjective', 'len': 6}
[{'type': 'adverb', 'len': 6, 'extra': True}]
[{'type': 'adjective', 'len': 6}]
print(tok1.attributes)
print(tok2.attributes)
print(list(tok1.keys()))
print(list(tok2.keys()))
frozenset({'typo'})
frozenset({'typo'})
['typo']
['typo']
print(list(tok1.items()))
print(list(tok2.items()))
[('typo', {'type': 'adverb', 'len': 6, 'extra': True})]
[('typo', {'type': 'adjective', 'len': 6})]
5.5. Token.copy()
and its reset_attributes
parameter¶
One can work on a Token
, feed it with some attributes, and then want to change its attributes without destroying the information created before. This can be done by storing a copy of the Token
using the Token.copy()
method.
tok1_copy = tok1.copy()
tok1.typo = {'type':'adverb','len':len(tok1)}
print(tok1.typo)
print(tok1_copy.typo)
{'type': 'adverb', 'len': 6}
{'type': 'adverb', 'len': 6, 'extra': True}
There is a parameter for Token.copy(reset_attributes=True)
(default is False
), which will destroy all the attributes before making the copy.
tok1_copy = tok1.copy(reset_attributes=True)
tok1.typo = {'type':'adverb','len':len(tok1)}
print(tok1.typo)
print(tok1_copy.attributes)
{'type': 'adverb', 'len': 6}
frozenset()
5.6. carry_attributes
parameter of the class Token
¶
At the instanciation of a Token
object, one can precise carry_attribute=False
(default is True
), in which case the attributes will be reset automatically after a copy is done, whatever the reset_attributes
one chooses during the Token.copy()
handling.
tok4 = Token(string=text,
ranges=[range(27,39)],
carry_attributes=False)
print(tok4.attributes)
tok4.setattr('typo',{'type':'name','len':len(tok4)})
print(tok4.attributes)
tok4_copy = tok4.copy()
print("\n")
print(tok4_copy.attributes)
print(tok4.attributes)
frozenset()
frozenset({'typo'})
frozenset()
frozenset({'typo'})
carry_attributes
also changes quite a lot the comparaison between two Token
objects, as we will see in a later chapter.
5.7. Transfering attributes to Tokens
¶
Once generated, the Token
inside the Tokens
will keep their attributes.
tokens = Tokens([tok1,tok2,tok3,tok4])
tokens[0].attributes
frozenset({'typo'})
In addition, there are a few functionnalities in Tokens
that allow to have a picture of the different attributes inside each Token
it contains. These are the
attributes_keys
which gives the frozenset of all available attributes, but has no mention of whichToken
bring such attribute with itself or notattributes_map
which gives a list of allToken
indices having the attribute, in the form of a dictionnaryattributes_values
which gives a list containing the values of eachToken
attributes, in the form of a dictionnary
We comment these attribute of the class Tokens
below.
tokens.attributes_keys
frozenset({'typo'})
Tokens.attribute_keys
contains only one attribute, since all the Token
it contains have all the same personnalized attribute.
tokens.attributes_map
{'typo': [0, 1, 2]}
In this example, Tokens.attribute_map
signifies that the personnalized attribute typo
is present in Token
at position 0
, 1
and 2
in the Tokens
list. It might be strange that tokens[3]
(which corresponds to tok4
) has no typo
attribute. We will see below why.
tokens.attributes_values
{'typo': [{'type': 'adverb', 'len': 6},
{'type': 'adjective', 'len': 6},
{'type': 'name', 'len': 6},
{}]}
One sees the values contained in the attribute typo
for the four Token
in tokens
. Clearly, tok4
has not passed its typo
attribute in comparison with the three other ones. The reason why is because of the parameter carry_attributes
which reseted all the personalized attributes that tok4
may have. Internally, the construction of the Tokens
instance uses the Token.copy
procedure in order to handle such a behavior.
Note that this behavior of destroying the personnalized attributes when passing from Token
to Tokens
might be usefull when, for instance, one has completely messed up the atrtibutes and want to start anew, or when the attributes are useless, or even when one wants to make a new iteration between Token
and Tokens
. It is less clear why one may want different behaviors depending on the Token
in Tokens
, that is, it is perhaps better that all the Token
in Tokens
have the same carry_attributes
parameter. Anyway, here it is just a demonstration that things are versatile in these classes.
To see the mechanism behind carry_attributes
, one will invert: we will render tok1
carrying incompatible, and recover carry_attributes=True
for tok4
, then generate a new Tokens
and call the three above facilities for handling personnalized attributes at the Tokens
level.
tok1.carry_attributes = False
tok4.carry_attributes = True
tokens = Tokens([tok1,tok2,tok3,tok4])
print(tokens.attributes_map)
print(tokens.attributes_values)
{'typo': [1, 2, 3]}
{'typo': [{}, {'type': 'adjective', 'len': 6}, {'type': 'name', 'len': 6}, {'type': 'name', 'len': 12}]}
That is, the only missing typo
attributes is now at position 0
: this is indeed tok1
.
We nevertheless see that we never destroyed the information about tok4
in the previus process. Here it is the same, because all Token
are passed via copy to the Tokens
, their original attributes (including the personnalized ones) have never been endangered, whereas their copy representations have no more personnalized attributes.
print("In tokens :")
for tok in tokens:
print(tok.attributes)
print("\nIn each Token:")
for tok in [tok1,tok2,tok3,tok4]:
print(list(tok.items()))
In tokens :
frozenset()
frozenset({'typo'})
frozenset({'typo'})
frozenset({'typo'})
In each Token:
[('typo', {'type': 'adverb', 'len': 6})]
[('typo', {'type': 'adjective', 'len': 6})]
[('typo', {'type': 'name', 'len': 6})]
[('typo', {'type': 'name', 'len': 12})]
It is believed that the above attributes_keys
, attributes_map
and attributes_values
can help understanding
the subsequent structure of the Token
and of the Tokens
. if they were not sufficient, there are dictionnary-like methods as well: keys()
, values()
and items()
. As for the Token
counterparts, these method return generator, and not list, so list have to be constructed by hand, or the generator directly passed directly to some extractor (for instance a loop).
print(list(tokens.keys()))
print(list(tokens.values()))
print(list(tokens.items()))
['typo']
[[{}, {'type': 'adjective', 'len': 6}, {'type': 'name', 'len': 6}, {'type': 'name', 'len': 12}]]
[('typo', [{}, {'type': 'adjective', 'len': 6}, {'type': 'name', 'len': 6}, {'type': 'name', 'len': 12}])]
5.8. Going back to Token
¶
As one has now the use of, the way to come back to Token
object are via the Tokens.join
method. When it comes to the attributes, there is no choice done by the Tokens
object. It will simply collect all the personnalized atrtibutes, and restitute them in the same way as they were before the Tokens
, except that what was associated to the tokens[0].perso_attr
will end up at the first position of a list that has been generated in the same order as the Token
were instanciated in the Tokens
object… Hum, perhaps better to show examples at this level…
We first give again the four typo
attributes (as they are in the tokens
object, once they have been transmitted and potentially erased by the copy process) for convenience.
for i,tok in enumerate(tokens):
try:
print("tok{}: {}".format(i+1,tok.typo))
except AttributeError:
print("tok{}: no typo attribute".format(i+1))
tok1: no typo attribute
tok2: {'type': 'adjective', 'len': 6}
tok3: {'type': 'name', 'len': 6}
tok4: {'type': 'name', 'len': 12}
Let us start with just joining tok1
and tok2
.
tok12 = tokens.join(0,2)
tok12.typo
{'type': [{}, 'adjective'], 'len': [{}, 6]}
Then let us join tok2
and tok3
, then tok3
and tok4
, then tok2
and tok4
and finally tok1
and tok4
(just to manipulate a bit the join
method one more time)
print("tok2 and tok3 : {}".format(tokens.join(1,3).typo))
print("tok3 and tok4 : {}".format(tokens.join(2,4).typo))
print("tok2 and tok4 : {}".format(tokens.join(1,4,2).typo))
print("tok1 and tok4 : {}".format(tokens.join(0,4,3).typo))
tok2 and tok3 : {'type': ['adjective', 'name'], 'len': [6, 6]}
tok3 and tok4 : {'type': ['name', 'name'], 'len': [6, 12]}
tok2 and tok4 : {'type': ['adjective', 'name'], 'len': [6, 12]}
tok1 and tok4 : {'type': [{}, 'name'], 'len': [{}, 12]}
And finally let us glue all Token
together, in the same order as they appear in the Tokens
list.
tok = tokens.join()
tok.typo
{'type': [{}, 'adjective', 'name', 'name'], 'len': [{}, 6, 6, 12]}
Several remarks are in order :
There is no way to remember the position of the attributes before the
Tokens.join
: they combine in the list in the order they are in theTokens
, but this may have nothing to do with the position of theToken
in the initial string !When a
Token
inTokens
does not have an attribute before theTokens.join
, it is restituted as an empty dictionnary{}
in the attribute list after theTokens.join
. See the example below, with alternatingcarry_attributes
beingTrue
andFalse
What you want to do with the resulting list of attributes is up to you. One more time, this is the illustration that
Token
andTokens
are better seen as tools as end-to-end tokenizer protocols.
toks = [tok1,tok2,tok3,tok4]
for tok in toks:
tok.carry_attributes = True
for t1,t2 in zip(toks,toks[1:]+toks[:1]):
t1.carry_attributes = True
t2.carry_attributes = False
tokens = Tokens([tok1,tok2,tok3,tok4])
tok = tokens.join()
print(tok.typo)
{'type': ['adverb', {}, 'name', 'name'], 'len': [6, {}, 6, 12]}
{'type': ['adverb', 'adjective', {}, 'name'], 'len': [6, 6, {}, 12]}
{'type': ['adverb', 'adjective', 'name', {}], 'len': [6, 6, 6, {}]}
{'type': [{}, 'adjective', 'name', 'name'], 'len': [{}, 6, 6, 12]}
5.9. Complete code¶
As a final illustration, we will reproduce the complete code, and add some attributes, to see how everythoing is handle smoothly inside the Token
and Tokens
handling.
text = "A really simple string for illustration."
tok1 = Token(string=text,ranges=[range(2,8)])
tok2 = Token(string=text,ranges=[range(9,15)])
tok3 = Token(string=text,ranges=[range(16,22)])
tok4 = Token(string=text,ranges=[range(27,39)])
tok1.setattr('typo',{'type':'adverb','len':len(tok1)})
tok2.setattr('typo',{'type':'adjective','len':len(tok2)})
tok3.setattr('typo',{'type':'name','len':len(tok3)})
tok4.setattr('typo',{'type':'name','len':len(tok4)})
tok1.setattr('extra',{'test':True})
tok2.setattr('extra',{'test':False})
tok1.setattr('compare',{'tok2':0,'tok3':1,'tok4':2})
tok2.setattr('compare',{'tok1':1,'tok3':2,'tok4':0})
tok3.setattr('compare',{'tok1':0,'tok2':2,'tok4':1})
tok4.setattr('compare',{'tok1':1,'tok2':0,'tok3':2})
t1234 = Tokens([tok1,tok2,tok3,tok4])
tok1234 = t1234.join()
t4321 = Tokens([tok4,tok3,tok2,tok1])
tok4321 = t4321.join()
for tok in [tok1234,tok4321]:
print(str(tok)+' :')
print("-"*(len(tok)+2))
for attr in ['typo','extra','compare']:
print(attr+' : {}'.format(getattr(tok,attr)))
print("\n")
really simple string illustration :
-----------------------------------
typo : {'type': ['adverb', 'adjective', 'name', 'name'], 'len': [6, 6, 6, 12]}
extra : {'test': [True, False, {}, {}]}
compare : {'tok1': [{}, 1, 0, 1], 'tok3': [1, 2, {}, 2], 'tok4': [2, 0, 1, {}], 'tok2': [0, {}, 2, 0]}
really simple string illustration :
-----------------------------------
typo : {'type': ['name', 'name', 'adjective', 'adverb'], 'len': [12, 6, 6, 6]}
extra : {'test': [{}, {}, False, True]}
compare : {'tok1': [1, 0, 1, {}], 'tok4': [{}, 1, 0, 2], 'tok2': [0, 2, {}, 0], 'tok3': [2, {}, 2, 1]}
Remark that the string of the two different Token
is the same, because of the ranges
they share and that conserve the initial order of the parent string.
from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Wed Jan 19 19:56:24 2022