Token
and Tokens
class
Token
sub-classes the string
class in Python, thus enabling basic usages of
string (as e.g. split, isupper, lower, … see
https://docs.python.org/3.8/library/string.html)
in addition to enabling additional attribute in the flow, for later
compatibility with other packages. Its main original methods are
partition_token(start,stop)
: which partitions the initialToken
in three newToken
instance, collected in a uniqueTokens
instance (see below)split_token([(start,end),(start2,end2), ...])
: which splits theToken
in several instances grouped in a singleTokens
objectslice(step)
: which slices the initial string in overlapping sub-strings, all grouped in a singleTokens
instance
Tokens
collects the different Token
instances in order to let the initial
Token
instance still glued somehow. It also allows to come back to Token
instances using its original methods :
- undo(start,stop,step)
: which generate a unique Token
from the Tokens
elements Tokens[start:stop:step]
- slice(start,stop,step)
: which slices the list of Token
instance and
glue them in some overlapping strings. This is still a Tokens
instance
Token Objects¶
class Token(Span)
Subclass of the Python string class. It allows manipulating a string as a usual Python object, excepts it returns Token instances. Especially for Token instances are the methods :
split(s)
, which splits the string everytime the string ‘s’ appearsisupper()
(for instance), which tells whether the string is uppercase or notlower()
orupper()
, to make theToken
lower-/upper-case. See more string methods on the Python standard library documentation.
In addition, having class associated to a Token allows to add custom attribute and method at any moment during its use.
__init__¶
| __init__(string='', ranges=None, subtoksep=chr(32), carry_attributes=True)
Token
object is basically a string
with a ranges
(list of range) position. Its basic usage is :
string
extracted from all intervals defined in theranges
list and its attributes areToken.string
-> a stringToken.ranges
-> a list of rangesToken.subtoksep
-> a string, preferably of length 1
__repr__¶
| __repr__()
Return the two main arguments (namely the string
and the number of ranges) of a Token
instance in a readable way.
setattr¶
| setattr(name, dic=dict(), **kwargs)
Add attribute to the Token
instance. Can be copied to a new instance using Token.copy()
.
Call this function as setattr('string_name',dict(a=1,b=2),c=3,d=4)
where all arguments, except the first one, are optionnal. One can pass the attributes dictionnaries either as usual dictionnaries : dict(a=1,b=2)
or {'a':1, 'b':2}
or directly as keywor arguments : c=3,d=4
, or both methods, in which case all the values in the different dictionanries will be concatenated. The first argument must be a string, which will serve to append the corresponding attribute to the Token instance, e.g. Token.string_name
will exist in the above example.
Raise an AttributeError
in case the attribute already exists. It is still possible to update the attribute (i.e. it is not protected) in the usual way : Token.string_name.update(a=0,b=1)
will overide a
and b
keys in Token.string_name
dictionnary.
copy¶
| copy(reset_attributes=False)
Returns a copy of the actual Token
instance. Take care of the Token.attributes
if they are created using Token.setattr
and if reset_attributes
is False
(value per default).
set_string_methods¶
| set_string_methods()
Add the basic Python string methods to the attribute string_methods
of the Token.
__eq__¶
| __eq__(token)
Verify whether the actual instance of Token and an extra ones have the same attributes.
Raise an Exception in case the attributes are the same, but the extra_attributes are not all equals, despite the two instances have the same names of extra_attributes.
Raise a ValueError when one object is not a Token instance
get_subToken¶
| get_subToken(n)
Get the Token associated to the ranges elements n (being an integer or a slice). Return a Token. Raise a IndexError in case n is larger than the number of ranges in self.ranges.
subTokens¶
| @property
| subTokens()
Get the Token associated to each Token.ranges in a list Return a list of Token. Keep the attributes in case Token.carry_attributes is True.
fusion_attributes¶
| fusion_attributes(token)
Fuse the attributes of the present instance with the token
one.
union¶
| union(token)
Quasi-Alias for append. Returns a new Token instance with Token.ranges = self.append(r).ranges For compatibility with set terminology.
difference¶
| difference(token)
Quasi-Alias for remove. Return a new Token instance with Token.ranges = self.remove(r).ranges For compatibility with set terminology.
intersection¶
| intersection(token)
Return a new Token whose Token.ranges is the intersecting ranges with the given ranges in the self instance.
symmetric_difference¶
| symmetric_difference(token)
Return a new Token instance whose Token.ranges is the symmetric difference ranges with the given ranges in the self instance
_prepareTokens¶
| _prepareTokens(ranges, reset_attributes)
remove empty ranges and keep attributes and construct the Tokens object
partition¶
| partition(start, end, remove_empty=False, reset_attributes=False)
Split the Token.string
in three Token
objects :
string[:start]
string[start:stop]
string[stop:]
and put all non-emptyToken
objects in aTokens
instance.
It acts a bit like the str.partition(s)
method of the Python
string
object, but partition_token
takes start
and end
argument instead of a string. So in case one wants to split a string in three
sub-strings using a string ‘s’, use Token.partition(s)
instead,
inherited from str.partition(s)
.
NB : Token.partition(s)
has no non_empty
option.
Parameters |
Type |
Details |
---|---|---|
|
int |
Starting position of the splitting sequence. |
|
int |
Ending position of the splitting sequence. |
|
bool. Default is |
If |
Returns |
Type |
Details |
---|---|---|
tokens |
|
The |
split¶
| split(cuts, remove_empty=False, reset_attributes=False)
Split a text as many times as there are entities in the cuts list.
Return a Tokens
instance.
This is a bit like str.split(s)
method from Python string
object, except one has to feed Token.split_token
with a full list
of (start,end)
tuples instead of the string ‘s’ in str.split(s)
If the (start,end)
tuples in cuts are given by a regex re.finditer
search, the two methods give the same thing. So in case one wants
to split a string in several Token
instances according to a
string ‘s’ splitting procedure, use Token.split(s)
instead of
Token.split_token([(start,end), ...])
.
Parameters |
Type |
Details |
---|---|---|
|
a list of |
Basic usage is to take these cuts from |
Return |
Type |
Details |
---|---|---|
|
A |
A |
slice¶
| slice(start=0, stop=None, size=1, step=1, remove_empty=False, reset_attributes=False)
Cut the Token.string
in overlapping sequences of strings of size size
by step
,
put all these sequences in separated Token
objects, and finally
put all theses objects in a Tokens
instance.
Parameters |
Type |
Details |
---|---|---|
|
int |
The size of the string in each subsequent Token objects. |
|
int |
The number of characters skipped from one Token object to the next one. |
Returns |
Type |
Details |
---|---|---|
|
|
The |
Tokens Objects¶
class Tokens()
A tool class for later uses in Tokenizer
class (not documented in this module).
This is mainly a list of Token
objects, with additional methods to
implement string manipulations, and go back to individual Token
instances.
__init__¶
| __init__(tokens=list())
Tokens
instance is just a list of Token
instances, called
Tokens.tokens
-> a list attribute.
The only verification is that the Token.copy()
methods works.
__repr__¶
| __repr__()
Representation of the Tokens
class, printing the number of Token
instances inside the Tokens
one, and the concatenated string from
all Tokens
instances.
__add__¶
| __add__(tokens)
Add two Tokens
instances in the same way list
can be concatenated :
by concatenation of their tokens list of Token
instances.
Returns a new Tokens instance.
__str__¶
| __str__()
Return the concatenated string of all the Token
instances in the
Tokens.tokens
attribute. Each Token string is separated from its neighborhood by the _NEW_Token_
string.
__getitem__¶
| __getitem__(n)
Return either a Tokens
new instance in case a slice is given,
or the Token
instance correspondig to the position n in case
an integer is catched as argument.
__contains__¶
| __contains__(token)
Return True if the Token is included in one of the Token in self
attributes_keys¶
| @property
| attributes_keys()
Find all the _extra_attributes in all Token objects composing the Tokens, returns a frozenset
attributes_map¶
| @property
| attributes_map()
Find all the Token indexes per _extra_attributes, returns a dictionnary map.
attributes_values¶
| @property
| attributes_values()
Find all the values of all the _extra_attributes of all the Token objects composing the Tokens object, returns a dictionnary {attribute: [list of dictionnaries of for the attribute for Token 1, then for Token 2, …], with one entry per Token.
keys¶
| keys()
Returns the generator of all the attributes present in the Tokens instance, as given by each Token element of Tokens.
values¶
| values()
Returns the generator of all the attributes values in the Tokens instance, as given by each Token element of Tokens.
has_attribute¶
| has_attribute(attr)
Returns a new Tokens instance, with only those Token objects having the required attribute in parameter.
insert¶
| insert(position, token)
Insert a Token
instance to the actual Tokens
instance at position position (an integer)
join¶
| join(start=0, stop=None, step=1, reset_attributes=False)
Glue the different Token
objects present in the Tokens
instance at
position [start:stop:step]
.
Return a Token
instance.
step = 1
: undo theToken.split_token
orToken.partition_token
methodsoverlap step-1
: undo theToken.slice(step)
method
Parameters |
Type |
Details |
---|---|---|
|
int, optional. The default is 0. |
Starting |
|
int, optional. The default is |
Ending |
|
int, optional. The default is 1. |
The step in the |
|
int, optional. The default is 0. |
The number of characters that will be drop from each |
Remark: the reason why glue(step)
does not revert the Token.slice(step)
process is because one has
token = Token(string=text)
tokens = token.slice(size)
undone = tokens.glue(step=size)
assert undone.string == text[:-(len(text)%size)]
for any text
(a string) and size
(an int). So when len(text)%size=0
everything goes well, but when there are rest string, one has to do:
token = Token(string=text)
tokens = token.slice(size)
undone = tokens.glue(overlap=size-1)
assert undone.string == text
Returns |
Type |
Details |
---|---|---|
|
A |
The container containing the glued string. |
slice¶
| slice(start=0, stop=None, size=1, step=1)
Glue the different Token
objects present in the Tokens.tokens
list and returns a list of Token objects
with overlapping strings among the different Token
objects, all together grouped in a Tokens
instance.
Parameters |
Type |
Details |
---|---|---|
|
int, optional. The default is 0. |
Starting |
|
int, optional. The default is |
Ending |
|
int, optional. The default is 1. |
Size of the span of each slice. |
|
int, optional. The default is 1. |
The step in the Tokens.tokens list. If |
Returns |
Type |
Details |
---|---|---|
|
|
|