Token and Tokens class

Token sub-classes the string class in Python, thus enabling basic usages of string (as e.g. split, isupper, lower, … see https://docs.python.org/3.8/library/string.html) in addition to enabling additional attribute in the flow, for later compatibility with other packages. Its main original methods are

  • partition_token(start,stop) : which partitions the initial Token in three new Token instance, collected in a unique Tokens instance (see below)

  • split_token([(start,end),(start2,end2), ...]) : which splits the Token in several instances grouped in a single Tokens object

  • slice(step) : which slices the initial string in overlapping sub-strings, all grouped in a single Tokens instance

Tokens collects the different Token instances in order to let the initial Token instance still glued somehow. It also allows to come back to Token instances using its original methods : - undo(start,stop,step) : which generate a unique Token from the Tokens elements Tokens[start:stop:step] - slice(start,stop,step) : which slices the list of Token instance and glue them in some overlapping strings. This is still a Tokens instance

Token Objects

class Token(Span)

Subclass of the Python string class. It allows manipulating a string as a usual Python object, excepts it returns Token instances. Especially for Token instances are the methods :

  • split(s), which splits the string everytime the string ‘s’ appears

  • isupper() (for instance), which tells whether the string is uppercase or not

  • lower() or upper(), to make the Token lower-/upper-case. See more string methods on the Python standard library documentation.

In addition, having class associated to a Token allows to add custom attribute and method at any moment during its use.

__init__

 | __init__(string='', ranges=None, subtoksep=chr(32), carry_attributes=True)

Token object is basically a string with a ranges (list of range) position. Its basic usage is :

  • string extracted from all intervals defined in the ranges list and its attributes are

  • Token.string -> a string

  • Token.ranges -> a list of ranges

  • Token.subtoksep -> a string, preferably of length 1

__repr__

 | __repr__()

Return the two main arguments (namely the string and the number of ranges) of a Token instance in a readable way.

__getattr__

 | __getattr__(name)

Add some string methods to the Token objects.

setattr

 | setattr(name, dic=dict(), **kwargs)

Add attribute to the Token instance. Can be copied to a new instance using Token.copy().

Call this function as setattr('string_name',dict(a=1,b=2),c=3,d=4) where all arguments, except the first one, are optionnal. One can pass the attributes dictionnaries either as usual dictionnaries : dict(a=1,b=2) or {'a':1, 'b':2} or directly as keywor arguments : c=3,d=4, or both methods, in which case all the values in the different dictionanries will be concatenated. The first argument must be a string, which will serve to append the corresponding attribute to the Token instance, e.g. Token.string_name will exist in the above example.

Raise an AttributeError in case the attribute already exists. It is still possible to update the attribute (i.e. it is not protected) in the usual way : Token.string_name.update(a=0,b=1) will overide a and b keys in Token.string_name dictionnary.

copy

 | copy(reset_attributes=False)

Returns a copy of the actual Token instance. Take care of the Token.attributes if they are created using Token.setattr and if reset_attributes is False (value per default).

set_string_methods

 | set_string_methods()

Add the basic Python string methods to the attribute string_methods of the Token.

__eq__

 | __eq__(token)

Verify whether the actual instance of Token and an extra ones have the same attributes.

Raise an Exception in case the attributes are the same, but the extra_attributes are not all equals, despite the two instances have the same names of extra_attributes.

Raise a ValueError when one object is not a Token instance

attributes

 | @property
 | attributes()

Returns the name of the attributes in a frozenset

keys

 | keys()

Returns the keys of the attributes in a generator

values

 | values()

Returns the values of the attributes in a generator

items

 | items()

Returns the tuple (key,value) of the attributes in a generator

get_subToken

 | get_subToken(n)

Get the Token associated to the ranges elements n (being an integer or a slice). Return a Token. Raise a IndexError in case n is larger than the number of ranges in self.ranges.

subTokens

 | @property
 | subTokens()

Get the Token associated to each Token.ranges in a list Return a list of Token. Keep the attributes in case Token.carry_attributes is True.

fusion_attributes

 | fusion_attributes(token)

Fuse the attributes of the present instance with the token one.

union

 | union(token)

Quasi-Alias for append. Returns a new Token instance with Token.ranges = self.append(r).ranges For compatibility with set terminology.

difference

 | difference(token)

Quasi-Alias for remove. Return a new Token instance with Token.ranges = self.remove(r).ranges For compatibility with set terminology.

__AmB_BmA

 | __AmB_BmA(token)

Utility to calculate the intersection and symmetric_difference

intersection

 | intersection(token)

Return a new Token whose Token.ranges is the intersecting ranges with the given ranges in the self instance.

symmetric_difference

 | symmetric_difference(token)

Return a new Token instance whose Token.ranges is the symmetric difference ranges with the given ranges in the self instance

_prepareTokens

 | _prepareTokens(ranges, reset_attributes)

remove empty ranges and keep attributes and construct the Tokens object

partition

 | partition(start, end, remove_empty=False, reset_attributes=False)

Split the Token.string in three Token objects :

  • string[:start]

  • string[start:stop]

  • string[stop:] and put all non-empty Token objects in a Tokens instance.

It acts a bit like the str.partition(s) method of the Python string object, but partition_token takes start and end argument instead of a string. So in case one wants to split a string in three sub-strings using a string ‘s’, use Token.partition(s) instead, inherited from str.partition(s).

NB : Token.partition(s) has no non_empty option.

Parameters

Type

Details

start

int

Starting position of the splitting sequence.

end

int

Ending position of the splitting sequence.

non_empty

bool. Default is False

If True, returns a Tokens instance with only non-empty Token objects. see bool() method for non-empty Token

Returns

Type

Details

tokens

Tokens object

The Tokens object containing the different non-empty Token objects.

split

 | split(cuts, remove_empty=False, reset_attributes=False)

Split a text as many times as there are entities in the cuts list. Return a Tokens instance.

This is a bit like str.split(s) method from Python string object, except one has to feed Token.split_token with a full list of (start,end) tuples instead of the string ‘s’ in str.split(s) If the (start,end) tuples in cuts are given by a regex re.finditer search, the two methods give the same thing. So in case one wants to split a string in several Token instances according to a string ‘s’ splitting procedure, use Token.split(s) instead of Token.split_token([(start,end), ...]).

Parameters

Type

Details

cuts

a list of (start,end,) tuples. start/end are integer

Basic usage is to take these cuts from re.finditer.

Return

Type

Details

tokens

A Tokens object

A Tokens instance containing all the Token instances of the individual tokens.

slice

 | slice(start=0, stop=None, size=1, step=1, remove_empty=False, reset_attributes=False)

Cut the Token.string in overlapping sequences of strings of size size by step, put all these sequences in separated Token objects, and finally put all theses objects in a Tokens instance.

Parameters

Type

Details

size

int

The size of the string in each subsequent Token objects.

step

int

The number of characters skipped from one Token object to the next one.

Returns

Type

Details

tokens

Tokens object

The Tokens object containing the different Token sliced objects.

Tokens Objects

class Tokens()

A tool class for later uses in Tokenizer class (not documented in this module). This is mainly a list of Token objects, with additional methods to implement string manipulations, and go back to individual Token instances.

__init__

 | __init__(tokens=list())

Tokens instance is just a list of Token instances, called

  • Tokens.tokens -> a list attribute.

The only verification is that the Token.copy() methods works.

__repr__

 | __repr__()

Representation of the Tokens class, printing the number of Token instances inside the Tokens one, and the concatenated string from all Tokens instances.

__len__

 | __len__()

Return the number of Token instances in the Tokens object.

__add__

 | __add__(tokens)

Add two Tokens instances in the same way list can be concatenated :
by concatenation of their tokens list of Token instances. Returns a new Tokens instance.

__str__

 | __str__()

Return the concatenated string of all the Token instances in the Tokens.tokens attribute. Each Token string is separated from its neighborhood by the _NEW_Token_ string.

__getitem__

 | __getitem__(n)

Return either a Tokens new instance in case a slice is given, or the Token instance correspondig to the position n in case an integer is catched as argument.

__eq__

 | __eq__(tokens)

Return True if all elements of self are the same as in tokens.

__contains__

 | __contains__(token)

Return True if the Token is included in one of the Token in self

__bool__

 | __bool__()

Return True if any of the Token in Tokens is True, otherwise False

copy

 | copy(reset_attributes=False)

Make a copy of the Tokens object

attributes_keys

 | @property
 | attributes_keys()

Find all the _extra_attributes in all Token objects composing the Tokens, returns a frozenset

attributes_map

 | @property
 | attributes_map()

Find all the Token indexes per _extra_attributes, returns a dictionnary map.

attributes_values

 | @property
 | attributes_values()

Find all the values of all the _extra_attributes of all the Token objects composing the Tokens object, returns a dictionnary {attribute: [list of dictionnaries of for the attribute for Token 1, then for Token 2, …], with one entry per Token.

keys

 | keys()

Returns the generator of all the attributes present in the Tokens instance, as given by each Token element of Tokens.

values

 | values()

Returns the generator of all the attributes values in the Tokens instance, as given by each Token element of Tokens.

items

 | items()

Returns the generator of the tuples (keys, values), as given by these methods.

has_attribute

 | has_attribute(attr)

Returns a new Tokens instance, with only those Token objects having the required attribute in parameter.

append

 | append(token)

Append a Token instance to the actual Tokens instance.

extend

 | extend(tokens)

Extend a list of tokens to the actual Tokens instance.

insert

 | insert(position, token)

Insert a Token instance to the actual Tokens instance at position position (an integer)

join

 | join(start=0, stop=None, step=1, reset_attributes=False)

Glue the different Token objects present in the Tokens instance at position [start:stop:step].

Return a Token instance.

  • step = 1 : undo the Token.split_token or Token.partition_token methods

  • overlap step-1 : undo the Token.slice(step) method

Parameters

Type

Details

start

int, optional. The default is 0.

Starting Token from which the gluing starts.

stop

int, optional. The default is None, in which case the stop is at the end of the string.

Ending Token at which the gluing stops.

step

int, optional. The default is 1.

The step in the Tokens.tokens list. If step = 1, undo the Token.split_token method.

overlap

int, optional. The default is 0.

The number of characters that will be drop from each Token.string before concatenating it to the undone one. If overlap = step-1 from Token.slice(step), undo the Token.slice method.

Remark: the reason why glue(step) does not revert the Token.slice(step) process is because one has

token = Token(string=text)
tokens = token.slice(size)
undone = tokens.glue(step=size)
assert undone.string == text[:-(len(text)%size)]

for any text (a string) and size (an int). So when len(text)%size=0 everything goes well, but when there are rest string, one has to do:

token = Token(string=text)
tokens = token.slice(size)
undone = tokens.glue(overlap=size-1)
assert undone.string == text

Returns

Type

Details

tokens

A Token instance

The container containing the glued string.

slice

 | slice(start=0, stop=None, size=1, step=1)

Glue the different Token objects present in the Tokens.tokens list and returns a list of Token objects with overlapping strings among the different Token objects, all together grouped in a Tokens instance.

Parameters

Type

Details

start

int, optional. The default is 0.

Starting Token from which the gluing starts.

stop

int, optional. The default is None, in which case the stop is at the end of the string.

Ending Token at which the gluing stops.

size

int, optional. The default is 1.

Size of the span of each slice.

step

int, optional. The default is 1.

The step in the Tokens.tokens list. If step = 1, give back the initial Tokens object. If step = n > 1, give some n-grams Token by Token.

Returns

Type

Details

tokens

Tokens object

Tokens objects containing the list of n-grams Token objects.