Token Objects¶

class Token(Span)

Subclass of the Python string class. It allows manipulating a string as a usual Python object, excepts it returns Token instances. Especially for Token instances are the methods :

split(s), which splits the string everytime the string ‘s’ appears
isupper() (for instance), which tells whether the string is uppercase or not
lower() or upper(), to make the Token lower-/upper-case. See more string methods on the Python standard library documentation.

In addition, having class associated to a Token allows to add custom attribute and method at any moment during its use.

init¶

 | __init__(string='', ranges=None, subtoksep=chr(32), carry_attributes=True)

Token object is basically a string with a ranges (list of range) position. Its basic usage is :

string extracted from all intervals defined in the ranges list and its attributes are
Token.string -> a string
Token.ranges -> a list of ranges
Token.subtoksep -> a string, preferably of length 1

repr¶

 | __repr__()

Return the two main arguments (namely the string and the number of ranges) of a Token instance in a readable way.

getattr¶

 | __getattr__(name)

Add some string methods to the Token objects.

setattr¶

 | setattr(name, dic=dict(), **kwargs)

Add attribute to the Token instance. Can be copied to a new instance using Token.copy().

Call this function as setattr('string_name',dict(a=1,b=2),c=3,d=4) where all arguments, except the first one, are optionnal. One can pass the attributes dictionnaries either as usual dictionnaries : dict(a=1,b=2) or {'a':1, 'b':2} or directly as keywor arguments : c=3,d=4, or both methods, in which case all the values in the different dictionanries will be concatenated. The first argument must be a string, which will serve to append the corresponding attribute to the Token instance, e.g. Token.string_name will exist in the above example.

Raise an AttributeError in case the attribute already exists. It is still possible to update the attribute (i.e. it is not protected) in the usual way : Token.string_name.update(a=0,b=1) will overide a and b keys in Token.string_name dictionnary.

copy¶

 | copy(reset_attributes=False)

Returns a copy of the actual Token instance. Take care of the Token.attributes if they are created using Token.setattr and if reset_attributes is False (value per default).

set_string_methods¶

 | set_string_methods()

Add the basic Python string methods to the attribute string_methods of the Token.

eq¶

 | __eq__(token)

Verify whether the actual instance of Token and an extra ones have the same attributes.

Raise an Exception in case the attributes are the same, but the extra_attributes are not all equals, despite the two instances have the same names of extra_attributes.

Raise a ValueError when one object is not a Token instance

attributes¶

 | @property
 | attributes()

Returns the name of the attributes in a frozenset

keys¶

 | keys()

Returns the keys of the attributes in a generator

values¶

 | values()

Returns the values of the attributes in a generator

items¶

 | items()

Returns the tuple (key,value) of the attributes in a generator

get_subToken¶

 | get_subToken(n)

Get the Token associated to the ranges elements n (being an integer or a slice). Return a Token. Raise a IndexError in case n is larger than the number of ranges in self.ranges.

subTokens¶

 | @property
 | subTokens()

Get the Token associated to each Token.ranges in a list Return a list of Token. Keep the attributes in case Token.carry_attributes is True.

fusion_attributes¶

 | fusion_attributes(token)

Fuse the attributes of the present instance with the token one.

union¶

 | union(token)

Quasi-Alias for append. Returns a new Token instance with Token.ranges = self.append(r).ranges For compatibility with set terminology.

difference¶

 | difference(token)

Quasi-Alias for remove. Return a new Token instance with Token.ranges = self.remove(r).ranges For compatibility with set terminology.

__AmB_BmA¶

 | __AmB_BmA(token)

Utility to calculate the intersection and symmetric_difference

intersection¶

 | intersection(token)

Return a new Token whose Token.ranges is the intersecting ranges with the given ranges in the self instance.

symmetric_difference¶

 | symmetric_difference(token)

Return a new Token instance whose Token.ranges is the symmetric difference ranges with the given ranges in the self instance

_prepareTokens¶

 | _prepareTokens(ranges, reset_attributes)

remove empty ranges and keep attributes and construct the Tokens object

partition¶

 | partition(start, end, remove_empty=False, reset_attributes=False)

Split the Token.string in three Token objects :

string[:start]
string[start:stop]
string[stop:] and put all non-empty Token objects in a Tokens instance.

It acts a bit like the str.partition(s) method of the Python string object, but partition_token takes start and end argument instead of a string. So in case one wants to split a string in three sub-strings using a string ‘s’, use Token.partition(s) instead, inherited from str.partition(s).

NB : Token.partition(s) has no non_empty option.

Parameters	Type	Details
`start`	int	Starting position of the splitting sequence.
`end`	int	Ending position of the splitting sequence.
`non_empty`	bool. Default is `False`	If `True`, returns a `Tokens` instance with only non-empty `Token` objects. see bool() method for non-empty `Token`

Returns	Type	Details
tokens	`Tokens` object	The `Tokens` object containing the different non-empty `Token` objects.

split¶

 | split(cuts, remove_empty=False, reset_attributes=False)

Split a text as many times as there are entities in the cuts list. Return a Tokens instance.

This is a bit like str.split(s) method from Python string object, except one has to feed Token.split_token with a full list of (start,end) tuples instead of the string ‘s’ in str.split(s) If the (start,end) tuples in cuts are given by a regex re.finditer search, the two methods give the same thing. So in case one wants to split a string in several Token instances according to a string ‘s’ splitting procedure, use Token.split(s) instead of Token.split_token([(start,end), ...]).

Parameters	Type	Details
`cuts`	a list of `(start,end,)` tuples. start/end are integer	Basic usage is to take these cuts from `re.finditer`.

Return	Type	Details
`tokens`	A `Tokens` object	A `Tokens` instance containing all the `Token` instances of the individual tokens.

slice¶

 | slice(start=0, stop=None, size=1, step=1, remove_empty=False, reset_attributes=False)

Cut the Token.string in overlapping sequences of strings of size size by step, put all these sequences in separated Token objects, and finally put all theses objects in a Tokens instance.

Parameters	Type	Details
`size`	int	The size of the string in each subsequent Token objects.
`step`	int	The number of characters skipped from one Token object to the next one.

Returns	Type	Details
`tokens`	`Tokens` object	The `Tokens` object containing the different `Token` sliced objects.

Parameters	Type	Details
`start`	int, optional. The default is 0.	Starting `Token` from which the gluing starts.
`stop`	int, optional. The default is `None`, in which case the stop is at the end of the string.	Ending `Token` at which the gluing stops.
`step`	int, optional. The default is 1.	The step in the `Tokens.tokens` list. If `step = 1`, undo the `Token.split_token` method.
`overlap`	int, optional. The default is 0.	The number of characters that will be drop from each `Token.string` before concatenating it to the undone one. If `overlap = step-1` from `Token.slice(step)`, undo the `Token.slice` method.

Parameters	Type	Details
`start`	int, optional. The default is 0.	Starting `Token` from which the gluing starts.
`stop`	int, optional. The default is `None`, in which case the stop is at the end of the string.	Ending `Token` at which the gluing stops.
`size`	int, optional. The default is 1.	Size of the span of each slice.
`step`	int, optional. The default is 1.	The step in the Tokens.tokens list. If `step = 1`, give back the initial `Tokens` object. If `step = n > 1`, give some n-grams `Token` by `Token`.

tokenspan package

Token Objects¶

__init__¶

__repr__¶

__getattr__¶

setattr¶

copy¶

set_string_methods¶

__eq__¶

attributes¶

keys¶

values¶

items¶

get_subToken¶

subTokens¶

fusion_attributes¶

union¶

difference¶

__AmB_BmA¶

intersection¶

symmetric_difference¶

_prepareTokens¶

partition¶

split¶

slice¶

Tokens Objects¶

__init__¶

__repr__¶

__len__¶

__add__¶

__str__¶

__getitem__¶

__eq__¶

__contains__¶

__bool__¶

copy¶

attributes_keys¶

attributes_map¶

attributes_values¶

keys¶

values¶

items¶

has_attribute¶

append¶

extend¶

insert¶

join¶

slice¶

init¶

repr¶

getattr¶

eq¶

init¶

repr¶

len¶

add¶

str¶

getitem¶

eq¶

contains¶

bool¶