Span class

Its main original methods are

  • partition(range(start,stop)) : which partitions the initial Span in three new Span instance, collected in a unique Spans instance (see below)

  • split([range(start,end),range(start2,end2), ...]) : which splits the Span in several instances grouped in a a list of Span objects

  • slice(start,stop,size,step) : which slices the initial string from position start to position stop by step in sub-strings of size size, all grouped in a list of Span objects

In addition, one can compare two Span using the non-overlapping order < and > or the overlapping-order <= and >= when two Spans overlap.

Finally, since a Span can be seen as a selection of a set of character positions from the parent string, one can apply the basic set operations to two Spans, in order to construct more elaborated Span instance.

Span Objects

class Span()

Span object is basically a collection of a parent string named string, and a ranges collection (list of range) of positions. Its basic usage is its str method, which consists in a string extracted from all intervals defined in the ranges list, and joined by the subtoksep separator.

__init__

 | __init__(string='', ranges=None, subtoksep=chr(32), encoding='utf-8')

Span attributes are

  • Span.string -> a string, default is empty

  • Span.ranges -> a list of ranges, default is None, in which case the ranges are calculated to contain the entire string

  • Span.subtoksep -> a string, preferably of length 1, default is a white space ‘ ‘

  • Span.encoding -> a string, representing the encoding, default is 'utf-8'

_append_range

 | _append_range(r)

Utility that appends a range in self.ranges

append_range

 | append_range(r)

Append a range object to self.ranges. The range r must be given in absolute coordinates. Return self (append in place). Raise a ValueError in case r is not a range object. Raise a BoundaryWarning in case r has start or stop attributes outside the size of Span.string, in which case thse parameters are recalculated to fit Span.string (being either 0 for start or len(Span.string) for stop).

append_ranges

 | append_ranges(ranges)

Append a list of range objects to self.ranges. This method applies append_range several times, so please see its documentation for more details.

_remove_range

 | _remove_range(r)

Utility that removes a range from self.ranges

remove_range

 | remove_range(r)

Remove the range r from Span.ranges. The range r must be given in absolute coordinates. Return self (remove in place). In case the range r encompass the complete string, there is no more Span.ranges associated to the outcome of this method.

remove_ranges

 | remove_ranges(ranges)

Remove a list of range objects toself.ranges. This method applies remove_range several times, so please see its documentation for more details.

__len__

 | __len__()

Return the length of the string associated with the Span

__repr__

 | __repr__()

Return the two main elements (namely the string and the ranges attributes) of a Span instance in a readable way.

__str__

 | __str__()

str(Span) method returns the recombination of the extract of each Span.subSpan from the Span.string attribute corresponding to all its Span.ranges attribute, and joined by the Span.subtoksep character.

__hash__

 | __hash__()

Make the Span object hashable, such that it can serve for set and dict.keys. Span is constructed on the unicity of the Span object, that is, this is the hash of the string made of the parent string, plus the string representation of the instance, including subtoksep. Everything is then converted to hashlib.sha1.hexdigest

__contains__

 | __contains__(s)

If the object to be compared with is a Span related to the same string as this instance, check whether the ranges are overlapping. Otherwise, check whether the string str(s) (which transforms the other Span instance in a string in case s is not related to the same string) is a sub-string of the Span instance.

__bool__

 | __bool__()

Return True if the Span.ranges is non-empty, otherwise return False

__getitem__

 | __getitem__(n)

Allow slice and integer catch of the elements of the string of Span. Return a string.

Note: As for the usual Python string, a slice with positions outside str(Span) will outcome an empty string, whereas Span[x] with x>len(Span) would results in an IndexError.

get_subSpan

 | get_subSpan(n)

Get the Span associated to the ranges elements n (being an integer or a slice). Return a Span object. Raise an IndexError in case n is larger than the number of ranges in self.ranges.

subSpans

 | @property
 | subSpans()

Get the Span associated to each Span.ranges in a Span object. Return a Span object.

__eq__

 | __eq__(span)

Verify whether the actual instance of Span and an extra ones have the same attributes.

Returns a boolean.

Raise a ValueError when one object is not a Span instance

__add__

 | __add__(span)

If the two Span objects have same strings, returns a new Span object with combined ranges of the initial ones.

__sub__

 | __sub__(span)

If the two Span objects have same strings, returns a new Span object with ranges of self with Span ranges removed. Might returns an empty Span.

__mul__

 | __mul__(span)

If the two Span objects have same strings, returns a new Span object with ranges of self having intersection with Span ranges removed. Might returns an empty Span.

__truediv__

 | __truediv__(span)

If the two Span objects have same strings, returns a new Span object with ranges of self having symmetric_difference with Span ranges removed. Might returns an empty Span.

start

 | @property
 | start()

Returns the starting position (an integer) of the first ranges. Make sense only for contiguous Span.

stop

 | @property
 | stop()

Returns the ending position (an integer) of the last ranges. Make sense only for contiguous Span.

__lt__

 | __lt__(span)

Returns True if Span is entirely on the left of span (the Span object given as parameter). Make sense only for contiguous Span.

__gt__

 | __gt__(span)

Returns True if Span is entirely on the right of span (the Span object given as parameter). Make sense only for contiguous Span.

__le__

 | __le__(span)

Returns True if Span is partly on the left of span (the Span object given as parameter). Make sense only for contiguous Span.

__ge__

 | __ge__(span)

Returns True if Span is partly on the right of span (the Span object given as parameter). Make sense only for contiguous Span.

union

 | union(span)

Takes a Span object as entry, and returns a new Span instance, with Span.ranges given by the union of the actual Span.ranges with the span.ranges, when one sees the ranges attributes as sets of positions of each instance.

Parameters

Type

Details

span

Span object

A Span object with same mother string (Span.string) and eventually different ranges that the actual instance.

Returns

Type

Details

newSpan

Span object

A Span object with newSpan.ranges = union of Span.ranges and span.ranges

Raises

Details

ValueError

in case the entry is not a Span instance.

TypeError

in case the span.string is not the same as Span.string.

difference

 | difference(span)

Takes a Span object as entry, and returns a new Span instance with Span.ranges given by the difference of the actual Span.ranges with the span.ranges, when one sees the ranges attributes as sets of positions of each instance.

Parameters

Type

Details

span

Span object

A Span object with same mother string (Span.string) and eventually different ranges that the actual instance.

Returns

Type

Details

newSpan

Span object

A Span object with newSpan.ranges = difference of Span.ranges and span.ranges.

Raises

Details

ValueError

in case the entry is not a Span instance.

TypeError

in case the span.string is not the same as Span.string.

intersection

 | intersection(span)

Takes a Span object as entry and returns a new Span whose Span.ranges given by the intersection of the actual Span.ranges with the span.ranges, when one sees the ranges attributes as sets of positions of each instance..

Parameters

Type

Details

span

Span object

A Span object with same mother string (Span.string) and eventually different ranges that the actual instance.

Returns

Type

Details

newSpan

Span object

A Span object with newSpan.ranges = intersection of Span.ranges and span.ranges

Raises

Details

ValueError

in case the entry is not a Span instance.

TypeError

in case the span.string is not the same as Span.string.

symmetric_difference

 | symmetric_difference(span)

Takes a Span object as entry, and return a new Span instance whose Span.ranges given by the symmetric difference of the actual Span.ranges with the span.ranges, when one sees the ranges attributes as sets of positions of each instance.

Parameters

Type

Details

span

Span object

A Span object with same mother string (Span.string) and eventually different ranges that the actual instance.

Returns

Type

Details

newSpan

Span object

A Span object with newSpan.ranges = symmetric difference of Span.ranges and span.ranges

Raises

Details

ValueError

in case the entry is not a Span instance.

TypeError

in case the span.string is not the same as Span.string.

_prepareSpans

 | _prepareSpans(ranges, remove_empty)

Utility that removes empty ranges and constructs a list of Span objects.

partition

 | partition(start, stop, remove_empty=False)

Split the Span.string in three Span objects :

  • string[:start]

  • string[start:stop]

  • string[stop:] and put all non-empty Span objects in a list of Span instances.

It acts a bit like the str.partition(s) method of the Python string object, but Span.partition takes start and stop argument instead of a string.

Parameters

Type

Details

start

int

Starting position of the splitting sequence.

stop

int

Ending position of the splitting sequence.

remove_empty

bool. Default is False

If True, returns a list of Span instance with only non-empty Span objects. see __bool__() method for non-empty Span

Returns

Type

Details

spans

list of Span objects

The list object containing the different Span objects.

split

 | split(cuts, remove_empty=False)

Split a text as many times as there are range entities in the cuts list. Return a list of Span instances.

This is a bit like str.split(s) method from Python string object, except one has to feed Span.split with a full list of range(start,stop) range objects instead of the string ‘s’ in str.split(s) If the range(start,stop) tuples in cuts are given by a regex re.finditer search on str(Span), the two methods give the same thing.

Parameters

Type

Details

cuts

a list of range(start,stop,) range objects. start/stop are integer

Basic usage is to take these cuts from re.finditer. The start/end integers are given in the relative coordinate system, that is, in terms of the position in str(Span).

remove_empty

bool. Default is False

If True, returns a list of Span instance with only non-empty Span objects. see __bool__() method for non-empty Span

Return

Type

Details

spans

list of Span objects

The list object containing the different Span objects.

slice

 | slice(start=0, stop=None, size=1, step=1, remove_empty=False)

Cut the Span.string in overlapping sequences of strings of size size by step, put all these sequences in separated Span objects, and finally put all theses objects in a list of Span instances.

Parameters

Type

Details

start

int

The relative position where to start slicing the Span.

stop

int

The relative position where to stop slicing the Span.

size

int

The size of the string in each subsequent Span objects.

step

int

The number of characters skipped from one Span object to the next one. A character is given by str(Span) (called relative position)

Returns

Type

Details

spans

list of Span objects

The list object containing the different Span objects.