1. Span User Guide

We present the Span class, which lies at the heart of this library. It models the notion of a section of a parent string, and allows algebraic manipulation of different sections taken from the same parent string.

1.1. Motivations

At its most basic level, the only manipulations one can apply to a string are :

  • extraction of a part of a parent string into a child string, this is the tokenization process

  • insertion or deletion of a child string from a parent string, this is the substitution process

We here focus on the splitting (or extracting, or cutting, or tokenizing, …) process, the substitution process being discussed in an other module of ours.

If one describes any sub-string of a parent string by the range (or interval) of its positions of characters, then extracting one sub-string from this parent string can be seen as selecting an interval of positions inside the complete range of available positions in the parent string. For instance, the string 'Simple string for demonstration and for illustration.' has length 52 and the interval [7,13[ corresponds to the sub-string 'string', as illustrated in the following cartoon

'Simple string for demonstration and for illustration.' # the parent string
'01234567891123456789212345678931234567894123456789512' # the positions

'       string                       for illustration ' # the Span span1
'       789112                       678 412345678951 ' # the ranges

'       string for                       illustration ' # the Span span2
'       789112 456                       412345678951 ' # the ranges

In addition, there is nothing which forbids to take several intervals at once, as for instance the sub-string 'string for illustration' that corresponds to the union of intervals [7,13[, [36,39[ and [40,52[ in the above example.

Thus in practice, to cut a string consists in associating to a given object a parent string and a collection of ranges. This is what we model as a Span, described by

  • a parent string: this is the attribute Span.string

  • a collection of interval: this is the attribute Span.ranges that corresponds to a list of range Python objects

(in fact there are two an other attributes: Span.subtoksep and Span.encoding that are less important, the later being used only to construct a hash of the object)

The motivation of the present library is to represent the tokenization process as the extraction of characters positions that corresponds to sub-string instead of the extraction of sub-string directly as it is usually done in other libraries dealing with Natural Language Processing (NLP). This alternative approach allows much more

  • flexibility, since one can associate an infinite collection of tokens (called Span in this library) to a given parent string, and algebraically manipulate them by union, difference, intersection, … of their positions sets

  • rigor, since e.g. the span1 and span2, having the same string representation 'string for illustration', have no more reason to be equal because their position sets are not the same ; instead there is an order relation between them, since they have a non-empty intersecting sub-set in common (namely 'string illustration')

In addition, one can easilly reconstruct the basic tokenizers (for instance n-grams or char-grams) using the Span class, as we will illustrate in a moment.

The only restriction one has to impose to the Span concept is to forbids overlapping ranges of position. Practically, if an overlap appears, the Span class will construct the union of the underneath ranges. This is not a drastic restriction, since one can always construct as many Span objects as one wants, and attach them to the same parent string in order to compare them.

from tokenspan import Span

string = 'Simple string for demonstration and for illustration.'

1.2. Basic examples

We here illustrate the basic construction of the Span object, then we reconstruct the usual basic tokenizer. note nevertheless that the approach here is not customized, and the use of the companion class iamtokenizing is recommended, see https://framagit.org/nlp/iamtokenizing for more details.

1.2.1. Attributes of the Span class

We start with the explicit construction of the above examples of span1 and span2. If one defines no ranges attributes while instanciating the Span class, this attribute is calculated from the given parent string.

span = Span(string)
span.ranges
[range(0, 53)]

In order to construct the span1, one can construct the ranges attributes manually

span1 = Span(string, ranges=[range(7,13), range(36,39), range(40,52)])
span1
Span('string for illustration', [(7,13),(36,39),(40,52)])

The same for span2

span2 = Span(string, ranges=[range(7,13), range(14,17), range(40,52)])
span2
Span('string for illustration', [(7,13),(14,17),(40,52)])

In passing, note that the attributes subtoksep is used to glue the different sub-strings given by the ranges of Span. By default, Span.subtoksep = chr(32) (the white space of length 1), and it is wise to keep a sub-token separator of length 1, but for illustration we change it here to a more complex pattern

span1_subtoksep = Span(string, 
                       ranges=[range(7,13), range(36,39), range(40,52)],
                       subtoksep=' _SubTokSep_ ')
span1_subtoksep
Span('string _SubTokSep_ for _SubTokSep_ illustration', [(7,13),(36,39),(40,52)])

1.2.2. Child-string str(Span) versus parent-string Span.string representations

What is captured by the Span representation is the collection of ranges, and its string representation, whereas Span.string keeps the parent string.

print(span1.ranges)
print(str(span1))
print(span1.string)
[range(7, 13), range(36, 39), range(40, 52)]
string for illustration
Simple string for demonstration and for illustration.
print(span1_subtoksep.ranges)
print(str(span1_subtoksep))
print(span1_subtoksep.string)
[range(7, 13), range(36, 39), range(40, 52)]
string _SubTokSep_ for _SubTokSep_ illustration
Simple string for demonstration and for illustration.

1.2.3. Different levels of equality

Since Python refers to the objects via reference, the Span.string is really the same object among the different instances of Span if they are constructed accordingly.

span2.string == span1.string
True

In the same way, since str(Span) is a string, the equality among the two resulting objects is the usual comparison of string

str(span1) == str(span2)
True

In addition, the equality of Span object verifies that Span.string and Span.ranges and Span.subtoksep are the same

span1 == span2
False
span1 == span1_subtoksep
False

1.2.4. subSpans attributes

In case there are several ranges in the Span.ranges attributes, the subSpans parameters constructs the different instances of Span corresponding to the sub-parts, and collect them in a list.

span1.subSpans
[Span('string', [(7,13)]),
 Span('for', [(36,39)]),
 Span('illustration', [(40,52)])]

It discards the use of subtoksep, …

span1_subtoksep.subSpans
[Span('string', [(7,13)]),
 Span('for', [(36,39)]),
 Span('illustration', [(40,52)])]

… despite it is still there.

span1_subtoksep.subSpans[0].subtoksep
' _SubTokSep_ '

1.3. Construction of basic Tokenizer from Span

Let us construct the basic tokenizers one usually encounters in NLP, namely the n-gram (mobile window of n contiguous words) and char-gram (mobile window of n contiguous characters).

1.3.1. Construction of n-grams

To recover the basic n-grams construction, we have to find a way to construct the ranges of useless (or usefull, depending on the strategy of keeping/excluding the sub-strings from the parent string) parts of the parent string. This is done using the re.finditer matches and the REGular EXpressions. Below we use it on the usefull parts, that is any contiguous portions of alpha-numerics characters.

from re import finditer 
ranges = [range(f.start(), f.end()) for f in finditer('\w+', string)]
ngrams = Span(string, ranges)
ngrams
Span('Simple string for demonstration and for illustration', [(0,6),(7,13),(14,17),(18,31),(32,35),(36,39),(40,52)])

It results a single Span, since one knows that the REGEX are never overlapping. In order to construct n-grams (below example is 2-grams) that are overlapping sub-tokens of the parent string, one simply has to construct independent Span instance, each of them being attached to the same parent string. This is easy using the subSpans constructor, which does that automatically, and to join the succesive sub-spans using the union operation + (that we will deal with later).

ngrams_21 = [s1+s2 for s1, s2 in zip(ngrams.subSpans[:-1], ngrams.subSpans[1:])]
ngrams_21
[Span('Simple string', [(0,6),(7,13)]),
 Span('string for', [(7,13),(14,17)]),
 Span('for demonstration', [(14,17),(18,31)]),
 Span('demonstration and', [(18,31),(32,35)]),
 Span('and for', [(32,35),(36,39)]),
 Span('for illustration', [(36,39),(40,52)])]

If one prefers, one can directly constructs the ranges, and pass them from the beginning

ranges_ngrams = [[r1,r2] for r1,r2 in zip(ranges[:-1], ranges[1:])]
ngrams_22 = [Span(string, ranges=r) for r in ranges_ngrams]
ngrams_22
[Span('Simple string', [(0,6),(7,13)]),
 Span('string for', [(7,13),(14,17)]),
 Span('for demonstration', [(14,17),(18,31)]),
 Span('demonstration and', [(18,31),(32,35)]),
 Span('and for', [(32,35),(36,39)]),
 Span('for illustration', [(36,39),(40,52)])]

The two constructions are equivalent

for s1, s2 in zip(ngrams_21, ngrams_22):
    print(s1==s2)
True
True
True
True
True
True

1.3.2. Span.start and Span.stop parameters

One might be unsatisfied by the presence of discontiguous ranges defining each span in the above ngrams construction. For instance, one may want to have the first token being 'Simple string from range [0,13[ with the empty space directly inside the string, and not given by the subtoksep attribute.

This can be realized using the Span.start and Span.stop attributes, representing the first and last positions of the ranges of the Span irrespective of the number of ranges. Said differently, one has to keep in mind there is no verification of the number of ranges element before calculating the start and stop attributes. For instance

print(ngrams_22[0].start, ngrams_22[0].stop)
0 13

and so to convert is quite easy

ngrams_23 = [Span(s.string, ranges=[range(s.start, s.stop)]) for s in ngrams_22]
ngrams_23
[Span('Simple string', [(0,13)]),
 Span('string for', [(7,17)]),
 Span('for demonstration', [(14,31)]),
 Span('demonstration and', [(18,35)]),
 Span('and for', [(32,39)]),
 Span('for illustration', [(36,52)])]

Span.start and Span.stop will play a prominent role when we will discuss the ordering relation among different instances of Span associated to the same parent string later in this introduction.

1.3.3. Span.cuts

In order to help making n-grams and tokens, the method split has been designed. Let us try it

spans = span.split(cuts=ranges)
spans
[Span('', [(0,0)]),
 Span('Simple', [(0,6)]),
 Span(' ', [(6,7)]),
 Span('string', [(7,13)]),
 Span(' ', [(13,14)]),
 Span('for', [(14,17)]),
 Span(' ', [(17,18)]),
 Span('demonstration', [(18,31)]),
 Span(' ', [(31,32)]),
 Span('and', [(32,35)]),
 Span(' ', [(35,36)]),
 Span('for', [(36,39)]),
 Span(' ', [(39,40)]),
 Span('illustration', [(40,52)]),
 Span('.', [(52,53)])]

One sees that split separates the initial span each time a range appear in the cuts parameter, without detroying information. This might not be the most suitable case for you, since one usually has to filter the irrelevant part of the splitting. Here the strategy could be to remove the empty Span (defined as having no child-string representation), the empty space spans and the non-alphanumerics span … but it is far better to use the previous approach, and to feed a new Span instance directly with the outcome of the REGEX matches…

In fact, here, and because we used REGEX to capture a pattern of contiguous alphanumeric characters as the relevant spans, one can simply capture half of the Span

relevant_spans = spans[1::2]
relevant_spans
[Span('Simple', [(0,6)]),
 Span('string', [(7,13)]),
 Span('for', [(14,17)]),
 Span('demonstration', [(18,31)]),
 Span('and', [(32,35)]),
 Span('for', [(36,39)]),
 Span('illustration', [(40,52)])]

If one is lost with the starting element of the spans list being empty or not, one can use the remove_empty=True parameter in split, which remove empty Span.

span.split(cuts=ranges, remove_empty=True)
[Span('Simple', [(0,6)]),
 Span(' ', [(6,7)]),
 Span('string', [(7,13)]),
 Span(' ', [(13,14)]),
 Span('for', [(14,17)]),
 Span(' ', [(17,18)]),
 Span('demonstration', [(18,31)]),
 Span(' ', [(31,32)]),
 Span('and', [(32,35)]),
 Span(' ', [(35,36)]),
 Span('for', [(36,39)]),
 Span(' ', [(39,40)]),
 Span('illustration', [(40,52)]),
 Span('.', [(52,53)])]

An other strategy in our specific case would have been to filter on the length of the Span (defined as the length of its child-string representation)

relevant_spans = [s for s in spans if len(s)>1]
relevant_spans
[Span('Simple', [(0,6)]),
 Span('string', [(7,13)]),
 Span('for', [(14,17)]),
 Span('demonstration', [(18,31)]),
 Span('and', [(32,35)]),
 Span('for', [(36,39)]),
 Span('illustration', [(40,52)])]

That way, one quite easilly filters the different spans extracted at the previous step. An other usefull filter is the string comparison. For instance, to withdraw all the spans that contains the character 'o' is intuitive

relevant_spans_without_o = [s for s in relevant_spans if 'o' not in s]
relevant_spans_without_o
[Span('Simple', [(0,6)]), Span('string', [(7,13)]), Span('and', [(32,35)])]

1.3.4. Char-grams and Span.slice

The char-grams are even easier to implement, since the method slice does it for us, with parameters :

  • start: the character number where the slice starts

  • stop: the character number where the slice stops

  • size: the number of characters inside an outcoming Span

  • step: the number of characters skipped from one Span to the next one in the outcome

The outcome is a list of Span.

span.slice(start=0, stop=8, size=3, step=1)
[Span('Sim', [(0,3)]),
 Span('imp', [(1,4)]),
 Span('mpl', [(2,5)]),
 Span('ple', [(3,6)]),
 Span('le ', [(4,7)]),
 Span('e s', [(5,8)])]

Note that .slice applies straightforwardly even on already truncated string. Note the starting and stoping positions are calculated inside the child-string representation. For instance

span1.slice(size=5, stop=8)
[Span('strin', [(7,12)]),
 Span('tring', [(8,13)]),
 Span('ring ', [(9,13),(36,36)]),
 Span('ing f', [(10,13),(36,37)])]

Note that the ranges are cutted to give each child-string of the Span instance the correct length (this is the reason of the appearance of zero length range(36,36) for Span('ring ', [(9,13),(36,36)]), and this is also why one should not take a subtoksep of length higher than 1, because ranges = [range(9,13), range(36,36)] signifies that there are 13-9=4 characters plus one subtoksep character in this Span, while range(36,36) selects no string in the parent-string of this Span ; compare with the last Span of the above example, and change eventually the subtoksep to '#' character to see it in action).

Note also that a BoundaryWarning easilly raises while using this method, since the default values are overwriten to start and stop on the size of the str(Span).

1.4. Order relation among Span

Span object can be ordered, providing that one keeps in mind

  • the order is only relevant for contiguous Span, a contiguous Span being a Span with a single range, that is, this is a Span for which start and stop attribute make sense

  • the order is only relevant for two Span attached to the same string. This is somehow less sensible than the previous remark, since comparison of Spans would result in a TypeError in case their Span.string are not the same.

Before entering into the details of the comparison, let us recall all the Span we constructed above

'Simple string for demonstration and for illustration.' # the Span span
'01234567891123456789212345678931234567894123456789512' # the positions

'Simple string for demonstration and for illustration.' # the Span ngrams
'012345 789112 456 8921234567893 234 678 412345678951 ' # the ranges

In particular, the span ngrams contains 7 ranges, and we can call sub-spans using the subSpans attribute.

1.4.1. < and > relations for non overlapping Span

The Span class has been designed for latin languages (though it might work for any language that can be encoded in an alphabet), and the reading sense is from left to right.

  • span1 < span2 means that the last position of span1 is on the left of the first position of span2 ; said differently span1 is entirely read before span2 is read

  • span1 > span2 means that the first position of span1 is on the right of the last position of span2 ; said differently span2 is entirely read before span1 is read

In particular, these definitions imply that :

  • if either span1 < span2 or span1 > span2, they do not overlap

  • span1 < span2 being false does not mean that span1 > span2 : the two spans may overlap

  • span1 > span2 being false does not mean that span1 < span2 : the two spans may overlap

Let us illustrate this :

# not overlapping Span
print(ngrams.subSpans[0] < ngrams.subSpans[1])
print(ngrams.subSpans[1] > ngrams.subSpans[0])
print(ngrams.subSpans[2] < ngrams.subSpans[4])
print(ngrams.subSpans[4] > ngrams.subSpans[2])
True
True
True
True

To illustrate the strict order < and > on overlapping Span, let us cartoon the instances span1 and span2 from above

'Simple string for demonstration and for illustration.' # the Span span
'01234567891123456789212345678931234567894123456789512' # the positions

'       string                       for illustration ' # the Span span1
'       789112                       678 412345678951 ' # the ranges

'       string for                       illustration ' # the Span span2
'       789112 456                       412345678951 ' # the ranges

and recall the construction of span1.start = 7, span1.stop = 51 and the same for span2: when it is about order comparison, a Span object associated to several ranges behaves as a monobloc entity from its starting to ending positions.

Now the order of these two Span entities:

# overlapping Span
print(span1 < span2)
print(span1 > span2)
print(span2 < span1)
print(span2 > span1)
False
False
False
False

1.4.2. <= and >= relations

For overlapping spans, the orders of the Span can either be

  • span1 <= span2 if there are possibly non-overlapping part of span1 on the left of span2 ; said differently the left-most part of the union of span1 and span2 belongs to span1 ; said differently span1 is partly read before span2 is read

  • span1 >= span2 if there are possibly non-overlapping part of span1 on the right of span2 ; said differently the right-most part of the union of span1 and span2 belongs to span1 ; said differently span2 is partly read before span1 is read

From the above definition, note that

  • if both span1 >= span2 and span1 <= span2 were true, there is no reason why span1 == span2 should be true. In fact we defined the overlapping orders <= and >= to be independent to each other: there is no way that span1 >= span2 and span1 <= span2 returns True

There is no difficulty in interpretation, as illustrated with span1 and the Span including the string 'documentation' from ngrams : documentation is on the right of span1 and so both documentation is on the right of span1 (documentation >= span1) and span1 is on the left of demonstration (span1 <= demonstration). To insist one more time : span1 starts before demonstration, and there are overlapping part between the two Span (this is counter-intuitive if one forgets the roles of span1.start and span1.stop in the ordering relations). Those are the only true orders in presence.

# overlapping span with no common boundary
demonstration = ngrams.subSpans[3]
print(demonstration < span1)
print(demonstration > span1)
print(span1 < demonstration)
print(span1 > demonstration)
print(demonstration <= span1)
print(demonstration >= span1)
print(span1 <= demonstration)
print(span1 >= demonstration)
False
False
False
False
False
True
True
False

The same easiness of interpretation is present in the case of one boundary in common, as e.g.

# overlapping span with left common boundary
string = ngrams.subSpans[1]
print(string < span1)
print(string > span1)
print(string <= span1)
print(string >= span1)
print(span1 < string)
print(span1 > string)
print(span1 <= string)
print(span1 >= string)
False
False
True
False
False
False
False
True

for left boundary in common, and

# overlapping span with right common boundary
illustration = ngrams.subSpans[-1]
print(illustration < span1)
print(illustration > span1)
print(illustration <= span1)
print(illustration >= span1)
print(span1 < illustration)
print(span1 > illustration)
print(span1 <= illustration)
print(span1 >= illustration)
False
False
False
True
False
False
True
False

for right boundary in commom. In contrary, if the two boundaries are in common, there is no order relation between the different Span

# overlapping Span with two common boundaries
print(span1 <= span2)
print(span1 >= span2)
print(span2 <= span1)
print(span2 >= span1)
False
False
False
False

1.4.3. Complete order of Span on a parent string

The interest in this order is that, whatever the two choosen Spans associated with a common string, there is one and only one valid assumption, either:

  • span1 > span2: the two spans do not overlap, and span1 is entirely on the right of span2,

  • span1 < span2: no overlap, span1 is entirely on the left of span2,

  • span1 >= span2: overlapping string, with span1 finishing later than span2 in the reading order,

  • span1 <= span2: overlapping string, with span2 finishing later than span1 in the reading order,

and the order is dependent of the order of the Spans, that is, we should not try to interpret the outcome of span1 <= span2 and span2 <= span1, that just signifies that span1.start == span2.start.

Recall that the order comparison takes into account only the complete size of the Span, that is, it compares only the start and stop attributes of the Span, disregarding the possibility of non-contiguous Span. Said differently, span1 above is treated (in terms of comparisons) in the same way as span2 (and this is also why one has both span1 <= span2 and span2 <= span1 in order to preserve the unique order relations).

1.5. Algebraic manipulations of Span

Since the Span object represents a collection of characters positions from a parent string, and because a position cannot appear more than once (recall overlapping range in Span.ranges are forbidden), one can treat this collection as a mathematical set. So the basic set operations of union, difference, intersection, and symmetric division are allowed.

Here are their results, illustrated from span1 and span2

1.5.1. Union

The complete set of all positions in span1 plus all positions in span2. This operation is symmetric.

# union
span1 + span2
# equals to span1.union(span2), span2+span1, span2.union(span1)
Span('string for for illustration', [(7,13),(14,17),(36,39),(40,52)])

1.5.2. Difference

The portion of the parent string in span1 that is absent in span2 is represented as span1.difference(span2). Note this is not a symmetric operation (warn the Span('for') with different ranges…)

# difference
span1 - span2
# equals to span1.difference(span2)
Span('for', [(36,39)])
# difference is not symmetric
span2.difference(span1)
# equals to span2 - span1
Span('for', [(14,17)])

1.5.3. Intersection

The ranges which are common to both span1 and span2 can be found using span1 * span2 or span1.intersection(span2). This operation is symmetric span1 * span2 == span2 * span1 is True.

# intersection
span1 * span2
# equals to span1.intersection(span2), span2*span1, span2.intersection(span1)
Span('string illustration', [(7,13),(40,52)])

1.5.4. Symmetric difference

The symmetric difference represent the union minus the intersection of the two Span. It is repsented by the division operator /, but keep in mind this relation is symmetric.

# symmetric_difference
span1/span2
# equals to span1.symmetric_difference(span2), span2/span1, ...
Span('for for', [(14,17),(36,39)])

1.6. Conclusion

The Span object is a convenient abstraction of the processus of cutting a string into sub-parts. Instead of constructing all string possibilities before feeding them to the next stage of the NLP treatment, one can apply some construction steps on their characters positions, which in principle is far more efficient in terms of algorithms. In addition to reconstruct the usual tokens (e.g. n-grams and char-grams for instance), the main advantages of the Span class is its habilities to construct as many children string as one wants, starting from a unique parent string. Using the four order relations <, >, <= and >= among all these children, as well as the infinite possibilities of combining these children strings using the +, -, * and / operations allow overwhelming new routes of manipulating a string in an abstract way.

from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Fri Jan 21 21:48:31 2022