1. Span
User Guide¶
We present the Span
class, which lies at the heart of this library. It models the notion of a section of a parent string, and allows algebraic manipulation of different sections taken from the same parent string.
1.1. Motivations¶
At its most basic level, the only manipulations one can apply to a string are :
extraction of a part of a parent string into a child string, this is the tokenization process
insertion or deletion of a child string from a parent string, this is the substitution process
We here focus on the splitting (or extracting, or cutting, or tokenizing, …) process, the substitution process being discussed in an other module of ours.
If one describes any sub-string of a parent string by the range (or interval) of its positions of characters, then extracting one sub-string from this parent string can be seen as selecting an interval of positions inside the complete range of available positions in the parent string. For instance, the string 'Simple string for demonstration and for illustration.'
has length 52 and the interval [7,13[
corresponds to the sub-string 'string'
, as illustrated in the following cartoon
'Simple string for demonstration and for illustration.' # the parent string
'01234567891123456789212345678931234567894123456789512' # the positions
' string for illustration ' # the Span span1
' 789112 678 412345678951 ' # the ranges
' string for illustration ' # the Span span2
' 789112 456 412345678951 ' # the ranges
In addition, there is nothing which forbids to take several intervals at once, as for instance the sub-string 'string for illustration'
that corresponds to the union of intervals [7,13[
, [36,39[
and [40,52[
in the above example.
Thus in practice, to cut a string consists in associating to a given object a parent string and a collection of ranges. This is what we model as a Span
, described by
a parent string: this is the attribute
Span.string
a collection of interval: this is the attribute
Span.ranges
that corresponds to a list ofrange
Python objects
(in fact there are two an other attributes: Span.subtoksep
and Span.encoding
that are less important, the later being used only to construct a hash of the object)
The motivation of the present library is to represent the tokenization process as the extraction of characters positions that corresponds to sub-string instead of the extraction of sub-string directly as it is usually done in other libraries dealing with Natural Language Processing (NLP). This alternative approach allows much more
flexibility, since one can associate an infinite collection of tokens (called
Span
in this library) to a given parent string, and algebraically manipulate them by union, difference, intersection, … of their positions setsrigor, since e.g. the
span1
andspan2
, having the same string representation'string for illustration'
, have no more reason to be equal because their position sets are not the same ; instead there is an order relation between them, since they have a non-empty intersecting sub-set in common (namely'string illustration'
)
In addition, one can easilly reconstruct the basic tokenizers (for instance n-grams or char-grams) using the Span
class, as we will illustrate in a moment.
The only restriction one has to impose to the Span
concept is to forbids overlapping ranges of position. Practically, if an overlap appears, the Span
class will construct the union of the underneath ranges. This is not a drastic restriction, since one can always construct as many Span
objects as one wants, and attach them to the same parent string in order to compare them.
from tokenspan import Span
string = 'Simple string for demonstration and for illustration.'
1.2. Basic examples¶
We here illustrate the basic construction of the Span
object, then we reconstruct the usual basic tokenizer. note nevertheless that the approach here is not customized, and the use of the companion class iamtokenizing
is recommended, see https://framagit.org/nlp/iamtokenizing for more details.
1.2.1. Attributes of the Span
class¶
We start with the explicit construction of the above examples of span1
and span2
. If one defines no ranges
attributes while instanciating the Span
class, this attribute is calculated from the given parent string.
span = Span(string)
span.ranges
[range(0, 53)]
In order to construct the span1
, one can construct the ranges
attributes manually
span1 = Span(string, ranges=[range(7,13), range(36,39), range(40,52)])
span1
Span('string for illustration', [(7,13),(36,39),(40,52)])
The same for span2
span2 = Span(string, ranges=[range(7,13), range(14,17), range(40,52)])
span2
Span('string for illustration', [(7,13),(14,17),(40,52)])
In passing, note that the attributes subtoksep
is used to glue the different sub-strings given by the ranges of Span
. By default, Span.subtoksep = chr(32)
(the white space of length 1), and it is wise to keep a sub-token separator of length 1, but for illustration we change it here to a more complex pattern
span1_subtoksep = Span(string,
ranges=[range(7,13), range(36,39), range(40,52)],
subtoksep=' _SubTokSep_ ')
span1_subtoksep
Span('string _SubTokSep_ for _SubTokSep_ illustration', [(7,13),(36,39),(40,52)])
1.2.2. Child-string str(Span)
versus parent-string Span.string
representations¶
What is captured by the Span
representation is the collection of ranges, and its string representation, whereas Span.string
keeps the parent string.
print(span1.ranges)
print(str(span1))
print(span1.string)
[range(7, 13), range(36, 39), range(40, 52)]
string for illustration
Simple string for demonstration and for illustration.
print(span1_subtoksep.ranges)
print(str(span1_subtoksep))
print(span1_subtoksep.string)
[range(7, 13), range(36, 39), range(40, 52)]
string _SubTokSep_ for _SubTokSep_ illustration
Simple string for demonstration and for illustration.
1.2.3. Different levels of equality¶
Since Python refers to the objects via reference, the Span.string
is really the same object among the different instances of Span
if they are constructed accordingly.
span2.string == span1.string
True
In the same way, since str(Span)
is a string, the equality among the two resulting objects is the usual comparison of string
str(span1) == str(span2)
True
In addition, the equality of Span
object verifies that Span.string
and Span.ranges
and Span.subtoksep
are the same
span1 == span2
False
span1 == span1_subtoksep
False
1.2.4. subSpans
attributes¶
In case there are several ranges in the Span.ranges
attributes, the subSpans
parameters constructs the different instances of Span
corresponding to the sub-parts, and collect them in a list.
span1.subSpans
[Span('string', [(7,13)]),
Span('for', [(36,39)]),
Span('illustration', [(40,52)])]
It discards the use of subtoksep
, …
span1_subtoksep.subSpans
[Span('string', [(7,13)]),
Span('for', [(36,39)]),
Span('illustration', [(40,52)])]
… despite it is still there.
span1_subtoksep.subSpans[0].subtoksep
' _SubTokSep_ '
1.3. Construction of basic Tokenizer from Span
¶
Let us construct the basic tokenizers one usually encounters in NLP, namely the n-gram (mobile window of n contiguous words) and char-gram (mobile window of n contiguous characters).
1.3.1. Construction of n-grams¶
To recover the basic n-grams construction, we have to find a way to construct the ranges of useless (or usefull, depending on the strategy of keeping/excluding the sub-strings from the parent string) parts of the parent string. This is done using the re.finditer matches and the REGular EXpressions. Below we use it on the usefull parts, that is any contiguous portions of alpha-numerics characters.
from re import finditer
ranges = [range(f.start(), f.end()) for f in finditer('\w+', string)]
ngrams = Span(string, ranges)
ngrams
Span('Simple string for demonstration and for illustration', [(0,6),(7,13),(14,17),(18,31),(32,35),(36,39),(40,52)])
It results a single Span, since one knows that the REGEX are never overlapping. In order to construct n-grams (below example is 2-grams) that are overlapping sub-tokens of the parent string, one simply has to construct independent Span
instance, each of them being attached to the same parent string. This is easy using the subSpans
constructor, which does that automatically, and to join the succesive sub-spans using the union operation +
(that we will deal with later).
ngrams_21 = [s1+s2 for s1, s2 in zip(ngrams.subSpans[:-1], ngrams.subSpans[1:])]
ngrams_21
[Span('Simple string', [(0,6),(7,13)]),
Span('string for', [(7,13),(14,17)]),
Span('for demonstration', [(14,17),(18,31)]),
Span('demonstration and', [(18,31),(32,35)]),
Span('and for', [(32,35),(36,39)]),
Span('for illustration', [(36,39),(40,52)])]
If one prefers, one can directly constructs the ranges, and pass them from the beginning
ranges_ngrams = [[r1,r2] for r1,r2 in zip(ranges[:-1], ranges[1:])]
ngrams_22 = [Span(string, ranges=r) for r in ranges_ngrams]
ngrams_22
[Span('Simple string', [(0,6),(7,13)]),
Span('string for', [(7,13),(14,17)]),
Span('for demonstration', [(14,17),(18,31)]),
Span('demonstration and', [(18,31),(32,35)]),
Span('and for', [(32,35),(36,39)]),
Span('for illustration', [(36,39),(40,52)])]
The two constructions are equivalent
for s1, s2 in zip(ngrams_21, ngrams_22):
print(s1==s2)
True
True
True
True
True
True
1.3.2. Span.start
and Span.stop
parameters¶
One might be unsatisfied by the presence of discontiguous ranges defining each span in the above ngrams construction. For instance, one may want to have the first token being 'Simple string
from range [0,13[
with the empty space directly inside the string, and not given by the subtoksep
attribute.
This can be realized using the Span.start
and Span.stop
attributes, representing the first and last positions of the ranges of the Span
irrespective of the number of ranges. Said differently, one has to keep in mind there is no verification of the number of ranges
element before calculating the start
and stop
attributes. For instance
print(ngrams_22[0].start, ngrams_22[0].stop)
0 13
and so to convert is quite easy
ngrams_23 = [Span(s.string, ranges=[range(s.start, s.stop)]) for s in ngrams_22]
ngrams_23
[Span('Simple string', [(0,13)]),
Span('string for', [(7,17)]),
Span('for demonstration', [(14,31)]),
Span('demonstration and', [(18,35)]),
Span('and for', [(32,39)]),
Span('for illustration', [(36,52)])]
Span.start
and Span.stop
will play a prominent role when we will discuss the ordering relation among different instances of Span
associated to the same parent string later in this introduction.
1.3.3. Span.cuts
¶
In order to help making n-grams and tokens, the method split
has been designed. Let us try it
spans = span.split(cuts=ranges)
spans
[Span('', [(0,0)]),
Span('Simple', [(0,6)]),
Span(' ', [(6,7)]),
Span('string', [(7,13)]),
Span(' ', [(13,14)]),
Span('for', [(14,17)]),
Span(' ', [(17,18)]),
Span('demonstration', [(18,31)]),
Span(' ', [(31,32)]),
Span('and', [(32,35)]),
Span(' ', [(35,36)]),
Span('for', [(36,39)]),
Span(' ', [(39,40)]),
Span('illustration', [(40,52)]),
Span('.', [(52,53)])]
One sees that split
separates the initial span each time a range appear in the cuts
parameter, without detroying information. This might not be the most suitable case for you, since one usually has to filter the irrelevant part of the splitting. Here the strategy could be to remove the empty Span
(defined as having no child-string representation), the empty space spans and the non-alphanumerics span … but it is far better to use the previous approach, and to feed a new Span
instance directly with the outcome of the REGEX matches…
In fact, here, and because we used REGEX to capture a pattern of contiguous alphanumeric characters as the relevant spans, one can simply capture half of the Span
relevant_spans = spans[1::2]
relevant_spans
[Span('Simple', [(0,6)]),
Span('string', [(7,13)]),
Span('for', [(14,17)]),
Span('demonstration', [(18,31)]),
Span('and', [(32,35)]),
Span('for', [(36,39)]),
Span('illustration', [(40,52)])]
If one is lost with the starting element of the spans
list being empty or not, one can use the remove_empty=True
parameter in split
, which remove empty Span
.
span.split(cuts=ranges, remove_empty=True)
[Span('Simple', [(0,6)]),
Span(' ', [(6,7)]),
Span('string', [(7,13)]),
Span(' ', [(13,14)]),
Span('for', [(14,17)]),
Span(' ', [(17,18)]),
Span('demonstration', [(18,31)]),
Span(' ', [(31,32)]),
Span('and', [(32,35)]),
Span(' ', [(35,36)]),
Span('for', [(36,39)]),
Span(' ', [(39,40)]),
Span('illustration', [(40,52)]),
Span('.', [(52,53)])]
An other strategy in our specific case would have been to filter on the length of the Span
(defined as the length of its child-string representation)
relevant_spans = [s for s in spans if len(s)>1]
relevant_spans
[Span('Simple', [(0,6)]),
Span('string', [(7,13)]),
Span('for', [(14,17)]),
Span('demonstration', [(18,31)]),
Span('and', [(32,35)]),
Span('for', [(36,39)]),
Span('illustration', [(40,52)])]
That way, one quite easilly filters the different spans extracted at the previous step. An other usefull filter is the string comparison. For instance, to withdraw all the spans that contains the character 'o'
is intuitive
relevant_spans_without_o = [s for s in relevant_spans if 'o' not in s]
relevant_spans_without_o
[Span('Simple', [(0,6)]), Span('string', [(7,13)]), Span('and', [(32,35)])]
1.3.4. Char-grams and Span.slice
¶
The char-grams are even easier to implement, since the method slice
does it for us, with parameters :
start
: the character number where the slice startsstop
: the character number where the slice stopssize
: the number of characters inside an outcomingSpan
step
: the number of characters skipped from oneSpan
to the next one in the outcome
The outcome is a list of Span
.
span.slice(start=0, stop=8, size=3, step=1)
[Span('Sim', [(0,3)]),
Span('imp', [(1,4)]),
Span('mpl', [(2,5)]),
Span('ple', [(3,6)]),
Span('le ', [(4,7)]),
Span('e s', [(5,8)])]
Note that .slice
applies straightforwardly even on already truncated string. Note the starting and stoping positions are calculated inside the child-string representation. For instance
span1.slice(size=5, stop=8)
[Span('strin', [(7,12)]),
Span('tring', [(8,13)]),
Span('ring ', [(9,13),(36,36)]),
Span('ing f', [(10,13),(36,37)])]
Note that the ranges
are cutted to give each child-string of the Span
instance the correct length (this is the reason of the appearance of zero length range(36,36)
for Span('ring ', [(9,13),(36,36)])
, and this is also why one should not take a subtoksep
of length higher than 1, because ranges = [range(9,13), range(36,36)]
signifies that there are 13-9=4
characters plus one subtoksep
character in this Span
, while range(36,36)
selects no string in the parent-string of this Span
; compare with the last Span
of the above example, and change eventually the subtoksep
to '#'
character to see it in action).
Note also that a BoundaryWarning
easilly raises while using this method, since the default values are overwriten to start and stop on the size of the str(Span)
.
1.4. Order relation among Span
¶
Span
object can be ordered, providing that one keeps in mind
the order is only relevant for contiguous
Span
, a contiguousSpan
being aSpan
with a singlerange
, that is, this is aSpan
for whichstart
andstop
attribute make sensethe order is only relevant for two
Span
attached to the same string. This is somehow less sensible than the previous remark, since comparison ofSpan
s would result in aTypeError
in case theirSpan.string
are not the same.
Before entering into the details of the comparison, let us recall all the Span
we constructed above
'Simple string for demonstration and for illustration.' # the Span span
'01234567891123456789212345678931234567894123456789512' # the positions
'Simple string for demonstration and for illustration.' # the Span ngrams
'012345 789112 456 8921234567893 234 678 412345678951 ' # the ranges
In particular, the span ngrams
contains 7 ranges
, and we can call sub-spans using the subSpans
attribute.
1.4.1. <
and >
relations for non overlapping Span
¶
The Span
class has been designed for latin languages (though it might work for any language that can be encoded in an alphabet), and the reading sense is from left to right.
span1 < span2
means that the last position ofspan1
is on the left of the first position ofspan2
; said differentlyspan1
is entirely read beforespan2
is readspan1 > span2
means that the first position ofspan1
is on the right of the last position ofspan2
; said differentlyspan2
is entirely read beforespan1
is read
In particular, these definitions imply that :
if either
span1 < span2
orspan1 > span2
, they do not overlapspan1 < span2
being false does not mean thatspan1 > span2
: the two spans may overlapspan1 > span2
being false does not mean thatspan1 < span2
: the two spans may overlap
Let us illustrate this :
# not overlapping Span
print(ngrams.subSpans[0] < ngrams.subSpans[1])
print(ngrams.subSpans[1] > ngrams.subSpans[0])
print(ngrams.subSpans[2] < ngrams.subSpans[4])
print(ngrams.subSpans[4] > ngrams.subSpans[2])
True
True
True
True
To illustrate the strict order <
and >
on overlapping Span
, let us cartoon the instances span1
and span2
from above
'Simple string for demonstration and for illustration.' # the Span span
'01234567891123456789212345678931234567894123456789512' # the positions
' string for illustration ' # the Span span1
' 789112 678 412345678951 ' # the ranges
' string for illustration ' # the Span span2
' 789112 456 412345678951 ' # the ranges
and recall the construction of span1.start = 7
, span1.stop = 51
and the same for span2
: when it is about order comparison, a Span
object associated to several ranges
behaves as a monobloc entity from its starting to ending positions.
Now the order of these two Span
entities:
# overlapping Span
print(span1 < span2)
print(span1 > span2)
print(span2 < span1)
print(span2 > span1)
False
False
False
False
1.4.2. <=
and >=
relations¶
For overlapping spans, the orders of the Span
can either be
span1 <= span2
if there are possibly non-overlapping part ofspan1
on the left ofspan2
; said differently the left-most part of the union ofspan1
andspan2
belongs tospan1
; said differentlyspan1
is partly read beforespan2
is readspan1 >= span2
if there are possibly non-overlapping part ofspan1
on the right ofspan2
; said differently the right-most part of the union ofspan1
andspan2
belongs tospan1
; said differentlyspan2
is partly read beforespan1
is read
From the above definition, note that
if both
span1 >= span2
andspan1 <= span2
were true, there is no reason whyspan1 == span2
should be true. In fact we defined the overlapping orders<=
and>=
to be independent to each other: there is no way thatspan1 >= span2 and span1 <= span2
returnsTrue
There is no difficulty in interpretation, as illustrated with span1
and the Span
including the string 'documentation'
from ngrams
: documentation
is on the right of span1
and so both documentation
is on the right of span1
(documentation >= span1
) and span1
is on the left of demonstration
(span1 <= demonstration
). To insist one more time : span1
starts before demonstration
, and there are overlapping part between the two Span
(this is counter-intuitive if one forgets the roles of span1.start
and span1.stop
in the ordering relations). Those are the only true orders in presence.
# overlapping span with no common boundary
demonstration = ngrams.subSpans[3]
print(demonstration < span1)
print(demonstration > span1)
print(span1 < demonstration)
print(span1 > demonstration)
print(demonstration <= span1)
print(demonstration >= span1)
print(span1 <= demonstration)
print(span1 >= demonstration)
False
False
False
False
False
True
True
False
The same easiness of interpretation is present in the case of one boundary in common, as e.g.
# overlapping span with left common boundary
string = ngrams.subSpans[1]
print(string < span1)
print(string > span1)
print(string <= span1)
print(string >= span1)
print(span1 < string)
print(span1 > string)
print(span1 <= string)
print(span1 >= string)
False
False
True
False
False
False
False
True
for left boundary in common, and
# overlapping span with right common boundary
illustration = ngrams.subSpans[-1]
print(illustration < span1)
print(illustration > span1)
print(illustration <= span1)
print(illustration >= span1)
print(span1 < illustration)
print(span1 > illustration)
print(span1 <= illustration)
print(span1 >= illustration)
False
False
False
True
False
False
True
False
for right boundary in commom. In contrary, if the two boundaries are in common, there is no order relation between the different Span
# overlapping Span with two common boundaries
print(span1 <= span2)
print(span1 >= span2)
print(span2 <= span1)
print(span2 >= span1)
False
False
False
False
1.4.3. Complete order of Span
on a parent string¶
The interest in this order is that, whatever the two choosen Span
s associated with a common string, there is one and only one valid assumption, either:
span1 > span2
: the two spans do not overlap, andspan1
is entirely on the right ofspan2
,span1 < span2
: no overlap,span1
is entirely on the left ofspan2
,span1 >= span2
: overlapping string, withspan1
finishing later thanspan2
in the reading order,span1 <= span2
: overlapping string, withspan2
finishing later thanspan1
in the reading order,
and the order is dependent of the order of the Span
s, that is, we should not try to interpret the outcome of span1 <= span2 and span2 <= span1
, that just signifies that span1.start == span2.start
.
Recall that the order comparison takes into account only the complete size of the Span
, that is, it compares only the start
and stop
attributes of the Span
, disregarding the possibility of non-contiguous Span
. Said differently, span1
above is treated (in terms of comparisons) in the same way as span2
(and this is also why one has both span1 <= span2
and span2 <= span1
in order to preserve the unique order relations).
1.5. Algebraic manipulations of Span
¶
Since the Span
object represents a collection of characters positions from a parent string, and because a position cannot appear more than once (recall overlapping range
in Span.ranges
are forbidden), one can treat this collection as a mathematical set. So the basic set operations of union, difference, intersection, and symmetric division are allowed.
Here are their results, illustrated from span1
and span2
1.5.1. Union¶
The complete set of all positions in span1
plus all positions in span2
. This operation is symmetric.
# union
span1 + span2
# equals to span1.union(span2), span2+span1, span2.union(span1)
Span('string for for illustration', [(7,13),(14,17),(36,39),(40,52)])
1.5.2. Difference¶
The portion of the parent string in span1
that is absent in span2
is represented as span1.difference(span2)
. Note this is not a symmetric operation (warn the Span('for')
with different ranges
…)
# difference
span1 - span2
# equals to span1.difference(span2)
Span('for', [(36,39)])
# difference is not symmetric
span2.difference(span1)
# equals to span2 - span1
Span('for', [(14,17)])
1.5.3. Intersection¶
The ranges
which are common to both span1
and span2
can be found using span1 * span2
or span1.intersection(span2)
. This operation is symmetric span1 * span2 == span2 * span1
is True
.
# intersection
span1 * span2
# equals to span1.intersection(span2), span2*span1, span2.intersection(span1)
Span('string illustration', [(7,13),(40,52)])
1.5.4. Symmetric difference¶
The symmetric difference represent the union minus the intersection of the two Span
. It is repsented by the division operator /
, but keep in mind this relation is symmetric.
# symmetric_difference
span1/span2
# equals to span1.symmetric_difference(span2), span2/span1, ...
Span('for for', [(14,17),(36,39)])
1.6. Conclusion¶
The Span
object is a convenient abstraction of the processus of cutting a string into sub-parts. Instead of constructing all string possibilities before feeding them to the next stage of the NLP treatment, one can apply some construction steps on their characters positions, which in principle is far more efficient in terms of algorithms. In addition to reconstruct the usual tokens (e.g. n-grams and char-grams for instance), the main advantages of the Span
class is its habilities to construct as many children string as one wants, starting from a unique parent string. Using the four order relations <
, >
, <=
and >=
among all these children, as well as the infinite possibilities of combining these children strings using the +
, -
, *
and /
operations allow overwhelming new routes of manipulating a string in an abstract way.
from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Fri Jan 21 21:48:31 2022