7. On the ordering of the Span objects¶
The Span
object has been equiped with an ordering (from version 0.6.2 of tokenspan
), which allows comparison among any kind of span that makes sense for an associated string. That is, the requirements are that
a
Span
should not be empty:len(Span)
exists (is not null)two
Span
should not be equal: otherwise they correspond to the same portion of a string, and are thus useless for comparison
We would like to show how the comparison tools might be usefull in the case when one creates many different Span
instances of the same string, and want to compare them (for instance to define a unique reading order for a further algorithm).
Note: the present Notebook uses the packages numpy
and pandas
, that the tokenspan
package doesn’t need
from itertools import permutations
from typing import List
from pandas import DataFrame, __version__ # pandas is used to present the results
print(f"pandas version: {__version__}")
import numpy as np
print(f"numpy version: {np.__version__}")
from tokenspan import Span, __version__
print(f"tokenspan version: {__version__}")
pandas version: 1.3.5
numpy version: 1.19.4
tokenspan version: 0.6.2
7.1. Motivations¶
First, we construct the different versions of Span
, they are given by the following cartoon
'Simple string for demonstration and for illustration.' # the Span span
'01234567891123456789212345678931234567894123456789512' # the positions
' string for illustration ' # the Span span1
' 789112 678 412345678951 ' # the ranges
' string for illustration ' # the Span span2
' 789112 456 412345678951 ' # the ranges
'Simple string for demonstration and for illustration.' # the Span spans
'012345 789112 456 8921234567893 234 678 4123456789512' # the positions
'Simple string for demonstration and for illustration.' # the Span ngram1
'0123456789112 ' # the positions
'Simple string for demonstration and for illustration.' # the Span ngram2
' 7891123456 ' # the positions
'Simple string for demonstration and for illustration.' # the Span ngram3
' 45678921234567893 ' # the positions
'Simple string for demonstration and for illustration.' # the Span ngram4
' 89212345678931234 ' # the positions
'Simple string for demonstration and for illustration.' # the Span ngram5
' 2345678 ' # the positions
'Simple string for demonstration and for illustration.' # the Span ngram6
' 67894123456789512' # the positions
We now have all possible configurations :
spans not overlapping (the different
spans.subspans
spans, orngram2
andngram5
for instance)spans overlapping (
ngram1
andngram2
for instance)spans having the left boundary in common (
spans.subspans[2]
andngram3
for instance)spans having the right boundary in common (
spans.subspans[3]
andngram3
for instance)spans constituted of several non-contiguous sub-spans (
span1
orspan2
for instance)spans made of a contiguous range of positions (any n-gram, or any
spans.subspans
element for instance)
We want to discuss a possibility to compare these different objects among themselves.
Now, recall that the order relations are defined using the Span.start
and Span.stop
attributes, and those are calculated as the left-most and right-most boundaries of the Span
, respectively, disregrading any possible holes in the Span.ranges
description (said differently: disregarding the number of range in Span.ranges
).
There are thus strange beast in the above list of spans, namely :
span1
andspan2
have the same boundaries (namelyspan1.start = 7
andspan1.stop = 52
, and the same forspan2
), but they contain different portions of text (the textfor
present in both spans has not the same position, and the twoSpan
are not equal)one can generate some
Span(string, ranges=spans.ranges[1:3])
that would have the same boundaries asngram2
without the sameranges
exactly
These Span
example should not be equal, but should be comparable. We will should also how to deal with such situations.
7.2. Construction of all the Span
¶
We construct the different spans by hand, to not bother us with their construction using REGEX or other methods.
string = 'Simple string for demonstration and for illustration.'
span = Span(string)
span1 = Span(string, ranges=[range(7,13), range(36,39), range(40,52)])
span2 = Span(string, ranges=[range(7,13), range(14,17), range(40,52)])
spans = Span(string, ranges=[range(0,6),range(7,13),range(14,17),
range(18,31),range(32,35),range(36,39),
range(40,52)])
ngram1 = Span(string, ranges=[range(0,13)])
ngram2 = Span(string, ranges=[range(7,17)])
ngram3 = Span(string, ranges=[range(14,31)])
ngram4 = Span(string, ranges=[range(18,35)])
ngram5 = Span(string, ranges=[range(32,39)])
ngram6 = Span(string, ranges=[range(36,52)])
relations = {' < ':'__lt__',
' > ':'__gt__',
' <= ':'__le__',
' >= ':'__ge__',
' == ':'__eq__'}
7.3. Order relations¶
Before discussing how are defined the different order relations , we simply check a few facts. What is important to remember is that the different order relations <
, >
, <=
, >=
and ==
are not conceptually related to each other. That is, each of them is an order relation that one can use to order any set of spans, and define what mathematicians call a poset, or partially ordered set, see also the simpler discussion on interval order.
The construction of the order relation is postponed to the next section. For the moment we verify that one captured many interesting properties from these order relations, and that we reusse giving them some meaning.
7.3.1. Non-overlapping spans¶
We pass the data to the order relations, starting from the non-overlapping datas. For all possible couples of Span
taken from spans
(that is the list of words in the example text), we calculate the 5 different order relations, to show that
only
<
or>
relations are verified for non-overlapping spansif
span1 > span2
, thenspan2 < span1
def order_relations(spans: List[Span]) -> DataFrame:
orders = list()
for s1, s2 in permutations(spans, 2):
data_ = {'span1': str(s1), 'span2': str(s2), }
data1 = {"span1{}span2".format(rel): getattr(s1, attr)(s2)
for rel, attr in relations.items()}
data_.update(data1)
data2 = {"span2{}span1".format(rel): getattr(s2, attr)(s1)
for rel, attr in relations.items()}
data_.update(data2)
orders.append(data_)
return DataFrame(orders)
non_overlapping = order_relations(spans.subspans[:4])
non_overlapping
span1 | span2 | span1 < span2 | span1 > span2 | span1 <= span2 | span1 >= span2 | span1 == span2 | span2 < span1 | span2 > span1 | span2 <= span1 | span2 >= span1 | span2 == span1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Simple | string | True | False | False | False | False | False | True | False | False | False |
1 | Simple | for | True | False | False | False | False | False | True | False | False | False |
2 | Simple | demonstration | True | False | False | False | False | False | True | False | False | False |
3 | string | Simple | False | True | False | False | False | True | False | False | False | False |
4 | string | for | True | False | False | False | False | False | True | False | False | False |
5 | string | demonstration | True | False | False | False | False | False | True | False | False | False |
6 | for | Simple | False | True | False | False | False | True | False | False | False | False |
7 | for | string | False | True | False | False | False | True | False | False | False | False |
8 | for | demonstration | True | False | False | False | False | False | True | False | False | False |
9 | demonstration | Simple | False | True | False | False | False | True | False | False | False | False |
10 | demonstration | string | False | True | False | False | False | True | False | False | False | False |
11 | demonstration | for | False | True | False | False | False | True | False | False | False | False |
To verify what we claim, we simply calculate the sum along the horizontal axis, and we should find 2 for each line since there are span1 > span2
and span2 < span1
, or span1 < span2
and span2 > span1
. We verify that the number of lines where these conditions are satisfied corresponds to the total number of lines of the boolean table.
check = non_overlapping['span1 < span2'] * non_overlapping['span2 > span1']
check += non_overlapping['span1 > span2'] * non_overlapping['span2 < span1']
sum(check) == len(non_overlapping)
True
7.4. Overlapping spans¶
Now we discuss the overlapping spans, with the example of the ngrams Simple string
, string for
and the span string
that share eventually some boundary with the oher ones.
overlapping = order_relations([ngram1, ngram2, spans.subspans[1]])
overlapping
span1 | span2 | span1 < span2 | span1 > span2 | span1 <= span2 | span1 >= span2 | span1 == span2 | span2 < span1 | span2 > span1 | span2 <= span1 | span2 >= span1 | span2 == span1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Simple string | string for | False | False | True | False | False | False | False | False | True | False |
1 | Simple string | string | False | False | True | False | False | False | False | False | True | False |
2 | string for | Simple string | False | False | False | True | False | False | False | True | False | False |
3 | string for | string | False | False | False | True | False | False | False | True | False | False |
4 | string | Simple string | False | False | False | True | False | False | False | True | False | False |
5 | string | string for | False | False | True | False | False | False | False | False | True | False |
One more time, one sees that, despite the difference in the order relations <=
and >=
, they are dual to each other, in the sense that span1 >= span2
implies span2 <= span1
and vice-versa.
check = overlapping['span1 <= span2'] * overlapping['span2 >= span1']
check += overlapping['span1 >= span2'] * overlapping['span2 <= span1']
sum(check) == len(overlapping)
True
We thus showed that any contiguous spans can be ordered by one of the four relation laws <
or >
(for non-overlapping spans) and <=
or >=
(for overlapping spans). In addition, only one of these relation is verified at a time, and any situation we have encountered so far by only one situation. Now that the contiguous spans have been studied, let us understand how one can deal with the non-contiguous ones.
7.5. Non-contiguous spans¶
Below we see that the only problematic case is the fake equality. Since the non-contiguous Span
are described by the Span.start
and Span.stop
attributes, span1
and span2
could be thought as being equals in the example studied here. If fact they are not, since their ranges
are not equals, which is one of the conditions of the equality of two Span
instances. This is illustrated on the first line of the table below.
order_relations([span1, span2,
Span(string, ranges=spans.ranges[1:3]),
Span(string, ranges=spans.ranges[-2:])])
span1 | span2 | span1 < span2 | span1 > span2 | span1 <= span2 | span1 >= span2 | span1 == span2 | span2 < span1 | span2 > span1 | span2 <= span1 | span2 >= span1 | span2 == span1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | string for illustration | string for illustration | False | False | False | False | False | False | False | False | False | False |
1 | string for illustration | string for | False | False | False | True | False | False | False | True | False | False |
2 | string for illustration | for illustration | False | False | True | False | False | False | False | False | True | False |
3 | string for illustration | string for illustration | False | False | False | False | False | False | False | False | False | False |
4 | string for illustration | string for | False | False | False | True | False | False | False | True | False | False |
5 | string for illustration | for illustration | False | False | True | False | False | False | False | False | True | False |
6 | string for | string for illustration | False | False | True | False | False | False | False | False | True | False |
7 | string for | string for illustration | False | False | True | False | False | False | False | False | True | False |
8 | string for | for illustration | True | False | False | False | False | False | True | False | False | False |
9 | for illustration | string for illustration | False | False | False | True | False | False | False | True | False | False |
10 | for illustration | string for illustration | False | False | False | True | False | False | False | True | False | False |
11 | for illustration | string for | False | True | False | False | False | True | False | False | False | False |
In contrary, the rest of the spans in the table behave in the same way as described before: they verify one and only one order relation, and doing so, their dual is also verified.
In conclusion, we have the following situations :
either
span1 > span2
orspan1 < span2
if theSpan
are not overlapping, in which case the dual relation is also satisfied (that isspan2 < span1
orspan2 > span1
, respectively)either
span1 >= span2
orspan1 <= span2
if theSpan
are not overlapping, in which case the dual relation is also verified (that isspan2 <= span1
orspan2 >= span1
, respectively)either no order relation applies, in which case the two
Span
are non-contiguous spans with similar boundaries.
7.6. Anti-symmetric matrix representation of the Span ordering¶
If, for two spans span1
and span2
, we add
-1
ifspan1 < span2
+1
ifspan1 > span2
-2
ifspan1 <= span2
+2
ifspan1 >= span2
and if we do the same for any combinations of all non-identical Span
constructed from a given string, we obtain an antisymmetric matrix having zero values (if zero is the initial value of all the matrix) outside the diagonal only for Span
having same boundaries.
This defined a unique reading order that can be fed to any firther algorithm that will deal with the collection of Span
inherited from the parent string.
We now construct the Span
order matrix
relations_weights = {'__lt__': -1,
'__gt__': +1,
'__le__': -2,
'__ge__': +2,}
def order_matrix(spans: List[Span]) -> np.array:
"""Construct the order matrix from a list of Span.
Returns an antisymmetric matrix (a numpy array) of integers."""
orders = np.zeros(shape=(len(spans), len(spans)), dtype=np.int8)
for i,j in permutations(range(len(spans)), 2):
for rel, weight in relations_weights.items():
orders[i,j] += weight if getattr(spans[i], rel)(spans[j]) else 0
return orders
non_contiguous = order_matrix([span1, span2,
Span(string, ranges=spans.ranges[1:3]),
Span(string, ranges=spans.ranges[-2:])]
)
non_contiguous
array([[ 0, 0, 2, -2],
[ 0, 0, 2, -2],
[-2, -2, 0, -1],
[ 2, 2, 1, 0]], dtype=int8)
subspans = span1.subspans+span2.subspans
non_contiguous_subspans = order_matrix(subspans)
non_contiguous_subspans
array([[ 0, -1, -1, 0, -1, -1],
[ 1, 0, -1, 1, 1, -1],
[ 1, 1, 0, 1, 1, 0],
[ 0, -1, -1, 0, -1, -1],
[ 1, -1, -1, 1, 0, -1],
[ 1, 1, 0, 1, 1, 0]], dtype=int8)
print(subspans[0], subspans[3])
print(subspans[2], subspans[5])
string string
illustration illustration
7.7. Mathematical elaboration on the order relations¶
We now enter a more technical part on the order relations described above.
7.7.1. Do we have one order, or four ?¶
First of all, let us clarify what we call an order, in comparison with what mathematicians define as a poset, or partially ordered set. According to the definition, a weak \(\leq\) or a strict \(\lt\) order on a set \(P\) verify a few axioms :
reflexivity: \(a \leq a\), i.e. every element is related to itself.
antisymmetry: if \(a \leq b\) and \(b \leq a\) then \(a = b\), i.e. no two distinct elements precede each other.
transitivity: if \(a \leq b\) and \(b \leq c\) then \(a \leq c\),
for the weak-order, and
irreflexivity: not \(a \lt a\), i.e. no element is related to itself
transitivity: if \(a \lt b\) and \( b \lt c\) then \(a \lt c\),
asymmetry: if \(a \lt b\) then not \(b \lt a\),
for the strict order.
Clearly, we defined four different order relations on the ensemble of Span
objects, and all of them are stricts. This is clear from the fact that an order relation is always false when applied to a Span
element.
This is certainly the most important message at this stage : each of the order relation >
, >
, >=
or <=
corresponds to one order relation, while it is conventional to write \(\leq\) the weak version of the \(\lt\) strict order relation.
This can be easilly check once we have an algorithm that sort Span
according to a given order relation. We give below a simple example (this is not an optimized algorithm to sort long sequences of Span
, though, since its time complexity scales like \(O\left(n^{2}\right)\) with \(n\) the sequence length).
def sort_spans(spans: List[Span],
rel: str='__lt__',
) -> List[Span]:
"""Sort a list of Span, according to the relation rel (in class nomenclature).
Returns a list of Span"""
sorted_spans = []
temp_spans = spans[:]
for _ in range(len(spans)):
extremum = temp_spans[0]
i_extremum = 0
for i, span in enumerate(temp_spans[1:]):
if getattr(span, rel)(extremum):
extremum = span
i_extremum = i+1
sorted_spans.append(temp_spans.pop(i_extremum))
return sorted_spans
This algorithm take the first element of the list, compare it with all the remaining ones to find the extremum (say the minimum if the order relation were \(\lt\) associated to the integer number for instance) among them, then append this extremum to the ordered list and withdraw it from the initial list (the .pop
action at the end of the loop) and starts again with the next element. The order relation is a parameter, and one should pass it using the nomenclature of class, namely:
<
corresponds to__lt__
(lower than magic-function)>
corresponds to__gt__
(greater than magic-function)<=
corresponds to__le__
(lower than or equal magic-function)>=
corresponds to__ge__
(greater than or equal magic-function)
Now we can order the spans
according to the two non-overlapping orders >
and <
. We start with the >
order
gt_order = sort_spans(spans.subspans, rel='__gt__')
gt_order
[Span('illustration', [(40,52)]),
Span('for', [(36,39)]),
Span('and', [(32,35)]),
Span('demonstration', [(18,31)]),
Span('for', [(14,17)]),
Span('string', [(7,13)]),
Span('Simple', [(0,6)])]
which classifies the different Span
in a satisfactory way, the right-most first, then its successors in a reading order from right to left, …
We can re-order this sequence using the <
-order
lt_order = sort_spans(gt_order, rel='__lt__')
lt_order
[Span('Simple', [(0,6)]),
Span('string', [(7,13)]),
Span('for', [(14,17)]),
Span('demonstration', [(18,31)]),
Span('and', [(32,35)]),
Span('for', [(36,39)]),
Span('illustration', [(40,52)])]
and this des the job as excpected.
What happens now if one use an order that is dedicated for overlapping Span
on the above sequences, which contain only non-overlapping Span
? Well, try it: below is the example of >=
order on the original spans
sequence.
ge_order = sort_spans(spans.subspans, rel='__ge__')
ge_order
[Span('Simple', [(0,6)]),
Span('string', [(7,13)]),
Span('for', [(14,17)]),
Span('demonstration', [(18,31)]),
Span('and', [(32,35)]),
Span('for', [(36,39)]),
Span('illustration', [(40,52)])]
We see that the initial list is left untouched. This is because there is never in the process a situation where the >=
relation returns True
, and so the inner-loop of the sort_spans
algorithm does nothing.
This is the same if one applies the <=
order relation to the gt_order
ed list.
le_order = sort_spans(gt_order, rel='__le__')
le_order
[Span('illustration', [(40,52)]),
Span('for', [(36,39)]),
Span('and', [(32,35)]),
Span('demonstration', [(18,31)]),
Span('for', [(14,17)]),
Span('string', [(7,13)]),
Span('Simple', [(0,6)])]
7.7.2. What are the reasons of these confusing definitions ?¶
The only reason for the confusing call if the limitation of Python magic-functions. Since there are only four magic functions conceptually related to an order, we choose to employ the four of them for each of the order relations we have in head. The reason why there is no more usefull order relations will be explained below.
We would like to recast that the notations introduced in the present section are confusing only if one does not remember what an order relation is. In fact to order any set, we only need to find a comparison relation (called an order relation in this context) and to apply it to all elements of the set (recall the construction of the algorithm sort_spans
above). So our strategy is to use the different Python magic function not to construct an order relation, say \(\leq\) and its dual \(\geq\) and their strict restrictions \(\lt\) and \(\gt\), but instead to construct four different order relations.
7.7.3. Why is 4 a good number for constructing the order relations of a Span
objects ?¶
It seems clear that one can invent any kind of ordering relation to order a set of Span
object. But one can try to restrict the range of possibilities by thinking a bit about the possible application.
First of all, let us try to choose order relations which are mutually exclusive to each other. That is, we would like to find four order relations such that two Span
objects must be in one and only one of the constructed order relations. Since a Span
object is basically a start
(we note the start position \(\alpha\) in the following) and a stop
(that we will note \(\omega\) in the following), to compare two Span
objects consists in comparing four integers. So four relations are in principle sufficient. In fact three order relations are sufficient since there are some constraints on the Span
object (that is \(\alpha\leq\omega\) is always true).
One can see that by proposing the four following relations to compare the four parameters of the two Span
\(S_{1}\left(\alpha_{1},\omega_{1}\right)\) and \(S_{2}\left(\alpha_{2},\omega_{2}\right)\) :
which mutually exclusive: \(\bar{A}\) is the complementary of \(A\), and \(\bar{B}\) is the complementary of \(B\). now to compare \(S_{1}\) and \(S_{2}\) one simply have to combine \(A\) and \(B\), \(A\) and \(\bar{B}\), \(\bar{A}\) and \(B\) and finally \(\bar{A}\) and \(\bar{B}\) and attribute to each mutually exclusive order relations thus created a name, for instance \(R_{1,2,3,4}\), respectively. Then we have
except \(R_{3}\) is never verified in practice (its stoping point is left to its starting one). So we just proove than at most, one can find three different order relations to describe all situations between two Span
objects.
7.7.4. What were the guidelines for our construction ?¶
We choose to get four order relation for the following reasons :
four relations is sufficient to make any collection of
Span
entirely sortable, even if one has to pass from one order relation to the next one to understand the ordering: that is, one can create a graph of ordering that will cover the complete set ofSpan
constructed on top of a stringfour order relations make them meaningfull in term of reading order: the ordered span will interpretable in term of e.g. “
span1
is read first, thenspan2
starts to be read, whilespan3
starts to be read beforespan2
is entirely read” (this reads in our notationsspan1 < span2
andspan3 <= span2
, here we didn’t described the order relation betweenspan1
andspan3
, but such an order relation exists in virtue of the previous point)four order relations allows to construct dual order relations, that is if
span1 < span2
, thenspan2 > span1
(in mathematician parlance, we say that the strict order relation<
is dual to the strict order relation>
)
The way we defined the two order relations \(<\) (<
in the package) and \(\prec\) (<=
in the package) and their duals (\(>\)/>
and \(\succ\)/>=
) is given by the following relations
and
which are easilly showed to be dual to each other (simply invert the \(1\) and \(2\) indices of the first line to instantaneously get the definition of the second line) and mutually exclusive (the duality implies the mutual exclusivity, and the opposite of \(\omega_{1}\leq\alpha_{2}\) defining \(S_{1}<S_{2}\) is indeed the condition \(\alpha_{2}<\omega_{1}\) that is present on the definition of \(S_{1}\prec S_{2}\)).
Note that
equal ranges of the span is forbidden in any order relation (except the two ranges being empty in the definition of \(<\)): the reason is that this is considered useless: a
Span
whoseranges
attribute being empty is simply not aSpan
that one is able to readfollowing previous remark: if a non-contiguous span present the same ranges than its comparing span, those two are not detected. Since this is the only case without order (because all other situations are mutualy exclusive by construction), this situation is in fact detected (see example above).
because of the ordering we just established, it is possible to construct any graph on top of any string, by way of splitting and re-gluing the cuts
7.8. Conclusion¶
It is possible to order a set of Span
objects, using three order relation that naturally describes the notion of reading: two of these order relations are strict and present a dual, the third one is self-dual (this is the equality relation that we shortly noticed its existence above). With the help of these order relations, it is posisble to describe any cutting of a text using a weighted directed graph, the direction being given by the duality of the order relation, and the weigth by the nature of the order relation itself. For reasonnable Span
(that is non-empty ones, since those are excluded from the order relation), the graph representing any collection of Span
constructed from a given parent string is entirely connected.
from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Sat Jan 22 00:09:43 2022