7. On the ordering of the Span objects

The Span object has been equiped with an ordering (from version 0.6.2 of tokenspan), which allows comparison among any kind of span that makes sense for an associated string. That is, the requirements are that

  • a Span should not be empty: len(Span) exists (is not null)

  • two Span should not be equal: otherwise they correspond to the same portion of a string, and are thus useless for comparison

We would like to show how the comparison tools might be usefull in the case when one creates many different Span instances of the same string, and want to compare them (for instance to define a unique reading order for a further algorithm).

Note: the present Notebook uses the packages numpy and pandas, that the tokenspan package doesn’t need

from itertools import permutations
from typing import List

from pandas import DataFrame, __version__ # pandas is used to present the results
print(f"pandas version: {__version__}")
import numpy as np
print(f"numpy version: {np.__version__}")

from tokenspan import Span, __version__
print(f"tokenspan version: {__version__}")
pandas version: 1.3.5
numpy version: 1.19.4
tokenspan version: 0.6.2

7.1. Motivations

First, we construct the different versions of Span, they are given by the following cartoon

'Simple string for demonstration and for illustration.' # the Span span
'01234567891123456789212345678931234567894123456789512' # the positions

'       string                       for illustration ' # the Span span1
'       789112                       678 412345678951 ' # the ranges

'       string for                       illustration ' # the Span span2
'       789112 456                       412345678951 ' # the ranges

'Simple string for demonstration and for illustration.' # the Span spans
'012345 789112 456 8921234567893 234 678 4123456789512' # the positions

'Simple string for demonstration and for illustration.' # the Span ngram1
'0123456789112                                        ' # the positions

'Simple string for demonstration and for illustration.' # the Span ngram2
'       7891123456                                    ' # the positions

'Simple string for demonstration and for illustration.' # the Span ngram3
'              45678921234567893                      ' # the positions

'Simple string for demonstration and for illustration.' # the Span ngram4
'                  89212345678931234                  ' # the positions

'Simple string for demonstration and for illustration.' # the Span ngram5
'                                2345678              ' # the positions

'Simple string for demonstration and for illustration.' # the Span ngram6
'                                    67894123456789512' # the positions

We now have all possible configurations :

  • spans not overlapping (the different spans.subspans spans, or ngram2 and ngram5 for instance)

  • spans overlapping (ngram1 and ngram2 for instance)

  • spans having the left boundary in common (spans.subspans[2] and ngram3 for instance)

  • spans having the right boundary in common (spans.subspans[3] and ngram3 for instance)

  • spans constituted of several non-contiguous sub-spans (span1 or span2 for instance)

  • spans made of a contiguous range of positions (any n-gram, or any spans.subspans element for instance)

We want to discuss a possibility to compare these different objects among themselves.

Now, recall that the order relations are defined using the Span.start and Span.stop attributes, and those are calculated as the left-most and right-most boundaries of the Span, respectively, disregrading any possible holes in the Span.ranges description (said differently: disregarding the number of range in Span.ranges).

There are thus strange beast in the above list of spans, namely :

  • span1 and span2 have the same boundaries (namely span1.start = 7 and span1.stop = 52, and the same for span2), but they contain different portions of text (the text for present in both spans has not the same position, and the two Span are not equal)

  • one can generate some Span(string, ranges=spans.ranges[1:3]) that would have the same boundaries as ngram2 without the same ranges exactly

These Span example should not be equal, but should be comparable. We will should also how to deal with such situations.

7.2. Construction of all the Span

We construct the different spans by hand, to not bother us with their construction using REGEX or other methods.

string = 'Simple string for demonstration and for illustration.'

span = Span(string)
span1 = Span(string, ranges=[range(7,13), range(36,39), range(40,52)])
span2 = Span(string, ranges=[range(7,13), range(14,17), range(40,52)])
spans = Span(string, ranges=[range(0,6),range(7,13),range(14,17),
                             range(18,31),range(32,35),range(36,39),
                             range(40,52)])
ngram1 = Span(string, ranges=[range(0,13)])
ngram2 = Span(string, ranges=[range(7,17)])
ngram3 = Span(string, ranges=[range(14,31)])
ngram4 = Span(string, ranges=[range(18,35)])
ngram5 = Span(string, ranges=[range(32,39)])
ngram6 = Span(string, ranges=[range(36,52)])

relations = {' <  ':'__lt__',
             ' >  ':'__gt__', 
             ' <= ':'__le__',
             ' >= ':'__ge__', 
             ' == ':'__eq__'}

7.3. Order relations

Before discussing how are defined the different order relations , we simply check a few facts. What is important to remember is that the different order relations <, >, <=, >= and == are not conceptually related to each other. That is, each of them is an order relation that one can use to order any set of spans, and define what mathematicians call a poset, or partially ordered set, see also the simpler discussion on interval order.

The construction of the order relation is postponed to the next section. For the moment we verify that one captured many interesting properties from these order relations, and that we reusse giving them some meaning.

7.3.1. Non-overlapping spans

We pass the data to the order relations, starting from the non-overlapping datas. For all possible couples of Span taken from spans (that is the list of words in the example text), we calculate the 5 different order relations, to show that

  • only < or > relations are verified for non-overlapping spans

  • if span1 > span2, then span2 < span1

def order_relations(spans: List[Span]) -> DataFrame:
    orders = list()
    for s1, s2 in permutations(spans, 2):
        data_ = {'span1': str(s1), 'span2': str(s2), }
        data1 = {"span1{}span2".format(rel): getattr(s1, attr)(s2) 
                 for rel, attr in relations.items()}
        data_.update(data1)
        data2 = {"span2{}span1".format(rel): getattr(s2, attr)(s1) 
                 for rel, attr in relations.items()}
        data_.update(data2)
        orders.append(data_)
    return DataFrame(orders)

non_overlapping = order_relations(spans.subspans[:4])
non_overlapping
span1 span2 span1 < span2 span1 > span2 span1 <= span2 span1 >= span2 span1 == span2 span2 < span1 span2 > span1 span2 <= span1 span2 >= span1 span2 == span1
0 Simple string True False False False False False True False False False
1 Simple for True False False False False False True False False False
2 Simple demonstration True False False False False False True False False False
3 string Simple False True False False False True False False False False
4 string for True False False False False False True False False False
5 string demonstration True False False False False False True False False False
6 for Simple False True False False False True False False False False
7 for string False True False False False True False False False False
8 for demonstration True False False False False False True False False False
9 demonstration Simple False True False False False True False False False False
10 demonstration string False True False False False True False False False False
11 demonstration for False True False False False True False False False False

To verify what we claim, we simply calculate the sum along the horizontal axis, and we should find 2 for each line since there are span1 > span2 and span2 < span1, or span1 < span2 and span2 > span1. We verify that the number of lines where these conditions are satisfied corresponds to the total number of lines of the boolean table.

check = non_overlapping['span1 <  span2'] * non_overlapping['span2 >  span1']
check += non_overlapping['span1 >  span2'] * non_overlapping['span2 <  span1']
sum(check) == len(non_overlapping)
True

7.4. Overlapping spans

Now we discuss the overlapping spans, with the example of the ngrams Simple string, string for and the span string that share eventually some boundary with the oher ones.

overlapping = order_relations([ngram1, ngram2, spans.subspans[1]])
overlapping
span1 span2 span1 < span2 span1 > span2 span1 <= span2 span1 >= span2 span1 == span2 span2 < span1 span2 > span1 span2 <= span1 span2 >= span1 span2 == span1
0 Simple string string for False False True False False False False False True False
1 Simple string string False False True False False False False False True False
2 string for Simple string False False False True False False False True False False
3 string for string False False False True False False False True False False
4 string Simple string False False False True False False False True False False
5 string string for False False True False False False False False True False

One more time, one sees that, despite the difference in the order relations <= and >=, they are dual to each other, in the sense that span1 >= span2 implies span2 <= span1 and vice-versa.

check = overlapping['span1 <= span2'] * overlapping['span2 >= span1']
check += overlapping['span1 >= span2'] * overlapping['span2 <= span1']
sum(check) == len(overlapping)
True

We thus showed that any contiguous spans can be ordered by one of the four relation laws < or > (for non-overlapping spans) and <= or >= (for overlapping spans). In addition, only one of these relation is verified at a time, and any situation we have encountered so far by only one situation. Now that the contiguous spans have been studied, let us understand how one can deal with the non-contiguous ones.

7.5. Non-contiguous spans

Below we see that the only problematic case is the fake equality. Since the non-contiguous Span are described by the Span.start and Span.stop attributes, span1 and span2 could be thought as being equals in the example studied here. If fact they are not, since their ranges are not equals, which is one of the conditions of the equality of two Span instances. This is illustrated on the first line of the table below.

order_relations([span1, span2, 
                 Span(string, ranges=spans.ranges[1:3]),
                 Span(string, ranges=spans.ranges[-2:])])
span1 span2 span1 < span2 span1 > span2 span1 <= span2 span1 >= span2 span1 == span2 span2 < span1 span2 > span1 span2 <= span1 span2 >= span1 span2 == span1
0 string for illustration string for illustration False False False False False False False False False False
1 string for illustration string for False False False True False False False True False False
2 string for illustration for illustration False False True False False False False False True False
3 string for illustration string for illustration False False False False False False False False False False
4 string for illustration string for False False False True False False False True False False
5 string for illustration for illustration False False True False False False False False True False
6 string for string for illustration False False True False False False False False True False
7 string for string for illustration False False True False False False False False True False
8 string for for illustration True False False False False False True False False False
9 for illustration string for illustration False False False True False False False True False False
10 for illustration string for illustration False False False True False False False True False False
11 for illustration string for False True False False False True False False False False

In contrary, the rest of the spans in the table behave in the same way as described before: they verify one and only one order relation, and doing so, their dual is also verified.

In conclusion, we have the following situations :

  • either span1 > span2 or span1 < span2 if the Span are not overlapping, in which case the dual relation is also satisfied (that is span2 < span1 or span2 > span1, respectively)

  • either span1 >= span2 or span1 <= span2 if the Span are not overlapping, in which case the dual relation is also verified (that is span2 <= span1 or span2 >= span1, respectively)

  • either no order relation applies, in which case the two Span are non-contiguous spans with similar boundaries.

7.6. Anti-symmetric matrix representation of the Span ordering

If, for two spans span1 and span2, we add

  • -1 if span1 < span2

  • +1 if span1 > span2

  • -2 if span1 <= span2

  • +2 if span1 >= span2

and if we do the same for any combinations of all non-identical Span constructed from a given string, we obtain an antisymmetric matrix having zero values (if zero is the initial value of all the matrix) outside the diagonal only for Span having same boundaries.

This defined a unique reading order that can be fed to any firther algorithm that will deal with the collection of Span inherited from the parent string.

We now construct the Span order matrix

relations_weights = {'__lt__': -1,
                     '__gt__': +1, 
                     '__le__': -2,
                     '__ge__': +2,}

def order_matrix(spans: List[Span]) -> np.array:
    """Construct the order matrix from a list of Span. 
    Returns an antisymmetric matrix (a numpy array) of integers."""
    orders = np.zeros(shape=(len(spans), len(spans)), dtype=np.int8)
    for i,j in permutations(range(len(spans)), 2):
        for rel, weight in relations_weights.items():
            orders[i,j] += weight if getattr(spans[i], rel)(spans[j]) else 0
    return orders

non_contiguous = order_matrix([span1, span2, 
                               Span(string, ranges=spans.ranges[1:3]),
                               Span(string, ranges=spans.ranges[-2:])]
                             )
non_contiguous
array([[ 0,  0,  2, -2],
       [ 0,  0,  2, -2],
       [-2, -2,  0, -1],
       [ 2,  2,  1,  0]], dtype=int8)
subspans = span1.subspans+span2.subspans
non_contiguous_subspans = order_matrix(subspans)
non_contiguous_subspans
array([[ 0, -1, -1,  0, -1, -1],
       [ 1,  0, -1,  1,  1, -1],
       [ 1,  1,  0,  1,  1,  0],
       [ 0, -1, -1,  0, -1, -1],
       [ 1, -1, -1,  1,  0, -1],
       [ 1,  1,  0,  1,  1,  0]], dtype=int8)
print(subspans[0], subspans[3])
print(subspans[2], subspans[5])
string string
illustration illustration

7.7. Mathematical elaboration on the order relations

We now enter a more technical part on the order relations described above.

7.7.1. Do we have one order, or four ?

First of all, let us clarify what we call an order, in comparison with what mathematicians define as a poset, or partially ordered set. According to the definition, a weak \(\leq\) or a strict \(\lt\) order on a set \(P\) verify a few axioms :

  1. reflexivity: \(a \leq a\), i.e. every element is related to itself.

  2. antisymmetry: if \(a \leq b\) and \(b \leq a\) then \(a = b\), i.e. no two distinct elements precede each other.

  3. transitivity: if \(a \leq b\) and \(b \leq c\) then \(a \leq c\),

for the weak-order, and

  1. irreflexivity: not \(a \lt a\), i.e. no element is related to itself

  2. transitivity: if \(a \lt b\) and \( b \lt c\) then \(a \lt c\),

  3. asymmetry: if \(a \lt b\) then not \(b \lt a\),

for the strict order.

Clearly, we defined four different order relations on the ensemble of Span objects, and all of them are stricts. This is clear from the fact that an order relation is always false when applied to a Span element.

This is certainly the most important message at this stage : each of the order relation >, >, >= or <= corresponds to one order relation, while it is conventional to write \(\leq\) the weak version of the \(\lt\) strict order relation.

This can be easilly check once we have an algorithm that sort Span according to a given order relation. We give below a simple example (this is not an optimized algorithm to sort long sequences of Span, though, since its time complexity scales like \(O\left(n^{2}\right)\) with \(n\) the sequence length).

def sort_spans(spans: List[Span], 
               rel: str='__lt__',
              ) -> List[Span]:
    """Sort a list of Span, according to the relation rel (in class nomenclature).
    Returns a list of Span"""
    sorted_spans = []
    temp_spans = spans[:]
    for _ in range(len(spans)):
        extremum = temp_spans[0]
        i_extremum = 0
        for i, span in enumerate(temp_spans[1:]):
            if getattr(span, rel)(extremum): 
                extremum = span
                i_extremum = i+1
        sorted_spans.append(temp_spans.pop(i_extremum))
    return sorted_spans

This algorithm take the first element of the list, compare it with all the remaining ones to find the extremum (say the minimum if the order relation were \(\lt\) associated to the integer number for instance) among them, then append this extremum to the ordered list and withdraw it from the initial list (the .pop action at the end of the loop) and starts again with the next element. The order relation is a parameter, and one should pass it using the nomenclature of class, namely:

  • < corresponds to __lt__ (lower than magic-function)

  • > corresponds to __gt__ (greater than magic-function)

  • <= corresponds to __le__ (lower than or equal magic-function)

  • >= corresponds to __ge__ (greater than or equal magic-function)

Now we can order the spans according to the two non-overlapping orders > and <. We start with the > order

gt_order = sort_spans(spans.subspans, rel='__gt__')
gt_order
[Span('illustration', [(40,52)]),
 Span('for', [(36,39)]),
 Span('and', [(32,35)]),
 Span('demonstration', [(18,31)]),
 Span('for', [(14,17)]),
 Span('string', [(7,13)]),
 Span('Simple', [(0,6)])]

which classifies the different Span in a satisfactory way, the right-most first, then its successors in a reading order from right to left, …

We can re-order this sequence using the <-order

lt_order = sort_spans(gt_order, rel='__lt__')
lt_order
[Span('Simple', [(0,6)]),
 Span('string', [(7,13)]),
 Span('for', [(14,17)]),
 Span('demonstration', [(18,31)]),
 Span('and', [(32,35)]),
 Span('for', [(36,39)]),
 Span('illustration', [(40,52)])]

and this des the job as excpected.

What happens now if one use an order that is dedicated for overlapping Span on the above sequences, which contain only non-overlapping Span ? Well, try it: below is the example of >= order on the original spans sequence.

ge_order = sort_spans(spans.subspans, rel='__ge__')
ge_order
[Span('Simple', [(0,6)]),
 Span('string', [(7,13)]),
 Span('for', [(14,17)]),
 Span('demonstration', [(18,31)]),
 Span('and', [(32,35)]),
 Span('for', [(36,39)]),
 Span('illustration', [(40,52)])]

We see that the initial list is left untouched. This is because there is never in the process a situation where the >= relation returns True, and so the inner-loop of the sort_spans algorithm does nothing.

This is the same if one applies the <= order relation to the gt_ordered list.

le_order = sort_spans(gt_order, rel='__le__')
le_order
[Span('illustration', [(40,52)]),
 Span('for', [(36,39)]),
 Span('and', [(32,35)]),
 Span('demonstration', [(18,31)]),
 Span('for', [(14,17)]),
 Span('string', [(7,13)]),
 Span('Simple', [(0,6)])]

7.7.2. What are the reasons of these confusing definitions ?

The only reason for the confusing call if the limitation of Python magic-functions. Since there are only four magic functions conceptually related to an order, we choose to employ the four of them for each of the order relations we have in head. The reason why there is no more usefull order relations will be explained below.

We would like to recast that the notations introduced in the present section are confusing only if one does not remember what an order relation is. In fact to order any set, we only need to find a comparison relation (called an order relation in this context) and to apply it to all elements of the set (recall the construction of the algorithm sort_spans above). So our strategy is to use the different Python magic function not to construct an order relation, say \(\leq\) and its dual \(\geq\) and their strict restrictions \(\lt\) and \(\gt\), but instead to construct four different order relations.

7.7.3. Why is 4 a good number for constructing the order relations of a Span objects ?

It seems clear that one can invent any kind of ordering relation to order a set of Span object. But one can try to restrict the range of possibilities by thinking a bit about the possible application.

First of all, let us try to choose order relations which are mutually exclusive to each other. That is, we would like to find four order relations such that two Span objects must be in one and only one of the constructed order relations. Since a Span object is basically a start (we note the start position \(\alpha\) in the following) and a stop (that we will note \(\omega\) in the following), to compare two Span objects consists in comparing four integers. So four relations are in principle sufficient. In fact three order relations are sufficient since there are some constraints on the Span object (that is \(\alpha\leq\omega\) is always true).

One can see that by proposing the four following relations to compare the four parameters of the two Span \(S_{1}\left(\alpha_{1},\omega_{1}\right)\) and \(S_{2}\left(\alpha_{2},\omega_{2}\right)\) :

(7.1)\[\begin{align} A&:\alpha_{1}\leq\alpha_{2}\\\bar{A}&:\alpha_{2}<\alpha_{1}\\B&:\omega_{1}\leq\omega_{2}\\\bar{B}&:\omega_{2}<\omega_{1} \end{align}\]

which mutually exclusive: \(\bar{A}\) is the complementary of \(A\), and \(\bar{B}\) is the complementary of \(B\). now to compare \(S_{1}\) and \(S_{2}\) one simply have to combine \(A\) and \(B\), \(A\) and \(\bar{B}\), \(\bar{A}\) and \(B\) and finally \(\bar{A}\) and \(\bar{B}\) and attribute to each mutually exclusive order relations thus created a name, for instance \(R_{1,2,3,4}\), respectively. Then we have

(7.2)\[\begin{align} R_{1}&:\alpha_{1}\leq\alpha_{2}\cap\omega_{1}\leq\omega_{2}\\R_{2}&:\alpha_{1}\leq\alpha_{2}\cap\omega_{2}<\omega_{1}\\R_{3}&:\alpha_{2}<\alpha_{1}\cap\omega_{1}\leq\omega_{2}\\R_{4}&:\alpha_{2}<\alpha_{1}\cap\omega_{2}<\omega_{1} \end{align}\]

except \(R_{3}\) is never verified in practice (its stoping point is left to its starting one). So we just proove than at most, one can find three different order relations to describe all situations between two Span objects.

7.7.4. What were the guidelines for our construction ?

We choose to get four order relation for the following reasons :

  1. four relations is sufficient to make any collection of Span entirely sortable, even if one has to pass from one order relation to the next one to understand the ordering: that is, one can create a graph of ordering that will cover the complete set of Span constructed on top of a string

  2. four order relations make them meaningfull in term of reading order: the ordered span will interpretable in term of e.g. “span1 is read first, then span2 starts to be read, while span3 starts to be read before span2 is entirely read” (this reads in our notations span1 < span2 and span3 <= span2, here we didn’t described the order relation between span1 and span3, but such an order relation exists in virtue of the previous point)

  3. four order relations allows to construct dual order relations, that is if span1 < span2, then span2 > span1 (in mathematician parlance, we say that the strict order relation < is dual to the strict order relation >)

The way we defined the two order relations \(<\) (< in the package) and \(\prec\) (<= in the package) and their duals (\(>\)/> and \(\succ\)/>=) is given by the following relations

(7.3)\[\begin{align} S_{1}<S_{2}&\Leftrightarrow\omega_{1}\leq\alpha_{2}\\S_{1}>S_{2}&\Leftrightarrow\omega_{2}\leq\alpha_{1} \end{align}\]

and

(7.4)\[\begin{align} S_{1}\prec S_{2}&\Leftrightarrow\alpha_{1}<\alpha_{2}<\omega_{1}\cup\left(\alpha_{1}=\alpha_{2}\cap\omega_{1}<\omega_{2}\right)\\S_{1}\succ S_{2}&\Leftrightarrow\alpha_{2}<\alpha_{1}<\omega_{2}\cup\left(\alpha_{1}=\alpha_{2}\cap\omega_{2}<\omega_{1}\right) \end{align}\]

which are easilly showed to be dual to each other (simply invert the \(1\) and \(2\) indices of the first line to instantaneously get the definition of the second line) and mutually exclusive (the duality implies the mutual exclusivity, and the opposite of \(\omega_{1}\leq\alpha_{2}\) defining \(S_{1}<S_{2}\) is indeed the condition \(\alpha_{2}<\omega_{1}\) that is present on the definition of \(S_{1}\prec S_{2}\)).

Note that

  1. equal ranges of the span is forbidden in any order relation (except the two ranges being empty in the definition of \(<\)): the reason is that this is considered useless: a Span whose ranges attribute being empty is simply not a Span that one is able to read

  2. following previous remark: if a non-contiguous span present the same ranges than its comparing span, those two are not detected. Since this is the only case without order (because all other situations are mutualy exclusive by construction), this situation is in fact detected (see example above).

  3. because of the ordering we just established, it is possible to construct any graph on top of any string, by way of splitting and re-gluing the cuts

7.8. Conclusion

It is possible to order a set of Span objects, using three order relation that naturally describes the notion of reading: two of these order relations are strict and present a dual, the third one is self-dual (this is the equality relation that we shortly noticed its existence above). With the help of these order relations, it is posisble to describe any cutting of a text using a weighted directed graph, the direction being given by the duality of the order relation, and the weigth by the nature of the order relation itself. For reasonnable Span (that is non-empty ones, since those are excluded from the order relation), the graph representing any collection of Span constructed from a given parent string is entirely connected.

from datetime import datetime
print("Last modification {}".format(datetime.now().strftime("%c")))
Last modification Sat Jan 22 00:09:43 2022