Module `fusus.align`

ALignment of the Afifi and Lakhnawi editions of the Fusus

We have two versions of the Fusus text, obtained in wildly different ways:

AF the Afifi edition, obtained via the OCR pipeline;
LK the Lakhnawi edition, obtained by reverse engineering a textual PDF with unusual fonts and private use characters.

The results of both attempts have been cleaned and enriched by Cornelis van Lit, through visual inspection and manipulation in Pandas.

In this notebook, we align the results obtained by Cornelis. The essential result is an alignment table where the words of the LK are brought into correspondence with the words in AF.

A candidate tool to perform sequence alignment with is Collatex by Ronald Haentjes-Dekker. We experimented with it, see the collatexAfLk notebook, but it took a very long time to run.

However, the two editions are very similar, with very few transpositions but a lot of OCR errors. We have developed our own algorithm, which runs in about a second.

The cleaned LK and AF are present in the Text-Fabric resources fususl and fususa. In this way we have each word precisely numbered in both version, and we also have a latin transcription of the words, which eases the visual comparison of similar words. We use the latin transcription, only because the author of this notebook (Dirk Roorda) is not trained to read Arabic, but also beecause the Arabaic letters change shape depending on their position in the word, which makes a task like this needlessly complicated.

Edit distance

We need a measure to estimate how similar words are. A good measure is the edit distance, i.e. the amount of edit operations to change one word into another. Since one of the editions is the result of OCR, we expect (many) OCR errors, and for this the concept of edit distance is useful.

However, we also need the slightly more refined notion of ratio, which takes the lengths of the words into account. It roughly corresponds to proportion of the common part of the words in relation to the totality of both words under comparison. In other words, an edit distance of 2 between words of length 10 tells you that the words are rather similar, their ratio is high. But the same edit distance between words of length 3 means that the words are very different, their ratio is low.

The concepts ratio and edit distance can be computed by the Levenshtein algorithm, which is implemented in the Python module Levenshtein.

We have included this module in the fusus dependencies, but if for some reason yoou have not recently installed fusus, you may have to

pip3 install python-Levenshtein

The alignment table

The table itself is stored in a Python list alignment. The elements of this list are tuples with a slot number in LK and a corresponding slot number in AF and a measure of how well these slots correspond with each other.

Some words cannot be matched with the other source. In those cases the corresponding member of the tuple is the empty string.

That means that the alignment list becomes longer than each of the editions, and that it is not trivial to locate a specific slot in this table.

For that reason we maintain two indexes that map the slot numbers of both editions to positions in the alignment list.

Alignment algorithm

We align the editions by walking through both of them in parallel and performing comparison actions at each point.

Comparison actions may succeed, in which case we jump in both sources to the first point after the compared material.

If they fail, we try several jumps in the neighbourhood to see if the alignment catches up later.

If that fails, we stop the process prematurely and print the situation around the point of failure.

Comparison

We do not only compare single words, we also try out short combinations of words at both sides. We do this because the editions do not always agree on word boundaries.

So, we need to compute whether $n$ consecutive words left are similar to $m$ consecutive words right.

We set a boundary $C$ on the amount of words that we combine at each source.

We need to try out all possible combinations, from simplest and shortest to longest and most complex.

Every combination can be characterized by $(n, m)$, where $n$ is the number of words on the left and $m$ is the number of words on the right. $n$ and $m$ are in the range $1 \ldots C$.

Suppose $C = 3$, then we want to compare combinations in the following order:

combination	x
$(1, 1)$	–
$(1, 2)$	–
$(2, 1)$	–
$(2, 2)$	–
$(1, 3)$	–
$(3, 1)$	–
$(2, 3)$	–
$(3, 2)$	–
$(3, 3)$	–

In fact, we list all possible combinations and then sort them first by sum of the pair and then by decreasing difference of the pair.

We fix a C (called COMBI) and compute the sequence of combinations up front.

It turns out that 4 is a good value for this source. 3 is definitely too low. 5 is too high, because then the following is likely to happen.

Suppose we have to compare the sentence left to the one right.

left	right
A, BB, CCC, DDD, EEE	BB, CCC, DDD, EEE, X

The best alignment is to decide that A on the left is not matched with anything on the right, X on the right is not matched to anything on the left, and BB, CCC, DDD, EEE will be exact matches.

Buth when comparing combinations of length 5, we get the comparison

ABBCCCDDDEEE versus BBCCCDDDEEEX with distance and ratio:

left = "ABBCCCDDDEEE"
right = "BBCCCDDDEEEX"

distance(left, right)=2
ratio(left, right)=0.9166666666666666

whereas when we go no further than 4, we would get:

left = "ABBCCCDDD"
right = "BBCCCDDDEEE"

distance(left, right)=4
ratio(left, right)=0.8

The algorithm specifies limits in terms of distance and ratio to determine when a comparison succeeds or fails, and as you see, the combination of 5 words is much more likely to succeed than the combination of 4.

And we want it to fail in this case, because it is a suboptimal match.

Similarity

We compute the similarity of words by a function that reports whether the words are similar with respect to a given edit distance and ratio.

The two words match if their edit distance is at most the given edit distance and their ratio is at least the given ratio.

The function returns the decision, and the computed distance and ratio between the words.

Comparing

This is about making a comparison between the locations in both editions where we are currently at.

First we make a quick 1-1 comparison between the words in both editions at the current positions. If that fails, we check whether the next word at both sides match. If that fails, we try out all possible short combinations, using findCombi above.

If all fails, None is returned. In case of success, a tuple with comparison information is appended to the alignment table and the next positions in both editions are returned.

Looking up

If comparison fails, even after having tried all possible combinations of words, we are potentially stuck.

But before giving up, we can make a jump if we can match the current word left to a word right that is not far ahead. Or vice versa.

The following function checks whether there are successful jumps within a specific region and within a given strictness. Typically, this function is called with different strictnesses for different regions. For example, first we want to try short jumps with moderate strictness, and if that fails, we try out longer jumps with higher strictness.

Short jumps are tried out before long jumps, and we try the jumps in the left and right edition alternately.

If a jump is successful, the next position will be returned, and a catchup will be done for the edition in which the jump has been made. When there is no successful jump, None is returned.

The catchup is a bit complicated, because the comparison that led to the successful jump has already been appended to the alignment table. So we have to pop that first (it could be multiple entries in case the comparison involved a combination of words), and then we can do the catchup, and then we can push the comparison entries again. We also need to update the indexes between slot numbers and the alignment table entries.

Orchestrating

We can now define how we are going to orchestrate the comparisons that will build the alignment table.

First we define a function that does a single step or a limited number of steps, starting at given positions in the editions.

This function will then be called from a big loop.

The comparison consists of checking out a number of things under several strictness conditions. As soon as a check succeeds, we perform the actions associated with that, and continue to the next iteration.

First of all: is there a special case defined for this point? If so, follow its instructions.
Perform local and very strict comparisons.
Relax the strictness a little bit and try the local comparisons again
Try out small jumps with great strictness
Try out small jumps with lesser strictness
Try out bigger jumps with great strictness
Try out bigger jumps with lesser strictness
Fail!

Clearly, there is no guarantee that we reach the end without failing. When it fails, we have to inspect the failure and see what happens there. There are basically three solutions to overcome such roadblocks

Define a special case. Easy. Until it appears that too many special cases are needed.
Tweak the strictness levels in the algorithm. Easy, but risky. Things that went well may now go awry. It is not only about preventing the roadblocks, but also maintaining the quality of the output alignment table.
Change the orchestration: change the order of comparing and jumping, introduce more strictness levels, try more or less combinations. Invent new criteria for comparison. This is really difficult. The present orchestration is the result of quite some trial and error.

We have provided an analysis function for the alignment table that assesses its quality and reports where the bad stretches are. This is a very helpful aid in tweaking the parameters and defining special cases. More about this below.

The diff function

Finally we can write up the doDiff function which loops through the editions and produces an alignment table.

But you can also run call it for a specific pair of positions and a limited number of steps. Handy for debugging and tweaking the decision parameters.

You specify the start positions by means of the slot numbers in the LK and AF. You can specify the number of steps, or pass -1 in order to continue to the end.

And if you pass show=True, the alignment table will be printed after completion. Only do this if you run a limited number of steps.

Define the special cases

The special cases that you find in the notebook are the result of trial and error.

The keys are the LK slot numbers where the cases must be applied.

The values are the amount of words in LK and in AF that will be identified.

So 1000: (1100, 3, 4) means that at slot position 1000 in LK resp. 1100 in AF we take 3 resp. 4 consecutive words and declare that those 3 words in LK match those 3 words in AF.

If for some reason the algorithm never reaches a point where the current position in LF is 1000 and the current position in AF is 1100, the case will be reported as a failed case.

When you read the alignment table (by printDiff or printLines) you see the information on the basis of which you could define a special case.

Quality check

How did the alignment perform? It did complete, but what have we got? It could be just garbage.

Sanity

First of all we need to know whether all words of both LF and LK occur left and right, without gaps and duplications and in the right order. We check that. This is important, because the alignment algorithm is under intense evolution, and it could easily incur a flaw in which the material of the editions is not properly conserved.

Agreement

We provide information about the agreement of the words in both sources. How many words are there for which there is no counterpart in the other edition?

And how close are the words for which an alignment could be established?

Note that there are two reasons for bad agreement results:

The editions are really very different
The alignment is not optimal and fails to align many words that would have matched under another alignment strategy.

Bad stretches

Are there long stretches of poorly matching alignments? We are going to examine them.

If they contain many cases of left missing words and many cases of right missing words, they are suspect, because they might contain largely the same words, but the algorithm has failed to match them.

We show all suspect bad stretches. It is advisable to tweak the algorithm until all suspect bad stretches are gone. We have done so.

The remaining stretches are benign. We also show examples of benign bad strectches (at most three examples per size).

Expand source code Browse git

"""
.. include:: docs/about/alignment.md
"""

from Levenshtein import distance, ratio

import collections
import os

from tf.app import use


ORG = "among"
REPO = "fusus"
RELATIVE = "tf"
BASE = os.path.expanduser(f"~/github/{ORG}/{REPO}")
ALIGNMENT_DIR = f"{BASE}/ur/Fusus"

LK = "LK"
AF = "AF"

EDITIONS = {
    LK: "Lakhnawi",
    AF: "Afifi",
}

COMBINE = 4
LOOKAHEAD = 3


def similar(s1, s2, maxD, minR):
    if s1 == s2:
        return (True, 0, 1.0)

    if maxD == 0 or minR == 1:
        return (False, 77, 0.0)

    d = distance(s1, s2)
    r = ratio(s1, s2)
    return (d <= maxD and r >= minR, d, r)


class Alignment:
    """Aligns the words of the LK edition with those of the AF edition.

    The main method is `doDiffs` which produces the alignment table.
    """

    def __init__(self):
        self.getCombis(COMBINE)

    def readEditions(self, version, cases):
        """Read the material of both LK and AF editions.

        Parameters
        ----------
        version: string
            The version number of the source datasets for both editions.
            We assume that both editions are available as Text-Fabric datasets.

        cases: dict
            The special cases defined to help the alignment algorithm through
            difficult spots.

            The dictionary is keyed by LK slot number, and the value is a tuple
            consisting of a corresponding AF slot number, the amount of
            words to collect at the LK side, and the amount of words to collect at
            the AF side.

        Returns
        -------
        None
            But the Alignment object will have attributes set by which
            the other methods are capable of reading the data source.
        """

        self.version = version
        self.cases = cases
        self.alignmentPath = f"{ALIGNMENT_DIR}/alignment{version}.tsv"
        self.alignmentText = f"{ALIGNMENT_DIR}/alignment{version}.txt"
        self.alignmentTextBlocked = f"{ALIGNMENT_DIR}/alignment-incomplete{version}.txt"

        if not os.path.exists(ALIGNMENT_DIR):
            os.makedirs(self.alignmentPath, exist_ok=True)

        A = {}
        F = {}
        T = {}
        L = {}
        maxSlot = {}

        for (acro, name) in EDITIONS.items():
            A[acro] = use(
                f"among/fusus/tf/{name}:clone", writing="ara", version=version
            )
            F[acro] = A[acro].api.F
            T[acro] = A[acro].api.T
            L[acro] = A[acro].api.L
            maxSlot[acro] = F[acro].otype.maxSlot

        self.F_LK = F[LK]
        self.F_AF = F[AF]
        self.T_LK = T[LK]
        self.T_AF = T[AF]
        self.L_LK = L[LK]
        self.L_AF = L[AF]
        self.maxLK = maxSlot[LK]
        self.maxAF = maxSlot[AF]

        self.getTextLK = F[LK].lettersn.v
        self.getTextAF = F[AF].lettersn.v

    def getPosLK(self, slot):
        """Obtain the page and line numbers where a slot in the LK occurs.
        """

        T_LK = self.T_LK

        (piece, page, line) = T_LK.sectionFromNode(slot)
        return f"{page:>03}:{line:>02}"

    def getPosAF(self, slot):
        """Obtain the page and line numbers where a slot in the AF occurs.
        """
        T_AF = self.T_AF

        (page, block, line) = T_AF.sectionFromNode(slot)
        return f"{page:>03}:{line:>02}"

    def printLines(self, start=0, end=None):
        """Print specified entries in the alignment table.

        Parameters
        ----------
        start: integer, optional 0
            Position of first alignment entry that needs to be printed.
            If it is 0 or negative, it starts from the beginning.

        end: integer, optional `None`
            Position of last alignment entry that needs to be printed.
            If it is negative, it indicates a position that much
            before the end of the list.
            If it is absent, it indicates the last line of the list.
        """

        getTextLK = self.getTextLK
        getTextAF = self.getTextAF
        alignment = self.alignment

        if start < 0:
            start = 0
        if end is None or end > len(alignment):
            end = len(alignment)
        lines = []

        lines.append(
            "pag:ln|slot |cc|textLakhnawi        |@ed~rat|textAfifi           |cc| slot|pag:ln"
        )
        lines.append(
            "------|-----|--|--------------------|-------|--------------------|--|-----|------"
        )
        for (iLK, left, d, r, right, iAF) in alignment[start:end]:
            textLK = getTextLK(iLK) if iLK else ""
            textAF = getTextAF(iAF) if iAF else ""
            posLK = self.getPosLK(iLK) if iLK else ""
            posAF = self.getPosAF(iAF) if iAF else ""
            lines.append(
                f"{posLK:>6}|{iLK:>5}|{left:<2}|{textLK:>20}|@{d:<2}~{r:3.1f}|{textAF:<20}|{right:>2}|{iAF:>5} {posAF:<6}"
            )
        return "\n".join(lines)

    def printAlignment(self, path, asTsv=False):
        """Prints the whole alignment table to file.

        Parameters
        ----------
        path: string
            The path name of the file to which the information is printed.

        asTsv: boolean, optional `False
            If False, pretty prints the table to file, using
            `printLines` above.
            If True, write essential data only in tab separated format.
            The essential information is *slot in LK*, *distance*, *slot in AF*
        """
        alignment = self.alignment

        with open(path, "w") as fh:
            if asTsv:
                for item in alignment:
                    fh.write(
                        "\t".join(
                            f"{it:3.1f}" if i == 3 else str(it)
                            for (i, it) in enumerate(item)
                        )
                        + "\n"
                    )
            else:
                fh.write(self.printLines())
                fh.write("\n")

    def printDiff(self, before, after):
        """Print the last alignment entries plus what comes after.

        This function is useful if alignment fails at some point.
        It then shows what happened before the failure and how it looks
        after the failure.

        Parameters
        ----------
        before: integer
            The amount of alignment entries to print.
            They will be picked from the end of the current alignment table.
        after: integer
            The amount of slots after the last alignment entry that
            needs to be shown for each source.
        """
        alignment = self.alignment
        maxLK = self.maxLK
        maxAF = self.maxAF
        getTextLK = self.getTextLK
        getTextAF = self.getTextAF

        print(self.printLines(start=len(alignment) - before))
        print("^" * 67)
        lastLK = None
        lastAF = None
        for c in range(len(alignment) - 1, -1, -1):
            comp = alignment[c]
            if lastLK is None:
                if comp[0]:
                    lastLK = comp[0]
            if lastAF is None:
                if comp[-1]:
                    lastAF = comp[-1]
            if lastLK is not None and lastAF is not None:
                break
        if lastLK is not None and lastAF is not None:
            for i in range(after):
                iLK = lastLK + 1 + i
                iAF = lastAF + 1 + i
                textLK = getTextLK(iLK) if iLK <= maxLK else ""
                textAF = getTextAF(iAF) if iAF <= maxAF else ""
                d = distance(textLK, textAF)
                r = ratio(textLK, textAF)
                print(
                    f"{iLK:>5} =  {textLK:>20} @{d:>2}~{r:3.1f} {textAF:<20}  = {iAF:>5}"
                )

    def printCase(self, iLK, label):
        """Print the alignment entries that belong to a special case.

        Parameters
        ----------
        iLK: integer
            The slot number in LK that triggered a special case.
        label: string
            The kind of issue with this case that you want to report:
            FAILED or MISSING.
        """
        indexLK = self.indexLK
        indexAF = self.indexAF

        cases = self.cases
        case = cases[iLK]
        (iAF, cLK, cAF) = case
        start = min((indexLK[iLK], indexAF[iAF]))
        end = max((indexLK[iLK + cLK + 1], indexAF[iAF + cAF + 1]))
        print(f"{label} CASE: {iLK} vs {iAF}:")
        print(self.printLines(start=start, end=end))

    def getCombis(self, c):
        """Precompute a specification of all combinations that might be tried.

        When the alignment fails between single words, we try combinations
        of words left and right in a specific order.
        The result of this function lists the combinations that must be
        tried, where each combination is specified as a tuple
        of the number of words left and the number of words right
        that needs to be taken together for the matching.

        See also: `findCommbi`.
        """
        combis = []
        for i in range(1, c + 1):
            for j in range(1, c + 1):
                if i != 1 or j != 1:
                    combis.append((i, j))
        self.combis = tuple(
            sorted(combis, key=lambda x: (x[0] + x[1], abs(x[0] - x[1])))
        )

    def catchupAF(self, start, end):
        """Move the current position in the AF edition forward.

        While doing so, it adds an entry to the alignment table
        for each time the position is shifted by one.

        Parameters
        ----------
        start: integer
            From where to start shifting
        end: integer
            Where to end shifting
        """
        indexAF = self.indexAF
        alignment = self.alignment

        for i in range(start, end + 1):
            indexAF[i] = len(alignment)
            alignment.append(("", 0, 99, 0.0, "", i))

    def catchupLK(self, start, end):
        """Move the current position in the LK edition forward.

        While doing so, it adds an entry to the alignment table
        for each time the position is shifted by one.

        Parameters
        ----------
        start: integer
            From where to start shifting
        end: integer
            Where to end shifting
        """
        indexLK = self.indexLK
        alignment = self.alignment

        for i in range(start, end + 1):
            indexLK[i] = len(alignment)
            alignment.append((i, "", 99, 0.0, 0, ""))

    def doCase(self, iLK, iAF, debug=False):
        """Deal with a special case.

        Parameters
        ----------
        iLK: integer
            Current position in LK edition.
            A case is triggered when iLK is equal
            to a key in the cases table and if the same case
            has not already been applied.
            This last condition is relevant for special
            cases where the number of LK words that must be
            combined is 0. In that case, the current LK
            position does not shift, and we would be in an
            infinite loop.
        iAF: integer
            Current position in AF edition.
            If a case is triggered by iLK, but the AF position
            specified in the case does not match iAF,
            the case is skipped and is reported as a failed case.
        """
        indexLK = self.indexLK
        indexAF = self.indexAF
        alignment = self.alignment
        usedCases = self.usedCases
        failedCases = self.failedCases

        cases = self.cases

        if iLK not in cases:
            return None

        (givenIAF, cLK, cAF) = cases[iLK]
        if givenIAF != iAF:
            failedCases.add(iLK)
            return None

        usedCases.add(iLK)
        common = min((cLK, cAF))
        for i in range(max((cLK, cAF))):
            nAlignment = len(alignment)
            if i < common:
                indexLK[iLK + i] = nAlignment
                indexAF[iAF + i] = nAlignment
                alignment.append((iLK + i, cLK, 88, 0.0, cAF, iAF + i))
            elif i < cLK:
                indexLK[iLK + i] = nAlignment
                alignment.append((iLK + i, cLK, 88, 0.0, cAF, ""))
            else:
                indexAF[iAF + i] = nAlignment
                alignment.append(("", cLK, 88, 0.0, cAF, iAF + i))
        if debug:
            print(f"[{iLK}~{iAF}] special case ({cLK}, {cAF})")
        return (iLK + cLK, iAF + cAF)

    def findCombi(self, iLK, iAF, maxD, minR):
        """Tries out all possible combinations at this point until success.

        This is typically called when a direct match at the current slots
        in LK and AF fail.

        We are going to take together small amounts of words and match
        them instead. As soon as we have a match, we break out of the loop
        and use step to the new positions.

        See also: `getCombis` and `compare`.

        Parameters
        ----------
        iLK: integer
            Current slot position in LK edition
        iAF: integer
            Current slot position in AF edition
        maxD: integer
            maximum edit distance that we allow for the comparisons to succeed
        minR: integer
            minimum similarity ratio that we require for the comparisons to succeed
        """
        combis = self.combis
        maxLK = self.maxLK
        maxAF = self.maxAF
        getTextLK = self.getTextLK
        getTextAF = self.getTextAF
        alignment = self.alignment
        indexLK = self.indexLK
        indexAF = self.indexAF

        found = None

        for (cLK, cAF) in combis:
            if iLK + cLK > maxLK or iAF + cAF > maxAF:
                continue
            textLK = "".join(getTextLK(iLK + i) for i in range(cLK))
            textAF = "".join(getTextAF(iAF + i) for i in range(cAF))
            (isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
            if isSimilar:
                found = (cLK, cAF)
                common = min((cLK, cAF))
                for i in range(max((cLK, cAF))):
                    nAlignment = len(alignment)
                    if i < common:
                        indexLK[iLK + i] = nAlignment
                        indexAF[iAF + i] = nAlignment
                        alignment.append((iLK + i, cLK, d, r, cAF, iAF + i))
                    elif i < cLK:
                        indexLK[iLK + i] = nAlignment
                        alignment.append((iLK + i, cLK, d, r, cAF, ""))
                    elif i < cAF:
                        indexAF[iAF + i] = nAlignment
                        alignment.append(("", cLK, d, r, cAF, iAF + i))
                break
        return found

    def compare(self, iLK, iAF, maxD, minR, debug=False):
        """Does a full comparison at a location, including between combinations.

        First a direct match between the words at the current positions in
        the LK and AF editions is tried. If that fails,
        combinations of words from those points onwards are tried.

        See also: `getCombis` and `findCombi`.

        Parameters
        ----------
        iLK: integer
            Current slot position in LK edition
        iAF: integer
            Current slot position in AF edition
        maxD: integer
            maximum edit distance that we allow for the comparisons to succeed
        minR: integer
            minimum similarity ratio that we require for the comparisons to succeed
        debug: boolean, optional `False`
            If True, print a statement indicating the result of the comparison.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        getTextLK = self.getTextLK
        getTextAF = self.getTextAF
        alignment = self.alignment
        indexLK = self.indexLK
        indexAF = self.indexAF

        textLK = getTextLK(iLK)
        textAF = getTextAF(iAF)
        (isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
        if isSimilar:
            nAlignment = len(alignment)
            indexLK[iLK] = nAlignment
            indexAF[iAF] = nAlignment
            alignment.append((iLK, "", d, r, "", iAF))
            if debug:
                print(
                    f"[{iLK}~{iAF}] single comparison with distance <= {maxD} and ratio >= {minR}"
                )
            return (iLK + 1, iAF + 1)

        if iLK < maxLK and iAF < maxAF:
            textLK = getTextLK(iLK + 1)
            textAF = getTextAF(iAF + 1)
            (isSimilarNext, dNext, rNext) = similar(textLK, textAF, maxD, minR)
            if isSimilarNext:
                nAlignment = len(alignment)
                indexLK[iLK] = nAlignment
                indexAF[iAF] = nAlignment
                alignment.append((iLK, "", dNext, rNext, "", iAF))
                if debug:
                    print(
                        f"[{iLK}~{iAF}] single comparison failed with distance <= {maxD} and ratio >= {minR}"
                    )
                nAlignment = len(alignment)
                indexLK[iLK + 1] = nAlignment
                indexAF[iAF + 1] = nAlignment
                alignment.append((iLK + 1, "", dNext, rNext, "", iAF + 1))
                if debug:
                    print(
                        f"[{iLK}~{iAF}] single comparison recovered with distance <= {maxD} and ratio >= {minR}"
                    )
                return (iLK + 2, iAF + 2)

        combi = self.findCombi(iLK, iAF, maxD, minR)
        if combi is not None:
            (cLK, cAF) = combi
            if debug:
                print(
                    f"[{iLK}~{iAF}] ({cLK}, {cAF}) comparison with distance <= {maxD} and ratio >= {minR}"
                )
            return (iLK + cLK, iAF + cAF)

        return None

    def lookup(self, iLK, iAF, maxD, minR, start, end, debug=False):
        """Performs a jump in both editions.

        Typically invoked when comparison at the current locations fail,
        even after having tried combinations.

        We compare the current word in one edition with those from a certain
        interval in the other edition. And vice versa, alternately.
        of subsequent words.

        Parameters
        ----------
        iLK: integer
            Current slot position in LK edition
        iAF: integer
            Current slot position in AF edition
        maxD: integer
            maximum edit distance that we allow for the comparisons to succeed
        minR: integer
            minimum similarity ratio that we require for the comparisons to succeed
        start: integer
            first position in the other edition where we start comparing
        end: integer
            last position in the other edition where we stop comparing
        debug: boolean, optional `False`
            If True, print a statement indicating the result of the lookup.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        alignment = self.alignment
        indexLK = self.indexLK
        indexAF = self.indexAF

        step = None

        for i in range(start, end + 1):
            prevAlignmentIndex = len(alignment)

            if iAF + i <= maxAF:
                step = self.compare(iLK, iAF + i, maxD, minR, debug=debug)
                if step:
                    if debug:
                        print(f"[{iLK}~{iAF}] right {i}-jump to {iAF + i}")
                    thisAlignment = list(alignment[prevAlignmentIndex:])
                    alignment[prevAlignmentIndex:] = []

                    self.catchupAF(iAF, iAF + i - 1)
                    for thisComp in thisAlignment:
                        nAlignment = len(alignment)
                        thisLK = thisComp[0]
                        thisAF = thisComp[-1]
                        if thisLK:
                            indexLK[thisLK] = nAlignment
                        if thisAF:
                            indexAF[thisAF] = nAlignment
                        alignment.append(thisComp)
                    break

            if iLK + i <= maxLK:
                step = self.compare(iLK + i, iAF, maxD, minR, debug=debug)
                if step:
                    if debug:
                        print(f"[{iLK}~{iAF}] left {i}-jump to {iLK + i}")
                    thisAlignment = list(alignment[prevAlignmentIndex:])
                    alignment[prevAlignmentIndex:] = []

                    self.catchupLK(iLK, iLK + i - 1)
                    for thisComp in thisAlignment:
                        nAlignment = len(alignment)
                        thisLK = thisComp[0]
                        thisAF = thisComp[-1]
                        if thisLK:
                            indexLK[thisLK] = nAlignment
                        if thisAF:
                            indexAF[thisAF] = nAlignment
                        alignment.append(thisComp)
                    break
        return step

    def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False):
        """Main loop of the alignment process.

        Without optional parameters, performs the whole alignment until
        it is completed or fails.

        But you can also run a few alignment steps, from arbitrary positions
        and record the decision steps. This is handy for debugging and
        exploring.

        !!! hint "doDiff"
            This function defines an inner function `doDiff`, which
            contains the logic of a single alignment step.
            That function `doDiff` issues a sequence of `compare` and `lookup`
            commands with various strictnesses and various jump sizes,
            which will be tried in order.

            Here you can finetune the amount of looseness you allow in a comparison
            before jumping from that position.
            You can first compare strictly, then jump away a short distance while
            still comparing strictly, and then start comparing more loosely, and
            then jump away a longer distance, possibly a bit more loosely.
            That is a matter of trial and error.

            The current implementation of `doDiff` matches the present LK and AF well,
            especially in conjunction with the special cases that have been defined.

        Parameters
        ----------
        startLK: integer, optional 1
            Start position in LK edition. If not passed, starts from the beginning.
        startAF: integer, optional 1
            Start position in AF edition. If not passed, starts from the beginning.
        steps: integer, optional `-1`
            The number of alignment steps to perform.
            If not passed or -1, there is no limit on the steps.
        show: boolean, optional False
            If True, prints the resulting alignment table to the screen.
            Only do this when you produce a small part of the alignment table!
            maximum edit distance that we allow for the comparisons to succeed
        debug: boolean, optional `False`
            If True, print a statement indicating the result of the each decision
            that has been taken during the alignment process.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        cases = self.cases
        alignmentPath = self.alignmentPath
        alignmentText = self.alignmentText
        alignmentTextBlocked = self.alignmentTextBlocked

        self.alignment = []
        self.indexLK = {}
        self.indexAF = {}
        self.decisions = collections.Counter()
        self.usedCases = set()
        self.failedCases = set()

        alignment = self.alignment
        indexLK = self.indexLK
        decisions = self.decisions
        usedCases = self.usedCases
        failedCases = self.failedCases

        lastCase = None
        step = (startLK, startAF)

        it = 0

        def doDiff(iLK, iAF):
            decision = 0
            if iLK > maxLK or iAF > maxAF:
                if iAF < maxAF:
                    self.catchupAF(iAF, maxAF)
                if iLK < maxLK:
                    self.catchupLK(iLK, maxLK)
                decisions[decision] += 1
                return True

            nonlocal lastCase

            decision += 1

            if lastCase != iLK:
                step = self.doCase(iLK, iAF, debug=debug)
                if step:
                    lastCase = iLK
                    decisions[decision] += 1
                    return step

            step = self.compare(iLK, iAF, 0, 1.0, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 0, 1.0, 1, 5, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 2, 0.7, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 3, 0.6, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 0, 1.0, 1, 10, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 1, 0.8, 1, 10, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 0, 1.0, 10, 20, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 1, 0.8, 10, 20, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 2, 0.7, 1, 20, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 1, 0.8, 20, 100, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 2, 0.7, 20, 100, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            return False

        while it != steps:
            it += 1
            step = doDiff(*step)

            if step is True:
                self.printAlignment(alignmentText)
                self.printAlignment(alignmentPath, asTsv=True)
                print(f"Alignment complete, {len(alignment)} entries.")
                break
            elif step is False:
                self.printAlignment(alignmentTextBlocked)
                print(f"Alignment blocked, {len(alignment)} entries.")
                self.printDiff(20, 20)
                break

        endLK = startLK if len(indexLK) == 0 else max(indexLK)
        possibleCases = set(range(startLK, endLK + 1))
        casesSet = set(cases)
        relevantCases = casesSet & possibleCases
        missedCases = relevantCases - usedCases - failedCases

        if failedCases:
            for iLK in sorted(failedCases):
                self.printCase(iLK, "FAILED")
        if missedCases:
            for iLK in sorted(missedCases):
                self.printCase(iLK, "MISSED")

        if not failedCases and not missedCases:
            if not relevantCases:
                print("No special cases defined for this stretch")
            else:
                print(
                    f"Special cases: all relevant {len(relevantCases)} cases "
                    "defined, encountered, and applied"
                )

        if show:
            print(self.printLines())

        for d in sorted(decisions):
            n = decisions[d]
            print(f"{d:>2} taken: {n:>5} x")

    def analyseStretch(self, start, end):
        """Analyse a stretch in the alignment table and reports whether it is suspect.

        Stretches are suspect if they contain both a number entries where an LK
        word is missing, and a number of entries where an AF word is missing.

        This is usually an indication that the alignment has gone out of sync.

        Parameters
        ----------
        start: integer
            Index in the alignment list where to start
        end: integer
            Index in the alignment list where to stop
        """
        alignment = self.alignment

        total = 0
        good = 0
        onlyLK = 0
        onlyAF = 0

        for (iLK, left, d, r, right, iAF) in alignment[start : end + 1]:
            total += 1
            if not iLK:
                onlyAF += 1
            if not iAF:
                onlyLK += 1
            if d == 0:
                good += 1

        suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5
        return suspect

    def check(self):
        """Comprehensive quality assessment of the alignment table.

        Reports various statistics on how close the matches are overall.

        Checks the minimum requirement of sanity:
        All words in the LK must occur without duplications in the right order
        int the alignment table. Same for the AF words.

        Reports where the combinations are: the cases where more than one word
        in the LK has matched with more than one word in the AF.
        Or where single words have been forced to match, or where words
        in one edition have no counterpart in the other edition.
        Gives at most three examples of all distinct combination patterns.

        Identifies bad strectches: chunks of entries within the alignment
        table where words do not match for various reasons.
        The identification is a bit loose in the sense that bad stretches
        with a single perfect match in between are combined.

        Some bad stretches are suspect (see `analyseStretch`).
        All suspect stretches are shown, and at most three examples
        of the benign (non-suspsect) bad stretches are shown.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        alignment = self.alignment

        errors = {}
        prevILK = 0
        prevIAF = 0

        where = collections.Counter()
        agreement = collections.Counter()
        badStretches = collections.defaultdict(lambda: [])
        combinations = collections.defaultdict(lambda: [])

        startBad = 0
        startCombi = 0
        nCombi = 0

        for (c, (iLK, left, d, r, right, iAF)) in enumerate(alignment):
            thisBad = d > 0 or not iLK or not iAF
            hasLeft = left != ""
            hasRight = right != ""

            # a good line between bad lines is counted as bad
            if not thisBad and startBad:
                nextGood = True
                for j in range(1, LOOKAHEAD + 1):
                    if c + j < len(alignment):
                        compJ = alignment[c + j]
                        if compJ[2] > 0 or not compJ[0] or not compJ[-1]:
                            nextGood = False
                            break
                if not nextGood:
                    thisBad = True
            if startBad:
                if not thisBad:
                    badStretches[c - startBad].append(startBad)
                    startBad = 0
            else:
                if thisBad:
                    startBad = c

            if startCombi and nCombi == c - startCombi:
                startCombi = 0
                nCombi = 0
            if startCombi == 0:
                thisCombi = hasLeft and left > 1 or hasRight and right > 1
                if thisCombi:
                    combinations[(left, right)].append(c)
                    startCombi = c
                    nCombi = max((left, right))

            agreement[d] += 1

            if iLK:
                if iLK != prevILK + 1:
                    errors.setdefault("wrong iLK", []).append(
                        f"{c:>5}: Expected {prevILK + 1}, found {iLK}"
                    )
                prevILK = iLK
                if iAF:
                    where["both"] += 1
            else:
                where[AF] += 1
            if iAF:
                if iAF != prevIAF + 1:
                    errors.setdefault("wrong iAF", []).append(
                        f"{c:>5}: Expected {prevIAF + 1}, found {iAF}"
                    )
                prevIAF = iAF
            else:
                where[LK] += 1

        if startBad:
            badStretches[len(alignment) - startBad].append(startBad)

        if prevILK < maxLK:
            errors.setdefault("missing iLKs at the end", []).append(
                f"last is {prevILK}, expected {maxLK}"
            )
        elif prevILK > maxLK:
            errors.setdefault("too many iLKs at the end", []).append(
                f"last is {prevILK}, expected {maxLK}"
            )
        if prevIAF < maxAF:
            errors.setdefault("missing iAFs at the end", []).append(
                f"last is {prevIAF}, expected {maxAF}"
            )
        elif prevIAF > maxAF:
            errors.setdefault("too many iAFs at the end", []).append(
                f"last is {prevIAF}, expected {maxAF}"
            )

        print("\nSANITY\n")
        if not errors:
            print("All OK")
        else:
            for (kind, msgs) in errors.items():
                print(f"ERROR {kind} ({len(msgs):>5}x):")
                for msg in msgs[0:10]:
                    print(f"\t{msg}")
                if len(msgs) > 10:
                    print(f"\t ... and {len(msgs) - 10} more ...")

        print("\nAGREEMENT\n")
        print("Where are the words?\n")
        print(f"\t{LK}-only: {where[LK]:>5} slots")
        print(f"\t{AF}-only: {where[AF]:>5} slots")
        print(f"\tboth:    {where['both']:>5} slots")

        print("\nHow well is the agreement?\n")
        for (d, n) in sorted(agreement.items()):
            print(f"edit distance {d:>3} : {n:>5} words")
        print("NB: 88 are special cases that have been declared explicitly")
        print("NB: 99 are words that have no counterpart in the other edition")

        print("\nCOMBINATIONS\n")
        print("What combination alignments are there and how many?")
        for comb in sorted(combinations):
            cs = combinations[comb]
            (left, right) = comb
            print(f"\t({left:>2}, {right:>2}) : {len(cs):>4} x :")
            for (i, c) in enumerate(cs[0:3]):
                print(f"EXAMPLE {i + 1}:")
                print(
                    self.printLines(
                        start=max((1, c - 2)),
                        end=min((len(alignment), c + 2 + max(comb))),
                    )
                )
                print("")

        print("\nBAD STRETCHES\n")
        print("How many of which size?\n")
        allSuspects = []
        someBenigns = []
        for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])):
            suspects = {
                start: size
                for start in starts
                if self.analyseStretch(start, start + size)
            }
            benigns = {start: size for start in starts if start not in suspects}
            allSuspects.extend(
                [(start, start + size) for (start, size) in suspects.items()]
            )
            someBenigns.extend(
                [(start, start + size) for (start, size) in list(benigns.items())[0:3]]
            )
            examples = ", ".join(str(start) for start in list(suspects.keys())[0:3])
            if not suspects:
                examples = ", ".join(str(start) for start in list(benigns.keys())[0:3])
            print(
                f"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}"
            )

        print(
            f"\nShowing all {len(allSuspects)} inversion suspects"
            if len(allSuspects)
            else "\nNo suspect bad stretches\n"
        )
        for (i, (start, end)) in enumerate(reversed(allSuspects)):
            print(f"\nSUSPECT {i + 1:>2}")
            print(self.printLines(max((1, start - 5)), min((len(alignment), end + 5))))
        print(
            f"\nShowing some ({len(someBenigns)}) benign examples"
            if len(someBenigns)
            else "\nNo bad stretches\n"
        )
        for (i, (start, end)) in enumerate(someBenigns):
            print(f"\nBENIGN {i + 1:>2}")
            print(self.printLines(max((1, start - 2)), min((len(alignment), end + 2))))

Functions

def similar(s1, s2, maxD, minR)

Expand source code Browse git

def similar(s1, s2, maxD, minR):
    if s1 == s2:
        return (True, 0, 1.0)

    if maxD == 0 or minR == 1:
        return (False, 77, 0.0)

    d = distance(s1, s2)
    r = ratio(s1, s2)
    return (d <= maxD and r >= minR, d, r)

Classes

class Alignment

Aligns the words of the LK edition with those of the AF edition.

The main method is doDiffs which produces the alignment table.

Expand source code Browse git

class Alignment:
    """Aligns the words of the LK edition with those of the AF edition.

    The main method is `doDiffs` which produces the alignment table.
    """

    def __init__(self):
        self.getCombis(COMBINE)

    def readEditions(self, version, cases):
        """Read the material of both LK and AF editions.

        Parameters
        ----------
        version: string
            The version number of the source datasets for both editions.
            We assume that both editions are available as Text-Fabric datasets.

        cases: dict
            The special cases defined to help the alignment algorithm through
            difficult spots.

            The dictionary is keyed by LK slot number, and the value is a tuple
            consisting of a corresponding AF slot number, the amount of
            words to collect at the LK side, and the amount of words to collect at
            the AF side.

        Returns
        -------
        None
            But the Alignment object will have attributes set by which
            the other methods are capable of reading the data source.
        """

        self.version = version
        self.cases = cases
        self.alignmentPath = f"{ALIGNMENT_DIR}/alignment{version}.tsv"
        self.alignmentText = f"{ALIGNMENT_DIR}/alignment{version}.txt"
        self.alignmentTextBlocked = f"{ALIGNMENT_DIR}/alignment-incomplete{version}.txt"

        if not os.path.exists(ALIGNMENT_DIR):
            os.makedirs(self.alignmentPath, exist_ok=True)

        A = {}
        F = {}
        T = {}
        L = {}
        maxSlot = {}

        for (acro, name) in EDITIONS.items():
            A[acro] = use(
                f"among/fusus/tf/{name}:clone", writing="ara", version=version
            )
            F[acro] = A[acro].api.F
            T[acro] = A[acro].api.T
            L[acro] = A[acro].api.L
            maxSlot[acro] = F[acro].otype.maxSlot

        self.F_LK = F[LK]
        self.F_AF = F[AF]
        self.T_LK = T[LK]
        self.T_AF = T[AF]
        self.L_LK = L[LK]
        self.L_AF = L[AF]
        self.maxLK = maxSlot[LK]
        self.maxAF = maxSlot[AF]

        self.getTextLK = F[LK].lettersn.v
        self.getTextAF = F[AF].lettersn.v

    def getPosLK(self, slot):
        """Obtain the page and line numbers where a slot in the LK occurs.
        """

        T_LK = self.T_LK

        (piece, page, line) = T_LK.sectionFromNode(slot)
        return f"{page:>03}:{line:>02}"

    def getPosAF(self, slot):
        """Obtain the page and line numbers where a slot in the AF occurs.
        """
        T_AF = self.T_AF

        (page, block, line) = T_AF.sectionFromNode(slot)
        return f"{page:>03}:{line:>02}"

    def printLines(self, start=0, end=None):
        """Print specified entries in the alignment table.

        Parameters
        ----------
        start: integer, optional 0
            Position of first alignment entry that needs to be printed.
            If it is 0 or negative, it starts from the beginning.

        end: integer, optional `None`
            Position of last alignment entry that needs to be printed.
            If it is negative, it indicates a position that much
            before the end of the list.
            If it is absent, it indicates the last line of the list.
        """

        getTextLK = self.getTextLK
        getTextAF = self.getTextAF
        alignment = self.alignment

        if start < 0:
            start = 0
        if end is None or end > len(alignment):
            end = len(alignment)
        lines = []

        lines.append(
            "pag:ln|slot |cc|textLakhnawi        |@ed~rat|textAfifi           |cc| slot|pag:ln"
        )
        lines.append(
            "------|-----|--|--------------------|-------|--------------------|--|-----|------"
        )
        for (iLK, left, d, r, right, iAF) in alignment[start:end]:
            textLK = getTextLK(iLK) if iLK else ""
            textAF = getTextAF(iAF) if iAF else ""
            posLK = self.getPosLK(iLK) if iLK else ""
            posAF = self.getPosAF(iAF) if iAF else ""
            lines.append(
                f"{posLK:>6}|{iLK:>5}|{left:<2}|{textLK:>20}|@{d:<2}~{r:3.1f}|{textAF:<20}|{right:>2}|{iAF:>5} {posAF:<6}"
            )
        return "\n".join(lines)

    def printAlignment(self, path, asTsv=False):
        """Prints the whole alignment table to file.

        Parameters
        ----------
        path: string
            The path name of the file to which the information is printed.

        asTsv: boolean, optional `False
            If False, pretty prints the table to file, using
            `printLines` above.
            If True, write essential data only in tab separated format.
            The essential information is *slot in LK*, *distance*, *slot in AF*
        """
        alignment = self.alignment

        with open(path, "w") as fh:
            if asTsv:
                for item in alignment:
                    fh.write(
                        "\t".join(
                            f"{it:3.1f}" if i == 3 else str(it)
                            for (i, it) in enumerate(item)
                        )
                        + "\n"
                    )
            else:
                fh.write(self.printLines())
                fh.write("\n")

    def printDiff(self, before, after):
        """Print the last alignment entries plus what comes after.

        This function is useful if alignment fails at some point.
        It then shows what happened before the failure and how it looks
        after the failure.

        Parameters
        ----------
        before: integer
            The amount of alignment entries to print.
            They will be picked from the end of the current alignment table.
        after: integer
            The amount of slots after the last alignment entry that
            needs to be shown for each source.
        """
        alignment = self.alignment
        maxLK = self.maxLK
        maxAF = self.maxAF
        getTextLK = self.getTextLK
        getTextAF = self.getTextAF

        print(self.printLines(start=len(alignment) - before))
        print("^" * 67)
        lastLK = None
        lastAF = None
        for c in range(len(alignment) - 1, -1, -1):
            comp = alignment[c]
            if lastLK is None:
                if comp[0]:
                    lastLK = comp[0]
            if lastAF is None:
                if comp[-1]:
                    lastAF = comp[-1]
            if lastLK is not None and lastAF is not None:
                break
        if lastLK is not None and lastAF is not None:
            for i in range(after):
                iLK = lastLK + 1 + i
                iAF = lastAF + 1 + i
                textLK = getTextLK(iLK) if iLK <= maxLK else ""
                textAF = getTextAF(iAF) if iAF <= maxAF else ""
                d = distance(textLK, textAF)
                r = ratio(textLK, textAF)
                print(
                    f"{iLK:>5} =  {textLK:>20} @{d:>2}~{r:3.1f} {textAF:<20}  = {iAF:>5}"
                )

    def printCase(self, iLK, label):
        """Print the alignment entries that belong to a special case.

        Parameters
        ----------
        iLK: integer
            The slot number in LK that triggered a special case.
        label: string
            The kind of issue with this case that you want to report:
            FAILED or MISSING.
        """
        indexLK = self.indexLK
        indexAF = self.indexAF

        cases = self.cases
        case = cases[iLK]
        (iAF, cLK, cAF) = case
        start = min((indexLK[iLK], indexAF[iAF]))
        end = max((indexLK[iLK + cLK + 1], indexAF[iAF + cAF + 1]))
        print(f"{label} CASE: {iLK} vs {iAF}:")
        print(self.printLines(start=start, end=end))

    def getCombis(self, c):
        """Precompute a specification of all combinations that might be tried.

        When the alignment fails between single words, we try combinations
        of words left and right in a specific order.
        The result of this function lists the combinations that must be
        tried, where each combination is specified as a tuple
        of the number of words left and the number of words right
        that needs to be taken together for the matching.

        See also: `findCommbi`.
        """
        combis = []
        for i in range(1, c + 1):
            for j in range(1, c + 1):
                if i != 1 or j != 1:
                    combis.append((i, j))
        self.combis = tuple(
            sorted(combis, key=lambda x: (x[0] + x[1], abs(x[0] - x[1])))
        )

    def catchupAF(self, start, end):
        """Move the current position in the AF edition forward.

        While doing so, it adds an entry to the alignment table
        for each time the position is shifted by one.

        Parameters
        ----------
        start: integer
            From where to start shifting
        end: integer
            Where to end shifting
        """
        indexAF = self.indexAF
        alignment = self.alignment

        for i in range(start, end + 1):
            indexAF[i] = len(alignment)
            alignment.append(("", 0, 99, 0.0, "", i))

    def catchupLK(self, start, end):
        """Move the current position in the LK edition forward.

        While doing so, it adds an entry to the alignment table
        for each time the position is shifted by one.

        Parameters
        ----------
        start: integer
            From where to start shifting
        end: integer
            Where to end shifting
        """
        indexLK = self.indexLK
        alignment = self.alignment

        for i in range(start, end + 1):
            indexLK[i] = len(alignment)
            alignment.append((i, "", 99, 0.0, 0, ""))

    def doCase(self, iLK, iAF, debug=False):
        """Deal with a special case.

        Parameters
        ----------
        iLK: integer
            Current position in LK edition.
            A case is triggered when iLK is equal
            to a key in the cases table and if the same case
            has not already been applied.
            This last condition is relevant for special
            cases where the number of LK words that must be
            combined is 0. In that case, the current LK
            position does not shift, and we would be in an
            infinite loop.
        iAF: integer
            Current position in AF edition.
            If a case is triggered by iLK, but the AF position
            specified in the case does not match iAF,
            the case is skipped and is reported as a failed case.
        """
        indexLK = self.indexLK
        indexAF = self.indexAF
        alignment = self.alignment
        usedCases = self.usedCases
        failedCases = self.failedCases

        cases = self.cases

        if iLK not in cases:
            return None

        (givenIAF, cLK, cAF) = cases[iLK]
        if givenIAF != iAF:
            failedCases.add(iLK)
            return None

        usedCases.add(iLK)
        common = min((cLK, cAF))
        for i in range(max((cLK, cAF))):
            nAlignment = len(alignment)
            if i < common:
                indexLK[iLK + i] = nAlignment
                indexAF[iAF + i] = nAlignment
                alignment.append((iLK + i, cLK, 88, 0.0, cAF, iAF + i))
            elif i < cLK:
                indexLK[iLK + i] = nAlignment
                alignment.append((iLK + i, cLK, 88, 0.0, cAF, ""))
            else:
                indexAF[iAF + i] = nAlignment
                alignment.append(("", cLK, 88, 0.0, cAF, iAF + i))
        if debug:
            print(f"[{iLK}~{iAF}] special case ({cLK}, {cAF})")
        return (iLK + cLK, iAF + cAF)

    def findCombi(self, iLK, iAF, maxD, minR):
        """Tries out all possible combinations at this point until success.

        This is typically called when a direct match at the current slots
        in LK and AF fail.

        We are going to take together small amounts of words and match
        them instead. As soon as we have a match, we break out of the loop
        and use step to the new positions.

        See also: `getCombis` and `compare`.

        Parameters
        ----------
        iLK: integer
            Current slot position in LK edition
        iAF: integer
            Current slot position in AF edition
        maxD: integer
            maximum edit distance that we allow for the comparisons to succeed
        minR: integer
            minimum similarity ratio that we require for the comparisons to succeed
        """
        combis = self.combis
        maxLK = self.maxLK
        maxAF = self.maxAF
        getTextLK = self.getTextLK
        getTextAF = self.getTextAF
        alignment = self.alignment
        indexLK = self.indexLK
        indexAF = self.indexAF

        found = None

        for (cLK, cAF) in combis:
            if iLK + cLK > maxLK or iAF + cAF > maxAF:
                continue
            textLK = "".join(getTextLK(iLK + i) for i in range(cLK))
            textAF = "".join(getTextAF(iAF + i) for i in range(cAF))
            (isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
            if isSimilar:
                found = (cLK, cAF)
                common = min((cLK, cAF))
                for i in range(max((cLK, cAF))):
                    nAlignment = len(alignment)
                    if i < common:
                        indexLK[iLK + i] = nAlignment
                        indexAF[iAF + i] = nAlignment
                        alignment.append((iLK + i, cLK, d, r, cAF, iAF + i))
                    elif i < cLK:
                        indexLK[iLK + i] = nAlignment
                        alignment.append((iLK + i, cLK, d, r, cAF, ""))
                    elif i < cAF:
                        indexAF[iAF + i] = nAlignment
                        alignment.append(("", cLK, d, r, cAF, iAF + i))
                break
        return found

    def compare(self, iLK, iAF, maxD, minR, debug=False):
        """Does a full comparison at a location, including between combinations.

        First a direct match between the words at the current positions in
        the LK and AF editions is tried. If that fails,
        combinations of words from those points onwards are tried.

        See also: `getCombis` and `findCombi`.

        Parameters
        ----------
        iLK: integer
            Current slot position in LK edition
        iAF: integer
            Current slot position in AF edition
        maxD: integer
            maximum edit distance that we allow for the comparisons to succeed
        minR: integer
            minimum similarity ratio that we require for the comparisons to succeed
        debug: boolean, optional `False`
            If True, print a statement indicating the result of the comparison.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        getTextLK = self.getTextLK
        getTextAF = self.getTextAF
        alignment = self.alignment
        indexLK = self.indexLK
        indexAF = self.indexAF

        textLK = getTextLK(iLK)
        textAF = getTextAF(iAF)
        (isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
        if isSimilar:
            nAlignment = len(alignment)
            indexLK[iLK] = nAlignment
            indexAF[iAF] = nAlignment
            alignment.append((iLK, "", d, r, "", iAF))
            if debug:
                print(
                    f"[{iLK}~{iAF}] single comparison with distance <= {maxD} and ratio >= {minR}"
                )
            return (iLK + 1, iAF + 1)

        if iLK < maxLK and iAF < maxAF:
            textLK = getTextLK(iLK + 1)
            textAF = getTextAF(iAF + 1)
            (isSimilarNext, dNext, rNext) = similar(textLK, textAF, maxD, minR)
            if isSimilarNext:
                nAlignment = len(alignment)
                indexLK[iLK] = nAlignment
                indexAF[iAF] = nAlignment
                alignment.append((iLK, "", dNext, rNext, "", iAF))
                if debug:
                    print(
                        f"[{iLK}~{iAF}] single comparison failed with distance <= {maxD} and ratio >= {minR}"
                    )
                nAlignment = len(alignment)
                indexLK[iLK + 1] = nAlignment
                indexAF[iAF + 1] = nAlignment
                alignment.append((iLK + 1, "", dNext, rNext, "", iAF + 1))
                if debug:
                    print(
                        f"[{iLK}~{iAF}] single comparison recovered with distance <= {maxD} and ratio >= {minR}"
                    )
                return (iLK + 2, iAF + 2)

        combi = self.findCombi(iLK, iAF, maxD, minR)
        if combi is not None:
            (cLK, cAF) = combi
            if debug:
                print(
                    f"[{iLK}~{iAF}] ({cLK}, {cAF}) comparison with distance <= {maxD} and ratio >= {minR}"
                )
            return (iLK + cLK, iAF + cAF)

        return None

    def lookup(self, iLK, iAF, maxD, minR, start, end, debug=False):
        """Performs a jump in both editions.

        Typically invoked when comparison at the current locations fail,
        even after having tried combinations.

        We compare the current word in one edition with those from a certain
        interval in the other edition. And vice versa, alternately.
        of subsequent words.

        Parameters
        ----------
        iLK: integer
            Current slot position in LK edition
        iAF: integer
            Current slot position in AF edition
        maxD: integer
            maximum edit distance that we allow for the comparisons to succeed
        minR: integer
            minimum similarity ratio that we require for the comparisons to succeed
        start: integer
            first position in the other edition where we start comparing
        end: integer
            last position in the other edition where we stop comparing
        debug: boolean, optional `False`
            If True, print a statement indicating the result of the lookup.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        alignment = self.alignment
        indexLK = self.indexLK
        indexAF = self.indexAF

        step = None

        for i in range(start, end + 1):
            prevAlignmentIndex = len(alignment)

            if iAF + i <= maxAF:
                step = self.compare(iLK, iAF + i, maxD, minR, debug=debug)
                if step:
                    if debug:
                        print(f"[{iLK}~{iAF}] right {i}-jump to {iAF + i}")
                    thisAlignment = list(alignment[prevAlignmentIndex:])
                    alignment[prevAlignmentIndex:] = []

                    self.catchupAF(iAF, iAF + i - 1)
                    for thisComp in thisAlignment:
                        nAlignment = len(alignment)
                        thisLK = thisComp[0]
                        thisAF = thisComp[-1]
                        if thisLK:
                            indexLK[thisLK] = nAlignment
                        if thisAF:
                            indexAF[thisAF] = nAlignment
                        alignment.append(thisComp)
                    break

            if iLK + i <= maxLK:
                step = self.compare(iLK + i, iAF, maxD, minR, debug=debug)
                if step:
                    if debug:
                        print(f"[{iLK}~{iAF}] left {i}-jump to {iLK + i}")
                    thisAlignment = list(alignment[prevAlignmentIndex:])
                    alignment[prevAlignmentIndex:] = []

                    self.catchupLK(iLK, iLK + i - 1)
                    for thisComp in thisAlignment:
                        nAlignment = len(alignment)
                        thisLK = thisComp[0]
                        thisAF = thisComp[-1]
                        if thisLK:
                            indexLK[thisLK] = nAlignment
                        if thisAF:
                            indexAF[thisAF] = nAlignment
                        alignment.append(thisComp)
                    break
        return step

    def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False):
        """Main loop of the alignment process.

        Without optional parameters, performs the whole alignment until
        it is completed or fails.

        But you can also run a few alignment steps, from arbitrary positions
        and record the decision steps. This is handy for debugging and
        exploring.

        !!! hint "doDiff"
            This function defines an inner function `doDiff`, which
            contains the logic of a single alignment step.
            That function `doDiff` issues a sequence of `compare` and `lookup`
            commands with various strictnesses and various jump sizes,
            which will be tried in order.

            Here you can finetune the amount of looseness you allow in a comparison
            before jumping from that position.
            You can first compare strictly, then jump away a short distance while
            still comparing strictly, and then start comparing more loosely, and
            then jump away a longer distance, possibly a bit more loosely.
            That is a matter of trial and error.

            The current implementation of `doDiff` matches the present LK and AF well,
            especially in conjunction with the special cases that have been defined.

        Parameters
        ----------
        startLK: integer, optional 1
            Start position in LK edition. If not passed, starts from the beginning.
        startAF: integer, optional 1
            Start position in AF edition. If not passed, starts from the beginning.
        steps: integer, optional `-1`
            The number of alignment steps to perform.
            If not passed or -1, there is no limit on the steps.
        show: boolean, optional False
            If True, prints the resulting alignment table to the screen.
            Only do this when you produce a small part of the alignment table!
            maximum edit distance that we allow for the comparisons to succeed
        debug: boolean, optional `False`
            If True, print a statement indicating the result of the each decision
            that has been taken during the alignment process.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        cases = self.cases
        alignmentPath = self.alignmentPath
        alignmentText = self.alignmentText
        alignmentTextBlocked = self.alignmentTextBlocked

        self.alignment = []
        self.indexLK = {}
        self.indexAF = {}
        self.decisions = collections.Counter()
        self.usedCases = set()
        self.failedCases = set()

        alignment = self.alignment
        indexLK = self.indexLK
        decisions = self.decisions
        usedCases = self.usedCases
        failedCases = self.failedCases

        lastCase = None
        step = (startLK, startAF)

        it = 0

        def doDiff(iLK, iAF):
            decision = 0
            if iLK > maxLK or iAF > maxAF:
                if iAF < maxAF:
                    self.catchupAF(iAF, maxAF)
                if iLK < maxLK:
                    self.catchupLK(iLK, maxLK)
                decisions[decision] += 1
                return True

            nonlocal lastCase

            decision += 1

            if lastCase != iLK:
                step = self.doCase(iLK, iAF, debug=debug)
                if step:
                    lastCase = iLK
                    decisions[decision] += 1
                    return step

            step = self.compare(iLK, iAF, 0, 1.0, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 0, 1.0, 1, 5, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 2, 0.7, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.compare(iLK, iAF, 3, 0.6, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 0, 1.0, 1, 10, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 1, 0.8, 1, 10, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 0, 1.0, 10, 20, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 1, 0.8, 10, 20, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 2, 0.7, 1, 20, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 1, 0.8, 20, 100, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            step = self.lookup(iLK, iAF, 2, 0.7, 20, 100, debug=debug)
            decision += 1
            if step:
                decisions[decision] += 1
                return step

            return False

        while it != steps:
            it += 1
            step = doDiff(*step)

            if step is True:
                self.printAlignment(alignmentText)
                self.printAlignment(alignmentPath, asTsv=True)
                print(f"Alignment complete, {len(alignment)} entries.")
                break
            elif step is False:
                self.printAlignment(alignmentTextBlocked)
                print(f"Alignment blocked, {len(alignment)} entries.")
                self.printDiff(20, 20)
                break

        endLK = startLK if len(indexLK) == 0 else max(indexLK)
        possibleCases = set(range(startLK, endLK + 1))
        casesSet = set(cases)
        relevantCases = casesSet & possibleCases
        missedCases = relevantCases - usedCases - failedCases

        if failedCases:
            for iLK in sorted(failedCases):
                self.printCase(iLK, "FAILED")
        if missedCases:
            for iLK in sorted(missedCases):
                self.printCase(iLK, "MISSED")

        if not failedCases and not missedCases:
            if not relevantCases:
                print("No special cases defined for this stretch")
            else:
                print(
                    f"Special cases: all relevant {len(relevantCases)} cases "
                    "defined, encountered, and applied"
                )

        if show:
            print(self.printLines())

        for d in sorted(decisions):
            n = decisions[d]
            print(f"{d:>2} taken: {n:>5} x")

    def analyseStretch(self, start, end):
        """Analyse a stretch in the alignment table and reports whether it is suspect.

        Stretches are suspect if they contain both a number entries where an LK
        word is missing, and a number of entries where an AF word is missing.

        This is usually an indication that the alignment has gone out of sync.

        Parameters
        ----------
        start: integer
            Index in the alignment list where to start
        end: integer
            Index in the alignment list where to stop
        """
        alignment = self.alignment

        total = 0
        good = 0
        onlyLK = 0
        onlyAF = 0

        for (iLK, left, d, r, right, iAF) in alignment[start : end + 1]:
            total += 1
            if not iLK:
                onlyAF += 1
            if not iAF:
                onlyLK += 1
            if d == 0:
                good += 1

        suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5
        return suspect

    def check(self):
        """Comprehensive quality assessment of the alignment table.

        Reports various statistics on how close the matches are overall.

        Checks the minimum requirement of sanity:
        All words in the LK must occur without duplications in the right order
        int the alignment table. Same for the AF words.

        Reports where the combinations are: the cases where more than one word
        in the LK has matched with more than one word in the AF.
        Or where single words have been forced to match, or where words
        in one edition have no counterpart in the other edition.
        Gives at most three examples of all distinct combination patterns.

        Identifies bad strectches: chunks of entries within the alignment
        table where words do not match for various reasons.
        The identification is a bit loose in the sense that bad stretches
        with a single perfect match in between are combined.

        Some bad stretches are suspect (see `analyseStretch`).
        All suspect stretches are shown, and at most three examples
        of the benign (non-suspsect) bad stretches are shown.
        """
        maxLK = self.maxLK
        maxAF = self.maxAF
        alignment = self.alignment

        errors = {}
        prevILK = 0
        prevIAF = 0

        where = collections.Counter()
        agreement = collections.Counter()
        badStretches = collections.defaultdict(lambda: [])
        combinations = collections.defaultdict(lambda: [])

        startBad = 0
        startCombi = 0
        nCombi = 0

        for (c, (iLK, left, d, r, right, iAF)) in enumerate(alignment):
            thisBad = d > 0 or not iLK or not iAF
            hasLeft = left != ""
            hasRight = right != ""

            # a good line between bad lines is counted as bad
            if not thisBad and startBad:
                nextGood = True
                for j in range(1, LOOKAHEAD + 1):
                    if c + j < len(alignment):
                        compJ = alignment[c + j]
                        if compJ[2] > 0 or not compJ[0] or not compJ[-1]:
                            nextGood = False
                            break
                if not nextGood:
                    thisBad = True
            if startBad:
                if not thisBad:
                    badStretches[c - startBad].append(startBad)
                    startBad = 0
            else:
                if thisBad:
                    startBad = c

            if startCombi and nCombi == c - startCombi:
                startCombi = 0
                nCombi = 0
            if startCombi == 0:
                thisCombi = hasLeft and left > 1 or hasRight and right > 1
                if thisCombi:
                    combinations[(left, right)].append(c)
                    startCombi = c
                    nCombi = max((left, right))

            agreement[d] += 1

            if iLK:
                if iLK != prevILK + 1:
                    errors.setdefault("wrong iLK", []).append(
                        f"{c:>5}: Expected {prevILK + 1}, found {iLK}"
                    )
                prevILK = iLK
                if iAF:
                    where["both"] += 1
            else:
                where[AF] += 1
            if iAF:
                if iAF != prevIAF + 1:
                    errors.setdefault("wrong iAF", []).append(
                        f"{c:>5}: Expected {prevIAF + 1}, found {iAF}"
                    )
                prevIAF = iAF
            else:
                where[LK] += 1

        if startBad:
            badStretches[len(alignment) - startBad].append(startBad)

        if prevILK < maxLK:
            errors.setdefault("missing iLKs at the end", []).append(
                f"last is {prevILK}, expected {maxLK}"
            )
        elif prevILK > maxLK:
            errors.setdefault("too many iLKs at the end", []).append(
                f"last is {prevILK}, expected {maxLK}"
            )
        if prevIAF < maxAF:
            errors.setdefault("missing iAFs at the end", []).append(
                f"last is {prevIAF}, expected {maxAF}"
            )
        elif prevIAF > maxAF:
            errors.setdefault("too many iAFs at the end", []).append(
                f"last is {prevIAF}, expected {maxAF}"
            )

        print("\nSANITY\n")
        if not errors:
            print("All OK")
        else:
            for (kind, msgs) in errors.items():
                print(f"ERROR {kind} ({len(msgs):>5}x):")
                for msg in msgs[0:10]:
                    print(f"\t{msg}")
                if len(msgs) > 10:
                    print(f"\t ... and {len(msgs) - 10} more ...")

        print("\nAGREEMENT\n")
        print("Where are the words?\n")
        print(f"\t{LK}-only: {where[LK]:>5} slots")
        print(f"\t{AF}-only: {where[AF]:>5} slots")
        print(f"\tboth:    {where['both']:>5} slots")

        print("\nHow well is the agreement?\n")
        for (d, n) in sorted(agreement.items()):
            print(f"edit distance {d:>3} : {n:>5} words")
        print("NB: 88 are special cases that have been declared explicitly")
        print("NB: 99 are words that have no counterpart in the other edition")

        print("\nCOMBINATIONS\n")
        print("What combination alignments are there and how many?")
        for comb in sorted(combinations):
            cs = combinations[comb]
            (left, right) = comb
            print(f"\t({left:>2}, {right:>2}) : {len(cs):>4} x :")
            for (i, c) in enumerate(cs[0:3]):
                print(f"EXAMPLE {i + 1}:")
                print(
                    self.printLines(
                        start=max((1, c - 2)),
                        end=min((len(alignment), c + 2 + max(comb))),
                    )
                )
                print("")

        print("\nBAD STRETCHES\n")
        print("How many of which size?\n")
        allSuspects = []
        someBenigns = []
        for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])):
            suspects = {
                start: size
                for start in starts
                if self.analyseStretch(start, start + size)
            }
            benigns = {start: size for start in starts if start not in suspects}
            allSuspects.extend(
                [(start, start + size) for (start, size) in suspects.items()]
            )
            someBenigns.extend(
                [(start, start + size) for (start, size) in list(benigns.items())[0:3]]
            )
            examples = ", ".join(str(start) for start in list(suspects.keys())[0:3])
            if not suspects:
                examples = ", ".join(str(start) for start in list(benigns.keys())[0:3])
            print(
                f"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}"
            )

        print(
            f"\nShowing all {len(allSuspects)} inversion suspects"
            if len(allSuspects)
            else "\nNo suspect bad stretches\n"
        )
        for (i, (start, end)) in enumerate(reversed(allSuspects)):
            print(f"\nSUSPECT {i + 1:>2}")
            print(self.printLines(max((1, start - 5)), min((len(alignment), end + 5))))
        print(
            f"\nShowing some ({len(someBenigns)}) benign examples"
            if len(someBenigns)
            else "\nNo bad stretches\n"
        )
        for (i, (start, end)) in enumerate(someBenigns):
            print(f"\nBENIGN {i + 1:>2}")
            print(self.printLines(max((1, start - 2)), min((len(alignment), end + 2))))

Methods

def analyseStretch(self, start, end)

Analyse a stretch in the alignment table and reports whether it is suspect.

Stretches are suspect if they contain both a number entries where an LK word is missing, and a number of entries where an AF word is missing.

This is usually an indication that the alignment has gone out of sync.

Parameters

start : integer: Index in the alignment list where to start
end : integer: Index in the alignment list where to stop

Expand source code Browse git

def analyseStretch(self, start, end):
    """Analyse a stretch in the alignment table and reports whether it is suspect.

    Stretches are suspect if they contain both a number entries where an LK
    word is missing, and a number of entries where an AF word is missing.

    This is usually an indication that the alignment has gone out of sync.

    Parameters
    ----------
    start: integer
        Index in the alignment list where to start
    end: integer
        Index in the alignment list where to stop
    """
    alignment = self.alignment

    total = 0
    good = 0
    onlyLK = 0
    onlyAF = 0

    for (iLK, left, d, r, right, iAF) in alignment[start : end + 1]:
        total += 1
        if not iLK:
            onlyAF += 1
        if not iAF:
            onlyLK += 1
        if d == 0:
            good += 1

    suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5
    return suspect

def catchupAF(self, start, end)

Move the current position in the AF edition forward.

While doing so, it adds an entry to the alignment table for each time the position is shifted by one.

Parameters

start : integer: From where to start shifting
end : integer: Where to end shifting

Expand source code Browse git

def catchupAF(self, start, end):
    """Move the current position in the AF edition forward.

    While doing so, it adds an entry to the alignment table
    for each time the position is shifted by one.

    Parameters
    ----------
    start: integer
        From where to start shifting
    end: integer
        Where to end shifting
    """
    indexAF = self.indexAF
    alignment = self.alignment

    for i in range(start, end + 1):
        indexAF[i] = len(alignment)
        alignment.append(("", 0, 99, 0.0, "", i))

def catchupLK(self, start, end)

Move the current position in the LK edition forward.

While doing so, it adds an entry to the alignment table for each time the position is shifted by one.

Parameters

start : integer: From where to start shifting
end : integer: Where to end shifting

Expand source code Browse git

def catchupLK(self, start, end):
    """Move the current position in the LK edition forward.

    While doing so, it adds an entry to the alignment table
    for each time the position is shifted by one.

    Parameters
    ----------
    start: integer
        From where to start shifting
    end: integer
        Where to end shifting
    """
    indexLK = self.indexLK
    alignment = self.alignment

    for i in range(start, end + 1):
        indexLK[i] = len(alignment)
        alignment.append((i, "", 99, 0.0, 0, ""))

def check(self)

Comprehensive quality assessment of the alignment table.

Reports various statistics on how close the matches are overall.

Checks the minimum requirement of sanity: All words in the LK must occur without duplications in the right order int the alignment table. Same for the AF words.

Reports where the combinations are: the cases where more than one word in the LK has matched with more than one word in the AF. Or where single words have been forced to match, or where words in one edition have no counterpart in the other edition. Gives at most three examples of all distinct combination patterns.

Identifies bad strectches: chunks of entries within the alignment table where words do not match for various reasons. The identification is a bit loose in the sense that bad stretches with a single perfect match in between are combined.

Some bad stretches are suspect (see analyseStretch). All suspect stretches are shown, and at most three examples of the benign (non-suspsect) bad stretches are shown.

Expand source code Browse git

def check(self):
    """Comprehensive quality assessment of the alignment table.

    Reports various statistics on how close the matches are overall.

    Checks the minimum requirement of sanity:
    All words in the LK must occur without duplications in the right order
    int the alignment table. Same for the AF words.

    Reports where the combinations are: the cases where more than one word
    in the LK has matched with more than one word in the AF.
    Or where single words have been forced to match, or where words
    in one edition have no counterpart in the other edition.
    Gives at most three examples of all distinct combination patterns.

    Identifies bad strectches: chunks of entries within the alignment
    table where words do not match for various reasons.
    The identification is a bit loose in the sense that bad stretches
    with a single perfect match in between are combined.

    Some bad stretches are suspect (see `analyseStretch`).
    All suspect stretches are shown, and at most three examples
    of the benign (non-suspsect) bad stretches are shown.
    """
    maxLK = self.maxLK
    maxAF = self.maxAF
    alignment = self.alignment

    errors = {}
    prevILK = 0
    prevIAF = 0

    where = collections.Counter()
    agreement = collections.Counter()
    badStretches = collections.defaultdict(lambda: [])
    combinations = collections.defaultdict(lambda: [])

    startBad = 0
    startCombi = 0
    nCombi = 0

    for (c, (iLK, left, d, r, right, iAF)) in enumerate(alignment):
        thisBad = d > 0 or not iLK or not iAF
        hasLeft = left != ""
        hasRight = right != ""

        # a good line between bad lines is counted as bad
        if not thisBad and startBad:
            nextGood = True
            for j in range(1, LOOKAHEAD + 1):
                if c + j < len(alignment):
                    compJ = alignment[c + j]
                    if compJ[2] > 0 or not compJ[0] or not compJ[-1]:
                        nextGood = False
                        break
            if not nextGood:
                thisBad = True
        if startBad:
            if not thisBad:
                badStretches[c - startBad].append(startBad)
                startBad = 0
        else:
            if thisBad:
                startBad = c

        if startCombi and nCombi == c - startCombi:
            startCombi = 0
            nCombi = 0
        if startCombi == 0:
            thisCombi = hasLeft and left > 1 or hasRight and right > 1
            if thisCombi:
                combinations[(left, right)].append(c)
                startCombi = c
                nCombi = max((left, right))

        agreement[d] += 1

        if iLK:
            if iLK != prevILK + 1:
                errors.setdefault("wrong iLK", []).append(
                    f"{c:>5}: Expected {prevILK + 1}, found {iLK}"
                )
            prevILK = iLK
            if iAF:
                where["both"] += 1
        else:
            where[AF] += 1
        if iAF:
            if iAF != prevIAF + 1:
                errors.setdefault("wrong iAF", []).append(
                    f"{c:>5}: Expected {prevIAF + 1}, found {iAF}"
                )
            prevIAF = iAF
        else:
            where[LK] += 1

    if startBad:
        badStretches[len(alignment) - startBad].append(startBad)

    if prevILK < maxLK:
        errors.setdefault("missing iLKs at the end", []).append(
            f"last is {prevILK}, expected {maxLK}"
        )
    elif prevILK > maxLK:
        errors.setdefault("too many iLKs at the end", []).append(
            f"last is {prevILK}, expected {maxLK}"
        )
    if prevIAF < maxAF:
        errors.setdefault("missing iAFs at the end", []).append(
            f"last is {prevIAF}, expected {maxAF}"
        )
    elif prevIAF > maxAF:
        errors.setdefault("too many iAFs at the end", []).append(
            f"last is {prevIAF}, expected {maxAF}"
        )

    print("\nSANITY\n")
    if not errors:
        print("All OK")
    else:
        for (kind, msgs) in errors.items():
            print(f"ERROR {kind} ({len(msgs):>5}x):")
            for msg in msgs[0:10]:
                print(f"\t{msg}")
            if len(msgs) > 10:
                print(f"\t ... and {len(msgs) - 10} more ...")

    print("\nAGREEMENT\n")
    print("Where are the words?\n")
    print(f"\t{LK}-only: {where[LK]:>5} slots")
    print(f"\t{AF}-only: {where[AF]:>5} slots")
    print(f"\tboth:    {where['both']:>5} slots")

    print("\nHow well is the agreement?\n")
    for (d, n) in sorted(agreement.items()):
        print(f"edit distance {d:>3} : {n:>5} words")
    print("NB: 88 are special cases that have been declared explicitly")
    print("NB: 99 are words that have no counterpart in the other edition")

    print("\nCOMBINATIONS\n")
    print("What combination alignments are there and how many?")
    for comb in sorted(combinations):
        cs = combinations[comb]
        (left, right) = comb
        print(f"\t({left:>2}, {right:>2}) : {len(cs):>4} x :")
        for (i, c) in enumerate(cs[0:3]):
            print(f"EXAMPLE {i + 1}:")
            print(
                self.printLines(
                    start=max((1, c - 2)),
                    end=min((len(alignment), c + 2 + max(comb))),
                )
            )
            print("")

    print("\nBAD STRETCHES\n")
    print("How many of which size?\n")
    allSuspects = []
    someBenigns = []
    for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])):
        suspects = {
            start: size
            for start in starts
            if self.analyseStretch(start, start + size)
        }
        benigns = {start: size for start in starts if start not in suspects}
        allSuspects.extend(
            [(start, start + size) for (start, size) in suspects.items()]
        )
        someBenigns.extend(
            [(start, start + size) for (start, size) in list(benigns.items())[0:3]]
        )
        examples = ", ".join(str(start) for start in list(suspects.keys())[0:3])
        if not suspects:
            examples = ", ".join(str(start) for start in list(benigns.keys())[0:3])
        print(
            f"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}"
        )

    print(
        f"\nShowing all {len(allSuspects)} inversion suspects"
        if len(allSuspects)
        else "\nNo suspect bad stretches\n"
    )
    for (i, (start, end)) in enumerate(reversed(allSuspects)):
        print(f"\nSUSPECT {i + 1:>2}")
        print(self.printLines(max((1, start - 5)), min((len(alignment), end + 5))))
    print(
        f"\nShowing some ({len(someBenigns)}) benign examples"
        if len(someBenigns)
        else "\nNo bad stretches\n"
    )
    for (i, (start, end)) in enumerate(someBenigns):
        print(f"\nBENIGN {i + 1:>2}")
        print(self.printLines(max((1, start - 2)), min((len(alignment), end + 2))))

def compare(self, iLK, iAF, maxD, minR, debug=False)

Does a full comparison at a location, including between combinations.

First a direct match between the words at the current positions in the LK and AF editions is tried. If that fails, combinations of words from those points onwards are tried.

Parameters

iLK : integer: Current slot position in LK edition
iAF : integer: Current slot position in AF edition
maxD : integer: maximum edit distance that we allow for the comparisons to succeed
minR : integer: minimum similarity ratio that we require for the comparisons to succeed
debug : boolean, optional False: If True, print a statement indicating the result of the comparison.

Expand source code Browse git

def compare(self, iLK, iAF, maxD, minR, debug=False):
    """Does a full comparison at a location, including between combinations.

    First a direct match between the words at the current positions in
    the LK and AF editions is tried. If that fails,
    combinations of words from those points onwards are tried.

    See also: `getCombis` and `findCombi`.

    Parameters
    ----------
    iLK: integer
        Current slot position in LK edition
    iAF: integer
        Current slot position in AF edition
    maxD: integer
        maximum edit distance that we allow for the comparisons to succeed
    minR: integer
        minimum similarity ratio that we require for the comparisons to succeed
    debug: boolean, optional `False`
        If True, print a statement indicating the result of the comparison.
    """
    maxLK = self.maxLK
    maxAF = self.maxAF
    getTextLK = self.getTextLK
    getTextAF = self.getTextAF
    alignment = self.alignment
    indexLK = self.indexLK
    indexAF = self.indexAF

    textLK = getTextLK(iLK)
    textAF = getTextAF(iAF)
    (isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
    if isSimilar:
        nAlignment = len(alignment)
        indexLK[iLK] = nAlignment
        indexAF[iAF] = nAlignment
        alignment.append((iLK, "", d, r, "", iAF))
        if debug:
            print(
                f"[{iLK}~{iAF}] single comparison with distance <= {maxD} and ratio >= {minR}"
            )
        return (iLK + 1, iAF + 1)

    if iLK < maxLK and iAF < maxAF:
        textLK = getTextLK(iLK + 1)
        textAF = getTextAF(iAF + 1)
        (isSimilarNext, dNext, rNext) = similar(textLK, textAF, maxD, minR)
        if isSimilarNext:
            nAlignment = len(alignment)
            indexLK[iLK] = nAlignment
            indexAF[iAF] = nAlignment
            alignment.append((iLK, "", dNext, rNext, "", iAF))
            if debug:
                print(
                    f"[{iLK}~{iAF}] single comparison failed with distance <= {maxD} and ratio >= {minR}"
                )
            nAlignment = len(alignment)
            indexLK[iLK + 1] = nAlignment
            indexAF[iAF + 1] = nAlignment
            alignment.append((iLK + 1, "", dNext, rNext, "", iAF + 1))
            if debug:
                print(
                    f"[{iLK}~{iAF}] single comparison recovered with distance <= {maxD} and ratio >= {minR}"
                )
            return (iLK + 2, iAF + 2)

    combi = self.findCombi(iLK, iAF, maxD, minR)
    if combi is not None:
        (cLK, cAF) = combi
        if debug:
            print(
                f"[{iLK}~{iAF}] ({cLK}, {cAF}) comparison with distance <= {maxD} and ratio >= {minR}"
            )
        return (iLK + cLK, iAF + cAF)

    return None

def doCase(self, iLK, iAF, debug=False)

Deal with a special case.

Parameters

iLK : integer: Current position in LK edition. A case is triggered when iLK is equal to a key in the cases table and if the same case has not already been applied. This last condition is relevant for special cases where the number of LK words that must be combined is 0. In that case, the current LK position does not shift, and we would be in an infinite loop.
iAF : integer: Current position in AF edition. If a case is triggered by iLK, but the AF position specified in the case does not match iAF, the case is skipped and is reported as a failed case.

Expand source code Browse git

def doCase(self, iLK, iAF, debug=False):
    """Deal with a special case.

    Parameters
    ----------
    iLK: integer
        Current position in LK edition.
        A case is triggered when iLK is equal
        to a key in the cases table and if the same case
        has not already been applied.
        This last condition is relevant for special
        cases where the number of LK words that must be
        combined is 0. In that case, the current LK
        position does not shift, and we would be in an
        infinite loop.
    iAF: integer
        Current position in AF edition.
        If a case is triggered by iLK, but the AF position
        specified in the case does not match iAF,
        the case is skipped and is reported as a failed case.
    """
    indexLK = self.indexLK
    indexAF = self.indexAF
    alignment = self.alignment
    usedCases = self.usedCases
    failedCases = self.failedCases

    cases = self.cases

    if iLK not in cases:
        return None

    (givenIAF, cLK, cAF) = cases[iLK]
    if givenIAF != iAF:
        failedCases.add(iLK)
        return None

    usedCases.add(iLK)
    common = min((cLK, cAF))
    for i in range(max((cLK, cAF))):
        nAlignment = len(alignment)
        if i < common:
            indexLK[iLK + i] = nAlignment
            indexAF[iAF + i] = nAlignment
            alignment.append((iLK + i, cLK, 88, 0.0, cAF, iAF + i))
        elif i < cLK:
            indexLK[iLK + i] = nAlignment
            alignment.append((iLK + i, cLK, 88, 0.0, cAF, ""))
        else:
            indexAF[iAF + i] = nAlignment
            alignment.append(("", cLK, 88, 0.0, cAF, iAF + i))
    if debug:
        print(f"[{iLK}~{iAF}] special case ({cLK}, {cAF})")
    return (iLK + cLK, iAF + cAF)

def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False)

Main loop of the alignment process.

Without optional parameters, performs the whole alignment until it is completed or fails.

But you can also run a few alignment steps, from arbitrary positions and record the decision steps. This is handy for debugging and exploring.

doDiff

This function defines an inner function doDiff, which contains the logic of a single alignment step. That function doDiff issues a sequence of compare and lookup commands with various strictnesses and various jump sizes, which will be tried in order.

Here you can finetune the amount of looseness you allow in a comparison before jumping from that position. You can first compare strictly, then jump away a short distance while still comparing strictly, and then start comparing more loosely, and then jump away a longer distance, possibly a bit more loosely. That is a matter of trial and error.

The current implementation of doDiff matches the present LK and AF well, especially in conjunction with the special cases that have been defined.

Parameters

startLK : integer, optional 1: Start position in LK edition. If not passed, starts from the beginning.
startAF : integer, optional 1: Start position in AF edition. If not passed, starts from the beginning.
steps : integer, optional -1: The number of alignment steps to perform. If not passed or -1, there is no limit on the steps.
show : boolean, optional False: If True, prints the resulting alignment table to the screen. Only do this when you produce a small part of the alignment table! maximum edit distance that we allow for the comparisons to succeed
debug : boolean, optional False: If True, print a statement indicating the result of the each decision that has been taken during the alignment process.

Expand source code Browse git

def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False):
    """Main loop of the alignment process.

    Without optional parameters, performs the whole alignment until
    it is completed or fails.

    But you can also run a few alignment steps, from arbitrary positions
    and record the decision steps. This is handy for debugging and
    exploring.

    !!! hint "doDiff"
        This function defines an inner function `doDiff`, which
        contains the logic of a single alignment step.
        That function `doDiff` issues a sequence of `compare` and `lookup`
        commands with various strictnesses and various jump sizes,
        which will be tried in order.

        Here you can finetune the amount of looseness you allow in a comparison
        before jumping from that position.
        You can first compare strictly, then jump away a short distance while
        still comparing strictly, and then start comparing more loosely, and
        then jump away a longer distance, possibly a bit more loosely.
        That is a matter of trial and error.

        The current implementation of `doDiff` matches the present LK and AF well,
        especially in conjunction with the special cases that have been defined.

    Parameters
    ----------
    startLK: integer, optional 1
        Start position in LK edition. If not passed, starts from the beginning.
    startAF: integer, optional 1
        Start position in AF edition. If not passed, starts from the beginning.
    steps: integer, optional `-1`
        The number of alignment steps to perform.
        If not passed or -1, there is no limit on the steps.
    show: boolean, optional False
        If True, prints the resulting alignment table to the screen.
        Only do this when you produce a small part of the alignment table!
        maximum edit distance that we allow for the comparisons to succeed
    debug: boolean, optional `False`
        If True, print a statement indicating the result of the each decision
        that has been taken during the alignment process.
    """
    maxLK = self.maxLK
    maxAF = self.maxAF
    cases = self.cases
    alignmentPath = self.alignmentPath
    alignmentText = self.alignmentText
    alignmentTextBlocked = self.alignmentTextBlocked

    self.alignment = []
    self.indexLK = {}
    self.indexAF = {}
    self.decisions = collections.Counter()
    self.usedCases = set()
    self.failedCases = set()

    alignment = self.alignment
    indexLK = self.indexLK
    decisions = self.decisions
    usedCases = self.usedCases
    failedCases = self.failedCases

    lastCase = None
    step = (startLK, startAF)

    it = 0

    def doDiff(iLK, iAF):
        decision = 0
        if iLK > maxLK or iAF > maxAF:
            if iAF < maxAF:
                self.catchupAF(iAF, maxAF)
            if iLK < maxLK:
                self.catchupLK(iLK, maxLK)
            decisions[decision] += 1
            return True

        nonlocal lastCase

        decision += 1

        if lastCase != iLK:
            step = self.doCase(iLK, iAF, debug=debug)
            if step:
                lastCase = iLK
                decisions[decision] += 1
                return step

        step = self.compare(iLK, iAF, 0, 1.0, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 0, 1.0, 1, 5, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.compare(iLK, iAF, 2, 0.7, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.compare(iLK, iAF, 3, 0.6, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 0, 1.0, 1, 10, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 1, 0.8, 1, 10, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 0, 1.0, 10, 20, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 1, 0.8, 10, 20, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 2, 0.7, 1, 20, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 1, 0.8, 20, 100, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        step = self.lookup(iLK, iAF, 2, 0.7, 20, 100, debug=debug)
        decision += 1
        if step:
            decisions[decision] += 1
            return step

        return False

    while it != steps:
        it += 1
        step = doDiff(*step)

        if step is True:
            self.printAlignment(alignmentText)
            self.printAlignment(alignmentPath, asTsv=True)
            print(f"Alignment complete, {len(alignment)} entries.")
            break
        elif step is False:
            self.printAlignment(alignmentTextBlocked)
            print(f"Alignment blocked, {len(alignment)} entries.")
            self.printDiff(20, 20)
            break

    endLK = startLK if len(indexLK) == 0 else max(indexLK)
    possibleCases = set(range(startLK, endLK + 1))
    casesSet = set(cases)
    relevantCases = casesSet & possibleCases
    missedCases = relevantCases - usedCases - failedCases

    if failedCases:
        for iLK in sorted(failedCases):
            self.printCase(iLK, "FAILED")
    if missedCases:
        for iLK in sorted(missedCases):
            self.printCase(iLK, "MISSED")

    if not failedCases and not missedCases:
        if not relevantCases:
            print("No special cases defined for this stretch")
        else:
            print(
                f"Special cases: all relevant {len(relevantCases)} cases "
                "defined, encountered, and applied"
            )

    if show:
        print(self.printLines())

    for d in sorted(decisions):
        n = decisions[d]
        print(f"{d:>2} taken: {n:>5} x")

def findCombi(self, iLK, iAF, maxD, minR)

Tries out all possible combinations at this point until success.

This is typically called when a direct match at the current slots in LK and AF fail.

We are going to take together small amounts of words and match them instead. As soon as we have a match, we break out of the loop and use step to the new positions.

Parameters

iLK : integer: Current slot position in LK edition
iAF : integer: Current slot position in AF edition
maxD : integer: maximum edit distance that we allow for the comparisons to succeed
minR : integer: minimum similarity ratio that we require for the comparisons to succeed

Expand source code Browse git

def findCombi(self, iLK, iAF, maxD, minR):
    """Tries out all possible combinations at this point until success.

    This is typically called when a direct match at the current slots
    in LK and AF fail.

    We are going to take together small amounts of words and match
    them instead. As soon as we have a match, we break out of the loop
    and use step to the new positions.

    See also: `getCombis` and `compare`.

    Parameters
    ----------
    iLK: integer
        Current slot position in LK edition
    iAF: integer
        Current slot position in AF edition
    maxD: integer
        maximum edit distance that we allow for the comparisons to succeed
    minR: integer
        minimum similarity ratio that we require for the comparisons to succeed
    """
    combis = self.combis
    maxLK = self.maxLK
    maxAF = self.maxAF
    getTextLK = self.getTextLK
    getTextAF = self.getTextAF
    alignment = self.alignment
    indexLK = self.indexLK
    indexAF = self.indexAF

    found = None

    for (cLK, cAF) in combis:
        if iLK + cLK > maxLK or iAF + cAF > maxAF:
            continue
        textLK = "".join(getTextLK(iLK + i) for i in range(cLK))
        textAF = "".join(getTextAF(iAF + i) for i in range(cAF))
        (isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
        if isSimilar:
            found = (cLK, cAF)
            common = min((cLK, cAF))
            for i in range(max((cLK, cAF))):
                nAlignment = len(alignment)
                if i < common:
                    indexLK[iLK + i] = nAlignment
                    indexAF[iAF + i] = nAlignment
                    alignment.append((iLK + i, cLK, d, r, cAF, iAF + i))
                elif i < cLK:
                    indexLK[iLK + i] = nAlignment
                    alignment.append((iLK + i, cLK, d, r, cAF, ""))
                elif i < cAF:
                    indexAF[iAF + i] = nAlignment
                    alignment.append(("", cLK, d, r, cAF, iAF + i))
            break
    return found

def getCombis(self, c)

Precompute a specification of all combinations that might be tried.

When the alignment fails between single words, we try combinations of words left and right in a specific order. The result of this function lists the combinations that must be tried, where each combination is specified as a tuple of the number of words left and the number of words right that needs to be taken together for the matching.

Parameters

iLK : integer: Current slot position in LK edition
iAF : integer: Current slot position in AF edition
maxD : integer: maximum edit distance that we allow for the comparisons to succeed
minR : integer: minimum similarity ratio that we require for the comparisons to succeed
start : integer: first position in the other edition where we start comparing
end : integer: last position in the other edition where we stop comparing
debug : boolean, optional False: If True, print a statement indicating the result of the lookup.

Expand source code Browse git

def lookup(self, iLK, iAF, maxD, minR, start, end, debug=False):
    """Performs a jump in both editions.

    Typically invoked when comparison at the current locations fail,
    even after having tried combinations.

    We compare the current word in one edition with those from a certain
    interval in the other edition. And vice versa, alternately.
    of subsequent words.

    Parameters
    ----------
    iLK: integer
        Current slot position in LK edition
    iAF: integer
        Current slot position in AF edition
    maxD: integer
        maximum edit distance that we allow for the comparisons to succeed
    minR: integer
        minimum similarity ratio that we require for the comparisons to succeed
    start: integer
        first position in the other edition where we start comparing
    end: integer
        last position in the other edition where we stop comparing
    debug: boolean, optional `False`
        If True, print a statement indicating the result of the lookup.
    """
    maxLK = self.maxLK
    maxAF = self.maxAF
    alignment = self.alignment
    indexLK = self.indexLK
    indexAF = self.indexAF

    step = None

    for i in range(start, end + 1):
        prevAlignmentIndex = len(alignment)

        if iAF + i <= maxAF:
            step = self.compare(iLK, iAF + i, maxD, minR, debug=debug)
            if step:
                if debug:
                    print(f"[{iLK}~{iAF}] right {i}-jump to {iAF + i}")
                thisAlignment = list(alignment[prevAlignmentIndex:])
                alignment[prevAlignmentIndex:] = []

                self.catchupAF(iAF, iAF + i - 1)
                for thisComp in thisAlignment:
                    nAlignment = len(alignment)
                    thisLK = thisComp[0]
                    thisAF = thisComp[-1]
                    if thisLK:
                        indexLK[thisLK] = nAlignment
                    if thisAF:
                        indexAF[thisAF] = nAlignment
                    alignment.append(thisComp)
                break

        if iLK + i <= maxLK:
            step = self.compare(iLK + i, iAF, maxD, minR, debug=debug)
            if step:
                if debug:
                    print(f"[{iLK}~{iAF}] left {i}-jump to {iLK + i}")
                thisAlignment = list(alignment[prevAlignmentIndex:])
                alignment[prevAlignmentIndex:] = []

                self.catchupLK(iLK, iLK + i - 1)
                for thisComp in thisAlignment:
                    nAlignment = len(alignment)
                    thisLK = thisComp[0]
                    thisAF = thisComp[-1]
                    if thisLK:
                        indexLK[thisLK] = nAlignment
                    if thisAF:
                        indexAF[thisAF] = nAlignment
                    alignment.append(thisComp)
                break
    return step

def printAlignment(self, path, asTsv=False)

Prints the whole alignment table to file.

Parameters

path : string: The path name of the file to which the information is printed.
asTsv : boolean, optional `False: If False, pretty prints the table to file, using printLines above. If True, write essential data only in tab separated format. The essential information is slot in LK, distance, slot in AF

Expand source code Browse git

def printAlignment(self, path, asTsv=False):
    """Prints the whole alignment table to file.

    Parameters
    ----------
    path: string
        The path name of the file to which the information is printed.

    asTsv: boolean, optional `False
        If False, pretty prints the table to file, using
        `printLines` above.
        If True, write essential data only in tab separated format.
        The essential information is *slot in LK*, *distance*, *slot in AF*
    """
    alignment = self.alignment

    with open(path, "w") as fh:
        if asTsv:
            for item in alignment:
                fh.write(
                    "\t".join(
                        f"{it:3.1f}" if i == 3 else str(it)
                        for (i, it) in enumerate(item)
                    )
                    + "\n"
                )
        else:
            fh.write(self.printLines())
            fh.write("\n")

def printCase(self, iLK, label)

Print the alignment entries that belong to a special case.

Parameters

iLK : integer: The slot number in LK that triggered a special case.
label : string: The kind of issue with this case that you want to report: FAILED or MISSING.

Expand source code Browse git

def printCase(self, iLK, label):
    """Print the alignment entries that belong to a special case.

    Parameters
    ----------
    iLK: integer
        The slot number in LK that triggered a special case.
    label: string
        The kind of issue with this case that you want to report:
        FAILED or MISSING.
    """
    indexLK = self.indexLK
    indexAF = self.indexAF

    cases = self.cases
    case = cases[iLK]
    (iAF, cLK, cAF) = case
    start = min((indexLK[iLK], indexAF[iAF]))
    end = max((indexLK[iLK + cLK + 1], indexAF[iAF + cAF + 1]))
    print(f"{label} CASE: {iLK} vs {iAF}:")
    print(self.printLines(start=start, end=end))

def printDiff(self, before, after)

Print the last alignment entries plus what comes after.

This function is useful if alignment fails at some point. It then shows what happened before the failure and how it looks after the failure.

Parameters

before : integer: The amount of alignment entries to print. They will be picked from the end of the current alignment table.
after : integer: The amount of slots after the last alignment entry that needs to be shown for each source.

Expand source code Browse git

def printDiff(self, before, after):
    """Print the last alignment entries plus what comes after.

    This function is useful if alignment fails at some point.
    It then shows what happened before the failure and how it looks
    after the failure.

    Parameters
    ----------
    before: integer
        The amount of alignment entries to print.
        They will be picked from the end of the current alignment table.
    after: integer
        The amount of slots after the last alignment entry that
        needs to be shown for each source.
    """
    alignment = self.alignment
    maxLK = self.maxLK
    maxAF = self.maxAF
    getTextLK = self.getTextLK
    getTextAF = self.getTextAF

    print(self.printLines(start=len(alignment) - before))
    print("^" * 67)
    lastLK = None
    lastAF = None
    for c in range(len(alignment) - 1, -1, -1):
        comp = alignment[c]
        if lastLK is None:
            if comp[0]:
                lastLK = comp[0]
        if lastAF is None:
            if comp[-1]:
                lastAF = comp[-1]
        if lastLK is not None and lastAF is not None:
            break
    if lastLK is not None and lastAF is not None:
        for i in range(after):
            iLK = lastLK + 1 + i
            iAF = lastAF + 1 + i
            textLK = getTextLK(iLK) if iLK <= maxLK else ""
            textAF = getTextAF(iAF) if iAF <= maxAF else ""
            d = distance(textLK, textAF)
            r = ratio(textLK, textAF)
            print(
                f"{iLK:>5} =  {textLK:>20} @{d:>2}~{r:3.1f} {textAF:<20}  = {iAF:>5}"
            )

def printLines(self, start=0, end=None)

Print specified entries in the alignment table.

Parameters

start : integer, optional 0: Position of first alignment entry that needs to be printed. If it is 0 or negative, it starts from the beginning.
end : integer, optional None: Position of last alignment entry that needs to be printed. If it is negative, it indicates a position that much before the end of the list. If it is absent, it indicates the last line of the list.

Expand source code Browse git

def printLines(self, start=0, end=None):
    """Print specified entries in the alignment table.

    Parameters
    ----------
    start: integer, optional 0
        Position of first alignment entry that needs to be printed.
        If it is 0 or negative, it starts from the beginning.

    end: integer, optional `None`
        Position of last alignment entry that needs to be printed.
        If it is negative, it indicates a position that much
        before the end of the list.
        If it is absent, it indicates the last line of the list.
    """

    getTextLK = self.getTextLK
    getTextAF = self.getTextAF
    alignment = self.alignment

    if start < 0:
        start = 0
    if end is None or end > len(alignment):
        end = len(alignment)
    lines = []

    lines.append(
        "pag:ln|slot |cc|textLakhnawi        |@ed~rat|textAfifi           |cc| slot|pag:ln"
    )
    lines.append(
        "------|-----|--|--------------------|-------|--------------------|--|-----|------"
    )
    for (iLK, left, d, r, right, iAF) in alignment[start:end]:
        textLK = getTextLK(iLK) if iLK else ""
        textAF = getTextAF(iAF) if iAF else ""
        posLK = self.getPosLK(iLK) if iLK else ""
        posAF = self.getPosAF(iAF) if iAF else ""
        lines.append(
            f"{posLK:>6}|{iLK:>5}|{left:<2}|{textLK:>20}|@{d:<2}~{r:3.1f}|{textAF:<20}|{right:>2}|{iAF:>5} {posAF:<6}"
        )
    return "\n".join(lines)

def readEditions(self, version, cases)

Read the material of both LK and AF editions.

Parameters

version : string

The version number of the source datasets for both editions. We assume that both editions are available as Text-Fabric datasets.

cases : dict

The special cases defined to help the alignment algorithm through difficult spots.

The dictionary is keyed by LK slot number, and the value is a tuple consisting of a corresponding AF slot number, the amount of words to collect at the LK side, and the amount of words to collect at the AF side.

Returns

None: But the Alignment object will have attributes set by which the other methods are capable of reading the data source.

Expand source code Browse git

def readEditions(self, version, cases):
    """Read the material of both LK and AF editions.

    Parameters
    ----------
    version: string
        The version number of the source datasets for both editions.
        We assume that both editions are available as Text-Fabric datasets.

    cases: dict
        The special cases defined to help the alignment algorithm through
        difficult spots.

        The dictionary is keyed by LK slot number, and the value is a tuple
        consisting of a corresponding AF slot number, the amount of
        words to collect at the LK side, and the amount of words to collect at
        the AF side.

    Returns
    -------
    None
        But the Alignment object will have attributes set by which
        the other methods are capable of reading the data source.
    """

    self.version = version
    self.cases = cases
    self.alignmentPath = f"{ALIGNMENT_DIR}/alignment{version}.tsv"
    self.alignmentText = f"{ALIGNMENT_DIR}/alignment{version}.txt"
    self.alignmentTextBlocked = f"{ALIGNMENT_DIR}/alignment-incomplete{version}.txt"

    if not os.path.exists(ALIGNMENT_DIR):
        os.makedirs(self.alignmentPath, exist_ok=True)

    A = {}
    F = {}
    T = {}
    L = {}
    maxSlot = {}

    for (acro, name) in EDITIONS.items():
        A[acro] = use(
            f"among/fusus/tf/{name}:clone", writing="ara", version=version
        )
        F[acro] = A[acro].api.F
        T[acro] = A[acro].api.T
        L[acro] = A[acro].api.L
        maxSlot[acro] = F[acro].otype.maxSlot

    self.F_LK = F[LK]
    self.F_AF = F[AF]
    self.T_LK = T[LK]
    self.T_AF = T[AF]
    self.L_LK = L[LK]
    self.L_AF = L[AF]
    self.maxLK = maxSlot[LK]
    self.maxAF = maxSlot[AF]

    self.getTextLK = F[LK].lettersn.v
    self.getTextAF = F[AF].lettersn.v