Module fusus.align
ALignment of the Afifi and Lakhnawi editions of the Fusus
We have two versions of the Fusus text, obtained in wildly different ways:
AF
the Afifi edition, obtained via the OCR pipeline;LK
the Lakhnawi edition, obtained by reverse engineering a textual PDF with unusual fonts and private use characters.
The results of both attempts have been cleaned and enriched by Cornelis van Lit, through visual inspection and manipulation in Pandas.
In this notebook, we align the results obtained by Cornelis.
The essential result is an alignment table where the words of the LK
are brought into
correspondence with the words in AF
.
A candidate tool to perform sequence alignment with is Collatex by Ronald Haentjes-Dekker. We experimented with it, see the collatexAfLk notebook, but it took a very long time to run.
However, the two editions are very similar, with very few transpositions but a lot of OCR errors. We have developed our own algorithm, which runs in about a second.
The cleaned LK
and AF
are present in the Text-Fabric resources fususl
and fususa
.
In this way we have each word precisely numbered in both version, and we also have a latin
transcription of the words, which eases the visual comparison of similar words.
We use the latin transcription,
only because the author of this notebook (Dirk Roorda) is not trained to read Arabic,
but also beecause the Arabaic letters change shape depending on their position in the word,
which makes a task like this needlessly complicated.
Edit distance
We need a measure to estimate how similar words are. A good measure is the edit distance, i.e. the amount of edit operations to change one word into another. Since one of the editions is the result of OCR, we expect (many) OCR errors, and for this the concept of edit distance is useful.
However, we also need the slightly more refined notion of ratio, which takes the lengths of the words into account. It roughly corresponds to proportion of the common part of the words in relation to the totality of both words under comparison. In other words, an edit distance of 2 between words of length 10 tells you that the words are rather similar, their ratio is high. But the same edit distance between words of length 3 means that the words are very different, their ratio is low.
The concepts ratio and edit distance can be computed by the
Levenshtein algorithm,
which is implemented in the Python module Levenshtein
.
We have included this module in the fusus dependencies, but if for some reason yoou have not recently installed fusus, you may have to
pip3 install python-Levenshtein
The alignment table
The table itself is stored in a Python list alignment
.
The elements of this list are tuples with a slot number in LK and
a corresponding slot number in
AF and a measure of how well these slots correspond with each other.
Some words cannot be matched with the other source. In those cases the corresponding member of the tuple is the empty string.
That means that the alignment list becomes longer than each of the editions, and that it is not trivial to locate a specific slot in this table.
For that reason we maintain two indexes that map the slot numbers of both editions to positions in the alignment list.
Alignment algorithm
We align the editions by walking through both of them in parallel and performing comparison actions at each point.
Comparison actions may succeed, in which case we jump in both sources to the first point after the compared material.
If they fail, we try several jumps in the neighbourhood to see if the alignment catches up later.
If that fails, we stop the process prematurely and print the situation around the point of failure.
Comparison
We do not only compare single words, we also try out short combinations of words at both sides. We do this because the editions do not always agree on word boundaries.
So, we need to compute whether $n$ consecutive words left are similar to $m$ consecutive words right.
We set a boundary $C$ on the amount of words that we combine at each source.
We need to try out all possible combinations, from simplest and shortest to longest and most complex.
Every combination can be characterized by $(n, m)$, where $n$ is the number of words on the left and $m$ is the number of words on the right. $n$ and $m$ are in the range $1 \ldots C$.
Suppose $C = 3$, then we want to compare combinations in the following order:
combination | x |
---|---|
$(1, 1)$ | – |
$(1, 2)$ | – |
$(2, 1)$ | – |
$(2, 2)$ | – |
$(1, 3)$ | – |
$(3, 1)$ | – |
$(2, 3)$ | – |
$(3, 2)$ | – |
$(3, 3)$ | – |
In fact, we list all possible combinations and then sort them first by sum of the pair and then by decreasing difference of the pair.
We fix a C
(called COMBI
) and compute the sequence of combinations up front.
It turns out that 4 is a good value for this source. 3 is definitely too low. 5 is too high, because then the following is likely to happen.
Suppose we have to compare the sentence left to the one right.
left | right |
---|---|
A, BB, CCC, DDD, EEE | BB, CCC, DDD, EEE, X |
The best alignment is to decide that A on the left is not matched with anything on the right, X on the right is not matched to anything on the left, and BB, CCC, DDD, EEE will be exact matches.
Buth when comparing combinations of length 5, we get the comparison
ABBCCCDDDEEE versus BBCCCDDDEEEX with distance and ratio:
left = "ABBCCCDDDEEE"
right = "BBCCCDDDEEEX"
distance(left, right)=2
ratio(left, right)=0.9166666666666666
whereas when we go no further than 4, we would get:
left = "ABBCCCDDD"
right = "BBCCCDDDEEE"
distance(left, right)=4
ratio(left, right)=0.8
The algorithm specifies limits in terms of distance and ratio to determine when a comparison succeeds or fails, and as you see, the combination of 5 words is much more likely to succeed than the combination of 4.
And we want it to fail in this case, because it is a suboptimal match.
Similarity
We compute the similarity of words by a function that reports whether the words are similar with respect to a given edit distance and ratio.
The two words match if their edit distance is at most the given edit distance and their ratio is at least the given ratio.
The function returns the decision, and the computed distance and ratio between the words.
Comparing
This is about making a comparison between the locations in both editions where we are currently at.
First we make a quick 1-1 comparison between the words in both editions
at the current positions.
If that fails, we check whether the next word at both sides match.
If that fails, we try out all possible short combinations, using findCombi
above.
If all fails, None is returned. In case of success, a tuple with comparison information is appended to the alignment table and the next positions in both editions are returned.
Looking up
If comparison fails, even after having tried all possible combinations of words, we are potentially stuck.
But before giving up, we can make a jump if we can match the current word left to a word right that is not far ahead. Or vice versa.
The following function checks whether there are successful jumps within a specific region and within a given strictness. Typically, this function is called with different strictnesses for different regions. For example, first we want to try short jumps with moderate strictness, and if that fails, we try out longer jumps with higher strictness.
Short jumps are tried out before long jumps, and we try the jumps in the left and right edition alternately.
If a jump is successful, the next position will be returned, and a catchup will be done for the edition in which the jump has been made. When there is no successful jump, None is returned.
The catchup is a bit complicated, because the comparison that led to the successful jump has already been appended to the alignment table. So we have to pop that first (it could be multiple entries in case the comparison involved a combination of words), and then we can do the catchup, and then we can push the comparison entries again. We also need to update the indexes between slot numbers and the alignment table entries.
Orchestrating
We can now define how we are going to orchestrate the comparisons that will build the alignment table.
First we define a function that does a single step or a limited number of steps, starting at given positions in the editions.
This function will then be called from a big loop.
The comparison consists of checking out a number of things under several strictness conditions. As soon as a check succeeds, we perform the actions associated with that, and continue to the next iteration.
- First of all: is there a special case defined for this point? If so, follow its instructions.
- Perform local and very strict comparisons.
- Relax the strictness a little bit and try the local comparisons again
- Try out small jumps with great strictness
- Try out small jumps with lesser strictness
- Try out bigger jumps with great strictness
- Try out bigger jumps with lesser strictness
- Fail!
Clearly, there is no guarantee that we reach the end without failing. When it fails, we have to inspect the failure and see what happens there. There are basically three solutions to overcome such roadblocks
- Define a special case. Easy. Until it appears that too many special cases are needed.
- Tweak the strictness levels in the algorithm. Easy, but risky. Things that went well may now go awry. It is not only about preventing the roadblocks, but also maintaining the quality of the output alignment table.
- Change the orchestration: change the order of comparing and jumping, introduce more strictness levels, try more or less combinations. Invent new criteria for comparison. This is really difficult. The present orchestration is the result of quite some trial and error.
We have provided an analysis function for the alignment table that assesses its quality and reports where the bad stretches are. This is a very helpful aid in tweaking the parameters and defining special cases. More about this below.
The diff function
Finally we can write up the doDiff
function which loops through the editions
and produces an alignment table.
But you can also run call it for a specific pair of positions and a limited number of steps. Handy for debugging and tweaking the decision parameters.
You specify the start positions by means of the slot numbers in the LK and AF. You can specify the number of steps, or pass -1 in order to continue to the end.
And if you pass show=True
, the alignment table will be printed after completion.
Only do this if you run a limited number of steps.
Define the special cases
The special cases that you find in the notebook are the result of trial and error.
The keys are the LK slot numbers where the cases must be applied.
The values are the amount of words in LK and in AF that will be identified.
So 1000: (1100, 3, 4)
means that at slot position 1000 in LK resp. 1100 in AF
we take 3 resp. 4 consecutive words and declare that those 3 words in LK
match those 3 words in AF.
If for some reason the algorithm never reaches a point where the current position in LF is 1000 and the current position in AF is 1100, the case will be reported as a failed case.
When you read the alignment table (by printDiff
or printLines
)
you see the information on the basis of which you could define a special case.
Quality check
How did the alignment perform? It did complete, but what have we got? It could be just garbage.
Sanity
First of all we need to know whether all words of both LF and LK occur left and right, without gaps and duplications and in the right order. We check that. This is important, because the alignment algorithm is under intense evolution, and it could easily incur a flaw in which the material of the editions is not properly conserved.
Agreement
We provide information about the agreement of the words in both sources. How many words are there for which there is no counterpart in the other edition?
And how close are the words for which an alignment could be established?
Note that there are two reasons for bad agreement results:
- The editions are really very different
- The alignment is not optimal and fails to align many words that would have matched under another alignment strategy.
Bad stretches
Are there long stretches of poorly matching alignments? We are going to examine them.
If they contain many cases of left missing words and many cases of right missing words, they are suspect, because they might contain largely the same words, but the algorithm has failed to match them.
We show all suspect bad stretches. It is advisable to tweak the algorithm until all suspect bad stretches are gone. We have done so.
The remaining stretches are benign. We also show examples of benign bad strectches (at most three examples per size).
Expand source code Browse git
"""
.. include:: docs/about/alignment.md
"""
from Levenshtein import distance, ratio
import collections
import os
from tf.app import use
ORG = "among"
REPO = "fusus"
RELATIVE = "tf"
BASE = os.path.expanduser(f"~/github/{ORG}/{REPO}")
ALIGNMENT_DIR = f"{BASE}/ur/Fusus"
LK = "LK"
AF = "AF"
EDITIONS = {
LK: "Lakhnawi",
AF: "Afifi",
}
COMBINE = 4
LOOKAHEAD = 3
def similar(s1, s2, maxD, minR):
if s1 == s2:
return (True, 0, 1.0)
if maxD == 0 or minR == 1:
return (False, 77, 0.0)
d = distance(s1, s2)
r = ratio(s1, s2)
return (d <= maxD and r >= minR, d, r)
class Alignment:
"""Aligns the words of the LK edition with those of the AF edition.
The main method is `doDiffs` which produces the alignment table.
"""
def __init__(self):
self.getCombis(COMBINE)
def readEditions(self, version, cases):
"""Read the material of both LK and AF editions.
Parameters
----------
version: string
The version number of the source datasets for both editions.
We assume that both editions are available as Text-Fabric datasets.
cases: dict
The special cases defined to help the alignment algorithm through
difficult spots.
The dictionary is keyed by LK slot number, and the value is a tuple
consisting of a corresponding AF slot number, the amount of
words to collect at the LK side, and the amount of words to collect at
the AF side.
Returns
-------
None
But the Alignment object will have attributes set by which
the other methods are capable of reading the data source.
"""
self.version = version
self.cases = cases
self.alignmentPath = f"{ALIGNMENT_DIR}/alignment{version}.tsv"
self.alignmentText = f"{ALIGNMENT_DIR}/alignment{version}.txt"
self.alignmentTextBlocked = f"{ALIGNMENT_DIR}/alignment-incomplete{version}.txt"
if not os.path.exists(ALIGNMENT_DIR):
os.makedirs(self.alignmentPath, exist_ok=True)
A = {}
F = {}
T = {}
L = {}
maxSlot = {}
for (acro, name) in EDITIONS.items():
A[acro] = use(
f"among/fusus/tf/{name}:clone", writing="ara", version=version
)
F[acro] = A[acro].api.F
T[acro] = A[acro].api.T
L[acro] = A[acro].api.L
maxSlot[acro] = F[acro].otype.maxSlot
self.F_LK = F[LK]
self.F_AF = F[AF]
self.T_LK = T[LK]
self.T_AF = T[AF]
self.L_LK = L[LK]
self.L_AF = L[AF]
self.maxLK = maxSlot[LK]
self.maxAF = maxSlot[AF]
self.getTextLK = F[LK].lettersn.v
self.getTextAF = F[AF].lettersn.v
def getPosLK(self, slot):
"""Obtain the page and line numbers where a slot in the LK occurs.
"""
T_LK = self.T_LK
(piece, page, line) = T_LK.sectionFromNode(slot)
return f"{page:>03}:{line:>02}"
def getPosAF(self, slot):
"""Obtain the page and line numbers where a slot in the AF occurs.
"""
T_AF = self.T_AF
(page, block, line) = T_AF.sectionFromNode(slot)
return f"{page:>03}:{line:>02}"
def printLines(self, start=0, end=None):
"""Print specified entries in the alignment table.
Parameters
----------
start: integer, optional 0
Position of first alignment entry that needs to be printed.
If it is 0 or negative, it starts from the beginning.
end: integer, optional `None`
Position of last alignment entry that needs to be printed.
If it is negative, it indicates a position that much
before the end of the list.
If it is absent, it indicates the last line of the list.
"""
getTextLK = self.getTextLK
getTextAF = self.getTextAF
alignment = self.alignment
if start < 0:
start = 0
if end is None or end > len(alignment):
end = len(alignment)
lines = []
lines.append(
"pag:ln|slot |cc|textLakhnawi |@ed~rat|textAfifi |cc| slot|pag:ln"
)
lines.append(
"------|-----|--|--------------------|-------|--------------------|--|-----|------"
)
for (iLK, left, d, r, right, iAF) in alignment[start:end]:
textLK = getTextLK(iLK) if iLK else ""
textAF = getTextAF(iAF) if iAF else ""
posLK = self.getPosLK(iLK) if iLK else ""
posAF = self.getPosAF(iAF) if iAF else ""
lines.append(
f"{posLK:>6}|{iLK:>5}|{left:<2}|{textLK:>20}|@{d:<2}~{r:3.1f}|{textAF:<20}|{right:>2}|{iAF:>5} {posAF:<6}"
)
return "\n".join(lines)
def printAlignment(self, path, asTsv=False):
"""Prints the whole alignment table to file.
Parameters
----------
path: string
The path name of the file to which the information is printed.
asTsv: boolean, optional `False
If False, pretty prints the table to file, using
`printLines` above.
If True, write essential data only in tab separated format.
The essential information is *slot in LK*, *distance*, *slot in AF*
"""
alignment = self.alignment
with open(path, "w") as fh:
if asTsv:
for item in alignment:
fh.write(
"\t".join(
f"{it:3.1f}" if i == 3 else str(it)
for (i, it) in enumerate(item)
)
+ "\n"
)
else:
fh.write(self.printLines())
fh.write("\n")
def printDiff(self, before, after):
"""Print the last alignment entries plus what comes after.
This function is useful if alignment fails at some point.
It then shows what happened before the failure and how it looks
after the failure.
Parameters
----------
before: integer
The amount of alignment entries to print.
They will be picked from the end of the current alignment table.
after: integer
The amount of slots after the last alignment entry that
needs to be shown for each source.
"""
alignment = self.alignment
maxLK = self.maxLK
maxAF = self.maxAF
getTextLK = self.getTextLK
getTextAF = self.getTextAF
print(self.printLines(start=len(alignment) - before))
print("^" * 67)
lastLK = None
lastAF = None
for c in range(len(alignment) - 1, -1, -1):
comp = alignment[c]
if lastLK is None:
if comp[0]:
lastLK = comp[0]
if lastAF is None:
if comp[-1]:
lastAF = comp[-1]
if lastLK is not None and lastAF is not None:
break
if lastLK is not None and lastAF is not None:
for i in range(after):
iLK = lastLK + 1 + i
iAF = lastAF + 1 + i
textLK = getTextLK(iLK) if iLK <= maxLK else ""
textAF = getTextAF(iAF) if iAF <= maxAF else ""
d = distance(textLK, textAF)
r = ratio(textLK, textAF)
print(
f"{iLK:>5} = {textLK:>20} @{d:>2}~{r:3.1f} {textAF:<20} = {iAF:>5}"
)
def printCase(self, iLK, label):
"""Print the alignment entries that belong to a special case.
Parameters
----------
iLK: integer
The slot number in LK that triggered a special case.
label: string
The kind of issue with this case that you want to report:
FAILED or MISSING.
"""
indexLK = self.indexLK
indexAF = self.indexAF
cases = self.cases
case = cases[iLK]
(iAF, cLK, cAF) = case
start = min((indexLK[iLK], indexAF[iAF]))
end = max((indexLK[iLK + cLK + 1], indexAF[iAF + cAF + 1]))
print(f"{label} CASE: {iLK} vs {iAF}:")
print(self.printLines(start=start, end=end))
def getCombis(self, c):
"""Precompute a specification of all combinations that might be tried.
When the alignment fails between single words, we try combinations
of words left and right in a specific order.
The result of this function lists the combinations that must be
tried, where each combination is specified as a tuple
of the number of words left and the number of words right
that needs to be taken together for the matching.
See also: `findCommbi`.
"""
combis = []
for i in range(1, c + 1):
for j in range(1, c + 1):
if i != 1 or j != 1:
combis.append((i, j))
self.combis = tuple(
sorted(combis, key=lambda x: (x[0] + x[1], abs(x[0] - x[1])))
)
def catchupAF(self, start, end):
"""Move the current position in the AF edition forward.
While doing so, it adds an entry to the alignment table
for each time the position is shifted by one.
Parameters
----------
start: integer
From where to start shifting
end: integer
Where to end shifting
"""
indexAF = self.indexAF
alignment = self.alignment
for i in range(start, end + 1):
indexAF[i] = len(alignment)
alignment.append(("", 0, 99, 0.0, "", i))
def catchupLK(self, start, end):
"""Move the current position in the LK edition forward.
While doing so, it adds an entry to the alignment table
for each time the position is shifted by one.
Parameters
----------
start: integer
From where to start shifting
end: integer
Where to end shifting
"""
indexLK = self.indexLK
alignment = self.alignment
for i in range(start, end + 1):
indexLK[i] = len(alignment)
alignment.append((i, "", 99, 0.0, 0, ""))
def doCase(self, iLK, iAF, debug=False):
"""Deal with a special case.
Parameters
----------
iLK: integer
Current position in LK edition.
A case is triggered when iLK is equal
to a key in the cases table and if the same case
has not already been applied.
This last condition is relevant for special
cases where the number of LK words that must be
combined is 0. In that case, the current LK
position does not shift, and we would be in an
infinite loop.
iAF: integer
Current position in AF edition.
If a case is triggered by iLK, but the AF position
specified in the case does not match iAF,
the case is skipped and is reported as a failed case.
"""
indexLK = self.indexLK
indexAF = self.indexAF
alignment = self.alignment
usedCases = self.usedCases
failedCases = self.failedCases
cases = self.cases
if iLK not in cases:
return None
(givenIAF, cLK, cAF) = cases[iLK]
if givenIAF != iAF:
failedCases.add(iLK)
return None
usedCases.add(iLK)
common = min((cLK, cAF))
for i in range(max((cLK, cAF))):
nAlignment = len(alignment)
if i < common:
indexLK[iLK + i] = nAlignment
indexAF[iAF + i] = nAlignment
alignment.append((iLK + i, cLK, 88, 0.0, cAF, iAF + i))
elif i < cLK:
indexLK[iLK + i] = nAlignment
alignment.append((iLK + i, cLK, 88, 0.0, cAF, ""))
else:
indexAF[iAF + i] = nAlignment
alignment.append(("", cLK, 88, 0.0, cAF, iAF + i))
if debug:
print(f"[{iLK}~{iAF}] special case ({cLK}, {cAF})")
return (iLK + cLK, iAF + cAF)
def findCombi(self, iLK, iAF, maxD, minR):
"""Tries out all possible combinations at this point until success.
This is typically called when a direct match at the current slots
in LK and AF fail.
We are going to take together small amounts of words and match
them instead. As soon as we have a match, we break out of the loop
and use step to the new positions.
See also: `getCombis` and `compare`.
Parameters
----------
iLK: integer
Current slot position in LK edition
iAF: integer
Current slot position in AF edition
maxD: integer
maximum edit distance that we allow for the comparisons to succeed
minR: integer
minimum similarity ratio that we require for the comparisons to succeed
"""
combis = self.combis
maxLK = self.maxLK
maxAF = self.maxAF
getTextLK = self.getTextLK
getTextAF = self.getTextAF
alignment = self.alignment
indexLK = self.indexLK
indexAF = self.indexAF
found = None
for (cLK, cAF) in combis:
if iLK + cLK > maxLK or iAF + cAF > maxAF:
continue
textLK = "".join(getTextLK(iLK + i) for i in range(cLK))
textAF = "".join(getTextAF(iAF + i) for i in range(cAF))
(isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
if isSimilar:
found = (cLK, cAF)
common = min((cLK, cAF))
for i in range(max((cLK, cAF))):
nAlignment = len(alignment)
if i < common:
indexLK[iLK + i] = nAlignment
indexAF[iAF + i] = nAlignment
alignment.append((iLK + i, cLK, d, r, cAF, iAF + i))
elif i < cLK:
indexLK[iLK + i] = nAlignment
alignment.append((iLK + i, cLK, d, r, cAF, ""))
elif i < cAF:
indexAF[iAF + i] = nAlignment
alignment.append(("", cLK, d, r, cAF, iAF + i))
break
return found
def compare(self, iLK, iAF, maxD, minR, debug=False):
"""Does a full comparison at a location, including between combinations.
First a direct match between the words at the current positions in
the LK and AF editions is tried. If that fails,
combinations of words from those points onwards are tried.
See also: `getCombis` and `findCombi`.
Parameters
----------
iLK: integer
Current slot position in LK edition
iAF: integer
Current slot position in AF edition
maxD: integer
maximum edit distance that we allow for the comparisons to succeed
minR: integer
minimum similarity ratio that we require for the comparisons to succeed
debug: boolean, optional `False`
If True, print a statement indicating the result of the comparison.
"""
maxLK = self.maxLK
maxAF = self.maxAF
getTextLK = self.getTextLK
getTextAF = self.getTextAF
alignment = self.alignment
indexLK = self.indexLK
indexAF = self.indexAF
textLK = getTextLK(iLK)
textAF = getTextAF(iAF)
(isSimilar, d, r) = similar(textLK, textAF, maxD, minR)
if isSimilar:
nAlignment = len(alignment)
indexLK[iLK] = nAlignment
indexAF[iAF] = nAlignment
alignment.append((iLK, "", d, r, "", iAF))
if debug:
print(
f"[{iLK}~{iAF}] single comparison with distance <= {maxD} and ratio >= {minR}"
)
return (iLK + 1, iAF + 1)
if iLK < maxLK and iAF < maxAF:
textLK = getTextLK(iLK + 1)
textAF = getTextAF(iAF + 1)
(isSimilarNext, dNext, rNext) = similar(textLK, textAF, maxD, minR)
if isSimilarNext:
nAlignment = len(alignment)
indexLK[iLK] = nAlignment
indexAF[iAF] = nAlignment
alignment.append((iLK, "", dNext, rNext, "", iAF))
if debug:
print(
f"[{iLK}~{iAF}] single comparison failed with distance <= {maxD} and ratio >= {minR}"
)
nAlignment = len(alignment)
indexLK[iLK + 1] = nAlignment
indexAF[iAF + 1] = nAlignment
alignment.append((iLK + 1, "", dNext, rNext, "", iAF + 1))
if debug:
print(
f"[{iLK}~{iAF}] single comparison recovered with distance <= {maxD} and ratio >= {minR}"
)
return (iLK + 2, iAF + 2)
combi = self.findCombi(iLK, iAF, maxD, minR)
if combi is not None:
(cLK, cAF) = combi
if debug:
print(
f"[{iLK}~{iAF}] ({cLK}, {cAF}) comparison with distance <= {maxD} and ratio >= {minR}"
)
return (iLK + cLK, iAF + cAF)
return None
def lookup(self, iLK, iAF, maxD, minR, start, end, debug=False):
"""Performs a jump in both editions.
Typically invoked when comparison at the current locations fail,
even after having tried combinations.
We compare the current word in one edition with those from a certain
interval in the other edition. And vice versa, alternately.
of subsequent words.
Parameters
----------
iLK: integer
Current slot position in LK edition
iAF: integer
Current slot position in AF edition
maxD: integer
maximum edit distance that we allow for the comparisons to succeed
minR: integer
minimum similarity ratio that we require for the comparisons to succeed
start: integer
first position in the other edition where we start comparing
end: integer
last position in the other edition where we stop comparing
debug: boolean, optional `False`
If True, print a statement indicating the result of the lookup.
"""
maxLK = self.maxLK
maxAF = self.maxAF
alignment = self.alignment
indexLK = self.indexLK
indexAF = self.indexAF
step = None
for i in range(start, end + 1):
prevAlignmentIndex = len(alignment)
if iAF + i <= maxAF:
step = self.compare(iLK, iAF + i, maxD, minR, debug=debug)
if step:
if debug:
print(f"[{iLK}~{iAF}] right {i}-jump to {iAF + i}")
thisAlignment = list(alignment[prevAlignmentIndex:])
alignment[prevAlignmentIndex:] = []
self.catchupAF(iAF, iAF + i - 1)
for thisComp in thisAlignment:
nAlignment = len(alignment)
thisLK = thisComp[0]
thisAF = thisComp[-1]
if thisLK:
indexLK[thisLK] = nAlignment
if thisAF:
indexAF[thisAF] = nAlignment
alignment.append(thisComp)
break
if iLK + i <= maxLK:
step = self.compare(iLK + i, iAF, maxD, minR, debug=debug)
if step:
if debug:
print(f"[{iLK}~{iAF}] left {i}-jump to {iLK + i}")
thisAlignment = list(alignment[prevAlignmentIndex:])
alignment[prevAlignmentIndex:] = []
self.catchupLK(iLK, iLK + i - 1)
for thisComp in thisAlignment:
nAlignment = len(alignment)
thisLK = thisComp[0]
thisAF = thisComp[-1]
if thisLK:
indexLK[thisLK] = nAlignment
if thisAF:
indexAF[thisAF] = nAlignment
alignment.append(thisComp)
break
return step
def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False):
"""Main loop of the alignment process.
Without optional parameters, performs the whole alignment until
it is completed or fails.
But you can also run a few alignment steps, from arbitrary positions
and record the decision steps. This is handy for debugging and
exploring.
!!! hint "doDiff"
This function defines an inner function `doDiff`, which
contains the logic of a single alignment step.
That function `doDiff` issues a sequence of `compare` and `lookup`
commands with various strictnesses and various jump sizes,
which will be tried in order.
Here you can finetune the amount of looseness you allow in a comparison
before jumping from that position.
You can first compare strictly, then jump away a short distance while
still comparing strictly, and then start comparing more loosely, and
then jump away a longer distance, possibly a bit more loosely.
That is a matter of trial and error.
The current implementation of `doDiff` matches the present LK and AF well,
especially in conjunction with the special cases that have been defined.
Parameters
----------
startLK: integer, optional 1
Start position in LK edition. If not passed, starts from the beginning.
startAF: integer, optional 1
Start position in AF edition. If not passed, starts from the beginning.
steps: integer, optional `-1`
The number of alignment steps to perform.
If not passed or -1, there is no limit on the steps.
show: boolean, optional False
If True, prints the resulting alignment table to the screen.
Only do this when you produce a small part of the alignment table!
maximum edit distance that we allow for the comparisons to succeed
debug: boolean, optional `False`
If True, print a statement indicating the result of the each decision
that has been taken during the alignment process.
"""
maxLK = self.maxLK
maxAF = self.maxAF
cases = self.cases
alignmentPath = self.alignmentPath
alignmentText = self.alignmentText
alignmentTextBlocked = self.alignmentTextBlocked
self.alignment = []
self.indexLK = {}
self.indexAF = {}
self.decisions = collections.Counter()
self.usedCases = set()
self.failedCases = set()
alignment = self.alignment
indexLK = self.indexLK
decisions = self.decisions
usedCases = self.usedCases
failedCases = self.failedCases
lastCase = None
step = (startLK, startAF)
it = 0
def doDiff(iLK, iAF):
decision = 0
if iLK > maxLK or iAF > maxAF:
if iAF < maxAF:
self.catchupAF(iAF, maxAF)
if iLK < maxLK:
self.catchupLK(iLK, maxLK)
decisions[decision] += 1
return True
nonlocal lastCase
decision += 1
if lastCase != iLK:
step = self.doCase(iLK, iAF, debug=debug)
if step:
lastCase = iLK
decisions[decision] += 1
return step
step = self.compare(iLK, iAF, 0, 1.0, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 0, 1.0, 1, 5, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.compare(iLK, iAF, 1, 0.8, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.compare(iLK, iAF, 2, 0.7, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.compare(iLK, iAF, 3, 0.6, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 0, 1.0, 1, 10, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 1, 0.8, 1, 10, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 0, 1.0, 10, 20, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 1, 0.8, 10, 20, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 2, 0.7, 1, 20, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 1, 0.8, 20, 100, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
step = self.lookup(iLK, iAF, 2, 0.7, 20, 100, debug=debug)
decision += 1
if step:
decisions[decision] += 1
return step
return False
while it != steps:
it += 1
step = doDiff(*step)
if step is True:
self.printAlignment(alignmentText)
self.printAlignment(alignmentPath, asTsv=True)
print(f"Alignment complete, {len(alignment)} entries.")
break
elif step is False:
self.printAlignment(alignmentTextBlocked)
print(f"Alignment blocked, {len(alignment)} entries.")
self.printDiff(20, 20)
break
endLK = startLK if len(indexLK) == 0 else max(indexLK)
possibleCases = set(range(startLK, endLK + 1))
casesSet = set(cases)
relevantCases = casesSet & possibleCases
missedCases = relevantCases - usedCases - failedCases
if failedCases:
for iLK in sorted(failedCases):
self.printCase(iLK, "FAILED")
if missedCases:
for iLK in sorted(missedCases):
self.printCase(iLK, "MISSED")
if not failedCases and not missedCases:
if not relevantCases:
print("No special cases defined for this stretch")
else:
print(
f"Special cases: all relevant {len(relevantCases)} cases "
"defined, encountered, and applied"
)
if show:
print(self.printLines())
for d in sorted(decisions):
n = decisions[d]
print(f"{d:>2} taken: {n:>5} x")
def analyseStretch(self, start, end):
"""Analyse a stretch in the alignment table and reports whether it is suspect.
Stretches are suspect if they contain both a number entries where an LK
word is missing, and a number of entries where an AF word is missing.
This is usually an indication that the alignment has gone out of sync.
Parameters
----------
start: integer
Index in the alignment list where to start
end: integer
Index in the alignment list where to stop
"""
alignment = self.alignment
total = 0
good = 0
onlyLK = 0
onlyAF = 0
for (iLK, left, d, r, right, iAF) in alignment[start : end + 1]:
total += 1
if not iLK:
onlyAF += 1
if not iAF:
onlyLK += 1
if d == 0:
good += 1
suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5
return suspect
def check(self):
"""Comprehensive quality assessment of the alignment table.
Reports various statistics on how close the matches are overall.
Checks the minimum requirement of sanity:
All words in the LK must occur without duplications in the right order
int the alignment table. Same for the AF words.
Reports where the combinations are: the cases where more than one word
in the LK has matched with more than one word in the AF.
Or where single words have been forced to match, or where words
in one edition have no counterpart in the other edition.
Gives at most three examples of all distinct combination patterns.
Identifies bad strectches: chunks of entries within the alignment
table where words do not match for various reasons.
The identification is a bit loose in the sense that bad stretches
with a single perfect match in between are combined.
Some bad stretches are suspect (see `analyseStretch`).
All suspect stretches are shown, and at most three examples
of the benign (non-suspsect) bad stretches are shown.
"""
maxLK = self.maxLK
maxAF = self.maxAF
alignment = self.alignment
errors = {}
prevILK = 0
prevIAF = 0
where = collections.Counter()
agreement = collections.Counter()
badStretches = collections.defaultdict(lambda: [])
combinations = collections.defaultdict(lambda: [])
startBad = 0
startCombi = 0
nCombi = 0
for (c, (iLK, left, d, r, right, iAF)) in enumerate(alignment):
thisBad = d > 0 or not iLK or not iAF
hasLeft = left != ""
hasRight = right != ""
# a good line between bad lines is counted as bad
if not thisBad and startBad:
nextGood = True
for j in range(1, LOOKAHEAD + 1):
if c + j < len(alignment):
compJ = alignment[c + j]
if compJ[2] > 0 or not compJ[0] or not compJ[-1]:
nextGood = False
break
if not nextGood:
thisBad = True
if startBad:
if not thisBad:
badStretches[c - startBad].append(startBad)
startBad = 0
else:
if thisBad:
startBad = c
if startCombi and nCombi == c - startCombi:
startCombi = 0
nCombi = 0
if startCombi == 0:
thisCombi = hasLeft and left > 1 or hasRight and right > 1
if thisCombi:
combinations[(left, right)].append(c)
startCombi = c
nCombi = max((left, right))
agreement[d] += 1
if iLK:
if iLK != prevILK + 1:
errors.setdefault("wrong iLK", []).append(
f"{c:>5}: Expected {prevILK + 1}, found {iLK}"
)
prevILK = iLK
if iAF:
where["both"] += 1
else:
where[AF] += 1
if iAF:
if iAF != prevIAF + 1:
errors.setdefault("wrong iAF", []).append(
f"{c:>5}: Expected {prevIAF + 1}, found {iAF}"
)
prevIAF = iAF
else:
where[LK] += 1
if startBad:
badStretches[len(alignment) - startBad].append(startBad)
if prevILK < maxLK:
errors.setdefault("missing iLKs at the end", []).append(
f"last is {prevILK}, expected {maxLK}"
)
elif prevILK > maxLK:
errors.setdefault("too many iLKs at the end", []).append(
f"last is {prevILK}, expected {maxLK}"
)
if prevIAF < maxAF:
errors.setdefault("missing iAFs at the end", []).append(
f"last is {prevIAF}, expected {maxAF}"
)
elif prevIAF > maxAF:
errors.setdefault("too many iAFs at the end", []).append(
f"last is {prevIAF}, expected {maxAF}"
)
print("\nSANITY\n")
if not errors:
print("All OK")
else:
for (kind, msgs) in errors.items():
print(f"ERROR {kind} ({len(msgs):>5}x):")
for msg in msgs[0:10]:
print(f"\t{msg}")
if len(msgs) > 10:
print(f"\t ... and {len(msgs) - 10} more ...")
print("\nAGREEMENT\n")
print("Where are the words?\n")
print(f"\t{LK}-only: {where[LK]:>5} slots")
print(f"\t{AF}-only: {where[AF]:>5} slots")
print(f"\tboth: {where['both']:>5} slots")
print("\nHow well is the agreement?\n")
for (d, n) in sorted(agreement.items()):
print(f"edit distance {d:>3} : {n:>5} words")
print("NB: 88 are special cases that have been declared explicitly")
print("NB: 99 are words that have no counterpart in the other edition")
print("\nCOMBINATIONS\n")
print("What combination alignments are there and how many?")
for comb in sorted(combinations):
cs = combinations[comb]
(left, right) = comb
print(f"\t({left:>2}, {right:>2}) : {len(cs):>4} x :")
for (i, c) in enumerate(cs[0:3]):
print(f"EXAMPLE {i + 1}:")
print(
self.printLines(
start=max((1, c - 2)),
end=min((len(alignment), c + 2 + max(comb))),
)
)
print("")
print("\nBAD STRETCHES\n")
print("How many of which size?\n")
allSuspects = []
someBenigns = []
for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])):
suspects = {
start: size
for start in starts
if self.analyseStretch(start, start + size)
}
benigns = {start: size for start in starts if start not in suspects}
allSuspects.extend(
[(start, start + size) for (start, size) in suspects.items()]
)
someBenigns.extend(
[(start, start + size) for (start, size) in list(benigns.items())[0:3]]
)
examples = ", ".join(str(start) for start in list(suspects.keys())[0:3])
if not suspects:
examples = ", ".join(str(start) for start in list(benigns.keys())[0:3])
print(
f"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}"
)
print(
f"\nShowing all {len(allSuspects)} inversion suspects"
if len(allSuspects)
else "\nNo suspect bad stretches\n"
)
for (i, (start, end)) in enumerate(reversed(allSuspects)):
print(f"\nSUSPECT {i + 1:>2}")
print(self.printLines(max((1, start - 5)), min((len(alignment), end + 5))))
print(
f"\nShowing some ({len(someBenigns)}) benign examples"
if len(someBenigns)
else "\nNo bad stretches\n"
)
for (i, (start, end)) in enumerate(someBenigns):
print(f"\nBENIGN {i + 1:>2}")
print(self.printLines(max((1, start - 2)), min((len(alignment), end + 2))))
Functions
def similar(s1, s2, maxD, minR)
-
Expand source code Browse git
def similar(s1, s2, maxD, minR): if s1 == s2: return (True, 0, 1.0) if maxD == 0 or minR == 1: return (False, 77, 0.0) d = distance(s1, s2) r = ratio(s1, s2) return (d <= maxD and r >= minR, d, r)
Classes
class Alignment
-
Aligns the words of the LK edition with those of the AF edition.
The main method is
doDiffs
which produces the alignment table.Expand source code Browse git
class Alignment: """Aligns the words of the LK edition with those of the AF edition. The main method is `doDiffs` which produces the alignment table. """ def __init__(self): self.getCombis(COMBINE) def readEditions(self, version, cases): """Read the material of both LK and AF editions. Parameters ---------- version: string The version number of the source datasets for both editions. We assume that both editions are available as Text-Fabric datasets. cases: dict The special cases defined to help the alignment algorithm through difficult spots. The dictionary is keyed by LK slot number, and the value is a tuple consisting of a corresponding AF slot number, the amount of words to collect at the LK side, and the amount of words to collect at the AF side. Returns ------- None But the Alignment object will have attributes set by which the other methods are capable of reading the data source. """ self.version = version self.cases = cases self.alignmentPath = f"{ALIGNMENT_DIR}/alignment{version}.tsv" self.alignmentText = f"{ALIGNMENT_DIR}/alignment{version}.txt" self.alignmentTextBlocked = f"{ALIGNMENT_DIR}/alignment-incomplete{version}.txt" if not os.path.exists(ALIGNMENT_DIR): os.makedirs(self.alignmentPath, exist_ok=True) A = {} F = {} T = {} L = {} maxSlot = {} for (acro, name) in EDITIONS.items(): A[acro] = use( f"among/fusus/tf/{name}:clone", writing="ara", version=version ) F[acro] = A[acro].api.F T[acro] = A[acro].api.T L[acro] = A[acro].api.L maxSlot[acro] = F[acro].otype.maxSlot self.F_LK = F[LK] self.F_AF = F[AF] self.T_LK = T[LK] self.T_AF = T[AF] self.L_LK = L[LK] self.L_AF = L[AF] self.maxLK = maxSlot[LK] self.maxAF = maxSlot[AF] self.getTextLK = F[LK].lettersn.v self.getTextAF = F[AF].lettersn.v def getPosLK(self, slot): """Obtain the page and line numbers where a slot in the LK occurs. """ T_LK = self.T_LK (piece, page, line) = T_LK.sectionFromNode(slot) return f"{page:>03}:{line:>02}" def getPosAF(self, slot): """Obtain the page and line numbers where a slot in the AF occurs. """ T_AF = self.T_AF (page, block, line) = T_AF.sectionFromNode(slot) return f"{page:>03}:{line:>02}" def printLines(self, start=0, end=None): """Print specified entries in the alignment table. Parameters ---------- start: integer, optional 0 Position of first alignment entry that needs to be printed. If it is 0 or negative, it starts from the beginning. end: integer, optional `None` Position of last alignment entry that needs to be printed. If it is negative, it indicates a position that much before the end of the list. If it is absent, it indicates the last line of the list. """ getTextLK = self.getTextLK getTextAF = self.getTextAF alignment = self.alignment if start < 0: start = 0 if end is None or end > len(alignment): end = len(alignment) lines = [] lines.append( "pag:ln|slot |cc|textLakhnawi |@ed~rat|textAfifi |cc| slot|pag:ln" ) lines.append( "------|-----|--|--------------------|-------|--------------------|--|-----|------" ) for (iLK, left, d, r, right, iAF) in alignment[start:end]: textLK = getTextLK(iLK) if iLK else "" textAF = getTextAF(iAF) if iAF else "" posLK = self.getPosLK(iLK) if iLK else "" posAF = self.getPosAF(iAF) if iAF else "" lines.append( f"{posLK:>6}|{iLK:>5}|{left:<2}|{textLK:>20}|@{d:<2}~{r:3.1f}|{textAF:<20}|{right:>2}|{iAF:>5} {posAF:<6}" ) return "\n".join(lines) def printAlignment(self, path, asTsv=False): """Prints the whole alignment table to file. Parameters ---------- path: string The path name of the file to which the information is printed. asTsv: boolean, optional `False If False, pretty prints the table to file, using `printLines` above. If True, write essential data only in tab separated format. The essential information is *slot in LK*, *distance*, *slot in AF* """ alignment = self.alignment with open(path, "w") as fh: if asTsv: for item in alignment: fh.write( "\t".join( f"{it:3.1f}" if i == 3 else str(it) for (i, it) in enumerate(item) ) + "\n" ) else: fh.write(self.printLines()) fh.write("\n") def printDiff(self, before, after): """Print the last alignment entries plus what comes after. This function is useful if alignment fails at some point. It then shows what happened before the failure and how it looks after the failure. Parameters ---------- before: integer The amount of alignment entries to print. They will be picked from the end of the current alignment table. after: integer The amount of slots after the last alignment entry that needs to be shown for each source. """ alignment = self.alignment maxLK = self.maxLK maxAF = self.maxAF getTextLK = self.getTextLK getTextAF = self.getTextAF print(self.printLines(start=len(alignment) - before)) print("^" * 67) lastLK = None lastAF = None for c in range(len(alignment) - 1, -1, -1): comp = alignment[c] if lastLK is None: if comp[0]: lastLK = comp[0] if lastAF is None: if comp[-1]: lastAF = comp[-1] if lastLK is not None and lastAF is not None: break if lastLK is not None and lastAF is not None: for i in range(after): iLK = lastLK + 1 + i iAF = lastAF + 1 + i textLK = getTextLK(iLK) if iLK <= maxLK else "" textAF = getTextAF(iAF) if iAF <= maxAF else "" d = distance(textLK, textAF) r = ratio(textLK, textAF) print( f"{iLK:>5} = {textLK:>20} @{d:>2}~{r:3.1f} {textAF:<20} = {iAF:>5}" ) def printCase(self, iLK, label): """Print the alignment entries that belong to a special case. Parameters ---------- iLK: integer The slot number in LK that triggered a special case. label: string The kind of issue with this case that you want to report: FAILED or MISSING. """ indexLK = self.indexLK indexAF = self.indexAF cases = self.cases case = cases[iLK] (iAF, cLK, cAF) = case start = min((indexLK[iLK], indexAF[iAF])) end = max((indexLK[iLK + cLK + 1], indexAF[iAF + cAF + 1])) print(f"{label} CASE: {iLK} vs {iAF}:") print(self.printLines(start=start, end=end)) def getCombis(self, c): """Precompute a specification of all combinations that might be tried. When the alignment fails between single words, we try combinations of words left and right in a specific order. The result of this function lists the combinations that must be tried, where each combination is specified as a tuple of the number of words left and the number of words right that needs to be taken together for the matching. See also: `findCommbi`. """ combis = [] for i in range(1, c + 1): for j in range(1, c + 1): if i != 1 or j != 1: combis.append((i, j)) self.combis = tuple( sorted(combis, key=lambda x: (x[0] + x[1], abs(x[0] - x[1]))) ) def catchupAF(self, start, end): """Move the current position in the AF edition forward. While doing so, it adds an entry to the alignment table for each time the position is shifted by one. Parameters ---------- start: integer From where to start shifting end: integer Where to end shifting """ indexAF = self.indexAF alignment = self.alignment for i in range(start, end + 1): indexAF[i] = len(alignment) alignment.append(("", 0, 99, 0.0, "", i)) def catchupLK(self, start, end): """Move the current position in the LK edition forward. While doing so, it adds an entry to the alignment table for each time the position is shifted by one. Parameters ---------- start: integer From where to start shifting end: integer Where to end shifting """ indexLK = self.indexLK alignment = self.alignment for i in range(start, end + 1): indexLK[i] = len(alignment) alignment.append((i, "", 99, 0.0, 0, "")) def doCase(self, iLK, iAF, debug=False): """Deal with a special case. Parameters ---------- iLK: integer Current position in LK edition. A case is triggered when iLK is equal to a key in the cases table and if the same case has not already been applied. This last condition is relevant for special cases where the number of LK words that must be combined is 0. In that case, the current LK position does not shift, and we would be in an infinite loop. iAF: integer Current position in AF edition. If a case is triggered by iLK, but the AF position specified in the case does not match iAF, the case is skipped and is reported as a failed case. """ indexLK = self.indexLK indexAF = self.indexAF alignment = self.alignment usedCases = self.usedCases failedCases = self.failedCases cases = self.cases if iLK not in cases: return None (givenIAF, cLK, cAF) = cases[iLK] if givenIAF != iAF: failedCases.add(iLK) return None usedCases.add(iLK) common = min((cLK, cAF)) for i in range(max((cLK, cAF))): nAlignment = len(alignment) if i < common: indexLK[iLK + i] = nAlignment indexAF[iAF + i] = nAlignment alignment.append((iLK + i, cLK, 88, 0.0, cAF, iAF + i)) elif i < cLK: indexLK[iLK + i] = nAlignment alignment.append((iLK + i, cLK, 88, 0.0, cAF, "")) else: indexAF[iAF + i] = nAlignment alignment.append(("", cLK, 88, 0.0, cAF, iAF + i)) if debug: print(f"[{iLK}~{iAF}] special case ({cLK}, {cAF})") return (iLK + cLK, iAF + cAF) def findCombi(self, iLK, iAF, maxD, minR): """Tries out all possible combinations at this point until success. This is typically called when a direct match at the current slots in LK and AF fail. We are going to take together small amounts of words and match them instead. As soon as we have a match, we break out of the loop and use step to the new positions. See also: `getCombis` and `compare`. Parameters ---------- iLK: integer Current slot position in LK edition iAF: integer Current slot position in AF edition maxD: integer maximum edit distance that we allow for the comparisons to succeed minR: integer minimum similarity ratio that we require for the comparisons to succeed """ combis = self.combis maxLK = self.maxLK maxAF = self.maxAF getTextLK = self.getTextLK getTextAF = self.getTextAF alignment = self.alignment indexLK = self.indexLK indexAF = self.indexAF found = None for (cLK, cAF) in combis: if iLK + cLK > maxLK or iAF + cAF > maxAF: continue textLK = "".join(getTextLK(iLK + i) for i in range(cLK)) textAF = "".join(getTextAF(iAF + i) for i in range(cAF)) (isSimilar, d, r) = similar(textLK, textAF, maxD, minR) if isSimilar: found = (cLK, cAF) common = min((cLK, cAF)) for i in range(max((cLK, cAF))): nAlignment = len(alignment) if i < common: indexLK[iLK + i] = nAlignment indexAF[iAF + i] = nAlignment alignment.append((iLK + i, cLK, d, r, cAF, iAF + i)) elif i < cLK: indexLK[iLK + i] = nAlignment alignment.append((iLK + i, cLK, d, r, cAF, "")) elif i < cAF: indexAF[iAF + i] = nAlignment alignment.append(("", cLK, d, r, cAF, iAF + i)) break return found def compare(self, iLK, iAF, maxD, minR, debug=False): """Does a full comparison at a location, including between combinations. First a direct match between the words at the current positions in the LK and AF editions is tried. If that fails, combinations of words from those points onwards are tried. See also: `getCombis` and `findCombi`. Parameters ---------- iLK: integer Current slot position in LK edition iAF: integer Current slot position in AF edition maxD: integer maximum edit distance that we allow for the comparisons to succeed minR: integer minimum similarity ratio that we require for the comparisons to succeed debug: boolean, optional `False` If True, print a statement indicating the result of the comparison. """ maxLK = self.maxLK maxAF = self.maxAF getTextLK = self.getTextLK getTextAF = self.getTextAF alignment = self.alignment indexLK = self.indexLK indexAF = self.indexAF textLK = getTextLK(iLK) textAF = getTextAF(iAF) (isSimilar, d, r) = similar(textLK, textAF, maxD, minR) if isSimilar: nAlignment = len(alignment) indexLK[iLK] = nAlignment indexAF[iAF] = nAlignment alignment.append((iLK, "", d, r, "", iAF)) if debug: print( f"[{iLK}~{iAF}] single comparison with distance <= {maxD} and ratio >= {minR}" ) return (iLK + 1, iAF + 1) if iLK < maxLK and iAF < maxAF: textLK = getTextLK(iLK + 1) textAF = getTextAF(iAF + 1) (isSimilarNext, dNext, rNext) = similar(textLK, textAF, maxD, minR) if isSimilarNext: nAlignment = len(alignment) indexLK[iLK] = nAlignment indexAF[iAF] = nAlignment alignment.append((iLK, "", dNext, rNext, "", iAF)) if debug: print( f"[{iLK}~{iAF}] single comparison failed with distance <= {maxD} and ratio >= {minR}" ) nAlignment = len(alignment) indexLK[iLK + 1] = nAlignment indexAF[iAF + 1] = nAlignment alignment.append((iLK + 1, "", dNext, rNext, "", iAF + 1)) if debug: print( f"[{iLK}~{iAF}] single comparison recovered with distance <= {maxD} and ratio >= {minR}" ) return (iLK + 2, iAF + 2) combi = self.findCombi(iLK, iAF, maxD, minR) if combi is not None: (cLK, cAF) = combi if debug: print( f"[{iLK}~{iAF}] ({cLK}, {cAF}) comparison with distance <= {maxD} and ratio >= {minR}" ) return (iLK + cLK, iAF + cAF) return None def lookup(self, iLK, iAF, maxD, minR, start, end, debug=False): """Performs a jump in both editions. Typically invoked when comparison at the current locations fail, even after having tried combinations. We compare the current word in one edition with those from a certain interval in the other edition. And vice versa, alternately. of subsequent words. Parameters ---------- iLK: integer Current slot position in LK edition iAF: integer Current slot position in AF edition maxD: integer maximum edit distance that we allow for the comparisons to succeed minR: integer minimum similarity ratio that we require for the comparisons to succeed start: integer first position in the other edition where we start comparing end: integer last position in the other edition where we stop comparing debug: boolean, optional `False` If True, print a statement indicating the result of the lookup. """ maxLK = self.maxLK maxAF = self.maxAF alignment = self.alignment indexLK = self.indexLK indexAF = self.indexAF step = None for i in range(start, end + 1): prevAlignmentIndex = len(alignment) if iAF + i <= maxAF: step = self.compare(iLK, iAF + i, maxD, minR, debug=debug) if step: if debug: print(f"[{iLK}~{iAF}] right {i}-jump to {iAF + i}") thisAlignment = list(alignment[prevAlignmentIndex:]) alignment[prevAlignmentIndex:] = [] self.catchupAF(iAF, iAF + i - 1) for thisComp in thisAlignment: nAlignment = len(alignment) thisLK = thisComp[0] thisAF = thisComp[-1] if thisLK: indexLK[thisLK] = nAlignment if thisAF: indexAF[thisAF] = nAlignment alignment.append(thisComp) break if iLK + i <= maxLK: step = self.compare(iLK + i, iAF, maxD, minR, debug=debug) if step: if debug: print(f"[{iLK}~{iAF}] left {i}-jump to {iLK + i}") thisAlignment = list(alignment[prevAlignmentIndex:]) alignment[prevAlignmentIndex:] = [] self.catchupLK(iLK, iLK + i - 1) for thisComp in thisAlignment: nAlignment = len(alignment) thisLK = thisComp[0] thisAF = thisComp[-1] if thisLK: indexLK[thisLK] = nAlignment if thisAF: indexAF[thisAF] = nAlignment alignment.append(thisComp) break return step def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False): """Main loop of the alignment process. Without optional parameters, performs the whole alignment until it is completed or fails. But you can also run a few alignment steps, from arbitrary positions and record the decision steps. This is handy for debugging and exploring. !!! hint "doDiff" This function defines an inner function `doDiff`, which contains the logic of a single alignment step. That function `doDiff` issues a sequence of `compare` and `lookup` commands with various strictnesses and various jump sizes, which will be tried in order. Here you can finetune the amount of looseness you allow in a comparison before jumping from that position. You can first compare strictly, then jump away a short distance while still comparing strictly, and then start comparing more loosely, and then jump away a longer distance, possibly a bit more loosely. That is a matter of trial and error. The current implementation of `doDiff` matches the present LK and AF well, especially in conjunction with the special cases that have been defined. Parameters ---------- startLK: integer, optional 1 Start position in LK edition. If not passed, starts from the beginning. startAF: integer, optional 1 Start position in AF edition. If not passed, starts from the beginning. steps: integer, optional `-1` The number of alignment steps to perform. If not passed or -1, there is no limit on the steps. show: boolean, optional False If True, prints the resulting alignment table to the screen. Only do this when you produce a small part of the alignment table! maximum edit distance that we allow for the comparisons to succeed debug: boolean, optional `False` If True, print a statement indicating the result of the each decision that has been taken during the alignment process. """ maxLK = self.maxLK maxAF = self.maxAF cases = self.cases alignmentPath = self.alignmentPath alignmentText = self.alignmentText alignmentTextBlocked = self.alignmentTextBlocked self.alignment = [] self.indexLK = {} self.indexAF = {} self.decisions = collections.Counter() self.usedCases = set() self.failedCases = set() alignment = self.alignment indexLK = self.indexLK decisions = self.decisions usedCases = self.usedCases failedCases = self.failedCases lastCase = None step = (startLK, startAF) it = 0 def doDiff(iLK, iAF): decision = 0 if iLK > maxLK or iAF > maxAF: if iAF < maxAF: self.catchupAF(iAF, maxAF) if iLK < maxLK: self.catchupLK(iLK, maxLK) decisions[decision] += 1 return True nonlocal lastCase decision += 1 if lastCase != iLK: step = self.doCase(iLK, iAF, debug=debug) if step: lastCase = iLK decisions[decision] += 1 return step step = self.compare(iLK, iAF, 0, 1.0, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 1, 0.8, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 0, 1.0, 1, 5, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 1, 0.8, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 2, 0.7, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 3, 0.6, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 0, 1.0, 1, 10, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 1, 0.8, 1, 10, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 0, 1.0, 10, 20, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 1, 0.8, 10, 20, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 2, 0.7, 1, 20, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 1, 0.8, 20, 100, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 2, 0.7, 20, 100, debug=debug) decision += 1 if step: decisions[decision] += 1 return step return False while it != steps: it += 1 step = doDiff(*step) if step is True: self.printAlignment(alignmentText) self.printAlignment(alignmentPath, asTsv=True) print(f"Alignment complete, {len(alignment)} entries.") break elif step is False: self.printAlignment(alignmentTextBlocked) print(f"Alignment blocked, {len(alignment)} entries.") self.printDiff(20, 20) break endLK = startLK if len(indexLK) == 0 else max(indexLK) possibleCases = set(range(startLK, endLK + 1)) casesSet = set(cases) relevantCases = casesSet & possibleCases missedCases = relevantCases - usedCases - failedCases if failedCases: for iLK in sorted(failedCases): self.printCase(iLK, "FAILED") if missedCases: for iLK in sorted(missedCases): self.printCase(iLK, "MISSED") if not failedCases and not missedCases: if not relevantCases: print("No special cases defined for this stretch") else: print( f"Special cases: all relevant {len(relevantCases)} cases " "defined, encountered, and applied" ) if show: print(self.printLines()) for d in sorted(decisions): n = decisions[d] print(f"{d:>2} taken: {n:>5} x") def analyseStretch(self, start, end): """Analyse a stretch in the alignment table and reports whether it is suspect. Stretches are suspect if they contain both a number entries where an LK word is missing, and a number of entries where an AF word is missing. This is usually an indication that the alignment has gone out of sync. Parameters ---------- start: integer Index in the alignment list where to start end: integer Index in the alignment list where to stop """ alignment = self.alignment total = 0 good = 0 onlyLK = 0 onlyAF = 0 for (iLK, left, d, r, right, iAF) in alignment[start : end + 1]: total += 1 if not iLK: onlyAF += 1 if not iAF: onlyLK += 1 if d == 0: good += 1 suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5 return suspect def check(self): """Comprehensive quality assessment of the alignment table. Reports various statistics on how close the matches are overall. Checks the minimum requirement of sanity: All words in the LK must occur without duplications in the right order int the alignment table. Same for the AF words. Reports where the combinations are: the cases where more than one word in the LK has matched with more than one word in the AF. Or where single words have been forced to match, or where words in one edition have no counterpart in the other edition. Gives at most three examples of all distinct combination patterns. Identifies bad strectches: chunks of entries within the alignment table where words do not match for various reasons. The identification is a bit loose in the sense that bad stretches with a single perfect match in between are combined. Some bad stretches are suspect (see `analyseStretch`). All suspect stretches are shown, and at most three examples of the benign (non-suspsect) bad stretches are shown. """ maxLK = self.maxLK maxAF = self.maxAF alignment = self.alignment errors = {} prevILK = 0 prevIAF = 0 where = collections.Counter() agreement = collections.Counter() badStretches = collections.defaultdict(lambda: []) combinations = collections.defaultdict(lambda: []) startBad = 0 startCombi = 0 nCombi = 0 for (c, (iLK, left, d, r, right, iAF)) in enumerate(alignment): thisBad = d > 0 or not iLK or not iAF hasLeft = left != "" hasRight = right != "" # a good line between bad lines is counted as bad if not thisBad and startBad: nextGood = True for j in range(1, LOOKAHEAD + 1): if c + j < len(alignment): compJ = alignment[c + j] if compJ[2] > 0 or not compJ[0] or not compJ[-1]: nextGood = False break if not nextGood: thisBad = True if startBad: if not thisBad: badStretches[c - startBad].append(startBad) startBad = 0 else: if thisBad: startBad = c if startCombi and nCombi == c - startCombi: startCombi = 0 nCombi = 0 if startCombi == 0: thisCombi = hasLeft and left > 1 or hasRight and right > 1 if thisCombi: combinations[(left, right)].append(c) startCombi = c nCombi = max((left, right)) agreement[d] += 1 if iLK: if iLK != prevILK + 1: errors.setdefault("wrong iLK", []).append( f"{c:>5}: Expected {prevILK + 1}, found {iLK}" ) prevILK = iLK if iAF: where["both"] += 1 else: where[AF] += 1 if iAF: if iAF != prevIAF + 1: errors.setdefault("wrong iAF", []).append( f"{c:>5}: Expected {prevIAF + 1}, found {iAF}" ) prevIAF = iAF else: where[LK] += 1 if startBad: badStretches[len(alignment) - startBad].append(startBad) if prevILK < maxLK: errors.setdefault("missing iLKs at the end", []).append( f"last is {prevILK}, expected {maxLK}" ) elif prevILK > maxLK: errors.setdefault("too many iLKs at the end", []).append( f"last is {prevILK}, expected {maxLK}" ) if prevIAF < maxAF: errors.setdefault("missing iAFs at the end", []).append( f"last is {prevIAF}, expected {maxAF}" ) elif prevIAF > maxAF: errors.setdefault("too many iAFs at the end", []).append( f"last is {prevIAF}, expected {maxAF}" ) print("\nSANITY\n") if not errors: print("All OK") else: for (kind, msgs) in errors.items(): print(f"ERROR {kind} ({len(msgs):>5}x):") for msg in msgs[0:10]: print(f"\t{msg}") if len(msgs) > 10: print(f"\t ... and {len(msgs) - 10} more ...") print("\nAGREEMENT\n") print("Where are the words?\n") print(f"\t{LK}-only: {where[LK]:>5} slots") print(f"\t{AF}-only: {where[AF]:>5} slots") print(f"\tboth: {where['both']:>5} slots") print("\nHow well is the agreement?\n") for (d, n) in sorted(agreement.items()): print(f"edit distance {d:>3} : {n:>5} words") print("NB: 88 are special cases that have been declared explicitly") print("NB: 99 are words that have no counterpart in the other edition") print("\nCOMBINATIONS\n") print("What combination alignments are there and how many?") for comb in sorted(combinations): cs = combinations[comb] (left, right) = comb print(f"\t({left:>2}, {right:>2}) : {len(cs):>4} x :") for (i, c) in enumerate(cs[0:3]): print(f"EXAMPLE {i + 1}:") print( self.printLines( start=max((1, c - 2)), end=min((len(alignment), c + 2 + max(comb))), ) ) print("") print("\nBAD STRETCHES\n") print("How many of which size?\n") allSuspects = [] someBenigns = [] for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])): suspects = { start: size for start in starts if self.analyseStretch(start, start + size) } benigns = {start: size for start in starts if start not in suspects} allSuspects.extend( [(start, start + size) for (start, size) in suspects.items()] ) someBenigns.extend( [(start, start + size) for (start, size) in list(benigns.items())[0:3]] ) examples = ", ".join(str(start) for start in list(suspects.keys())[0:3]) if not suspects: examples = ", ".join(str(start) for start in list(benigns.keys())[0:3]) print( f"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}" ) print( f"\nShowing all {len(allSuspects)} inversion suspects" if len(allSuspects) else "\nNo suspect bad stretches\n" ) for (i, (start, end)) in enumerate(reversed(allSuspects)): print(f"\nSUSPECT {i + 1:>2}") print(self.printLines(max((1, start - 5)), min((len(alignment), end + 5)))) print( f"\nShowing some ({len(someBenigns)}) benign examples" if len(someBenigns) else "\nNo bad stretches\n" ) for (i, (start, end)) in enumerate(someBenigns): print(f"\nBENIGN {i + 1:>2}") print(self.printLines(max((1, start - 2)), min((len(alignment), end + 2))))
Methods
def analyseStretch(self, start, end)
-
Analyse a stretch in the alignment table and reports whether it is suspect.
Stretches are suspect if they contain both a number entries where an LK word is missing, and a number of entries where an AF word is missing.
This is usually an indication that the alignment has gone out of sync.
Parameters
start
:integer
- Index in the alignment list where to start
end
:integer
- Index in the alignment list where to stop
Expand source code Browse git
def analyseStretch(self, start, end): """Analyse a stretch in the alignment table and reports whether it is suspect. Stretches are suspect if they contain both a number entries where an LK word is missing, and a number of entries where an AF word is missing. This is usually an indication that the alignment has gone out of sync. Parameters ---------- start: integer Index in the alignment list where to start end: integer Index in the alignment list where to stop """ alignment = self.alignment total = 0 good = 0 onlyLK = 0 onlyAF = 0 for (iLK, left, d, r, right, iAF) in alignment[start : end + 1]: total += 1 if not iLK: onlyAF += 1 if not iAF: onlyLK += 1 if d == 0: good += 1 suspect = onlyAF > 1 and onlyLK > 1 and onlyAF + onlyLK > 5 return suspect
def catchupAF(self, start, end)
-
Move the current position in the AF edition forward.
While doing so, it adds an entry to the alignment table for each time the position is shifted by one.
Parameters
start
:integer
- From where to start shifting
end
:integer
- Where to end shifting
Expand source code Browse git
def catchupAF(self, start, end): """Move the current position in the AF edition forward. While doing so, it adds an entry to the alignment table for each time the position is shifted by one. Parameters ---------- start: integer From where to start shifting end: integer Where to end shifting """ indexAF = self.indexAF alignment = self.alignment for i in range(start, end + 1): indexAF[i] = len(alignment) alignment.append(("", 0, 99, 0.0, "", i))
def catchupLK(self, start, end)
-
Move the current position in the LK edition forward.
While doing so, it adds an entry to the alignment table for each time the position is shifted by one.
Parameters
start
:integer
- From where to start shifting
end
:integer
- Where to end shifting
Expand source code Browse git
def catchupLK(self, start, end): """Move the current position in the LK edition forward. While doing so, it adds an entry to the alignment table for each time the position is shifted by one. Parameters ---------- start: integer From where to start shifting end: integer Where to end shifting """ indexLK = self.indexLK alignment = self.alignment for i in range(start, end + 1): indexLK[i] = len(alignment) alignment.append((i, "", 99, 0.0, 0, ""))
def check(self)
-
Comprehensive quality assessment of the alignment table.
Reports various statistics on how close the matches are overall.
Checks the minimum requirement of sanity: All words in the LK must occur without duplications in the right order int the alignment table. Same for the AF words.
Reports where the combinations are: the cases where more than one word in the LK has matched with more than one word in the AF. Or where single words have been forced to match, or where words in one edition have no counterpart in the other edition. Gives at most three examples of all distinct combination patterns.
Identifies bad strectches: chunks of entries within the alignment table where words do not match for various reasons. The identification is a bit loose in the sense that bad stretches with a single perfect match in between are combined.
Some bad stretches are suspect (see
analyseStretch
). All suspect stretches are shown, and at most three examples of the benign (non-suspsect) bad stretches are shown.Expand source code Browse git
def check(self): """Comprehensive quality assessment of the alignment table. Reports various statistics on how close the matches are overall. Checks the minimum requirement of sanity: All words in the LK must occur without duplications in the right order int the alignment table. Same for the AF words. Reports where the combinations are: the cases where more than one word in the LK has matched with more than one word in the AF. Or where single words have been forced to match, or where words in one edition have no counterpart in the other edition. Gives at most three examples of all distinct combination patterns. Identifies bad strectches: chunks of entries within the alignment table where words do not match for various reasons. The identification is a bit loose in the sense that bad stretches with a single perfect match in between are combined. Some bad stretches are suspect (see `analyseStretch`). All suspect stretches are shown, and at most three examples of the benign (non-suspsect) bad stretches are shown. """ maxLK = self.maxLK maxAF = self.maxAF alignment = self.alignment errors = {} prevILK = 0 prevIAF = 0 where = collections.Counter() agreement = collections.Counter() badStretches = collections.defaultdict(lambda: []) combinations = collections.defaultdict(lambda: []) startBad = 0 startCombi = 0 nCombi = 0 for (c, (iLK, left, d, r, right, iAF)) in enumerate(alignment): thisBad = d > 0 or not iLK or not iAF hasLeft = left != "" hasRight = right != "" # a good line between bad lines is counted as bad if not thisBad and startBad: nextGood = True for j in range(1, LOOKAHEAD + 1): if c + j < len(alignment): compJ = alignment[c + j] if compJ[2] > 0 or not compJ[0] or not compJ[-1]: nextGood = False break if not nextGood: thisBad = True if startBad: if not thisBad: badStretches[c - startBad].append(startBad) startBad = 0 else: if thisBad: startBad = c if startCombi and nCombi == c - startCombi: startCombi = 0 nCombi = 0 if startCombi == 0: thisCombi = hasLeft and left > 1 or hasRight and right > 1 if thisCombi: combinations[(left, right)].append(c) startCombi = c nCombi = max((left, right)) agreement[d] += 1 if iLK: if iLK != prevILK + 1: errors.setdefault("wrong iLK", []).append( f"{c:>5}: Expected {prevILK + 1}, found {iLK}" ) prevILK = iLK if iAF: where["both"] += 1 else: where[AF] += 1 if iAF: if iAF != prevIAF + 1: errors.setdefault("wrong iAF", []).append( f"{c:>5}: Expected {prevIAF + 1}, found {iAF}" ) prevIAF = iAF else: where[LK] += 1 if startBad: badStretches[len(alignment) - startBad].append(startBad) if prevILK < maxLK: errors.setdefault("missing iLKs at the end", []).append( f"last is {prevILK}, expected {maxLK}" ) elif prevILK > maxLK: errors.setdefault("too many iLKs at the end", []).append( f"last is {prevILK}, expected {maxLK}" ) if prevIAF < maxAF: errors.setdefault("missing iAFs at the end", []).append( f"last is {prevIAF}, expected {maxAF}" ) elif prevIAF > maxAF: errors.setdefault("too many iAFs at the end", []).append( f"last is {prevIAF}, expected {maxAF}" ) print("\nSANITY\n") if not errors: print("All OK") else: for (kind, msgs) in errors.items(): print(f"ERROR {kind} ({len(msgs):>5}x):") for msg in msgs[0:10]: print(f"\t{msg}") if len(msgs) > 10: print(f"\t ... and {len(msgs) - 10} more ...") print("\nAGREEMENT\n") print("Where are the words?\n") print(f"\t{LK}-only: {where[LK]:>5} slots") print(f"\t{AF}-only: {where[AF]:>5} slots") print(f"\tboth: {where['both']:>5} slots") print("\nHow well is the agreement?\n") for (d, n) in sorted(agreement.items()): print(f"edit distance {d:>3} : {n:>5} words") print("NB: 88 are special cases that have been declared explicitly") print("NB: 99 are words that have no counterpart in the other edition") print("\nCOMBINATIONS\n") print("What combination alignments are there and how many?") for comb in sorted(combinations): cs = combinations[comb] (left, right) = comb print(f"\t({left:>2}, {right:>2}) : {len(cs):>4} x :") for (i, c) in enumerate(cs[0:3]): print(f"EXAMPLE {i + 1}:") print( self.printLines( start=max((1, c - 2)), end=min((len(alignment), c + 2 + max(comb))), ) ) print("") print("\nBAD STRETCHES\n") print("How many of which size?\n") allSuspects = [] someBenigns = [] for (size, starts) in sorted(badStretches.items(), key=lambda x: (-x[0], x[1])): suspects = { start: size for start in starts if self.analyseStretch(start, start + size) } benigns = {start: size for start in starts if start not in suspects} allSuspects.extend( [(start, start + size) for (start, size) in suspects.items()] ) someBenigns.extend( [(start, start + size) for (start, size) in list(benigns.items())[0:3]] ) examples = ", ".join(str(start) for start in list(suspects.keys())[0:3]) if not suspects: examples = ", ".join(str(start) for start in list(benigns.keys())[0:3]) print( f"bad stretches of size {size:>3} : {len(suspects):>4} suspect of total {len(starts):>4} x see e.g. {examples}" ) print( f"\nShowing all {len(allSuspects)} inversion suspects" if len(allSuspects) else "\nNo suspect bad stretches\n" ) for (i, (start, end)) in enumerate(reversed(allSuspects)): print(f"\nSUSPECT {i + 1:>2}") print(self.printLines(max((1, start - 5)), min((len(alignment), end + 5)))) print( f"\nShowing some ({len(someBenigns)}) benign examples" if len(someBenigns) else "\nNo bad stretches\n" ) for (i, (start, end)) in enumerate(someBenigns): print(f"\nBENIGN {i + 1:>2}") print(self.printLines(max((1, start - 2)), min((len(alignment), end + 2))))
def compare(self, iLK, iAF, maxD, minR, debug=False)
-
Does a full comparison at a location, including between combinations.
First a direct match between the words at the current positions in the LK and AF editions is tried. If that fails, combinations of words from those points onwards are tried.
See also:
getCombis
andfindCombi
.Parameters
iLK
:integer
- Current slot position in LK edition
iAF
:integer
- Current slot position in AF edition
maxD
:integer
- maximum edit distance that we allow for the comparisons to succeed
minR
:integer
- minimum similarity ratio that we require for the comparisons to succeed
debug
:boolean
, optionalFalse
- If True, print a statement indicating the result of the comparison.
Expand source code Browse git
def compare(self, iLK, iAF, maxD, minR, debug=False): """Does a full comparison at a location, including between combinations. First a direct match between the words at the current positions in the LK and AF editions is tried. If that fails, combinations of words from those points onwards are tried. See also: `getCombis` and `findCombi`. Parameters ---------- iLK: integer Current slot position in LK edition iAF: integer Current slot position in AF edition maxD: integer maximum edit distance that we allow for the comparisons to succeed minR: integer minimum similarity ratio that we require for the comparisons to succeed debug: boolean, optional `False` If True, print a statement indicating the result of the comparison. """ maxLK = self.maxLK maxAF = self.maxAF getTextLK = self.getTextLK getTextAF = self.getTextAF alignment = self.alignment indexLK = self.indexLK indexAF = self.indexAF textLK = getTextLK(iLK) textAF = getTextAF(iAF) (isSimilar, d, r) = similar(textLK, textAF, maxD, minR) if isSimilar: nAlignment = len(alignment) indexLK[iLK] = nAlignment indexAF[iAF] = nAlignment alignment.append((iLK, "", d, r, "", iAF)) if debug: print( f"[{iLK}~{iAF}] single comparison with distance <= {maxD} and ratio >= {minR}" ) return (iLK + 1, iAF + 1) if iLK < maxLK and iAF < maxAF: textLK = getTextLK(iLK + 1) textAF = getTextAF(iAF + 1) (isSimilarNext, dNext, rNext) = similar(textLK, textAF, maxD, minR) if isSimilarNext: nAlignment = len(alignment) indexLK[iLK] = nAlignment indexAF[iAF] = nAlignment alignment.append((iLK, "", dNext, rNext, "", iAF)) if debug: print( f"[{iLK}~{iAF}] single comparison failed with distance <= {maxD} and ratio >= {minR}" ) nAlignment = len(alignment) indexLK[iLK + 1] = nAlignment indexAF[iAF + 1] = nAlignment alignment.append((iLK + 1, "", dNext, rNext, "", iAF + 1)) if debug: print( f"[{iLK}~{iAF}] single comparison recovered with distance <= {maxD} and ratio >= {minR}" ) return (iLK + 2, iAF + 2) combi = self.findCombi(iLK, iAF, maxD, minR) if combi is not None: (cLK, cAF) = combi if debug: print( f"[{iLK}~{iAF}] ({cLK}, {cAF}) comparison with distance <= {maxD} and ratio >= {minR}" ) return (iLK + cLK, iAF + cAF) return None
def doCase(self, iLK, iAF, debug=False)
-
Deal with a special case.
Parameters
iLK
:integer
- Current position in LK edition. A case is triggered when iLK is equal to a key in the cases table and if the same case has not already been applied. This last condition is relevant for special cases where the number of LK words that must be combined is 0. In that case, the current LK position does not shift, and we would be in an infinite loop.
iAF
:integer
- Current position in AF edition. If a case is triggered by iLK, but the AF position specified in the case does not match iAF, the case is skipped and is reported as a failed case.
Expand source code Browse git
def doCase(self, iLK, iAF, debug=False): """Deal with a special case. Parameters ---------- iLK: integer Current position in LK edition. A case is triggered when iLK is equal to a key in the cases table and if the same case has not already been applied. This last condition is relevant for special cases where the number of LK words that must be combined is 0. In that case, the current LK position does not shift, and we would be in an infinite loop. iAF: integer Current position in AF edition. If a case is triggered by iLK, but the AF position specified in the case does not match iAF, the case is skipped and is reported as a failed case. """ indexLK = self.indexLK indexAF = self.indexAF alignment = self.alignment usedCases = self.usedCases failedCases = self.failedCases cases = self.cases if iLK not in cases: return None (givenIAF, cLK, cAF) = cases[iLK] if givenIAF != iAF: failedCases.add(iLK) return None usedCases.add(iLK) common = min((cLK, cAF)) for i in range(max((cLK, cAF))): nAlignment = len(alignment) if i < common: indexLK[iLK + i] = nAlignment indexAF[iAF + i] = nAlignment alignment.append((iLK + i, cLK, 88, 0.0, cAF, iAF + i)) elif i < cLK: indexLK[iLK + i] = nAlignment alignment.append((iLK + i, cLK, 88, 0.0, cAF, "")) else: indexAF[iAF + i] = nAlignment alignment.append(("", cLK, 88, 0.0, cAF, iAF + i)) if debug: print(f"[{iLK}~{iAF}] special case ({cLK}, {cAF})") return (iLK + cLK, iAF + cAF)
def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False)
-
Main loop of the alignment process.
Without optional parameters, performs the whole alignment until it is completed or fails.
But you can also run a few alignment steps, from arbitrary positions and record the decision steps. This is handy for debugging and exploring.
doDiff
This function defines an inner function
doDiff
, which contains the logic of a single alignment step. That functiondoDiff
issues a sequence ofcompare
andlookup
commands with various strictnesses and various jump sizes, which will be tried in order.Here you can finetune the amount of looseness you allow in a comparison before jumping from that position. You can first compare strictly, then jump away a short distance while still comparing strictly, and then start comparing more loosely, and then jump away a longer distance, possibly a bit more loosely. That is a matter of trial and error.
The current implementation of
doDiff
matches the present LK and AF well, especially in conjunction with the special cases that have been defined.Parameters
startLK
:integer
, optional1
- Start position in LK edition. If not passed, starts from the beginning.
startAF
:integer
, optional1
- Start position in AF edition. If not passed, starts from the beginning.
steps
:integer
, optional-1
- The number of alignment steps to perform. If not passed or -1, there is no limit on the steps.
show
:boolean
, optionalFalse
- If True, prints the resulting alignment table to the screen. Only do this when you produce a small part of the alignment table! maximum edit distance that we allow for the comparisons to succeed
debug
:boolean
, optionalFalse
- If True, print a statement indicating the result of the each decision that has been taken during the alignment process.
Expand source code Browse git
def doDiffs(self, startLK=1, startAF=1, steps=-1, show=False, debug=False): """Main loop of the alignment process. Without optional parameters, performs the whole alignment until it is completed or fails. But you can also run a few alignment steps, from arbitrary positions and record the decision steps. This is handy for debugging and exploring. !!! hint "doDiff" This function defines an inner function `doDiff`, which contains the logic of a single alignment step. That function `doDiff` issues a sequence of `compare` and `lookup` commands with various strictnesses and various jump sizes, which will be tried in order. Here you can finetune the amount of looseness you allow in a comparison before jumping from that position. You can first compare strictly, then jump away a short distance while still comparing strictly, and then start comparing more loosely, and then jump away a longer distance, possibly a bit more loosely. That is a matter of trial and error. The current implementation of `doDiff` matches the present LK and AF well, especially in conjunction with the special cases that have been defined. Parameters ---------- startLK: integer, optional 1 Start position in LK edition. If not passed, starts from the beginning. startAF: integer, optional 1 Start position in AF edition. If not passed, starts from the beginning. steps: integer, optional `-1` The number of alignment steps to perform. If not passed or -1, there is no limit on the steps. show: boolean, optional False If True, prints the resulting alignment table to the screen. Only do this when you produce a small part of the alignment table! maximum edit distance that we allow for the comparisons to succeed debug: boolean, optional `False` If True, print a statement indicating the result of the each decision that has been taken during the alignment process. """ maxLK = self.maxLK maxAF = self.maxAF cases = self.cases alignmentPath = self.alignmentPath alignmentText = self.alignmentText alignmentTextBlocked = self.alignmentTextBlocked self.alignment = [] self.indexLK = {} self.indexAF = {} self.decisions = collections.Counter() self.usedCases = set() self.failedCases = set() alignment = self.alignment indexLK = self.indexLK decisions = self.decisions usedCases = self.usedCases failedCases = self.failedCases lastCase = None step = (startLK, startAF) it = 0 def doDiff(iLK, iAF): decision = 0 if iLK > maxLK or iAF > maxAF: if iAF < maxAF: self.catchupAF(iAF, maxAF) if iLK < maxLK: self.catchupLK(iLK, maxLK) decisions[decision] += 1 return True nonlocal lastCase decision += 1 if lastCase != iLK: step = self.doCase(iLK, iAF, debug=debug) if step: lastCase = iLK decisions[decision] += 1 return step step = self.compare(iLK, iAF, 0, 1.0, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 1, 0.8, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 0, 1.0, 1, 5, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 1, 0.8, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 2, 0.7, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.compare(iLK, iAF, 3, 0.6, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 0, 1.0, 1, 10, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 1, 0.8, 1, 10, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 0, 1.0, 10, 20, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 1, 0.8, 10, 20, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 2, 0.7, 1, 20, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 1, 0.8, 20, 100, debug=debug) decision += 1 if step: decisions[decision] += 1 return step step = self.lookup(iLK, iAF, 2, 0.7, 20, 100, debug=debug) decision += 1 if step: decisions[decision] += 1 return step return False while it != steps: it += 1 step = doDiff(*step) if step is True: self.printAlignment(alignmentText) self.printAlignment(alignmentPath, asTsv=True) print(f"Alignment complete, {len(alignment)} entries.") break elif step is False: self.printAlignment(alignmentTextBlocked) print(f"Alignment blocked, {len(alignment)} entries.") self.printDiff(20, 20) break endLK = startLK if len(indexLK) == 0 else max(indexLK) possibleCases = set(range(startLK, endLK + 1)) casesSet = set(cases) relevantCases = casesSet & possibleCases missedCases = relevantCases - usedCases - failedCases if failedCases: for iLK in sorted(failedCases): self.printCase(iLK, "FAILED") if missedCases: for iLK in sorted(missedCases): self.printCase(iLK, "MISSED") if not failedCases and not missedCases: if not relevantCases: print("No special cases defined for this stretch") else: print( f"Special cases: all relevant {len(relevantCases)} cases " "defined, encountered, and applied" ) if show: print(self.printLines()) for d in sorted(decisions): n = decisions[d] print(f"{d:>2} taken: {n:>5} x")
def findCombi(self, iLK, iAF, maxD, minR)
-
Tries out all possible combinations at this point until success.
This is typically called when a direct match at the current slots in LK and AF fail.
We are going to take together small amounts of words and match them instead. As soon as we have a match, we break out of the loop and use step to the new positions.
See also:
getCombis
andcompare
.Parameters
iLK
:integer
- Current slot position in LK edition
iAF
:integer
- Current slot position in AF edition
maxD
:integer
- maximum edit distance that we allow for the comparisons to succeed
minR
:integer
- minimum similarity ratio that we require for the comparisons to succeed
Expand source code Browse git
def findCombi(self, iLK, iAF, maxD, minR): """Tries out all possible combinations at this point until success. This is typically called when a direct match at the current slots in LK and AF fail. We are going to take together small amounts of words and match them instead. As soon as we have a match, we break out of the loop and use step to the new positions. See also: `getCombis` and `compare`. Parameters ---------- iLK: integer Current slot position in LK edition iAF: integer Current slot position in AF edition maxD: integer maximum edit distance that we allow for the comparisons to succeed minR: integer minimum similarity ratio that we require for the comparisons to succeed """ combis = self.combis maxLK = self.maxLK maxAF = self.maxAF getTextLK = self.getTextLK getTextAF = self.getTextAF alignment = self.alignment indexLK = self.indexLK indexAF = self.indexAF found = None for (cLK, cAF) in combis: if iLK + cLK > maxLK or iAF + cAF > maxAF: continue textLK = "".join(getTextLK(iLK + i) for i in range(cLK)) textAF = "".join(getTextAF(iAF + i) for i in range(cAF)) (isSimilar, d, r) = similar(textLK, textAF, maxD, minR) if isSimilar: found = (cLK, cAF) common = min((cLK, cAF)) for i in range(max((cLK, cAF))): nAlignment = len(alignment) if i < common: indexLK[iLK + i] = nAlignment indexAF[iAF + i] = nAlignment alignment.append((iLK + i, cLK, d, r, cAF, iAF + i)) elif i < cLK: indexLK[iLK + i] = nAlignment alignment.append((iLK + i, cLK, d, r, cAF, "")) elif i < cAF: indexAF[iAF + i] = nAlignment alignment.append(("", cLK, d, r, cAF, iAF + i)) break return found
def getCombis(self, c)
-
Precompute a specification of all combinations that might be tried.
When the alignment fails between single words, we try combinations of words left and right in a specific order. The result of this function lists the combinations that must be tried, where each combination is specified as a tuple of the number of words left and the number of words right that needs to be taken together for the matching.
See also:
findCommbi
.Expand source code Browse git
def getCombis(self, c): """Precompute a specification of all combinations that might be tried. When the alignment fails between single words, we try combinations of words left and right in a specific order. The result of this function lists the combinations that must be tried, where each combination is specified as a tuple of the number of words left and the number of words right that needs to be taken together for the matching. See also: `findCommbi`. """ combis = [] for i in range(1, c + 1): for j in range(1, c + 1): if i != 1 or j != 1: combis.append((i, j)) self.combis = tuple( sorted(combis, key=lambda x: (x[0] + x[1], abs(x[0] - x[1]))) )
def getPosAF(self, slot)
-
Obtain the page and line numbers where a slot in the AF occurs.
Expand source code Browse git
def getPosAF(self, slot): """Obtain the page and line numbers where a slot in the AF occurs. """ T_AF = self.T_AF (page, block, line) = T_AF.sectionFromNode(slot) return f"{page:>03}:{line:>02}"
def getPosLK(self, slot)
-
Obtain the page and line numbers where a slot in the LK occurs.
Expand source code Browse git
def getPosLK(self, slot): """Obtain the page and line numbers where a slot in the LK occurs. """ T_LK = self.T_LK (piece, page, line) = T_LK.sectionFromNode(slot) return f"{page:>03}:{line:>02}"
def lookup(self, iLK, iAF, maxD, minR, start, end, debug=False)
-
Performs a jump in both editions.
Typically invoked when comparison at the current locations fail, even after having tried combinations.
We compare the current word in one edition with those from a certain interval in the other edition. And vice versa, alternately. of subsequent words.
Parameters
iLK
:integer
- Current slot position in LK edition
iAF
:integer
- Current slot position in AF edition
maxD
:integer
- maximum edit distance that we allow for the comparisons to succeed
minR
:integer
- minimum similarity ratio that we require for the comparisons to succeed
start
:integer
- first position in the other edition where we start comparing
end
:integer
- last position in the other edition where we stop comparing
debug
:boolean
, optionalFalse
- If True, print a statement indicating the result of the lookup.
Expand source code Browse git
def lookup(self, iLK, iAF, maxD, minR, start, end, debug=False): """Performs a jump in both editions. Typically invoked when comparison at the current locations fail, even after having tried combinations. We compare the current word in one edition with those from a certain interval in the other edition. And vice versa, alternately. of subsequent words. Parameters ---------- iLK: integer Current slot position in LK edition iAF: integer Current slot position in AF edition maxD: integer maximum edit distance that we allow for the comparisons to succeed minR: integer minimum similarity ratio that we require for the comparisons to succeed start: integer first position in the other edition where we start comparing end: integer last position in the other edition where we stop comparing debug: boolean, optional `False` If True, print a statement indicating the result of the lookup. """ maxLK = self.maxLK maxAF = self.maxAF alignment = self.alignment indexLK = self.indexLK indexAF = self.indexAF step = None for i in range(start, end + 1): prevAlignmentIndex = len(alignment) if iAF + i <= maxAF: step = self.compare(iLK, iAF + i, maxD, minR, debug=debug) if step: if debug: print(f"[{iLK}~{iAF}] right {i}-jump to {iAF + i}") thisAlignment = list(alignment[prevAlignmentIndex:]) alignment[prevAlignmentIndex:] = [] self.catchupAF(iAF, iAF + i - 1) for thisComp in thisAlignment: nAlignment = len(alignment) thisLK = thisComp[0] thisAF = thisComp[-1] if thisLK: indexLK[thisLK] = nAlignment if thisAF: indexAF[thisAF] = nAlignment alignment.append(thisComp) break if iLK + i <= maxLK: step = self.compare(iLK + i, iAF, maxD, minR, debug=debug) if step: if debug: print(f"[{iLK}~{iAF}] left {i}-jump to {iLK + i}") thisAlignment = list(alignment[prevAlignmentIndex:]) alignment[prevAlignmentIndex:] = [] self.catchupLK(iLK, iLK + i - 1) for thisComp in thisAlignment: nAlignment = len(alignment) thisLK = thisComp[0] thisAF = thisComp[-1] if thisLK: indexLK[thisLK] = nAlignment if thisAF: indexAF[thisAF] = nAlignment alignment.append(thisComp) break return step
def printAlignment(self, path, asTsv=False)
-
Prints the whole alignment table to file.
Parameters
path
:string
- The path name of the file to which the information is printed.
asTsv
:boolean
, optional`False
- If False, pretty prints the table to file, using
printLines
above. If True, write essential data only in tab separated format. The essential information is slot in LK, distance, slot in AF
Expand source code Browse git
def printAlignment(self, path, asTsv=False): """Prints the whole alignment table to file. Parameters ---------- path: string The path name of the file to which the information is printed. asTsv: boolean, optional `False If False, pretty prints the table to file, using `printLines` above. If True, write essential data only in tab separated format. The essential information is *slot in LK*, *distance*, *slot in AF* """ alignment = self.alignment with open(path, "w") as fh: if asTsv: for item in alignment: fh.write( "\t".join( f"{it:3.1f}" if i == 3 else str(it) for (i, it) in enumerate(item) ) + "\n" ) else: fh.write(self.printLines()) fh.write("\n")
def printCase(self, iLK, label)
-
Print the alignment entries that belong to a special case.
Parameters
iLK
:integer
- The slot number in LK that triggered a special case.
label
:string
- The kind of issue with this case that you want to report: FAILED or MISSING.
Expand source code Browse git
def printCase(self, iLK, label): """Print the alignment entries that belong to a special case. Parameters ---------- iLK: integer The slot number in LK that triggered a special case. label: string The kind of issue with this case that you want to report: FAILED or MISSING. """ indexLK = self.indexLK indexAF = self.indexAF cases = self.cases case = cases[iLK] (iAF, cLK, cAF) = case start = min((indexLK[iLK], indexAF[iAF])) end = max((indexLK[iLK + cLK + 1], indexAF[iAF + cAF + 1])) print(f"{label} CASE: {iLK} vs {iAF}:") print(self.printLines(start=start, end=end))
def printDiff(self, before, after)
-
Print the last alignment entries plus what comes after.
This function is useful if alignment fails at some point. It then shows what happened before the failure and how it looks after the failure.
Parameters
before
:integer
- The amount of alignment entries to print. They will be picked from the end of the current alignment table.
after
:integer
- The amount of slots after the last alignment entry that needs to be shown for each source.
Expand source code Browse git
def printDiff(self, before, after): """Print the last alignment entries plus what comes after. This function is useful if alignment fails at some point. It then shows what happened before the failure and how it looks after the failure. Parameters ---------- before: integer The amount of alignment entries to print. They will be picked from the end of the current alignment table. after: integer The amount of slots after the last alignment entry that needs to be shown for each source. """ alignment = self.alignment maxLK = self.maxLK maxAF = self.maxAF getTextLK = self.getTextLK getTextAF = self.getTextAF print(self.printLines(start=len(alignment) - before)) print("^" * 67) lastLK = None lastAF = None for c in range(len(alignment) - 1, -1, -1): comp = alignment[c] if lastLK is None: if comp[0]: lastLK = comp[0] if lastAF is None: if comp[-1]: lastAF = comp[-1] if lastLK is not None and lastAF is not None: break if lastLK is not None and lastAF is not None: for i in range(after): iLK = lastLK + 1 + i iAF = lastAF + 1 + i textLK = getTextLK(iLK) if iLK <= maxLK else "" textAF = getTextAF(iAF) if iAF <= maxAF else "" d = distance(textLK, textAF) r = ratio(textLK, textAF) print( f"{iLK:>5} = {textLK:>20} @{d:>2}~{r:3.1f} {textAF:<20} = {iAF:>5}" )
def printLines(self, start=0, end=None)
-
Print specified entries in the alignment table.
Parameters
start
:integer
, optional0
- Position of first alignment entry that needs to be printed. If it is 0 or negative, it starts from the beginning.
end
:integer
, optionalNone
- Position of last alignment entry that needs to be printed. If it is negative, it indicates a position that much before the end of the list. If it is absent, it indicates the last line of the list.
Expand source code Browse git
def printLines(self, start=0, end=None): """Print specified entries in the alignment table. Parameters ---------- start: integer, optional 0 Position of first alignment entry that needs to be printed. If it is 0 or negative, it starts from the beginning. end: integer, optional `None` Position of last alignment entry that needs to be printed. If it is negative, it indicates a position that much before the end of the list. If it is absent, it indicates the last line of the list. """ getTextLK = self.getTextLK getTextAF = self.getTextAF alignment = self.alignment if start < 0: start = 0 if end is None or end > len(alignment): end = len(alignment) lines = [] lines.append( "pag:ln|slot |cc|textLakhnawi |@ed~rat|textAfifi |cc| slot|pag:ln" ) lines.append( "------|-----|--|--------------------|-------|--------------------|--|-----|------" ) for (iLK, left, d, r, right, iAF) in alignment[start:end]: textLK = getTextLK(iLK) if iLK else "" textAF = getTextAF(iAF) if iAF else "" posLK = self.getPosLK(iLK) if iLK else "" posAF = self.getPosAF(iAF) if iAF else "" lines.append( f"{posLK:>6}|{iLK:>5}|{left:<2}|{textLK:>20}|@{d:<2}~{r:3.1f}|{textAF:<20}|{right:>2}|{iAF:>5} {posAF:<6}" ) return "\n".join(lines)
def readEditions(self, version, cases)
-
Read the material of both LK and AF editions.
Parameters
version
:string
- The version number of the source datasets for both editions. We assume that both editions are available as Text-Fabric datasets.
cases
:dict
-
The special cases defined to help the alignment algorithm through difficult spots.
The dictionary is keyed by LK slot number, and the value is a tuple consisting of a corresponding AF slot number, the amount of words to collect at the LK side, and the amount of words to collect at the AF side.
Returns
None
- But the Alignment object will have attributes set by which the other methods are capable of reading the data source.
Expand source code Browse git
def readEditions(self, version, cases): """Read the material of both LK and AF editions. Parameters ---------- version: string The version number of the source datasets for both editions. We assume that both editions are available as Text-Fabric datasets. cases: dict The special cases defined to help the alignment algorithm through difficult spots. The dictionary is keyed by LK slot number, and the value is a tuple consisting of a corresponding AF slot number, the amount of words to collect at the LK side, and the amount of words to collect at the AF side. Returns ------- None But the Alignment object will have attributes set by which the other methods are capable of reading the data source. """ self.version = version self.cases = cases self.alignmentPath = f"{ALIGNMENT_DIR}/alignment{version}.tsv" self.alignmentText = f"{ALIGNMENT_DIR}/alignment{version}.txt" self.alignmentTextBlocked = f"{ALIGNMENT_DIR}/alignment-incomplete{version}.txt" if not os.path.exists(ALIGNMENT_DIR): os.makedirs(self.alignmentPath, exist_ok=True) A = {} F = {} T = {} L = {} maxSlot = {} for (acro, name) in EDITIONS.items(): A[acro] = use( f"among/fusus/tf/{name}:clone", writing="ara", version=version ) F[acro] = A[acro].api.F T[acro] = A[acro].api.T L[acro] = A[acro].api.L maxSlot[acro] = F[acro].otype.maxSlot self.F_LK = F[LK] self.F_AF = F[AF] self.T_LK = T[LK] self.T_AF = T[AF] self.L_LK = L[LK] self.L_AF = L[AF] self.maxLK = maxSlot[LK] self.maxAF = maxSlot[AF] self.getTextLK = F[LK].lettersn.v self.getTextAF = F[AF].lettersn.v