Module fusus.about.transcriptionl

Lakhnawi transcription

The Text-Fabric data is derived from the Lakhnawi PDF by reverse engineering. The PDF is a textual PDF with an unusual usage of fonts to obtain desired effects with ligatures and diacritics.

Divisions

The text is divided into the following chunks

Piece

Section level 1

Logical unit, corresponding to the main division of the work: bezel. (The title of the work is: bezels of wisdom.)

Some pieces are in fact introductory chapters, and not the bezels of the main work.

Features

name type description
n int sequence number of a piece, starting with 1
np int sequence number of a proper content piece, i.e. a bezel
title str title of a piece

Page

Section level 2

Physical unit: a printed page.

Features

name type description
n int sequence number of a page, starting with 1

Line

Section level 3

Physical unit: a printed line within a page.

Features

name type description
n int sequence number of a page, starting with 1

Column

Logical/physical unit: a column within a line.

Note that the page is not divided into columns. Some lines are divided into columns in hemistic poems. See Lakhnawi.columns.

Span

Logical/physical unit: a strectch of text with the same writing direction. Whenever the writing direction reverses, a new span is started.

Features

name type description
n int sequence number of a span within a column or line
dir str writing direction of a span; either r or l

Sentence

Logical unit: a sentence, defined by the full-stop marker. Whenever the writing direction reverses, a new span is started.

Features

name type description
n int sequence number of a span within a column or line

Word

Logical/physical unit: individual words in as far they are separated by whitespace.

Imperfect whitespace detection

We do not guarantee that whitespace has been detected perfectly. So we do miss word boundaries on the one hand, and we have spurious word boundaries on the other hand.

Features

name type description
boxl int left x-coordinate of the bounding box of a word
boxt int top y-coordinate of the bounding box of a word
boxr int right x-coordinate of the bounding box of a word
boxb int bottom y-coordinate of the bounding box of a word
letters str the text of a word in Arabic, unicode, without punctuation
lettersn str the text of a word in beta code, latin + diacritics
lettersp str the text of a word in beta code, ascii
letterst str the text of a word in romanized transcription
punc str the punctuation and/or space immediately after a word in Arabic, unicode
punca str the punctuation and/or space immediately after a word in ascii
Expand source code Browse git
"""
.. include:: ../docs/about/transcriptionl.md
"""