Module fusus.about.howto

Install and update

code and documentation fusus.about.install

  • get code
  • update docs
  • update code

Run

Straight from the command line fusus.convert

  • run the OCR pipeline from the command line
  • run the PDF extraction from the command line
  • convert TSV to TF

Contribute more sources

From "no comments" to "more comments" fusus.works

  • add commentaries as works

Explore

Page by page in a notebook

  • do example Run the pipeline in a notebook on the examples;
  • do Afifi Run the pipeline in a notebook on the Afifi edition of the Fusus;
  • inspect Inspect intermediate results in a notebook.
  • ocr Read the proofs of Kraken-OCR.
  • notebooks on nbviewer. All notebooks.

Tweak

Sickness and cure by parameters

  • tweak Basic parameter tweaking;
  • fusus.parameters All parameters.
  • comma A ministudy in cleaning: tweak mark templates and parameters to wipe commas.
  • lines Follow the line detection algorithm in a wide variety of cases.
  • piece What to do if you have an image that is a small fragment of a page.

Engineer

Change the flow

  • fusus.lakhnawi PDF reverse engineering.

  • drilldown Narrow down to specific pages and lines and see what text is extracted from which portion.

  • pages Work with pages, follow line division, extract text and save to disk.
  • characters See which characters are in the PDF and how they are converted.
  • final See in the effect of final characters on spacing.
  • border See how black borders get removed from a page. See also cropBorders() and removeBorders().

Work

Do data science with the results

  • fusus Description of the TSV output of the pipeline and the PDF text extraction
  • useTsv Use the TSV output of the pipeline.
  • useTf Use the Text-Fabric output of the pipeline.
  • boxes Work with bounding boxes in the Text-Fabric data of the Lakhnawi.
Expand source code Browse git
"""
.. include:: ../docs/about/howto.md
"""