Use ocrmypdf tool to create pdf/a during conversion

- Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.
2025-09-28 23:58:21 +00:00 · 2020-07-18 12:48:41 +02:00
parent 99210365ce
commit 3d49ceaab5
16 changed files with 316 additions and 21 deletions
--- a/modules/microsite/docs/doc/install.md
+++ b/modules/microsite/docs/doc/install.md
@@ -77,6 +77,10 @@ component.
  office documents into PDF files. It uses libreoffice/openoffice.
 - [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
  PDF files.
+- [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) can be optionally
+  used to convert PDF to PDF files. It adds an OCR layer to scanned
+  PDF files to make them searchable. It also creates PDF/A files from
+  the input pdf.

 The performance of `unoconv` can be improved by starting `unoconv -l`
 in a separate process. This runs a libreoffice/openoffice listener
@@ -87,7 +91,7 @@ therefore avoids starting one each time `unoconv` is called.
 On Debian this should install all joex requirements:

 ``` bash
-sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf
+sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf ocrmypdf
 ```