Use ocrmypdf tool to create pdf/a during conversion

- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
This commit is contained in:
Eike Kettner
2020-07-18 12:48:41 +02:00
parent 99210365ce
commit 3d49ceaab5
16 changed files with 316 additions and 21 deletions

View File

@ -77,6 +77,10 @@ component.
office documents into PDF files. It uses libreoffice/openoffice.
- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
PDF files.
- [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) can be optionally
used to convert PDF to PDF files. It adds an OCR layer to scanned
PDF files to make them searchable. It also creates PDF/A files from
the input pdf.
The performance of `unoconv` can be improved by starting `unoconv -l`
in a separate process. This runs a libreoffice/openoffice listener
@ -87,7 +91,7 @@ therefore avoids starting one each time `unoconv` is called.
On Debian this should install all joex requirements:
``` bash
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf ocrmypdf
```