Use ocrmypdf tool to create pdf/a during conversion

- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
This commit is contained in:
Eike Kettner
2020-07-18 12:48:41 +02:00
parent 99210365ce
commit 3d49ceaab5
16 changed files with 316 additions and 21 deletions

View File

@@ -23,3 +23,4 @@ Some early information about certain details can be found in a few
- [0012 Periodic Tasks](adr/0012_periodic_tasks)
- [0013 Archive Files](adr/0013_archive_files)
- [0014 Full-Text Search](adr/0014_fulltext_search_engine)
- [0015 Convert PDF files](adr/0015_convert_pdf_files)