Use ocrmypdf tool to create pdf/a during conversion

- Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.
2025-06-22 02:18:26 +00:00 · 2020-07-18 12:48:41 +02:00
parent 99210365ce
commit 3d49ceaab5
16 changed files with 316 additions and 21 deletions
--- a/modules/microsite/docs/dev/adr.md
+++ b/modules/microsite/docs/dev/adr.md
@ -23,3 +23,4 @@ Some early information about certain details can be found in a few
 - [0012 Periodic Tasks](adr/0012_periodic_tasks)
 - [0013 Archive Files](adr/0013_archive_files)
 - [0014 Full-Text Search](adr/0014_fulltext_search_engine)
+- [0015 Convert PDF files](adr/0015_convert_pdf_files)
--- a/modules/microsite/docs/dev/adr/0015_convert_pdf_files.md
+++ b/modules/microsite/docs/dev/adr/0015_convert_pdf_files.md
@ -0,0 +1,67 @@
+---
+layout: docs
+title: Convert PDF Files
+permalink: dev/adr/0015_convert_pdf_files
+---
+
+# {{ page.title }}
+
+## Context and Problem Statement
+
+Some PDFs contain only images (when coming from a scanner) and
+therefore one is not able to click into the pdf and select text for
+copy&paste. Also it is not searchable in a PDF viewer. These are
+really shortcomings that can be fixed, especially when there is
+already OCR build in.
+
+For images, this works already as tesseract is used to create the PDF
+files. Tesseract creates the files with an additional text layer
+containing the OCRed text.
+
+## Considered Options
+
+* [ocrmypdf](https://github.com/jbarlow83/OCRmyPDF) OCRmyPDF adds an
+  OCR text layer to scanned PDF files, allowing them to be searched
+
+
+### ocrmypdf
+
+This is a very nice python tool, that uses tesseract to do OCR on each
+page and add the extracted text as a pdf text layer to the page.
+Additionally it creates PDF/A type pdfs, which are great for
+archiving. This fixes exactly the things stated above.
+
+#### Integration
+
+Docspell already has this built in for images. When converting images
+to a PDF (which is done early in processing), the process creates a
+text and a PDF file. Docspell then sets the text in this step and the
+text extraction step skips doing its work, if there is already text
+available.
+
+It would be possible to use the `--sidecar` option with ocrmypdf to
+create a text file of the extracted text with one run, too (exactly
+like it works for tesseract). But for "text" pdfs, ocrmypdf writes
+some info-message into this text file:
+
+```
+[OCR skipped on page 1][OCR skipped on page 2]
+```
+
+Docspell cannot reliably tell, wether this is extracted text or not.
+It would be reqiured to load the pdf and check its contents. This is a
+bit of bad luck, because everything would just work already. So it
+requires a (small) change in the text-extraction step. By default,
+text extraction happens on the source file. For PDFs, text extraction
+should now be run on the converted file, to avoid running OCR twice.
+
+The converted pdf file is either be a text-pdf in the first place,
+where ocrmypdf would only convert it to a PDF/A file; or it may be a
+converted file containing the OCR-ed text as a pdf layer. If ocrmypdf
+is disabled, the converted file and the source file are the same for
+PDFs.
+
+## Decision Outcome
+
+Add ocrmypdf as an optional conversion from PDF to PDF. Ocrmypdf is
+distributed under the GPL-3 license.
--- a/modules/microsite/docs/doc/install.md
+++ b/modules/microsite/docs/doc/install.md
@ -77,6 +77,10 @@ component.
  office documents into PDF files. It uses libreoffice/openoffice.
 - [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
  PDF files.
+- [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) can be optionally
+  used to convert PDF to PDF files. It adds an OCR layer to scanned
+  PDF files to make them searchable. It also creates PDF/A files from
+  the input pdf.

 The performance of `unoconv` can be improved by starting `unoconv -l`
 in a separate process. This runs a libreoffice/openoffice listener
@ -87,7 +91,7 @@ therefore avoids starting one each time `unoconv` is called.
 On Debian this should install all joex requirements:

 ``` bash
-sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf
+sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf ocrmypdf
 ```


--- a/modules/microsite/docs/features.md
+++ b/modules/microsite/docs/features.md
@ -13,7 +13,9 @@ permalink: features
 - OCR using [tesseract](https://github.com/tesseract-ocr/tesseract)
 - [Full-Text Search](doc/finding#full-text-search) based on [Apache
  SOLR](https://lucene.apache.org/solr)
- Conversion to PDF: all files are converted into a PDF file
+- Conversion to PDF: all files are converted into a PDF file. PDFs
+  with only images (as often returned from scanners) are converted
+  into searchable PDF/A pdfs.
 - Non-destructive: all your uploaded files are never modified and can
  always be downloaded untouched
 - Text is analysed to find and attach meta data automatically