Use ocrmypdf tool to create pdf/a during conversion

- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
This commit is contained in:
Eike Kettner
2020-07-18 12:48:41 +02:00
parent 99210365ce
commit 3d49ceaab5
16 changed files with 316 additions and 21 deletions

View File

@ -13,7 +13,9 @@ permalink: features
- OCR using [tesseract](https://github.com/tesseract-ocr/tesseract)
- [Full-Text Search](doc/finding#full-text-search) based on [Apache
SOLR](https://lucene.apache.org/solr)
- Conversion to PDF: all files are converted into a PDF file
- Conversion to PDF: all files are converted into a PDF file. PDFs
with only images (as often returned from scanners) are converted
into searchable PDF/A pdfs.
- Non-destructive: all your uploaded files are never modified and can
always be downloaded untouched
- Text is analysed to find and attach meta data automatically