mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-22 02:18:26 +00:00
Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.
This commit is contained in:
@ -13,7 +13,9 @@ permalink: features
|
||||
- OCR using [tesseract](https://github.com/tesseract-ocr/tesseract)
|
||||
- [Full-Text Search](doc/finding#full-text-search) based on [Apache
|
||||
SOLR](https://lucene.apache.org/solr)
|
||||
- Conversion to PDF: all files are converted into a PDF file
|
||||
- Conversion to PDF: all files are converted into a PDF file. PDFs
|
||||
with only images (as often returned from scanners) are converted
|
||||
into searchable PDF/A pdfs.
|
||||
- Non-destructive: all your uploaded files are never modified and can
|
||||
always be downloaded untouched
|
||||
- Text is analysed to find and attach meta data automatically
|
||||
|
Reference in New Issue
Block a user