diff --git a/modules/microsite/docs/doc/install.md b/modules/microsite/docs/doc/install.md index af4ba903..6f085d53 100644 --- a/modules/microsite/docs/doc/install.md +++ b/modules/microsite/docs/doc/install.md @@ -68,19 +68,28 @@ component. extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) (at the expense of a longer runtime). - [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool - doing the OCR (converts images into text). It is a widely used open - source OCR engine. Tesseract 3 and 4 should work with docspell; you - can adopt the command line in the configuration file, if necessary. + doing the OCR (converts images into text). It can also convert + images into pdf files. It is a widely used open source OCR engine. + Tesseract 3 and 4 should work with docspell; you can adopt the + command line in the configuration file, if necessary. +- [Unoconv](https://github.com/unoconv/unoconv) is used to convert + office documents into PDF files. It uses libreoffice/openoffice. +- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into + PDF files. +The performance of `unoconv` can be improved by starting `unoconv -l` +in a separate process. This runs a libreoffice/openoffice listener +therefore avoids starting one each time `unoconv` is called. ### Example Debian On Debian this should install all joex requirements: ``` bash -sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper +sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf ``` + ## Database Both components must have access to a SQL database. Docspell has @@ -203,12 +212,15 @@ work is done by the joex components. ### Joex Running the joex component on the Raspberry Pi is possible, but will -result in long processing times. Tested on a RPi model 3 (4 cores, 1G -RAM) processing a PDF (scanned with 300dpi) with two pages took -9:52. You can speed it up considerably by uninstalling the `unpaper` -command, because this step takes quite long. This, of course, reduces -the quality of OCR. But without `unpaper` the same sample pdf was then -processed in 1:24, a speedup of 8 minutes. +result in long processing times for OCR. Files that don't require OCR +are no problem. + +Tested on a RPi model 3 (4 cores, 1G RAM) processing a PDF (scanned +with 300dpi) with two pages took 9:52. You can speed it up +considerably by uninstalling the `unpaper` command, because this step +takes quite long. This, of course, reduces the quality of OCR. But +without `unpaper` the same sample pdf was then processed in 1:24, a +speedup of 8 minutes. You should limit the joex pool size to 1 and, depending on your model and the amount of RAM, set a heap size of at least 500M diff --git a/modules/microsite/docs/features.md b/modules/microsite/docs/features.md index 5e9ec6ba..c5d60643 100644 --- a/modules/microsite/docs/features.md +++ b/modules/microsite/docs/features.md @@ -9,6 +9,7 @@ title: Features and Limitations - Multiple users per account - Handle multiple documents as one unit - OCR using [tesseract](https://github.com/tesseract-ocr/tesseract) +- Conversion to PDF: all files are converted into a PDF file - Text is analysed to find and attach meta data automatically - Manage document processing (cancel jobs, set priorities) - Everything available via a documented [REST Api](api) @@ -18,6 +19,14 @@ title: Features and Limitations - REST server and document processing are separate applications which can be scaled-out independently - Everything stored in a SQL database: PostgreSQL, MariaDB or H2 +- Files supported: + - PDF + - common MS Office (doc, docx, xls, xlsx) + - OpenDocument (odt, ods) + - RichText (rtf) + - Images (jpg, png, tiff) + - HTML + - text/* (treated as Markdown) - Tools: - Watch a folder: watch folders for changes and send files to docspell - Firefox plugin: right click on a link and send the file to docspell @@ -31,7 +40,6 @@ These are current known limitations that may be of interest for considering docspell at the moment. Hopefully they will be resolved eventually…. -- Only PDF files possible for now. - No fulltext search implemented. This currently has very low priority, because I myself never needed it. Open an issue if you find it important. diff --git a/modules/microsite/docs/getit.md b/modules/microsite/docs/getit.md index 0d533269..19c2721f 100644 --- a/modules/microsite/docs/getit.md +++ b/modules/microsite/docs/getit.md @@ -18,11 +18,20 @@ You need to download the two files: ## Prerequisite Install Java (use your package manager or look -[here](https://adoptopenjdk.net/)), -[tesseract](https://github.com/tesseract-ocr/tesseract), -[ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly -[unpaper](https://github.com/Flameeyes/unpaper). The last is not -really required, but improves OCR. +[here](https://adoptopenjdk.net/)). + +OCR functionality requires the following tools: + +- [tesseract](https://github.com/tesseract-ocr/tesseract), +- [ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly +- [unpaper](https://github.com/Flameeyes/unpaper). + +The last is not really required, but improves OCR. + +PDF conversion requires the following tools: + +- [unoconv](https://github.com/unoconv/unoconv) +- [wkhtmltopdf](https://wkhtmltopdf.org/) ## Running