Update documentation sites

This commit is contained in:
Eike Kettner 2020-02-20 21:43:37 +01:00
parent 3f316ab4d0
commit 7fe8843893
3 changed files with 45 additions and 16 deletions

View File

@ -68,19 +68,28 @@ component.
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
(at the expense of a longer runtime). (at the expense of a longer runtime).
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool - [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
doing the OCR (converts images into text). It is a widely used open doing the OCR (converts images into text). It can also convert
source OCR engine. Tesseract 3 and 4 should work with docspell; you images into pdf files. It is a widely used open source OCR engine.
can adopt the command line in the configuration file, if necessary. Tesseract 3 and 4 should work with docspell; you can adopt the
command line in the configuration file, if necessary.
- [Unoconv](https://github.com/unoconv/unoconv) is used to convert
office documents into PDF files. It uses libreoffice/openoffice.
- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
PDF files.
The performance of `unoconv` can be improved by starting `unoconv -l`
in a separate process. This runs a libreoffice/openoffice listener
therefore avoids starting one each time `unoconv` is called.
### Example Debian ### Example Debian
On Debian this should install all joex requirements: On Debian this should install all joex requirements:
``` bash ``` bash
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf
``` ```
## Database ## Database
Both components must have access to a SQL database. Docspell has Both components must have access to a SQL database. Docspell has
@ -203,12 +212,15 @@ work is done by the joex components.
### Joex ### Joex
Running the joex component on the Raspberry Pi is possible, but will Running the joex component on the Raspberry Pi is possible, but will
result in long processing times. Tested on a RPi model 3 (4 cores, 1G result in long processing times for OCR. Files that don't require OCR
RAM) processing a PDF (scanned with 300dpi) with two pages took are no problem.
9:52. You can speed it up considerably by uninstalling the `unpaper`
command, because this step takes quite long. This, of course, reduces Tested on a RPi model 3 (4 cores, 1G RAM) processing a PDF (scanned
the quality of OCR. But without `unpaper` the same sample pdf was then with 300dpi) with two pages took 9:52. You can speed it up
processed in 1:24, a speedup of 8 minutes. considerably by uninstalling the `unpaper` command, because this step
takes quite long. This, of course, reduces the quality of OCR. But
without `unpaper` the same sample pdf was then processed in 1:24, a
speedup of 8 minutes.
You should limit the joex pool size to 1 and, depending on your model You should limit the joex pool size to 1 and, depending on your model
and the amount of RAM, set a heap size of at least 500M and the amount of RAM, set a heap size of at least 500M

View File

@ -9,6 +9,7 @@ title: Features and Limitations
- Multiple users per account - Multiple users per account
- Handle multiple documents as one unit - Handle multiple documents as one unit
- OCR using [tesseract](https://github.com/tesseract-ocr/tesseract) - OCR using [tesseract](https://github.com/tesseract-ocr/tesseract)
- Conversion to PDF: all files are converted into a PDF file
- Text is analysed to find and attach meta data automatically - Text is analysed to find and attach meta data automatically
- Manage document processing (cancel jobs, set priorities) - Manage document processing (cancel jobs, set priorities)
- Everything available via a documented [REST Api](api) - Everything available via a documented [REST Api](api)
@ -18,6 +19,14 @@ title: Features and Limitations
- REST server and document processing are separate applications which - REST server and document processing are separate applications which
can be scaled-out independently can be scaled-out independently
- Everything stored in a SQL database: PostgreSQL, MariaDB or H2 - Everything stored in a SQL database: PostgreSQL, MariaDB or H2
- Files supported:
- PDF
- common MS Office (doc, docx, xls, xlsx)
- OpenDocument (odt, ods)
- RichText (rtf)
- Images (jpg, png, tiff)
- HTML
- text/* (treated as Markdown)
- Tools: - Tools:
- Watch a folder: watch folders for changes and send files to docspell - Watch a folder: watch folders for changes and send files to docspell
- Firefox plugin: right click on a link and send the file to docspell - Firefox plugin: right click on a link and send the file to docspell
@ -31,7 +40,6 @@ These are current known limitations that may be of interest for
considering docspell at the moment. Hopefully they will be resolved considering docspell at the moment. Hopefully they will be resolved
eventually…. eventually….
- Only PDF files possible for now.
- No fulltext search implemented. This currently has very low - No fulltext search implemented. This currently has very low
priority, because I myself never needed it. Open an issue if you priority, because I myself never needed it. Open an issue if you
find it important. find it important.

View File

@ -18,11 +18,20 @@ You need to download the two files:
## Prerequisite ## Prerequisite
Install Java (use your package manager or look Install Java (use your package manager or look
[here](https://adoptopenjdk.net/)), [here](https://adoptopenjdk.net/)).
[tesseract](https://github.com/tesseract-ocr/tesseract),
[ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly OCR functionality requires the following tools:
[unpaper](https://github.com/Flameeyes/unpaper). The last is not
really required, but improves OCR. - [tesseract](https://github.com/tesseract-ocr/tesseract),
- [ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
- [unpaper](https://github.com/Flameeyes/unpaper).
The last is not really required, but improves OCR.
PDF conversion requires the following tools:
- [unoconv](https://github.com/unoconv/unoconv)
- [wkhtmltopdf](https://wkhtmltopdf.org/)
## Running ## Running