mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-04-12 05:59:33 +00:00
Update documentation sites
This commit is contained in:
parent
3f316ab4d0
commit
7fe8843893
@ -68,19 +68,28 @@ component.
|
|||||||
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
||||||
(at the expense of a longer runtime).
|
(at the expense of a longer runtime).
|
||||||
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
||||||
doing the OCR (converts images into text). It is a widely used open
|
doing the OCR (converts images into text). It can also convert
|
||||||
source OCR engine. Tesseract 3 and 4 should work with docspell; you
|
images into pdf files. It is a widely used open source OCR engine.
|
||||||
can adopt the command line in the configuration file, if necessary.
|
Tesseract 3 and 4 should work with docspell; you can adopt the
|
||||||
|
command line in the configuration file, if necessary.
|
||||||
|
- [Unoconv](https://github.com/unoconv/unoconv) is used to convert
|
||||||
|
office documents into PDF files. It uses libreoffice/openoffice.
|
||||||
|
- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
|
||||||
|
PDF files.
|
||||||
|
|
||||||
|
The performance of `unoconv` can be improved by starting `unoconv -l`
|
||||||
|
in a separate process. This runs a libreoffice/openoffice listener
|
||||||
|
therefore avoids starting one each time `unoconv` is called.
|
||||||
|
|
||||||
### Example Debian
|
### Example Debian
|
||||||
|
|
||||||
On Debian this should install all joex requirements:
|
On Debian this should install all joex requirements:
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
|
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Database
|
## Database
|
||||||
|
|
||||||
Both components must have access to a SQL database. Docspell has
|
Both components must have access to a SQL database. Docspell has
|
||||||
@ -203,12 +212,15 @@ work is done by the joex components.
|
|||||||
### Joex
|
### Joex
|
||||||
|
|
||||||
Running the joex component on the Raspberry Pi is possible, but will
|
Running the joex component on the Raspberry Pi is possible, but will
|
||||||
result in long processing times. Tested on a RPi model 3 (4 cores, 1G
|
result in long processing times for OCR. Files that don't require OCR
|
||||||
RAM) processing a PDF (scanned with 300dpi) with two pages took
|
are no problem.
|
||||||
9:52. You can speed it up considerably by uninstalling the `unpaper`
|
|
||||||
command, because this step takes quite long. This, of course, reduces
|
Tested on a RPi model 3 (4 cores, 1G RAM) processing a PDF (scanned
|
||||||
the quality of OCR. But without `unpaper` the same sample pdf was then
|
with 300dpi) with two pages took 9:52. You can speed it up
|
||||||
processed in 1:24, a speedup of 8 minutes.
|
considerably by uninstalling the `unpaper` command, because this step
|
||||||
|
takes quite long. This, of course, reduces the quality of OCR. But
|
||||||
|
without `unpaper` the same sample pdf was then processed in 1:24, a
|
||||||
|
speedup of 8 minutes.
|
||||||
|
|
||||||
You should limit the joex pool size to 1 and, depending on your model
|
You should limit the joex pool size to 1 and, depending on your model
|
||||||
and the amount of RAM, set a heap size of at least 500M
|
and the amount of RAM, set a heap size of at least 500M
|
||||||
|
@ -9,6 +9,7 @@ title: Features and Limitations
|
|||||||
- Multiple users per account
|
- Multiple users per account
|
||||||
- Handle multiple documents as one unit
|
- Handle multiple documents as one unit
|
||||||
- OCR using [tesseract](https://github.com/tesseract-ocr/tesseract)
|
- OCR using [tesseract](https://github.com/tesseract-ocr/tesseract)
|
||||||
|
- Conversion to PDF: all files are converted into a PDF file
|
||||||
- Text is analysed to find and attach meta data automatically
|
- Text is analysed to find and attach meta data automatically
|
||||||
- Manage document processing (cancel jobs, set priorities)
|
- Manage document processing (cancel jobs, set priorities)
|
||||||
- Everything available via a documented [REST Api](api)
|
- Everything available via a documented [REST Api](api)
|
||||||
@ -18,6 +19,14 @@ title: Features and Limitations
|
|||||||
- REST server and document processing are separate applications which
|
- REST server and document processing are separate applications which
|
||||||
can be scaled-out independently
|
can be scaled-out independently
|
||||||
- Everything stored in a SQL database: PostgreSQL, MariaDB or H2
|
- Everything stored in a SQL database: PostgreSQL, MariaDB or H2
|
||||||
|
- Files supported:
|
||||||
|
- PDF
|
||||||
|
- common MS Office (doc, docx, xls, xlsx)
|
||||||
|
- OpenDocument (odt, ods)
|
||||||
|
- RichText (rtf)
|
||||||
|
- Images (jpg, png, tiff)
|
||||||
|
- HTML
|
||||||
|
- text/* (treated as Markdown)
|
||||||
- Tools:
|
- Tools:
|
||||||
- Watch a folder: watch folders for changes and send files to docspell
|
- Watch a folder: watch folders for changes and send files to docspell
|
||||||
- Firefox plugin: right click on a link and send the file to docspell
|
- Firefox plugin: right click on a link and send the file to docspell
|
||||||
@ -31,7 +40,6 @@ These are current known limitations that may be of interest for
|
|||||||
considering docspell at the moment. Hopefully they will be resolved
|
considering docspell at the moment. Hopefully they will be resolved
|
||||||
eventually….
|
eventually….
|
||||||
|
|
||||||
- Only PDF files possible for now.
|
|
||||||
- No fulltext search implemented. This currently has very low
|
- No fulltext search implemented. This currently has very low
|
||||||
priority, because I myself never needed it. Open an issue if you
|
priority, because I myself never needed it. Open an issue if you
|
||||||
find it important.
|
find it important.
|
||||||
|
@ -18,11 +18,20 @@ You need to download the two files:
|
|||||||
## Prerequisite
|
## Prerequisite
|
||||||
|
|
||||||
Install Java (use your package manager or look
|
Install Java (use your package manager or look
|
||||||
[here](https://adoptopenjdk.net/)),
|
[here](https://adoptopenjdk.net/)).
|
||||||
[tesseract](https://github.com/tesseract-ocr/tesseract),
|
|
||||||
[ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
|
OCR functionality requires the following tools:
|
||||||
[unpaper](https://github.com/Flameeyes/unpaper). The last is not
|
|
||||||
really required, but improves OCR.
|
- [tesseract](https://github.com/tesseract-ocr/tesseract),
|
||||||
|
- [ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
|
||||||
|
- [unpaper](https://github.com/Flameeyes/unpaper).
|
||||||
|
|
||||||
|
The last is not really required, but improves OCR.
|
||||||
|
|
||||||
|
PDF conversion requires the following tools:
|
||||||
|
|
||||||
|
- [unoconv](https://github.com/unoconv/unoconv)
|
||||||
|
- [wkhtmltopdf](https://wkhtmltopdf.org/)
|
||||||
|
|
||||||
|
|
||||||
## Running
|
## Running
|
||||||
|
Loading…
x
Reference in New Issue
Block a user