2019-07-22 22:53:30 +00:00
|
|
|
|
---
|
|
|
|
|
layout: docs
|
|
|
|
|
title: Installation
|
2020-03-28 15:35:28 +00:00
|
|
|
|
permalink: doc/install
|
2019-07-22 22:53:30 +00:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
# {{ page.title }}
|
|
|
|
|
|
|
|
|
|
This page contains detailed installation instructions. For a quick
|
2020-03-28 15:35:28 +00:00
|
|
|
|
start, refer to [this page](../getit).
|
2019-07-22 22:53:30 +00:00
|
|
|
|
|
|
|
|
|
Docspell has been developed and tested on a GNU/Linux system. It may
|
|
|
|
|
run on Windows and MacOS machines, too (ghostscript and tesseract are
|
|
|
|
|
available on these systems). But I've never tried.
|
|
|
|
|
|
|
|
|
|
Docspell consists of two components that are started in separate
|
|
|
|
|
processes:
|
|
|
|
|
|
|
|
|
|
1. *REST Server* This is the main application, providing the REST Api
|
|
|
|
|
and the web application.
|
|
|
|
|
2. *Joex* (job executor) This is the component that does the document
|
|
|
|
|
processing.
|
|
|
|
|
|
|
|
|
|
They can run on multiple machines. All REST server and Joex instances
|
|
|
|
|
should be on the same network. It is not strictly required that they
|
|
|
|
|
can reach each other, but the components can then notify themselves
|
|
|
|
|
about new or done work.
|
|
|
|
|
|
|
|
|
|
While this is possible, the simple setup is to start both components
|
|
|
|
|
once on the same machine.
|
|
|
|
|
|
|
|
|
|
The [download page](https://github.com/eikek/docspell/releases)
|
2020-03-28 15:35:28 +00:00
|
|
|
|
provides pre-compiled packages and the [development page](../dev)
|
2019-07-22 22:53:30 +00:00
|
|
|
|
contains build instructions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Prerequisites
|
|
|
|
|
|
|
|
|
|
The two components have one prerequisite in common: they both require
|
|
|
|
|
Java to run. While this is the only requirement for the *REST server*,
|
|
|
|
|
the *Joex* components requires some more external programs.
|
|
|
|
|
|
|
|
|
|
### Java
|
|
|
|
|
|
|
|
|
|
Very often, Java is already installed. You can check this by opening a
|
|
|
|
|
terminal and typing `java -version`. Otherwise install Java using your
|
|
|
|
|
package manager or see [this site](https://adoptopenjdk.net/) for
|
|
|
|
|
other options.
|
|
|
|
|
|
|
|
|
|
It is enough to install the JRE. The JDK is required, if you want to
|
|
|
|
|
build docspell from source.
|
|
|
|
|
|
|
|
|
|
Docspell has been tested with Java version 1.8 (or sometimes referred
|
|
|
|
|
to as JRE 8 and JDK 8, respectively). The pre-build packages are also
|
|
|
|
|
build using JDK 8. But a later version of Java should work as well.
|
|
|
|
|
|
|
|
|
|
The next tools are only required on machines running the *Joex*
|
|
|
|
|
component.
|
|
|
|
|
|
|
|
|
|
### External Tools for Joex
|
|
|
|
|
|
|
|
|
|
- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
|
|
|
|
|
is used to extract/convert PDF files into images that are then fed
|
|
|
|
|
to ocr. It is available on most GNU/Linux distributions.
|
|
|
|
|
- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
|
|
|
|
|
pre-processes images to yield better results when doing ocr. If this
|
|
|
|
|
is not installed, docspell tries without it. However, it is
|
|
|
|
|
recommended to install, because it [improves text
|
|
|
|
|
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
|
|
|
|
(at the expense of a longer runtime).
|
|
|
|
|
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
2020-02-20 20:43:37 +00:00
|
|
|
|
doing the OCR (converts images into text). It can also convert
|
|
|
|
|
images into pdf files. It is a widely used open source OCR engine.
|
|
|
|
|
Tesseract 3 and 4 should work with docspell; you can adopt the
|
|
|
|
|
command line in the configuration file, if necessary.
|
|
|
|
|
- [Unoconv](https://github.com/unoconv/unoconv) is used to convert
|
|
|
|
|
office documents into PDF files. It uses libreoffice/openoffice.
|
|
|
|
|
- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
|
|
|
|
|
PDF files.
|
|
|
|
|
|
|
|
|
|
The performance of `unoconv` can be improved by starting `unoconv -l`
|
|
|
|
|
in a separate process. This runs a libreoffice/openoffice listener
|
|
|
|
|
therefore avoids starting one each time `unoconv` is called.
|
2019-07-22 22:53:30 +00:00
|
|
|
|
|
|
|
|
|
### Example Debian
|
|
|
|
|
|
|
|
|
|
On Debian this should install all joex requirements:
|
|
|
|
|
|
|
|
|
|
``` bash
|
2020-02-20 20:43:37 +00:00
|
|
|
|
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf
|
2019-07-22 22:53:30 +00:00
|
|
|
|
```
|
|
|
|
|
|
2020-02-20 20:43:37 +00:00
|
|
|
|
|
2019-07-22 22:53:30 +00:00
|
|
|
|
## Database
|
|
|
|
|
|
|
|
|
|
Both components must have access to a SQL database. Docspell has
|
|
|
|
|
support these databases:
|
|
|
|
|
|
|
|
|
|
- PostreSQL
|
|
|
|
|
- MariaDB
|
|
|
|
|
- H2
|
|
|
|
|
|
|
|
|
|
The H2 database is an interesting option for personal and mid-size
|
|
|
|
|
setups, as it requires no additional work. It is integrated into
|
|
|
|
|
docspell and works really well. It is also configured as the default
|
|
|
|
|
database.
|
|
|
|
|
|
|
|
|
|
For large installations, PostgreSQL or MariaDB is recommended. Create
|
|
|
|
|
a database and a user with enough privileges (read, write, create
|
|
|
|
|
table) to that database.
|
|
|
|
|
|
|
|
|
|
When using H2, make sure that all components access the same database
|
|
|
|
|
– the jdbc url must point to the same file. Then, it is important to
|
|
|
|
|
add the options
|
|
|
|
|
`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
|
2020-03-28 15:35:28 +00:00
|
|
|
|
of the url. See the [config page](configure#jdbc) for an example.
|
2019-07-22 22:53:30 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Installing from ZIP files
|
|
|
|
|
|
|
|
|
|
After extracting the zip files, you'll find a start script in the
|
|
|
|
|
`bin/` folder.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Installing from DEB packages
|
|
|
|
|
|
|
|
|
|
The DEB packages can be installed on Debian, or Debian based Distros:
|
|
|
|
|
|
|
|
|
|
``` bash
|
|
|
|
|
$ sudo dpkg -i docspell*.deb
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Then the start scripts are in your `$PATH`. Run `docspell-restserver`
|
|
|
|
|
or `docspell-joex` from a terminal window.
|
|
|
|
|
|
|
|
|
|
The packages come with a systemd unit file that will be installed to
|
|
|
|
|
autostart the services.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Running
|
|
|
|
|
|
|
|
|
|
Run the start script (in the corresponding `bin/` directory when using
|
|
|
|
|
the zip files):
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ ./docspell-restserver*/bin/docspell-restserver
|
|
|
|
|
$ ./docspell-joex*/bin/docspell-joex
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This will startup both components using the default configuration. The
|
|
|
|
|
configuration should be adopted to your needs. For example, the
|
|
|
|
|
database connection is configured to use a H2 database in the `/tmp`
|
2020-03-28 15:35:28 +00:00
|
|
|
|
directory. Please refer to the [configuration page](configure) for how
|
|
|
|
|
to create a custom config file. Once you have your config file, simply
|
|
|
|
|
pass it as argument to the command:
|
2019-07-22 22:53:30 +00:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
|
|
|
|
|
$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
After starting the rest server, you can reach the web application at
|
2019-12-30 20:59:06 +00:00
|
|
|
|
path `/app`, so using default values it would be
|
|
|
|
|
`http://localhost:7880/app`.
|
2019-07-22 22:53:30 +00:00
|
|
|
|
|
|
|
|
|
You should be able to create a new account and sign in. Check the
|
2020-03-28 15:35:28 +00:00
|
|
|
|
[configuration page](configure) to further customize docspell.
|
2019-07-22 22:53:30 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Options
|
|
|
|
|
|
|
|
|
|
The start scripts support some options to configure the JVM. One often
|
|
|
|
|
used setting is the maximum heap size of the JVM. By default, java
|
|
|
|
|
determines it based on properties of the current machine. You can
|
|
|
|
|
specify it by given java startup options to the command:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This would limit the maximum heap to 1GB. The double slash separates
|
|
|
|
|
internal options and the arguments to the program. Another frequently
|
|
|
|
|
used option is to change the default temp directory. Usually it is
|
|
|
|
|
`/tmp`, but it may be desired to have a dedicated temp directory,
|
|
|
|
|
which can be configured:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The command:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ ./docspell-restserver*/bin/docspell-restserver -h
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
gives an overview of supported options.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Raspberry Pi, and similiar
|
|
|
|
|
|
|
|
|
|
Both component can run next to each other on a raspberry pi or
|
|
|
|
|
similiar device.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### REST Server
|
|
|
|
|
|
|
|
|
|
The REST server component runs very well on the Raspberry Pi and
|
|
|
|
|
similiar devices. It doesn't require much resources, because the heavy
|
|
|
|
|
work is done by the joex components.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Joex
|
|
|
|
|
|
|
|
|
|
Running the joex component on the Raspberry Pi is possible, but will
|
2020-02-20 20:43:37 +00:00
|
|
|
|
result in long processing times for OCR. Files that don't require OCR
|
|
|
|
|
are no problem.
|
|
|
|
|
|
|
|
|
|
Tested on a RPi model 3 (4 cores, 1G RAM) processing a PDF (scanned
|
|
|
|
|
with 300dpi) with two pages took 9:52. You can speed it up
|
|
|
|
|
considerably by uninstalling the `unpaper` command, because this step
|
|
|
|
|
takes quite long. This, of course, reduces the quality of OCR. But
|
|
|
|
|
without `unpaper` the same sample pdf was then processed in 1:24, a
|
|
|
|
|
speedup of 8 minutes.
|
2019-07-22 22:53:30 +00:00
|
|
|
|
|
|
|
|
|
You should limit the joex pool size to 1 and, depending on your model
|
|
|
|
|
and the amount of RAM, set a heap size of at least 500M
|
|
|
|
|
(`-J-Xmx500M`).
|
|
|
|
|
|
|
|
|
|
For personal setups, when you don't need the processing results asap,
|
|
|
|
|
this can work well enough.
|