mirror of
https://github.com/TheAnachronism/docspell.git
synced 2024-11-13 02:31:10 +00:00
219 lines
7.1 KiB
Markdown
219 lines
7.1 KiB
Markdown
|
---
|
|||
|
layout: docs
|
|||
|
title: Installation
|
|||
|
---
|
|||
|
|
|||
|
# {{ page.title }}
|
|||
|
|
|||
|
This page contains detailed installation instructions. For a quick
|
|||
|
start, refer to [this page](../getit.html).
|
|||
|
|
|||
|
Docspell has been developed and tested on a GNU/Linux system. It may
|
|||
|
run on Windows and MacOS machines, too (ghostscript and tesseract are
|
|||
|
available on these systems). But I've never tried.
|
|||
|
|
|||
|
Docspell consists of two components that are started in separate
|
|||
|
processes:
|
|||
|
|
|||
|
1. *REST Server* This is the main application, providing the REST Api
|
|||
|
and the web application.
|
|||
|
2. *Joex* (job executor) This is the component that does the document
|
|||
|
processing.
|
|||
|
|
|||
|
They can run on multiple machines. All REST server and Joex instances
|
|||
|
should be on the same network. It is not strictly required that they
|
|||
|
can reach each other, but the components can then notify themselves
|
|||
|
about new or done work.
|
|||
|
|
|||
|
While this is possible, the simple setup is to start both components
|
|||
|
once on the same machine.
|
|||
|
|
|||
|
The [download page](https://github.com/eikek/docspell/releases)
|
|||
|
provides pre-compiled packages and the [development page](dev.html)
|
|||
|
contains build instructions.
|
|||
|
|
|||
|
|
|||
|
## Prerequisites
|
|||
|
|
|||
|
The two components have one prerequisite in common: they both require
|
|||
|
Java to run. While this is the only requirement for the *REST server*,
|
|||
|
the *Joex* components requires some more external programs.
|
|||
|
|
|||
|
### Java
|
|||
|
|
|||
|
Very often, Java is already installed. You can check this by opening a
|
|||
|
terminal and typing `java -version`. Otherwise install Java using your
|
|||
|
package manager or see [this site](https://adoptopenjdk.net/) for
|
|||
|
other options.
|
|||
|
|
|||
|
It is enough to install the JRE. The JDK is required, if you want to
|
|||
|
build docspell from source.
|
|||
|
|
|||
|
Docspell has been tested with Java version 1.8 (or sometimes referred
|
|||
|
to as JRE 8 and JDK 8, respectively). The pre-build packages are also
|
|||
|
build using JDK 8. But a later version of Java should work as well.
|
|||
|
|
|||
|
The next tools are only required on machines running the *Joex*
|
|||
|
component.
|
|||
|
|
|||
|
### External Tools for Joex
|
|||
|
|
|||
|
- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
|
|||
|
is used to extract/convert PDF files into images that are then fed
|
|||
|
to ocr. It is available on most GNU/Linux distributions.
|
|||
|
- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
|
|||
|
pre-processes images to yield better results when doing ocr. If this
|
|||
|
is not installed, docspell tries without it. However, it is
|
|||
|
recommended to install, because it [improves text
|
|||
|
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
|||
|
(at the expense of a longer runtime).
|
|||
|
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
|||
|
doing the OCR (converts images into text). It is a widely used open
|
|||
|
source OCR engine. Tesseract 3 and 4 should work with docspell; you
|
|||
|
can adopt the command line in the configuration file, if necessary.
|
|||
|
|
|||
|
|
|||
|
### Example Debian
|
|||
|
|
|||
|
On Debian this should install all joex requirements:
|
|||
|
|
|||
|
``` bash
|
|||
|
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
|
|||
|
```
|
|||
|
|
|||
|
## Database
|
|||
|
|
|||
|
Both components must have access to a SQL database. Docspell has
|
|||
|
support these databases:
|
|||
|
|
|||
|
- PostreSQL
|
|||
|
- MariaDB
|
|||
|
- H2
|
|||
|
|
|||
|
The H2 database is an interesting option for personal and mid-size
|
|||
|
setups, as it requires no additional work. It is integrated into
|
|||
|
docspell and works really well. It is also configured as the default
|
|||
|
database.
|
|||
|
|
|||
|
For large installations, PostgreSQL or MariaDB is recommended. Create
|
|||
|
a database and a user with enough privileges (read, write, create
|
|||
|
table) to that database.
|
|||
|
|
|||
|
When using H2, make sure that all components access the same database
|
|||
|
– the jdbc url must point to the same file. Then, it is important to
|
|||
|
add the options
|
|||
|
`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
|
|||
|
of the url. See the [default config](configure.html) for an example.
|
|||
|
|
|||
|
|
|||
|
## Installing from ZIP files
|
|||
|
|
|||
|
After extracting the zip files, you'll find a start script in the
|
|||
|
`bin/` folder.
|
|||
|
|
|||
|
|
|||
|
## Installing from DEB packages
|
|||
|
|
|||
|
The DEB packages can be installed on Debian, or Debian based Distros:
|
|||
|
|
|||
|
``` bash
|
|||
|
$ sudo dpkg -i docspell*.deb
|
|||
|
```
|
|||
|
|
|||
|
Then the start scripts are in your `$PATH`. Run `docspell-restserver`
|
|||
|
or `docspell-joex` from a terminal window.
|
|||
|
|
|||
|
The packages come with a systemd unit file that will be installed to
|
|||
|
autostart the services.
|
|||
|
|
|||
|
|
|||
|
## Running
|
|||
|
|
|||
|
Run the start script (in the corresponding `bin/` directory when using
|
|||
|
the zip files):
|
|||
|
|
|||
|
```
|
|||
|
$ ./docspell-restserver*/bin/docspell-restserver
|
|||
|
$ ./docspell-joex*/bin/docspell-joex
|
|||
|
```
|
|||
|
|
|||
|
This will startup both components using the default configuration. The
|
|||
|
configuration should be adopted to your needs. For example, the
|
|||
|
database connection is configured to use a H2 database in the `/tmp`
|
|||
|
directory. Please refer to the [configuration page](configure.html)
|
|||
|
for how to create a custom config file. Once you have your config
|
|||
|
file, simply pass it as argument to the command:
|
|||
|
|
|||
|
```
|
|||
|
$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
|
|||
|
$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf
|
|||
|
```
|
|||
|
|
|||
|
After starting the rest server, you can reach the web application at
|
|||
|
path `/app/index.html`, so using default values it would be
|
|||
|
`http://localhost:7880/app/index.html`.
|
|||
|
|
|||
|
You should be able to create a new account and sign in. Check the
|
|||
|
[configuration page](configure.html) to further customize docspell.
|
|||
|
|
|||
|
|
|||
|
### Options
|
|||
|
|
|||
|
The start scripts support some options to configure the JVM. One often
|
|||
|
used setting is the maximum heap size of the JVM. By default, java
|
|||
|
determines it based on properties of the current machine. You can
|
|||
|
specify it by given java startup options to the command:
|
|||
|
|
|||
|
```
|
|||
|
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
|
|||
|
```
|
|||
|
|
|||
|
This would limit the maximum heap to 1GB. The double slash separates
|
|||
|
internal options and the arguments to the program. Another frequently
|
|||
|
used option is to change the default temp directory. Usually it is
|
|||
|
`/tmp`, but it may be desired to have a dedicated temp directory,
|
|||
|
which can be configured:
|
|||
|
|
|||
|
```
|
|||
|
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
|
|||
|
```
|
|||
|
|
|||
|
The command:
|
|||
|
|
|||
|
```
|
|||
|
$ ./docspell-restserver*/bin/docspell-restserver -h
|
|||
|
```
|
|||
|
|
|||
|
gives an overview of supported options.
|
|||
|
|
|||
|
|
|||
|
## Raspberry Pi, and similiar
|
|||
|
|
|||
|
Both component can run next to each other on a raspberry pi or
|
|||
|
similiar device.
|
|||
|
|
|||
|
|
|||
|
### REST Server
|
|||
|
|
|||
|
The REST server component runs very well on the Raspberry Pi and
|
|||
|
similiar devices. It doesn't require much resources, because the heavy
|
|||
|
work is done by the joex components.
|
|||
|
|
|||
|
|
|||
|
### Joex
|
|||
|
|
|||
|
Running the joex component on the Raspberry Pi is possible, but will
|
|||
|
result in long processing times. Tested on a RPi model 3 (4 cores, 1G
|
|||
|
RAM) processing a PDF (scanned with 300dpi) with two pages took
|
|||
|
9:52. You can speed it up considerably by uninstalling the `unpaper`
|
|||
|
command, because this step takes quite long. This, of course, reduces
|
|||
|
the quality of OCR. But without `unpaper` the same sample pdf was then
|
|||
|
processed in 1:24, a speedup of 8 minutes.
|
|||
|
|
|||
|
You should limit the joex pool size to 1 and, depending on your model
|
|||
|
and the amount of RAM, set a heap size of at least 500M
|
|||
|
(`-J-Xmx500M`).
|
|||
|
|
|||
|
For personal setups, when you don't need the processing results asap,
|
|||
|
this can work well enough.
|