docspell/modules/microsite/docs/doc/install.md
Eike Kettner 3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00

7.8 KiB
Raw Blame History

layout title permalink
docs Installation doc/install

{{ page.title }}

This page contains detailed installation instructions. For a quick start, refer to this page.

Docspell has been developed and tested on a GNU/Linux system. It may run on Windows and MacOS machines, too (ghostscript and tesseract are available on these systems). But I've never tried.

Docspell consists of two components that are started in separate processes:

  1. REST Server This is the main application, providing the REST Api and the web application.
  2. Joex (job executor) This is the component that does the document processing.

They can run on multiple machines. All REST server and Joex instances should be on the same network. It is not strictly required that they can reach each other, but the components can then notify themselves about new or done work.

While this is possible, the simple setup is to start both components once on the same machine.

The download page provides pre-compiled packages and the development page contains build instructions.

Prerequisites

The two components have one prerequisite in common: they both require Java to run. While this is the only requirement for the REST server, the Joex components requires some more external programs.

Java

Very often, Java is already installed. You can check this by opening a terminal and typing java -version. Otherwise install Java using your package manager or see this site for other options.

It is enough to install the JRE. The JDK is required, if you want to build docspell from source.

Docspell has been tested with Java version 1.8 (or sometimes referred to as JRE 8 and JDK 8, respectively). The pre-build packages are also build using JDK 8. But a later version of Java should work as well.

The next tools are only required on machines running the Joex component.

External Programs for Joex

  • Ghostscript (the gs command) is used to extract/convert PDF files into images that are then fed to ocr. It is available on most GNU/Linux distributions.
  • Unpaper is a program that pre-processes images to yield better results when doing ocr. If this is not installed, docspell tries without it. However, it is recommended to install, because it improves text extraction (at the expense of a longer runtime).
  • Tesseract is the tool doing the OCR (converts images into text). It can also convert images into pdf files. It is a widely used open source OCR engine. Tesseract 3 and 4 should work with docspell; you can adopt the command line in the configuration file, if necessary.
  • Unoconv is used to convert office documents into PDF files. It uses libreoffice/openoffice.
  • wkhtmltopdf is used to convert HTML into PDF files.
  • OCRmyPDF can be optionally used to convert PDF to PDF files. It adds an OCR layer to scanned PDF files to make them searchable. It also creates PDF/A files from the input pdf.

The performance of unoconv can be improved by starting unoconv -l in a separate process. This runs a libreoffice/openoffice listener therefore avoids starting one each time unoconv is called.

Example Debian

On Debian this should install all joex requirements:

sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf ocrmypdf

Database

Both components must have access to a SQL database. Docspell has support these databases:

  • PostreSQL
  • MariaDB
  • H2

The H2 database is an interesting option for personal and mid-size setups, as it requires no additional work. It is integrated into docspell and works really well. It is also configured as the default database.

For large installations, PostgreSQL or MariaDB is recommended. Create a database and a user with enough privileges (read, write, create table) to that database.

When using H2, make sure that all components access the same database the jdbc url must point to the same file. Then, it is important to add the options ;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE at the end of the url. See the config page for an example.

Installing from ZIP files

After extracting the zip files, you'll find a start script in the bin/ folder.

Installing from DEB packages

The DEB packages can be installed on Debian, or Debian based Distros:

$ sudo dpkg -i docspell*.deb

Then the start scripts are in your $PATH. Run docspell-restserver or docspell-joex from a terminal window.

The packages come with a systemd unit file that will be installed to autostart the services.

Running

Run the start script (in the corresponding bin/ directory when using the zip files):

$ ./docspell-restserver*/bin/docspell-restserver
$ ./docspell-joex*/bin/docspell-joex

This will startup both components using the default configuration. The configuration should be adopted to your needs. For example, the database connection is configured to use a H2 database in the /tmp directory. Please refer to the configuration page for how to create a custom config file. Once you have your config file, simply pass it as argument to the command:

$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf

After starting the rest server, you can reach the web application at path /app, so using default values it would be http://localhost:7880/app.

You should be able to create a new account and sign in. Check the configuration page to further customize docspell.

Options

The start scripts support some options to configure the JVM. One often used setting is the maximum heap size of the JVM. By default, java determines it based on properties of the current machine. You can specify it by given java startup options to the command:

$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf

This would limit the maximum heap to 1GB. The double slash separates internal options and the arguments to the program. Another frequently used option is to change the default temp directory. Usually it is /tmp, but it may be desired to have a dedicated temp directory, which can be configured:

$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf

The command:

$ ./docspell-restserver*/bin/docspell-restserver -h

gives an overview of supported options.

Raspberry Pi, and similiar

Both component can run next to each other on a raspberry pi or similiar device.

REST Server

The REST server component runs very well on the Raspberry Pi and similiar devices. It doesn't require much resources, because the heavy work is done by the joex components.

Joex

Running the joex component on the Raspberry Pi is possible, but will result in long processing times for OCR. Files that don't require OCR are no problem.

Tested on a RPi model 3 (4 cores, 1G RAM) processing a PDF (scanned with 300dpi) with two pages took 9:52. You can speed it up considerably by uninstalling the unpaper command, because this step takes quite long. This, of course, reduces the quality of OCR. But without unpaper the same sample pdf was then processed in 1:24, a speedup of 8 minutes.

You should limit the joex pool size to 1 and, depending on your model and the amount of RAM, set a heap size of at least 500M (-J-Xmx500M).

For personal setups, when you don't need the processing results asap, this can work well enough.