mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-08-05 02:24:52 +00:00
Initial website
This commit is contained in:
107
website/site/content/docs/install/prereq.md
Normal file
107
website/site/content/docs/install/prereq.md
Normal file
@ -0,0 +1,107 @@
|
||||
+++
|
||||
title = "Prerequisites"
|
||||
weight = 10
|
||||
+++
|
||||
|
||||
The two components have one prerequisite in common: they both require
|
||||
Java to run. While this is the only requirement for the *REST server*,
|
||||
the *Joex* components requires some more external programs.
|
||||
|
||||
The rest server and joex components are not required to "see" each
|
||||
other, though it is recommended.
|
||||
|
||||
# Java
|
||||
|
||||
Very often, Java is already installed. You can check this by opening a
|
||||
terminal and typing `java -version`. Otherwise install Java using your
|
||||
package manager or see [this site](https://adoptopenjdk.net/) for
|
||||
other options.
|
||||
|
||||
It is enough to install the JRE. The JDK is required, if you want to
|
||||
build docspell from source.
|
||||
|
||||
Docspell has been tested with Java version 1.8 (or sometimes referred
|
||||
to as JRE 8 and JDK 8, respectively). The pre-build packages are also
|
||||
build using JDK 8. But a later version of Java should work as well.
|
||||
|
||||
The next tools are only required on machines running the *Joex*
|
||||
component.
|
||||
|
||||
# External Programs for Joex
|
||||
|
||||
- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
|
||||
is used to extract/convert PDF files into images that are then fed
|
||||
to ocr. It is available on most GNU/Linux distributions.
|
||||
- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
|
||||
pre-processes images to yield better results when doing ocr. If this
|
||||
is not installed, docspell tries without it. However, it is
|
||||
recommended to install, because it [improves text
|
||||
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
||||
(at the expense of a longer runtime).
|
||||
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
||||
doing the OCR (converts images into text). It can also convert
|
||||
images into pdf files. It is a widely used open source OCR engine.
|
||||
Tesseract 3 and 4 should work with docspell; you can adopt the
|
||||
command line in the configuration file, if necessary.
|
||||
- [Unoconv](https://github.com/unoconv/unoconv) is used to convert
|
||||
office documents into PDF files. It uses libreoffice/openoffice.
|
||||
- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
|
||||
PDF files.
|
||||
- [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) can be optionally
|
||||
used to convert PDF to PDF files. It adds an OCR layer to scanned
|
||||
PDF files to make them searchable. It also creates PDF/A files from
|
||||
the input pdf.
|
||||
|
||||
The performance of `unoconv` can be improved by starting `unoconv -l`
|
||||
in a separate process. This runs a libreoffice/openoffice listener and
|
||||
therefore avoids starting one each time `unoconv` is called.
|
||||
|
||||
## Example Debian
|
||||
|
||||
On Debian this should install all joex requirements:
|
||||
|
||||
``` bash
|
||||
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf ocrmypdf
|
||||
```
|
||||
|
||||
# Apache SOLR
|
||||
|
||||
SOLR is used to provide the fulltext search feature. This feature can
|
||||
be disabled, so installing SOLR is optional. But without it, there is
|
||||
no fulltext search.
|
||||
|
||||
When installing manually (i.e. not via docker), just install solr and
|
||||
create a core as described in the [solr
|
||||
documentation](https://lucene.apache.org/solr/guide/8_4/installing-solr.html).
|
||||
That will provide you with the connection url (the last part is the
|
||||
core name).
|
||||
|
||||
When using the provided `docker-compose.yml` setup, SOLR is already setup.
|
||||
|
||||
SOLR must be reachable from all joex and all rest server components.
|
||||
|
||||
# Database
|
||||
|
||||
Both components must have access to a SQL database. The SQL database
|
||||
contains all data (including binary files) and is the central
|
||||
component of docspell. Docspell has support these databases:
|
||||
|
||||
- PostreSQL
|
||||
- MariaDB
|
||||
- H2
|
||||
|
||||
The H2 database is an interesting option for personal and mid-size
|
||||
setups, as it requires no additional work. It is integrated into
|
||||
docspell and works really well out of the box. It is also configured
|
||||
as the default database.
|
||||
|
||||
When using H2, make sure that all components access the same database
|
||||
– the jdbc url must point to the same file. Then, it is important to
|
||||
add the options
|
||||
`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
|
||||
of the url. See the [config page](@/docs/configure/_index.md#jdbc) for
|
||||
an example.
|
||||
|
||||
For large installations, PostgreSQL or MariaDB is recommended. Create
|
||||
a database and a user with enough privileges (read, write, create
|
||||
table) to that database.
|
Reference in New Issue
Block a user