Initial website

2025-09-28 15:48:22 +00:00 · 2020-07-27 22:13:22 +02:00
parent dbd0f3ff97
commit f8c6f79b10
160 changed files with 8854 additions and 64 deletions
--- a/website/site/content/docs/install/prereq.md
+++ b/website/site/content/docs/install/prereq.md
@@ -0,0 +1,107 @@
+++
+title = "Prerequisites"
+weight = 10
+++
+
+The two components have one prerequisite in common: they both require
+Java to run. While this is the only requirement for the *REST server*,
+the *Joex* components requires some more external programs.
+
+The rest server and joex components are not required to "see" each
+other, though it is recommended.
+
+# Java
+
+Very often, Java is already installed. You can check this by opening a
+terminal and typing `java -version`. Otherwise install Java using your
+package manager or see [this site](https://adoptopenjdk.net/) for
+other options.
+
+It is enough to install the JRE. The JDK is required, if you want to
+build docspell from source.
+
+Docspell has been tested with Java version 1.8 (or sometimes referred
+to as JRE 8 and JDK 8, respectively). The pre-build packages are also
+build using JDK 8. But a later version of Java should work as well.
+
+The next tools are only required on machines running the *Joex*
+component.
+
+# External Programs for Joex
+
+- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
+  is used to extract/convert PDF files into images that are then fed
+  to ocr. It is available on most GNU/Linux distributions.
+- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
+  pre-processes images to yield better results when doing ocr. If this
+  is not installed, docspell tries without it. However, it is
+  recommended to install, because it [improves text
+  extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
+  (at the expense of a longer runtime).
+- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
+  doing the OCR (converts images into text). It can also convert
+  images into pdf files. It is a widely used open source OCR engine.
+  Tesseract 3 and 4 should work with docspell; you can adopt the
+  command line in the configuration file, if necessary.
+- [Unoconv](https://github.com/unoconv/unoconv) is used to convert
+  office documents into PDF files. It uses libreoffice/openoffice.
+- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
+  PDF files.
+- [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) can be optionally
+  used to convert PDF to PDF files. It adds an OCR layer to scanned
+  PDF files to make them searchable. It also creates PDF/A files from
+  the input pdf.
+
+The performance of `unoconv` can be improved by starting `unoconv -l`
+in a separate process. This runs a libreoffice/openoffice listener and
+therefore avoids starting one each time `unoconv` is called.
+
+## Example Debian
+
+On Debian this should install all joex requirements:
+
+``` bash
+sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf ocrmypdf
+```
+
+# Apache SOLR
+
+SOLR is used to provide the fulltext search feature. This feature can
+be disabled, so installing SOLR is optional. But without it, there is
+no fulltext search.
+
+When installing manually (i.e. not via docker), just install solr and
+create a core as described in the [solr
+documentation](https://lucene.apache.org/solr/guide/8_4/installing-solr.html).
+That will provide you with the connection url (the last part is the
+core name).
+
+When using the provided `docker-compose.yml` setup, SOLR is already setup.
+
+SOLR must be reachable from all joex and all rest server components.
+
+# Database
+
+Both components must have access to a SQL database. The SQL database
+contains all data (including binary files) and is the central
+component of docspell. Docspell has support these databases:
+
+- PostreSQL
+- MariaDB
+- H2
+
+The H2 database is an interesting option for personal and mid-size
+setups, as it requires no additional work. It is integrated into
+docspell and works really well out of the box. It is also configured
+as the default database.
+
+When using H2, make sure that all components access the same database
+– the jdbc url must point to the same file. Then, it is important to
+add the options
+`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
+of the url. See the [config page](@/docs/configure/_index.md#jdbc) for
+an example.
+
+For large installations, PostgreSQL or MariaDB is recommended. Create
+a database and a user with enough privileges (read, write, create
+table) to that database.