mirror of
https://github.com/TheAnachronism/docspell.git
synced 2024-11-13 02:31:10 +00:00
551 lines
17 KiB
Markdown
551 lines
17 KiB
Markdown
+++
|
||
title = "Configuration"
|
||
insert_anchor_links = "right"
|
||
description = "Describes the configuration file and shows all default settings."
|
||
weight = 40
|
||
[extra]
|
||
mktoc = true
|
||
+++
|
||
|
||
Docspell's executables (restserver and joex) can take one argument – a
|
||
configuration file. If that is not given, the defaults are used. The
|
||
config file overrides default values, so only values that differ from
|
||
the defaults are necessary. The complete default options and their
|
||
documentation is at the end of this page.
|
||
|
||
Besides the config file, another way is to provide individual settings
|
||
via key-value pairs to the executable by the `-D` option. For example
|
||
to override only `base-url` you could add the argument
|
||
`-Ddocspell.server.base-url=…` to the command. Multiple options are
|
||
possible. For more than few values this is very tedious, obviously, so
|
||
the recommended way is to maintain a config file. If these options
|
||
*and* a file is provded, then any setting given via the `-D…` option
|
||
overrides the same setting from the config file.
|
||
|
||
# File Format
|
||
|
||
The format of the configuration files can be
|
||
[HOCON](https://github.com/lightbend/config/blob/master/HOCON.md#hocon-human-optimized-config-object-notation),
|
||
JSON or what this [config
|
||
library](https://github.com/lightbend/config) understands. The default
|
||
values below are in HOCON format, which is recommended, since it
|
||
allows comments and has some [advanced
|
||
features](https://github.com/lightbend/config#features-of-hocon).
|
||
Please refer to their documentation for more on this.
|
||
|
||
A short description (please see the links for better understanding):
|
||
The config consists of key-value pairs and can be written in a
|
||
JSON-like format (called HOCON). Keys are organized in trees, and a
|
||
key defines a full path into the tree. There are two ways:
|
||
|
||
```
|
||
a.b.c.d=15
|
||
```
|
||
|
||
or
|
||
|
||
```
|
||
a {
|
||
b {
|
||
c {
|
||
d = 15
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
Both are exactly the same and these forms are both used at the same
|
||
time. Usually the braces approach is used to group some more settings,
|
||
for better readability.
|
||
|
||
Strings that contain "not-so-common" characters should be enclosed in
|
||
quotes. It is possible to define values at the top of the file and
|
||
reuse them on different locations via the `${full.path.to.key}`
|
||
syntax. When using these variables, they *must not* be enclosed in
|
||
quotes.
|
||
|
||
|
||
# Important Config Options
|
||
|
||
The configuration of both components uses separate namespaces. The
|
||
configuration for the REST server is below `docspell.server`, while
|
||
the one for joex is below `docspell.joex`.
|
||
|
||
You can therefore use two separate config files or one single file
|
||
containing both namespaces.
|
||
|
||
## JDBC
|
||
|
||
This configures the connection to the database. This has to be
|
||
specified for the rest server and joex. By default, a H2 database in
|
||
the current `/tmp` directory is configured.
|
||
|
||
The config looks like this (both components):
|
||
|
||
``` bash
|
||
docspell.joex.jdbc {
|
||
url = ...
|
||
user = ...
|
||
password = ...
|
||
}
|
||
|
||
docspell.server.backend.jdbc {
|
||
url = ...
|
||
user = ...
|
||
password = ...
|
||
}
|
||
```
|
||
|
||
The `url` is the connection to the database. It must start with
|
||
`jdbc`, followed by name of the database. The rest is specific to the
|
||
database used: it is either a path to a file for H2 or a host/database
|
||
url for MariaDB and PostgreSQL.
|
||
|
||
When using H2, the user and password can be chosen freely on first
|
||
start, but must stay the same on subsequent starts. Usually, the user
|
||
is `sa` and the password is left empty. Additionally, the url must
|
||
include these options:
|
||
|
||
```
|
||
;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE
|
||
```
|
||
|
||
### Examples
|
||
|
||
PostgreSQL:
|
||
```
|
||
url = "jdbc:postgresql://localhost:5432/docspelldb"
|
||
```
|
||
|
||
MariaDB:
|
||
```
|
||
url = "jdbc:mariadb://localhost:3306/docspelldb"
|
||
```
|
||
|
||
H2
|
||
```
|
||
url = "jdbc:h2:///path/to/a/file.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
|
||
```
|
||
|
||
## Admin Endpoint
|
||
|
||
The admin endpoint defines some [routes](@/docs/api/intro.md#admin)
|
||
for adminstration tasks. This is disabled by default and can be
|
||
enabled by providing a secret:
|
||
|
||
``` bash
|
||
...
|
||
admin-endpoint {
|
||
secret = "123"
|
||
}
|
||
```
|
||
|
||
This secret must be provided to all requests to a `/api/v1/admin/`
|
||
endpoint.
|
||
|
||
|
||
## Full-Text Search: SOLR
|
||
|
||
[Apache SOLR](https://solr.apache.org) is used to provide the
|
||
full-text search. Both docspell components must provide the same
|
||
connection setup. This is defined in the `full-text-search.solr`
|
||
subsection:
|
||
|
||
``` bash
|
||
...
|
||
full-text-search {
|
||
enabled = true
|
||
...
|
||
solr = {
|
||
url = "http://localhost:8983/solr/docspell"
|
||
}
|
||
}
|
||
```
|
||
|
||
The default configuration at the end of this page contains more
|
||
information about each setting.
|
||
|
||
The `solr.url` is the mandatory setting that you need to change to
|
||
point to your SOLR instance. Then you need to set the `enabled` flag
|
||
to `true`.
|
||
|
||
When installing docspell manually, just install solr and create a core
|
||
as described in the [solr
|
||
documentation](https://solr.apache.org/guide/8_4/installing-solr.html).
|
||
That will provide you with the connection url (the last part is the
|
||
core name). If Docspell detects an empty core it will run a schema
|
||
setup on start automatically.
|
||
|
||
The `full-text-search.solr` options are the same for joex and the
|
||
restserver.
|
||
|
||
There is an [admin route](@/docs/api/intro.md#admin) that allows to
|
||
re-create the entire index (for all collectives). This is possible via
|
||
a call:
|
||
|
||
``` bash
|
||
$ curl -XPOST -H "Docspell-Admin-Secret: test123" http://localhost:7880/api/v1/admin/fts/reIndexAll
|
||
```
|
||
|
||
Here the `test123` is the key defined with `admin-endpoint.secret`. If
|
||
it is empty (the default), this call is disabled (all admin routes).
|
||
Otherwise, the POST request will submit a system task that is executed
|
||
by a joex instance eventually.
|
||
|
||
Using this endpoint, the entire index (including the schema) will be
|
||
re-created. This is sometimes necessary, for example if you upgrade
|
||
SOLR or delete the core to provide a new one (see
|
||
[here](https://solr.apache.org/guide/8_4/reindexing.html) for
|
||
details). Another way is to restart docspell (while clearing the
|
||
index). If docspell detects an empty index at startup, it will submit
|
||
a task to build the index automatically.
|
||
|
||
Note that a collective can also re-index their data using a similiar
|
||
endpoint; but this is only deleting their data and doesn't do a full
|
||
re-index.
|
||
|
||
The solr index doesn't contain any new information, it can be
|
||
regenerated any time using the above REST call. Thus it doesn't need
|
||
to be backed up.
|
||
|
||
## Bind
|
||
|
||
The host and port the http server binds to. This applies to both
|
||
components. The joex component also exposes a small REST api to
|
||
inspect its state and notify the scheduler.
|
||
|
||
``` bash
|
||
docspell.server.bind {
|
||
address = localhost
|
||
port = 7880
|
||
}
|
||
docspell.joex.bind {
|
||
address = localhost
|
||
port = 7878
|
||
}
|
||
```
|
||
|
||
By default, it binds to `localhost` and some predefined port. This
|
||
must be changed, if components are on different machines.
|
||
|
||
## Baseurl
|
||
|
||
The base url is an important setting that defines the http URL where
|
||
the corresponding component can be reached. It applies to both
|
||
components. For a joex component, the url must be resolvable from a
|
||
REST server component. The REST server also uses this url to create
|
||
absolute urls and to configure the authenication cookie.
|
||
|
||
By default it is build using the information from the `bind` setting,
|
||
which is `http://localhost:7880`.
|
||
|
||
If the default is not changed, docspell will use the request to
|
||
determine the base-url. It first inspects the `X-Forwarded-For` header
|
||
that is often used with reverse proxies. If that is not present, the
|
||
`Host` header of the request is used. However, if the `base-url`
|
||
setting is changed, then only this setting is used.
|
||
|
||
```
|
||
docspell.server.base-url = ...
|
||
docspell.joex.base-url = ...
|
||
```
|
||
|
||
If you are unsure, leave it at its default.
|
||
|
||
### Examples
|
||
|
||
```
|
||
docspell.server.baseurl = "https://docspell.example.com"
|
||
docspell.joex.baseurl = "http://192.168.101.10"
|
||
```
|
||
|
||
|
||
## App-id
|
||
|
||
The `app-id` is the identifier of the corresponding instance. It *must
|
||
be unique* for all instances. By default the REST server uses `rest1`
|
||
and joex `joex1`. It is recommended to overwrite this setting to have
|
||
an explicit and stable identifier should multiple instances are
|
||
intended.
|
||
|
||
``` bash
|
||
docspell.server.app-id = "rest1"
|
||
docspell.joex.app-id = "joex1"
|
||
```
|
||
|
||
## Registration Options
|
||
|
||
This defines if and how new users can create accounts. There are 3
|
||
options:
|
||
|
||
- *closed* no new user can sign up
|
||
- *open* new users can sign up
|
||
- *invite* new users can sign up but require an invitation key
|
||
|
||
This applies only to the REST sevrer component.
|
||
|
||
``` bash
|
||
docspell.server.backend.signup {
|
||
mode = "open"
|
||
|
||
# If mode == 'invite', a password must be provided to generate
|
||
# invitation keys. It must not be empty.
|
||
new-invite-password = ""
|
||
|
||
# If mode == 'invite', this is the period an invitation token is
|
||
# considered valid.
|
||
invite-time = "3 days"
|
||
}
|
||
```
|
||
|
||
The mode `invite` is intended to open the application only to some
|
||
users. The admin can create these invitation keys and distribute them
|
||
to the desired people. For this, the `new-invite-password` must be
|
||
given. The idea is that only the person who installs docspell knows
|
||
this. If it is not set, then invitation won't work. New invitation
|
||
keys can be generated from within the web application or via REST
|
||
calls (using `curl`, for example).
|
||
|
||
``` bash
|
||
curl -X POST -d '{"password":"blabla"}' "http://localhost:7880/api/v1/open/signup/newinvite"
|
||
```
|
||
|
||
## Authentication
|
||
|
||
Authentication works in two ways:
|
||
|
||
- with an account-name / password pair
|
||
- with an authentication token
|
||
|
||
The initial authentication must occur with an accountname/password
|
||
pair. This will generate an authentication token which is valid for a
|
||
some time. Subsequent calls to secured routes can use this token. The
|
||
token can be given as a normal http header or via a cookie header.
|
||
|
||
These settings apply only to the REST server.
|
||
|
||
``` bash
|
||
docspell.server.auth {
|
||
server-secret = "hex:caffee" # or "b64:Y2FmZmVlCg=="
|
||
session-valid = "5 minutes"
|
||
}
|
||
```
|
||
|
||
The `server-secret` is used to sign the token. If multiple REST
|
||
servers are deployed, all must share the same server secret. Otherwise
|
||
tokens from one instance are not valid on another instance. The secret
|
||
can be given as Base64 encoded string or in hex form. Use the prefix
|
||
`hex:` and `b64:`, respectively. If no prefix is given, the UTF8 bytes
|
||
of the string are used.
|
||
|
||
The `session-valid` determines how long a token is valid. This can be
|
||
just some minutes, the web application obtains new ones
|
||
periodically. So a rather short time is recommended.
|
||
|
||
|
||
## File Processing
|
||
|
||
Files are being processed by the joex component. So all the respective
|
||
configuration is in this config only.
|
||
|
||
File processing involves several stages, detailed information can be
|
||
found [here](@/docs/joex/file-processing.md#text-analysis) and in the
|
||
corresponding sections in [joex default config](#joex).
|
||
|
||
Configuration allows to define the external tools and set some
|
||
limitations to control memory usage. The sections are:
|
||
|
||
- `docspell.joex.extraction`
|
||
- `docspell.joex.text-analysis`
|
||
- `docspell.joex.convert`
|
||
|
||
Options to external commands can use variables that are replaced by
|
||
values at runtime. Variables are enclosed in double braces `{{…}}`.
|
||
Please see the default configuration for what variables exist per
|
||
command.
|
||
|
||
### Classification
|
||
|
||
In `text-analysis.classification` you can define how many documents at
|
||
most should be used for learning. The default settings should work
|
||
well for most cases. However, it always depends on the amount of data
|
||
and the machine that runs joex. For example, by default the documents
|
||
to learn from are limited to 600 (`classification.item-count`) and
|
||
every text is cut after 5000 characters (`text-analysis.max-length`).
|
||
This is fine if *most* of your documents are small and only a few are
|
||
near 5000 characters). But if *all* your documents are very large, you
|
||
probably need to either assign more heap memory or go down with the
|
||
limits.
|
||
|
||
Classification can be disabled, too, for when it's not needed.
|
||
|
||
### NLP
|
||
|
||
This setting defines which NLP mode to use. It defaults to `full`,
|
||
which requires more memory for certain languages (with the advantage
|
||
of better results). Other values are `basic`, `regexonly` and
|
||
`disabled`. The modes `full` and `basic` use pre-defined lanugage
|
||
models for procesing documents of languaes German, English and French.
|
||
These require some amount of memory (see below).
|
||
|
||
The mode `basic` is like the "light" variant to `full`. It doesn't use
|
||
all NLP features, which makes memory consumption much lower, but comes
|
||
with the compromise of less accurate results.
|
||
|
||
The mode `regexonly` doesn't use pre-defined lanuage models, even if
|
||
available. It checks your address book against a document to find
|
||
metadata. That means, it is language independent. Also, when using
|
||
`full` or `basic` with lanugages where no pre-defined models exist, it
|
||
will degrade to `regexonly` for these.
|
||
|
||
The mode `disabled` skips NLP processing completely. This has least
|
||
impact in memory consumption, obviously, but then only the classifier
|
||
is used to find metadata (unless it is disabled, too).
|
||
|
||
You might want to try different modes and see what combination suits
|
||
best your usage pattern and machine running joex. If a powerful
|
||
machine is used, simply leave the defaults. When running on an
|
||
raspberry pi, for example, you might need to adjust things.
|
||
|
||
### Memory Usage
|
||
|
||
The memory requirements for the joex component depends on the document
|
||
language and the enabled features for text-analysis. The `nlp.mode`
|
||
setting has significant impact, especially when your documents are in
|
||
German. Here are some rough numbers on jvm heap usage (the same file
|
||
was used for all tries):
|
||
|
||
<table class="table is-hoverable is-striped">
|
||
<thead>
|
||
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
|
||
</thead>
|
||
<tfoot>
|
||
</tfoot>
|
||
<tbody>
|
||
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
|
||
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
Note that these are only rough numbers and they show the maximum used
|
||
heap memory while processing a file.
|
||
|
||
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
|
||
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
|
||
recommended.
|
||
|
||
Other languages can't use these two modes, and so don't require this
|
||
amount of memory (but don't have as good results). Then you can go
|
||
with less heap. For these languages, the nlp mode is the same as
|
||
`regexonly`.
|
||
|
||
Training the classifier is also memory intensive, which solely depends
|
||
on the size and number of documents that are being trained. However,
|
||
training the classifier is done periodically and can happen maybe
|
||
every two weeks. When classifying new documents, memory requirements
|
||
are lower, since the model already exists.
|
||
|
||
More details about these modes can be found
|
||
[here](@/docs/joex/file-processing.md#text-analysis).
|
||
|
||
|
||
The restserver component is very lightweight, here you can use
|
||
defaults.
|
||
|
||
|
||
# JVM Options
|
||
|
||
The start scripts support some options to configure the JVM. One often
|
||
used setting is the maximum heap size of the JVM. By default, java
|
||
determines it based on properties of the current machine. You can
|
||
specify it by given java startup options to the command:
|
||
|
||
```
|
||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
|
||
```
|
||
|
||
This would limit the maximum heap to 1GB. The double slash separates
|
||
internal options and the arguments to the program. Another frequently
|
||
used option is to change the default temp directory. Usually it is
|
||
`/tmp`, but it may be desired to have a dedicated temp directory,
|
||
which can be configured:
|
||
|
||
```
|
||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
|
||
```
|
||
|
||
The command:
|
||
|
||
```
|
||
$ ./docspell-restserver*/bin/docspell-restserver -h
|
||
```
|
||
|
||
gives an overview of supported options.
|
||
|
||
It is recommended to run joex with the G1GC enabled. If you use java8,
|
||
you need to add an option to use G1GC (`-XX:+UseG1GC`), for java11
|
||
this is not necessary (but doesn't hurt either). This could look like
|
||
this:
|
||
|
||
```
|
||
./docspell-joex-{{version()}}/bin/docspell-joex -J-Xmx1596M -J-XX:+UseG1GC -- /path/to/joex.conf
|
||
```
|
||
|
||
Using these options you can define how much memory the JVM process is
|
||
able to use. This might be necessary to adopt depending on the usage
|
||
scenario and configured text analysis features.
|
||
|
||
Please have a look at the corresponding [section](@/docs/configure/_index.md#memory-usage).
|
||
|
||
|
||
|
||
# Logging
|
||
|
||
By default, docspell logs to stdout. This works well, when managed by
|
||
systemd or other inits. Logging is done by
|
||
[logback](https://logback.qos.ch/). Please refer to its documentation
|
||
for how to configure logging.
|
||
|
||
If you created your logback config file, it can be added as argument
|
||
to the executable using this syntax:
|
||
|
||
``` bash
|
||
/path/to/docspell -Dlogback.configurationFile=/path/to/your/logging-config-file
|
||
```
|
||
|
||
To get started, the default config looks like this:
|
||
|
||
``` xml
|
||
<configuration>
|
||
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
|
||
<withJansi>true</withJansi>
|
||
|
||
<encoder>
|
||
<pattern>[%thread] %highlight(%-5level) %cyan(%logger{15}) - %msg %n</pattern>
|
||
</encoder>
|
||
</appender>
|
||
|
||
<logger name="docspell" level="debug" />
|
||
<root level="INFO">
|
||
<appender-ref ref="STDOUT" />
|
||
</root>
|
||
</configuration>
|
||
```
|
||
|
||
The `<root level="INFO">` means, that only log statements with level
|
||
"INFO" will be printed. But the `<logger name="docspell"
|
||
level="debug">` above says, that for loggers with name "docspell"
|
||
statements with level "DEBUG" will be printed, too.
|
||
|
||
|
||
# Default Config
|
||
## Rest Server
|
||
|
||
{{ incl_conf(path="templates/shortcodes/server.conf") }}
|
||
|
||
|
||
## Joex
|
||
|
||
|
||
{{ incl_conf(path="templates/shortcodes/joex.conf") }}
|