mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-22 02:18:26 +00:00
Upgrade microsite
This commit is contained in:
261
modules/microsite/docs/doc/configure.md
Normal file
261
modules/microsite/docs/doc/configure.md
Normal file
@ -0,0 +1,261 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Configuring
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
Docspell's executable can take one argument – a configuration file. If
|
||||
that is not given, the defaults are used. The config file overrides
|
||||
default values, so only values that differ from the defaults are
|
||||
necessary.
|
||||
|
||||
This applies to the restserver and the joex as well.
|
||||
|
||||
## Important Config Options
|
||||
|
||||
The configuration of both components uses separate namespaces. The
|
||||
configuration for the REST server is below `docspell.server`, while
|
||||
the one for joex is below `docspell.joex`.
|
||||
|
||||
### JDBC
|
||||
|
||||
This configures the connection to the database. This has to be
|
||||
specified for the rest server and joex. By default, a H2 database in
|
||||
the current `/tmp` directory is configured.
|
||||
|
||||
The config looks like this (both components):
|
||||
|
||||
```
|
||||
docspell.joex.jdbc {
|
||||
url = ...
|
||||
user = ...
|
||||
password = ...
|
||||
}
|
||||
|
||||
docspell.server.backend.jdbc {
|
||||
url = ...
|
||||
user = ...
|
||||
password = ...
|
||||
}
|
||||
```
|
||||
|
||||
The `url` is the connection to the database. It must start with
|
||||
`jdbc`, followed by name of the database. The rest is specific to the
|
||||
database used: it is either a path to a file for H2 or a host/database
|
||||
url for MariaDB and PostgreSQL.
|
||||
|
||||
When using H2, the user is `sa`, the password can be empty and the url
|
||||
must include these options:
|
||||
|
||||
```
|
||||
;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE
|
||||
```
|
||||
|
||||
#### Examples
|
||||
|
||||
PostgreSQL:
|
||||
```
|
||||
url = "jdbc:postgresql://localhost:5432/docspelldb"
|
||||
```
|
||||
|
||||
MariaDB:
|
||||
```
|
||||
url = "jdbc:mariadb://localhost:3306/docspelldb"
|
||||
```
|
||||
|
||||
H2
|
||||
```
|
||||
url = "jdbc:h2:///path/to/a/file.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
|
||||
```
|
||||
|
||||
### Bind
|
||||
|
||||
The host and port the http server binds to. This applies to both
|
||||
components. The joex component also exposes a small REST api to
|
||||
inspect its state and notify the scheduler.
|
||||
|
||||
```
|
||||
docspell.server.bind {
|
||||
address = localhost
|
||||
port = 7880
|
||||
}
|
||||
docspell.joex.bind {
|
||||
address = localhost
|
||||
port = 7878
|
||||
}
|
||||
```
|
||||
|
||||
By default, it binds to `localhost` and some predefined port. This
|
||||
must be changed, if components are on different machines.
|
||||
|
||||
### baseurl
|
||||
|
||||
The base url is an important setting that defines the http URL where
|
||||
the corresponding component can be reached. It applies to both
|
||||
components. For a joex component, the url must be resolvable from a
|
||||
REST server component. The REST server also uses this url to create
|
||||
absolute urls and to configure the authenication cookie.
|
||||
|
||||
By default it is build using the information from the `bind` setting.
|
||||
|
||||
|
||||
```
|
||||
docspell.server.baseurl = ...
|
||||
docspell.joex.baseurl = ...
|
||||
```
|
||||
|
||||
#### Examples
|
||||
|
||||
```
|
||||
docspell.server.baseurl = "https://docspell.example.com"
|
||||
docspell.joex.baseurl = "http://192.168.101.10"
|
||||
```
|
||||
|
||||
|
||||
### app-id
|
||||
|
||||
The `app-id` is the identifier of the corresponding instance. It *must
|
||||
be unique* for all instances. By default the REST server uses `rest1`
|
||||
and joex `joex1`. It is recommended to overwrite this setting to have
|
||||
an explicit and stable identifier.
|
||||
|
||||
```
|
||||
docspell.server.app-id = "rest1"
|
||||
docspell.joex.app-id = "joex1"
|
||||
```
|
||||
|
||||
### registration options
|
||||
|
||||
This defines if and how new users can create accounts. There are 3
|
||||
options:
|
||||
|
||||
- *closed* no new user can sign up
|
||||
- *open* new users can sign up
|
||||
- *invite* new users can sign up but require an invitation key
|
||||
|
||||
This applies only to the REST sevrer component.
|
||||
|
||||
```
|
||||
docspell.server.signup {
|
||||
mode = "open"
|
||||
|
||||
# If mode == 'invite', a password must be provided to generate
|
||||
# invitation keys. It must not be empty.
|
||||
new-invite-password = ""
|
||||
|
||||
# If mode == 'invite', this is the period an invitation token is
|
||||
# considered valid.
|
||||
invite-time = "3 days"
|
||||
}
|
||||
```
|
||||
|
||||
The mode `invite` is intended to open the application only to some
|
||||
users. The admin can create these invitation keys and distribute them
|
||||
to the desired people. For this, the `new-invite-password` must be
|
||||
given. The idea is that only the person who installs docspell knows
|
||||
this. If it is not set, then invitation won't work. New invitation
|
||||
keys can be generated from within the web application or via REST
|
||||
calls (using `curl`, for example).
|
||||
|
||||
```
|
||||
curl -X POST -d '{"password":"blabla"}' "http://localhost:7880/api/v1/open/signup/newinvite"
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Authentication works in two ways:
|
||||
|
||||
- with an account-name / password pair
|
||||
- with an authentication token
|
||||
|
||||
The initial authentication must occur with an accountname/password
|
||||
pair. This will generate an authentication token which is valid for a
|
||||
some time. Subsequent calls to secured routes can use this token. The
|
||||
token can be given as a normal http header or via a cookie header.
|
||||
|
||||
These settings apply only to the REST server.
|
||||
|
||||
```
|
||||
docspell.server.auth {
|
||||
server-secret = "hex:caffee" # or "b64:Y2FmZmVlCg=="
|
||||
session-valid = "5 minutes"
|
||||
}
|
||||
```
|
||||
|
||||
The `server-secret` is used to sign the token. If multiple REST
|
||||
servers are deployed, all must share the same server secret. Otherwise
|
||||
tokens from one instance are not valid on another instance. The secret
|
||||
can be given as Base64 encoded string or in hex form. Use the prefix
|
||||
`hex:` and `b64:`, respectively.
|
||||
|
||||
The `session-valid` deterimens how long a token is valid. This can be
|
||||
just some minutes, the web application obtains new ones
|
||||
periodically. So a short time is recommended.
|
||||
|
||||
|
||||
## File Format
|
||||
|
||||
The format of the configuration files can be
|
||||
[HOCON](https://github.com/lightbend/config/blob/master/HOCON.md#hocon-human-optimized-config-object-notation),
|
||||
JSON or whatever the used [config
|
||||
library](https://github.com/lightbend/config) understands. The default
|
||||
values below are in HOCON format, which is recommended, since it
|
||||
allows comments and has some [advanced
|
||||
features](https://github.com/lightbend/config/blob/master/README.md#features-of-hocon). Please
|
||||
refer to their documentation for more on this.
|
||||
|
||||
Here are the default configurations.
|
||||
|
||||
|
||||
## Default Config
|
||||
|
||||
### Rest Server
|
||||
|
||||
```
|
||||
{% include server.conf %}
|
||||
```
|
||||
|
||||
### Joex
|
||||
|
||||
```
|
||||
{% include joex.conf %}
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
By default, docspell logs to stdout. This works well, when managed by
|
||||
systemd or other inits. Logging is done by
|
||||
[logback](https://logback.qos.ch/). Please refer to its documentation
|
||||
for how to configure logging.
|
||||
|
||||
If you created your logback config file, it can be added as argument
|
||||
to the executable using this syntax:
|
||||
|
||||
```
|
||||
/path/to/docspell -Dlogback.configurationFile=/path/to/your/logging-config-file
|
||||
```
|
||||
|
||||
To get started, the default config looks like this:
|
||||
|
||||
``` xml
|
||||
<configuration>
|
||||
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
|
||||
<withJansi>true</withJansi>
|
||||
|
||||
<encoder>
|
||||
<pattern>[%thread] %highlight(%-5level) %cyan(%logger{15}) - %msg %n</pattern>
|
||||
</encoder>
|
||||
</appender>
|
||||
|
||||
<logger name="docspell" level="debug" />
|
||||
<root level="INFO">
|
||||
<appender-ref ref="STDOUT" />
|
||||
</root>
|
||||
</configuration>
|
||||
```
|
||||
|
||||
The `<root level="INFO">` means, that only log statements with level
|
||||
"INFO" will be printed. But the `<logger name="docspell"
|
||||
level="debug">` above says, that for loggers with name "docspell"
|
||||
statements with level "DEBUG" will be printed, too.
|
77
modules/microsite/docs/doc/curate.md
Normal file
77
modules/microsite/docs/doc/curate.md
Normal file
@ -0,0 +1,77 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Find and Review
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
Curating the items meta data helps finding them later. This page
|
||||
describes how you can quickly go through those items and correct or
|
||||
amend with existing data.
|
||||
|
||||
## Select New items
|
||||
|
||||
After files have been uploaded and the job executor created the
|
||||
corresponding items, they will show up on the main page. All items,
|
||||
the job executor has created are initially marked as *New*. The option
|
||||
*only New* in the left search menu can be used to select only new
|
||||
items:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-1.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Check selected items
|
||||
|
||||
Then you can go through all new items and check their metadata: Click
|
||||
on the first item to open the detail view. This shows the documents
|
||||
and the meta data in the header.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-2.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Modify if necessary
|
||||
|
||||
To change something, click the *Edit* button in the menu above the
|
||||
document view. This will open a form next to your documents. You can
|
||||
compare the data with the documents and change as you like. Since the
|
||||
item status is *New*, you'll see the suggestions docspell found during
|
||||
processing. If there were multiple candidates, you can select another
|
||||
one by clicking its name in the suggestion list.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-3.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
When you change something in the form, it is immediatly applied. Only
|
||||
when changing text fields, a click on the *Save* symbol next to the
|
||||
field is required.
|
||||
|
||||
|
||||
## Confirm
|
||||
|
||||
If everything looks good, click the *Confirm* button to confirm the
|
||||
current data. The *New* status goes away and also the suggestions are
|
||||
hidden in this state. You can always go back by clicking the
|
||||
*Unconfirm* button.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-5.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Proceed with next item
|
||||
|
||||
To look at the next item in the search results, click the *Next*
|
||||
button in the menu (next to the *Edit* button). Clicking next, will
|
||||
keep the current view, so you can continue checking the data. If you
|
||||
are on the last item, the view switches to the listing view when
|
||||
clicking *Next*.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-6.jpg">
|
||||
</div>
|
218
modules/microsite/docs/doc/install.md
Normal file
218
modules/microsite/docs/doc/install.md
Normal file
@ -0,0 +1,218 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Installation
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
This page contains detailed installation instructions. For a quick
|
||||
start, refer to [this page](../getit.html).
|
||||
|
||||
Docspell has been developed and tested on a GNU/Linux system. It may
|
||||
run on Windows and MacOS machines, too (ghostscript and tesseract are
|
||||
available on these systems). But I've never tried.
|
||||
|
||||
Docspell consists of two components that are started in separate
|
||||
processes:
|
||||
|
||||
1. *REST Server* This is the main application, providing the REST Api
|
||||
and the web application.
|
||||
2. *Joex* (job executor) This is the component that does the document
|
||||
processing.
|
||||
|
||||
They can run on multiple machines. All REST server and Joex instances
|
||||
should be on the same network. It is not strictly required that they
|
||||
can reach each other, but the components can then notify themselves
|
||||
about new or done work.
|
||||
|
||||
While this is possible, the simple setup is to start both components
|
||||
once on the same machine.
|
||||
|
||||
The [download page](https://github.com/eikek/docspell/releases)
|
||||
provides pre-compiled packages and the [development page](dev.html)
|
||||
contains build instructions.
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
The two components have one prerequisite in common: they both require
|
||||
Java to run. While this is the only requirement for the *REST server*,
|
||||
the *Joex* components requires some more external programs.
|
||||
|
||||
### Java
|
||||
|
||||
Very often, Java is already installed. You can check this by opening a
|
||||
terminal and typing `java -version`. Otherwise install Java using your
|
||||
package manager or see [this site](https://adoptopenjdk.net/) for
|
||||
other options.
|
||||
|
||||
It is enough to install the JRE. The JDK is required, if you want to
|
||||
build docspell from source.
|
||||
|
||||
Docspell has been tested with Java version 1.8 (or sometimes referred
|
||||
to as JRE 8 and JDK 8, respectively). The pre-build packages are also
|
||||
build using JDK 8. But a later version of Java should work as well.
|
||||
|
||||
The next tools are only required on machines running the *Joex*
|
||||
component.
|
||||
|
||||
### External Tools for Joex
|
||||
|
||||
- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
|
||||
is used to extract/convert PDF files into images that are then fed
|
||||
to ocr. It is available on most GNU/Linux distributions.
|
||||
- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
|
||||
pre-processes images to yield better results when doing ocr. If this
|
||||
is not installed, docspell tries without it. However, it is
|
||||
recommended to install, because it [improves text
|
||||
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
||||
(at the expense of a longer runtime).
|
||||
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
||||
doing the OCR (converts images into text). It is a widely used open
|
||||
source OCR engine. Tesseract 3 and 4 should work with docspell; you
|
||||
can adopt the command line in the configuration file, if necessary.
|
||||
|
||||
|
||||
### Example Debian
|
||||
|
||||
On Debian this should install all joex requirements:
|
||||
|
||||
``` bash
|
||||
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
|
||||
```
|
||||
|
||||
## Database
|
||||
|
||||
Both components must have access to a SQL database. Docspell has
|
||||
support these databases:
|
||||
|
||||
- PostreSQL
|
||||
- MariaDB
|
||||
- H2
|
||||
|
||||
The H2 database is an interesting option for personal and mid-size
|
||||
setups, as it requires no additional work. It is integrated into
|
||||
docspell and works really well. It is also configured as the default
|
||||
database.
|
||||
|
||||
For large installations, PostgreSQL or MariaDB is recommended. Create
|
||||
a database and a user with enough privileges (read, write, create
|
||||
table) to that database.
|
||||
|
||||
When using H2, make sure that all components access the same database
|
||||
– the jdbc url must point to the same file. Then, it is important to
|
||||
add the options
|
||||
`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
|
||||
of the url. See the [default config](configure.html) for an example.
|
||||
|
||||
|
||||
## Installing from ZIP files
|
||||
|
||||
After extracting the zip files, you'll find a start script in the
|
||||
`bin/` folder.
|
||||
|
||||
|
||||
## Installing from DEB packages
|
||||
|
||||
The DEB packages can be installed on Debian, or Debian based Distros:
|
||||
|
||||
``` bash
|
||||
$ sudo dpkg -i docspell*.deb
|
||||
```
|
||||
|
||||
Then the start scripts are in your `$PATH`. Run `docspell-restserver`
|
||||
or `docspell-joex` from a terminal window.
|
||||
|
||||
The packages come with a systemd unit file that will be installed to
|
||||
autostart the services.
|
||||
|
||||
|
||||
## Running
|
||||
|
||||
Run the start script (in the corresponding `bin/` directory when using
|
||||
the zip files):
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver
|
||||
$ ./docspell-joex*/bin/docspell-joex
|
||||
```
|
||||
|
||||
This will startup both components using the default configuration. The
|
||||
configuration should be adopted to your needs. For example, the
|
||||
database connection is configured to use a H2 database in the `/tmp`
|
||||
directory. Please refer to the [configuration page](configure.html)
|
||||
for how to create a custom config file. Once you have your config
|
||||
file, simply pass it as argument to the command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
|
||||
$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf
|
||||
```
|
||||
|
||||
After starting the rest server, you can reach the web application at
|
||||
path `/app/index.html`, so using default values it would be
|
||||
`http://localhost:7880/app/index.html`.
|
||||
|
||||
You should be able to create a new account and sign in. Check the
|
||||
[configuration page](configure.html) to further customize docspell.
|
||||
|
||||
|
||||
### Options
|
||||
|
||||
The start scripts support some options to configure the JVM. One often
|
||||
used setting is the maximum heap size of the JVM. By default, java
|
||||
determines it based on properties of the current machine. You can
|
||||
specify it by given java startup options to the command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
|
||||
```
|
||||
|
||||
This would limit the maximum heap to 1GB. The double slash separates
|
||||
internal options and the arguments to the program. Another frequently
|
||||
used option is to change the default temp directory. Usually it is
|
||||
`/tmp`, but it may be desired to have a dedicated temp directory,
|
||||
which can be configured:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
|
||||
```
|
||||
|
||||
The command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -h
|
||||
```
|
||||
|
||||
gives an overview of supported options.
|
||||
|
||||
|
||||
## Raspberry Pi, and similiar
|
||||
|
||||
Both component can run next to each other on a raspberry pi or
|
||||
similiar device.
|
||||
|
||||
|
||||
### REST Server
|
||||
|
||||
The REST server component runs very well on the Raspberry Pi and
|
||||
similiar devices. It doesn't require much resources, because the heavy
|
||||
work is done by the joex components.
|
||||
|
||||
|
||||
### Joex
|
||||
|
||||
Running the joex component on the Raspberry Pi is possible, but will
|
||||
result in long processing times. Tested on a RPi model 3 (4 cores, 1G
|
||||
RAM) processing a PDF (scanned with 300dpi) with two pages took
|
||||
9:52. You can speed it up considerably by uninstalling the `unpaper`
|
||||
command, because this step takes quite long. This, of course, reduces
|
||||
the quality of OCR. But without `unpaper` the same sample pdf was then
|
||||
processed in 1:24, a speedup of 8 minutes.
|
||||
|
||||
You should limit the joex pool size to 1 and, depending on your model
|
||||
and the amount of RAM, set a heap size of at least 500M
|
||||
(`-J-Xmx500M`).
|
||||
|
||||
For personal setups, when you don't need the processing results asap,
|
||||
this can work well enough.
|
155
modules/microsite/docs/doc/joex.md
Normal file
155
modules/microsite/docs/doc/joex.md
Normal file
@ -0,0 +1,155 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Joex
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
Joex is short for *Job Executor* and it is the component managing long
|
||||
running tasks in docspell. One of these long running tasks is the file
|
||||
processing task.
|
||||
|
||||
One joex component handles the processing of all files of all
|
||||
collectives/users. It requires much more resources than the rest
|
||||
server component. Therefore the number of jobs that can run in
|
||||
parallel is limited with respect to the hardware it is running on.
|
||||
|
||||
For larger installations, it is probably better to run several joex
|
||||
components on different machines. That works out of the box, as long
|
||||
as all components point to the same database and use different
|
||||
`app-id`s (see [configuring docspell](./configure.html)).
|
||||
|
||||
When files are submitted to docspell, they are stored in the database
|
||||
and all known joex components are notified about new work. Then they
|
||||
compete on getting the next job from the queue. After a job finishes
|
||||
and no job is waiting in the queue, joex will sleep until notified
|
||||
again. It will also periodically notify itself as a fallback.
|
||||
|
||||
## Scheduler and Queue
|
||||
|
||||
The scheduler is the part that runs and monitors the long running
|
||||
jobs. It works together with the job queue, which defines what job to
|
||||
take next.
|
||||
|
||||
To create a somewhat fair distribution among multiple collectives, a
|
||||
collective is first chosen in a simple round-robin way. Then a job
|
||||
from this collective is chosen by priority.
|
||||
|
||||
There are only two priorities: low and high. A simple *counting
|
||||
scheme* determines if a low prio or high prio job is selected
|
||||
next. The default is `4, 1`, meaning to first select 4 high priority
|
||||
jobs and then 1 low priority job, then starting over. If no such job
|
||||
exists, its falls back to the other priority.
|
||||
|
||||
The priority can be set on a *Source* (see
|
||||
[uploads](uploading.html)). Uploading through the web application will
|
||||
always use priority *high*. The idea is that while logged in, jobs are
|
||||
more important that those submitted when not logged in.
|
||||
|
||||
|
||||
## Scheduler Config
|
||||
|
||||
The relevant part of the config file regarding the scheduler is shown
|
||||
below with some explanations.
|
||||
|
||||
```
|
||||
docspell.joex {
|
||||
# other settings left out for brevity
|
||||
|
||||
scheduler {
|
||||
|
||||
# Number of processing allowed in parallel.
|
||||
pool-size = 2
|
||||
|
||||
# A counting scheme determines the ratio of how high- and low-prio
|
||||
# jobs are run. For example: 4,1 means run 4 high prio jobs, then
|
||||
# 1 low prio and then start over.
|
||||
counting-scheme = "4,1"
|
||||
|
||||
# How often a failed job should be retried until it enters failed
|
||||
# state. If a job fails, it becomes "stuck" and will be retried
|
||||
# after a delay.
|
||||
retries = 5
|
||||
|
||||
# The delay until the next try is performed for a failed job. This
|
||||
# delay is increased exponentially with the number of retries.
|
||||
retry-delay = "1 minute"
|
||||
|
||||
# The queue size of log statements from a job.
|
||||
log-buffer-size = 500
|
||||
|
||||
# If no job is left in the queue, the scheduler will wait until a
|
||||
# notify is requested (using the REST interface). To also retry
|
||||
# stuck jobs, it will notify itself periodically.
|
||||
wakeup-period = "30 minutes"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `pool-size` setting deterimens how many jobs run in parallel. You
|
||||
need to play with this setting on your machine to find an optimal
|
||||
value.
|
||||
|
||||
The `counting-scheme` determines for all collectives how to select
|
||||
between high and low priority jobs; as explained above. It is
|
||||
currently not possible to define that per collective.
|
||||
|
||||
If a job fails, it will be set to *stuck* state and retried by the
|
||||
scheduler. The `retries` setting defines how many times a job is
|
||||
retried until it enters the final *failed* state. The scheduler waits
|
||||
some time until running the next try. This delay is given by
|
||||
`retry-delay`. This is the initial delay, the time until the first
|
||||
re-try (the second attempt). This time increases exponentially with
|
||||
the number of retries.
|
||||
|
||||
The jobs will log about what they do, which is picked up and stored
|
||||
into the database asynchronously. The log events are buffered in a
|
||||
queue and another thread will consume this queue and store them in the
|
||||
database. The `log-buffer-size` determines the size of the queue.
|
||||
|
||||
At last, there is a `wakeup-period` that determines at what interval
|
||||
the joex component notifies itself to look for new jobs. If jobs get
|
||||
stuck, and joex is not notified externally it could miss to
|
||||
retry. Also, since networks are not reliable, a notification may not
|
||||
reach a joex component. This periodic wakup is just to ensure that
|
||||
jobs are eventually run.
|
||||
|
||||
|
||||
## Starting on demand
|
||||
|
||||
The job executor and rest server can be started multiple times. This
|
||||
is especially useful for the job executor. For example, when
|
||||
submitting a lot of files in a short time, you can simply startup more
|
||||
job executors on other computers on your network. Maybe use your
|
||||
laptop to help with processing for a while.
|
||||
|
||||
You have to make sure, that all connect to the same database, and that
|
||||
all have unique `app-id`s.
|
||||
|
||||
Once the files have been processced you can stop the additional
|
||||
executors.
|
||||
|
||||
## Shutting down
|
||||
|
||||
If a job executor is sleeping and not executing any jobs, you can just
|
||||
quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
|
||||
there are jobs currently executing, it is advisable to initiate a
|
||||
graceful shutdown. The job executor will then stop taking new jobs
|
||||
from the queue but it will wait until all running jobs have completed
|
||||
before shutting down.
|
||||
|
||||
This can be done by sending a http POST request to the api of this job
|
||||
executor:
|
||||
|
||||
```
|
||||
curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
|
||||
```
|
||||
|
||||
If joex receives this request it will immediately stop taking new jobs
|
||||
and it will quit when all running jobs are done.
|
||||
|
||||
If a job executor gets terminated while there are running jobs, the
|
||||
jobs are still in the current state marked to be executed by this job
|
||||
executor. In order to fix this, start the job executor again. It will
|
||||
search all jobs that are marked with its id and put them back into
|
||||
waiting state. Then send a graceful shutdown request as shown above.
|
87
modules/microsite/docs/doc/metadata.md
Normal file
87
modules/microsite/docs/doc/metadata.md
Normal file
@ -0,0 +1,87 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Adding Meta Data
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
## Meta Data
|
||||
|
||||
The processing can be controlled implicitely by the provided meta
|
||||
data. The *Meta Data* page allows to manage this meta data. You can
|
||||
create the following:
|
||||
|
||||
- Tags
|
||||
- Organizations
|
||||
- Persons
|
||||
- Equipments
|
||||
|
||||
### Tags
|
||||
|
||||
Items can be tagged with multiple custom tags (aka labels). This
|
||||
allows to describe many different workflows people may have with their
|
||||
documents.
|
||||
|
||||
A tag can have a *category*. This is meant to group tags together. For
|
||||
example, you may want to have a tag category *doctype* that is
|
||||
comprised of tags like *bill*, *contract*, *receipt* and so on. Or for
|
||||
workflows, a tag category *state* may exist that includes tags like
|
||||
*Todo* or *Waiting*. Or you can tag items with user names to provide
|
||||
"assignment" semantics. Docspell doesn't propose any workflow, but it
|
||||
can help to implement some.
|
||||
|
||||
The tags are *not* taken into account when processing. Docspell will
|
||||
not automatically associate tags to your items. The tags are only
|
||||
meant to be used manually.
|
||||
|
||||
|
||||
### Organization and Person
|
||||
|
||||
The organization entity represents an non-personal (organization or
|
||||
company) correspondent of an item. Docspell will choose one or more
|
||||
organizations when processing documents and associate the "best" match
|
||||
with your item.
|
||||
|
||||
The person entitiy can appear in two roles: It may be a correspondent
|
||||
or the person an item is about. So a person is either a correspondent
|
||||
or a concerning person. Docspell can not know which person is which,
|
||||
therefore you need to tell this by checking the box "Use for
|
||||
concerning person suggestion only". If this is checked, docspell will
|
||||
use this person only to suggest a concerning person. Otherwise the
|
||||
person is used only for correspondent suggestions.
|
||||
|
||||
Document processing uses the following properties:
|
||||
|
||||
- name
|
||||
- websites
|
||||
- e-mails
|
||||
|
||||
The website an e-mails can be added as contact information. If these
|
||||
three are present, you should get good matches from docspell. All
|
||||
other fields of an organization and person are not used during
|
||||
document processing. They might be useful when using this as a real
|
||||
address book.
|
||||
|
||||
|
||||
### Equipment
|
||||
|
||||
The equipment entity is almost like a tag. In fact, it could be
|
||||
replaced by a tag with a specific known category. The difference is
|
||||
that docspell will try to find a match and associate it with your
|
||||
item. The equipment represents non-personal things that an item is
|
||||
about. Examples are: bills or insurances for *cars*, contracts for
|
||||
*houses* or *flats*.
|
||||
|
||||
Equipments don't have contact information, so the only property that
|
||||
is used to find matches during document processing is its name.
|
||||
|
||||
|
||||
## Document Language
|
||||
|
||||
An important setting is the language of your documents. This helps OCR
|
||||
and text analysis. You can select between English and German
|
||||
currently.
|
||||
|
||||
Go to the *Collective Settings* page and click *Document
|
||||
Language*. This will set the lanugage for all your documents. It is
|
||||
not (yet) possible to specify it when uploading.
|
40
modules/microsite/docs/doc/processing.md
Normal file
40
modules/microsite/docs/doc/processing.md
Normal file
@ -0,0 +1,40 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Processing Queue
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
|
||||
The page *Processing Queue* shows the current state of document
|
||||
processing for your uploads.
|
||||
|
||||
At the top of the page a list of running jobs is shown. Below that,
|
||||
the left column shows jobs that wait to be picked up by the job
|
||||
executor. On the right are finished jobs. The number of finished jobs
|
||||
is cut to some maximum and is also restricted by a date range. The
|
||||
page refreshes itself automatically to show the progress.
|
||||
|
||||
Example screenshot:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/processing-queue.jpg">
|
||||
</div>
|
||||
|
||||
You can cancel running jobs or remove waiting ones from the queue. If
|
||||
you click on the small file symbol on finished jobs, you can inspect
|
||||
its log messages again. A running job displays the job executor id
|
||||
that executes the job.
|
||||
|
||||
Currently the job queue executes just the document processing tasks,
|
||||
but it may be used for other long running tasks in the future.
|
||||
|
||||
Since job executors are shared among all collectives, it may happen
|
||||
that a job is some time waiting until it is picked up by a job
|
||||
executor. You can always start more job executors to help out.
|
||||
|
||||
If a job fails, it is retried after some time. Only if it fails too
|
||||
often (can be configured), it then is finished with *failed* state. If
|
||||
processing finally fails, the item is still created, just without
|
||||
suggestions. But if processing is cancelled by the user, the item is
|
||||
not created.
|
187
modules/microsite/docs/doc/tools.md
Normal file
187
modules/microsite/docs/doc/tools.md
Normal file
@ -0,0 +1,187 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Tools
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
The `tools/` folder contains some scripts and other resources intented
|
||||
for integrating docspell.
|
||||
|
||||
## consumedir
|
||||
|
||||
The `consumerdir.sh` is a bash script that works in two modes:
|
||||
|
||||
- Go through all files in given directories (non recursively) and sent
|
||||
each to docspell.
|
||||
- Watch one or more directories for new files and upload them to
|
||||
docspell.
|
||||
|
||||
It can watch or go through one or more directories. Files can be
|
||||
uploaded to multiple urls.
|
||||
|
||||
Run the script with the `-h` option, to see a short help text. The
|
||||
help text will also show the values for any given option.
|
||||
|
||||
The script requires `curl` for uploading. It requires the
|
||||
`inotifywait` command if directories should be watched for new
|
||||
files. If the `-m` option is used, the script will skip duplicate
|
||||
files. For this the `sha256sum` command is required.
|
||||
|
||||
Example for watching two directories:
|
||||
|
||||
``` bash
|
||||
./tools/consumedir.sh --path ~/Downloads --path ~/pdfs -m /var/run/consumedir -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
||||
|
||||
The script by default watches the given directories. If the `-o`
|
||||
option is used, it will instead go through these directories and
|
||||
upload all pdf files in there.
|
||||
|
||||
Example for uploading all immediatly (the same as above only with `-o`
|
||||
added):
|
||||
|
||||
``` bash
|
||||
./tools/consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m /var/run/consumedir -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
||||
|
||||
|
||||
### Systemd
|
||||
|
||||
The script can be used with systemd to run as a service. This is an
|
||||
example unit file:
|
||||
|
||||
```
|
||||
[Unit]
|
||||
After=networking.target
|
||||
Description=Docspell Consumedir
|
||||
|
||||
[Service]
|
||||
Environment="PATH=/set/a/path"
|
||||
|
||||
ExecStartPre=mkdir -p /var/run/consumedir && chown -R someuser /var/run/consumedir
|
||||
ExecStart=/bin/su -s /bin/bash someuser -c "consumedir.sh --path '/a/path/' -m '/var/run/consumedir' 'http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ'"
|
||||
```
|
||||
|
||||
This unit file is just an example, it needs some fiddling. It assumes
|
||||
an existing user `someuser` that is used to run this service. The url
|
||||
`http://localhost:7880/api/v1/open/upload/...` is an anonymous upload
|
||||
url as described [here](./uploading.html).
|
||||
|
||||
|
||||
## ds.sh
|
||||
|
||||
A bash script to quickly upload files from the command line. It reads
|
||||
a configuration file containing the URLs to upload to. Then each file
|
||||
given to the script will be uploaded to al URLs in the config.
|
||||
|
||||
The config file is expected in
|
||||
`$XDG_CONFIG_HOME/docspell/ds.conf`. `$XDG_CONFIG_HOME` defaults to
|
||||
`~/.config`.
|
||||
|
||||
The config file contains lines with key-value pairs, separated by an
|
||||
`=` sign. Lines starting with `#` are ignored. Example:
|
||||
|
||||
```
|
||||
# Config file
|
||||
url.1 = http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
url.2 = http://localhost:7880/api/v1/open/upload/item/6DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
||||
|
||||
The key must start with `url`.
|
||||
|
||||
### Usage
|
||||
|
||||
The `-h` option shows a help overview.
|
||||
|
||||
The script takes a list of files as arguments. It checks the file
|
||||
types and will raise an error (and quit) if a file is included that is
|
||||
not a PDF. The `-s` option can be used to skip them instead.
|
||||
|
||||
The `-c` option allows to specifiy a different config file.
|
||||
|
||||
Example:
|
||||
|
||||
``` bash
|
||||
./ds.sh ~/Downloads/*.pdf
|
||||
```
|
||||
|
||||
|
||||
## Webextension for Docspell
|
||||
|
||||
Idea: Inside the browser click on a PDF and send it to docspell. It is
|
||||
downloaded in the context of your current page. Then handed to an
|
||||
application that pushes it to docspell. There is a browser add-on
|
||||
implementing this in `tools/webextension`. This add-on only works with
|
||||
firefox.
|
||||
|
||||
### Install
|
||||
|
||||
This is a bit complicated, since you need to install external tools
|
||||
and the web extension. Both work together.
|
||||
|
||||
#### Install `ds.sh`
|
||||
|
||||
First install the `ds.sh` tool somewhere, maybe `/usr/local/bin` as
|
||||
described above.
|
||||
|
||||
|
||||
#### Install the native part
|
||||
|
||||
Then install the "native" part of the web extension:
|
||||
|
||||
Copy or symlink the `native.py` script into some known location. For
|
||||
example:
|
||||
|
||||
``` bash
|
||||
ln -s ~/docspell-checkout/tools/webextension/native/native.py /usr/local/share/docspell/native.py
|
||||
```
|
||||
|
||||
Then copy the `app_manifest.json` to
|
||||
`$HOME/.mozilla/native-messaging-hosts/docspell.json`. For example:
|
||||
|
||||
``` bash
|
||||
cp ~/docspell-checkout/tools/webextension/native/app_manifest.json ~/.mozilla/native-messaging-hosts/docspell.json
|
||||
```
|
||||
|
||||
See
|
||||
[here](https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Native_manifests#Manifest_location)
|
||||
for details.
|
||||
|
||||
And you might want to modify this json file, so the path to the
|
||||
`native.py` script is correct (it must be absolute).
|
||||
|
||||
If the `ds.sh` script is in your `$PATH`, then this should
|
||||
work. Otherwise, edit the `native.py` script and change the path to
|
||||
the tool. Or create a file `$HOME/.config/docspell/ds.cmd` whose
|
||||
content is the path to the `ds.sh` script.
|
||||
|
||||
|
||||
#### Install the extension
|
||||
|
||||
An extension file can be build using the `make-xpi.sh` script. But
|
||||
installing it in "standard" firefox won't work, because [Mozilla
|
||||
requires extensions to be signed by
|
||||
them](https://wiki.mozilla.org/Add-ons/Extension_Signing). This means
|
||||
creating an account and going through some process…. So here are two
|
||||
alternatives:
|
||||
|
||||
1. Open firefox and type `about:debugging` in the addressbar. Then
|
||||
click on *'Load Temporary Add-on...'* and select the
|
||||
`manifest.json` file. The extension is now installed. The downside
|
||||
is, that the extension will be removed once firefox is closed.
|
||||
2. Use Firefox ESR, which allows to install Add-ons not signed by
|
||||
Mozilla. But it has to be configured: Open firefox and type
|
||||
`about:config` in the address bar. Search for key
|
||||
`xpinstall.signatures.required` and set it to `false`. This is
|
||||
described on the last paragraph on [this
|
||||
page](https://support.mozilla.org/en-US/kb/add-on-signing-in-firefox).
|
||||
|
||||
When you right click on a file link, there should be a context menu
|
||||
entry *'Docspell Upload Helper'*. The add-on will download this file
|
||||
using the browser and then send the file path to the `native.py`
|
||||
script. This script will in turn call `ds.sh` which finally uploads it
|
||||
to your configured URLs.
|
||||
|
||||
Open the Add-ons page (`Ctrl`+`Shift`+`A`), the new add-on should be
|
||||
there.
|
130
modules/microsite/docs/doc/uploading.md
Normal file
130
modules/microsite/docs/doc/uploading.md
Normal file
@ -0,0 +1,130 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Uploads
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
|
||||
This page describes, how files can get into docspell. Technically,
|
||||
there is just one way: via http multipart/form-data requests.
|
||||
|
||||
|
||||
## Authenticated Upload
|
||||
|
||||
From within the web application there is the "Upload Files"
|
||||
page. There you can select multiple files to upload. You can also
|
||||
specify whether these files should become one item or if every file is
|
||||
a separate item.
|
||||
|
||||
When you click "Submit" the files are uploaded and stored in the
|
||||
database. Then the job executor(s) are notified which immediately
|
||||
start processing them.
|
||||
|
||||
Go to the top-right menu and click "Processing Queue" to see the
|
||||
current state.
|
||||
|
||||
This obviously requires an authenticated user. While this is handy for
|
||||
ad-hoc uploads, it is very inconvenient for automating it by custom
|
||||
scripts. For this the next variant exists.
|
||||
|
||||
## Anonymous Upload
|
||||
|
||||
It is also possible to upload files without authentication. This
|
||||
should make tools that interact with docspell much easier to write.
|
||||
|
||||
|
||||
### Creating Anonymous Uploads
|
||||
|
||||
Go to "Collective Settings" and then to the "Source" tab. A *Source*
|
||||
identifies an endpoint where files can be uploaded
|
||||
anonymously. Creating a new source creates a long unique id which is
|
||||
part on an url that can be used to upload files. You can choose any
|
||||
time to deactivate or delete the source at which point uploading is
|
||||
not possible anymore. The idea is to give this URL away safely. You
|
||||
can delete it any time and no passwords or secrets are visible, even
|
||||
your username is not visible.
|
||||
|
||||
Example screenshot:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/sources-form.jpg">
|
||||
</div>
|
||||
|
||||
This example shows a source with name "test". It defines two urls:
|
||||
|
||||
- `/app/index.html#/upload/<id>`
|
||||
- `/api/v1/open/upload/item/<id>`
|
||||
|
||||
The first points to a web page where everyone could upload files into
|
||||
your account. You could give this url to people for sending files
|
||||
directly into your docspell.
|
||||
|
||||
The second url is the API url, which accepts the requests to upload
|
||||
files (which is used by the first url).
|
||||
|
||||
For example, this url can be used to upload files with curl:
|
||||
|
||||
``` bash
|
||||
$ curl -XPOST -F file=@test.pdf http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
{"success":true,"message":"Files submitted."}
|
||||
```
|
||||
|
||||
You could add more `-F file=@/path/to/your/file.pdf` to upload
|
||||
multiple files (note, the `@` is required by curl, so it knows that
|
||||
the following is a file).
|
||||
|
||||
When files are uploaded to an source endpoint, the items resulting
|
||||
from this uploads are marked with the name of the source. So you know
|
||||
which source an item originated.
|
||||
|
||||
If files are uploaded using the web applications *Upload files* page,
|
||||
the source is implicitly set to `webapp`. If you also want to let
|
||||
docspell count the files uploaded through the web interface, just
|
||||
create a source (can be inactive) with that name (`webapp`).
|
||||
|
||||
|
||||
## The Request
|
||||
|
||||
This gives more details about the request for uploads. It is a http
|
||||
`multipart/form-data` request, with two possible fields:
|
||||
|
||||
- meta
|
||||
- file
|
||||
|
||||
The `file` field can appear multiple times and is required at least
|
||||
once. It is the part containing the file to upload.
|
||||
|
||||
The `meta` part is completely optional and can define additional meta
|
||||
data, that docspell uses to create items from the given files. It
|
||||
allows to transfer structured information together with the
|
||||
unstructured binary files.
|
||||
|
||||
The `meta` content must be `application/json` containing this
|
||||
structure:
|
||||
|
||||
```
|
||||
{ multiple: Bool
|
||||
, direction: Maybe String
|
||||
}
|
||||
```
|
||||
|
||||
The `multiple` property is by default `true`. It means that each file
|
||||
in the upload request corresponds to a single item. An upload with 5
|
||||
files will result in 5 items created. If it is `false`, then docspell
|
||||
will create just one item, that will then contain all files.
|
||||
|
||||
Furthermore, the direction of the document (one of `incoming` or
|
||||
`outgoing`) can be given. It is optional, it can be left out or
|
||||
`null`.
|
||||
|
||||
This kind of request is very common and most programming languages
|
||||
have support for this. For example, here is another curl command
|
||||
uploading two files with meta data:
|
||||
|
||||
```
|
||||
curl -XPOST -F meta='{"multiple":false, "direction": "outgoing"}' \
|
||||
-F file=@letter-en-source.pdf \
|
||||
-F file=@letter-de-source.pdf \
|
||||
http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
Reference in New Issue
Block a user