mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-22 02:18:26 +00:00
Upgrade microsite
This commit is contained in:
91
modules/microsite/docs/api.md
Normal file
91
modules/microsite/docs/api.md
Normal file
@ -0,0 +1,91 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Api
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
Docspell is designed as a REST server that uses JSON to exchange
|
||||
data. The REST api can be used to integrate docspell into your
|
||||
workflow.
|
||||
|
||||
[Docspell REST Api Doc](openapi/docspell-openapi.html)
|
||||
|
||||
The "raw" `openapi.yml` specification file can be found
|
||||
[here](openapi/docspell-openapi.yml).
|
||||
|
||||
The routes can be divided into protected and unprotected routes. The
|
||||
unprotected, or open routes are at `/open/*` wihle the protected
|
||||
routes are at `/sec/*`. Open routes don't require authenticated access
|
||||
and can be used by any user. The protected routes require an
|
||||
authenticated user.
|
||||
|
||||
## Authentication
|
||||
|
||||
The unprotected route `/open/auth/login` can be used to login with
|
||||
account name and password. The response contains a token that can be
|
||||
used for accessing protected routes. The token is only valid for a
|
||||
restricted time which can be configured (default is 5 minutes).
|
||||
|
||||
New tokens can be generated using an existing valid token and the
|
||||
protected route `/sec/auth/session`. This will return the same
|
||||
response as above, giving a new token.
|
||||
|
||||
This token can be added to requests in two ways: as a cookie header or
|
||||
a "normal" http header. If a cookie header is used, the cookie name
|
||||
must be `docspell_auth` and a custom header must be named
|
||||
`X-Docspell-Auth`.
|
||||
|
||||
## Live Api
|
||||
|
||||
Besides the statically generated documentation at this site, the rest
|
||||
server provides a swagger generated api documenation, that allows
|
||||
playing around with the api. It requires a running docspell rest
|
||||
server. If it is deployed at `http://localhost:7880`, then check this
|
||||
url:
|
||||
|
||||
```
|
||||
http://localhost:7880/app/doc
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
These examples use the great command line tool
|
||||
[curl](https://curl.haxx.se/).
|
||||
|
||||
### Login
|
||||
|
||||
```
|
||||
$ curl -X POST -d '{"account": "smith", "password": "test"}' http://localhost:7880/api/v1/open/auth/login
|
||||
{"collective":"smith"
|
||||
,"user":"smith"
|
||||
,"success":true
|
||||
,"message":"Login successful"
|
||||
,"token":"1568142350115-ZWlrZS9laWtl-$2a$10$rGZUFDAVNIKh4Tj6u6tlI.-O2euwCvmBT0TlyDmIHR1ZsLQPAI="
|
||||
,"validMs":300000
|
||||
}
|
||||
```
|
||||
|
||||
### Get new token
|
||||
|
||||
```
|
||||
$ curl -XPOST -H 'X-Docspell-Auth: 1568142350115-ZWlrZS9laWtl-$2a$10$rGZUFDAVNIKh4Tj6u6tlI.-O2euwCvmBT0TlyDmIHR1ZsLQPAI=' http://localhost:7880/api/v1/sec/auth/session
|
||||
{"collective":"smith"
|
||||
,"user":"smith"
|
||||
,"success":true
|
||||
,"message":"Login successful"
|
||||
,"token":"1568142446077-ZWlrZS9laWtl-$2a$10$3B0teJ9rMpsBJPzHfZZPoO-WeA1bkfEONBN8fyzWE8DeaAHtUc="
|
||||
,"validMs":300000
|
||||
}
|
||||
```
|
||||
|
||||
### Get some insights
|
||||
|
||||
```
|
||||
$ curl -H 'X-Docspell-Auth: 1568142446077-ZWlrZS9laWtl-$2a$10$3B0teJ9rMpsBJPzHfZZPoO-WeA1bkfEONBN8fyzWE8DeaAHtUc=' http://localhost:7880/api/v1/sec/collective/insights
|
||||
{"incomingCount":3
|
||||
,"outgoingCount":1
|
||||
,"itemSize":207310
|
||||
,"tagCloud":{"items":[]}
|
||||
}
|
||||
```
|
10
modules/microsite/docs/demo.md
Normal file
10
modules/microsite/docs/demo.md
Normal file
@ -0,0 +1,10 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Demo
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
|
||||
|
||||
<img width="100%" src="img/docspell-demo.gif" title="Demo">
|
86
modules/microsite/docs/dev.md
Normal file
86
modules/microsite/docs/dev.md
Normal file
@ -0,0 +1,86 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Development
|
||||
---
|
||||
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
|
||||
## Building
|
||||
|
||||
[Sbt](https://scala-sbt.org) is used to build the application. Clone
|
||||
the sources and run:
|
||||
|
||||
- `make` to compile all sources (Elm + Scala)
|
||||
- `make-zip` to create zip packages
|
||||
- `make-deb` to create debian packages
|
||||
|
||||
The zip files can be found afterwards in:
|
||||
|
||||
```
|
||||
modules/restserver/target/universal
|
||||
modules/joex/target/universal
|
||||
```
|
||||
|
||||
|
||||
## Starting Servers with `reStart`
|
||||
|
||||
When developing, it's very convenient to use the [revolver sbt
|
||||
plugin](https://github.com/spray/sbt-revolver). Start the sbt console
|
||||
and then run:
|
||||
|
||||
```
|
||||
sbt:docspell-root> restserver/reStart
|
||||
```
|
||||
|
||||
This starts a REST server. Once this started up, type:
|
||||
|
||||
```
|
||||
sbt:docspell-root> joex/reStart
|
||||
```
|
||||
|
||||
if also a joex component is required. Prefixing the commads with `~`,
|
||||
results in recompile+restart once a source file is modified.
|
||||
|
||||
|
||||
## Custom config file
|
||||
|
||||
The sbt build is setup such that a file `dev.conf` in the root of the
|
||||
source tree is picked up as config file, if it exists. So you can
|
||||
create a custom config file for development. For example, a custom
|
||||
database for development may be setup this way:
|
||||
|
||||
```
|
||||
#jdbcurl = "jdbc:h2:///home/dev/workspace/projects/docspell/local/docspell-demo.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
|
||||
jdbcurl = "jdbc:postgresql://localhost:5432/docspelldev"
|
||||
#jdbcurl = "jdbc:mariadb://localhost:3306/docspelldev"
|
||||
|
||||
docspell.server {
|
||||
backend {
|
||||
jdbc {
|
||||
url = ${jdbcurl}
|
||||
user = "dev"
|
||||
password = "dev"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
docspell.joex {
|
||||
jdbc {
|
||||
url = ${jdbcurl}
|
||||
user = "dev"
|
||||
password = "dev"
|
||||
}
|
||||
scheduler {
|
||||
pool-size = 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## ADRs
|
||||
|
||||
Some early information about certain details can be found in the few
|
||||
[ADR](https://adr.github.io/) that exist:
|
||||
|
||||
- [ADRs](dev/adr.html)
|
12
modules/microsite/docs/dev/adr.md
Normal file
12
modules/microsite/docs/dev/adr.md
Normal file
@ -0,0 +1,12 @@
|
||||
---
|
||||
layout: docs
|
||||
title: ADRs
|
||||
---
|
||||
|
||||
# ADR
|
||||
|
||||
- [0001 Components](adr/0001_components.html)
|
||||
- [0002 Component Interaction](adr/0002_component_interaction.html)
|
||||
- [0003 Encryption](adr/0003_encryption.html)
|
||||
- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch.html)
|
||||
- [0005 Job Executor](adr/0005_job-executor.html)
|
@ -0,0 +1,33 @@
|
||||
# Use Markdown Architectural Decision Records
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We want to [record architectural decisions](https://adr.github.io/)
|
||||
made in this project. Which format and structure should these records
|
||||
follow?
|
||||
|
||||
## Considered Options
|
||||
|
||||
* [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
|
||||
* [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
|
||||
* [Sustainable Architectural
|
||||
Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) -
|
||||
The Y-Statements
|
||||
* Other templates listed at
|
||||
<https://github.com/joelparkerhenderson/architecture_decision_record>
|
||||
* Formless - No conventions for file format and structure
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "MADR 2.1.0", because
|
||||
|
||||
* Implicit assumptions should be made explicit. Design documentation
|
||||
is important to enable people understanding the decisions later on.
|
||||
See also [A rational design process: How and why to fake
|
||||
it](https://doi.org/10.1109/TSE.1986.6312940).
|
||||
* The MADR format is lean and fits our development style.
|
||||
* The MADR structure is comprehensible and facilitates usage &
|
||||
maintenance.
|
||||
* The MADR project is vivid.
|
||||
* Version 2.1.0 is the latest one available when starting to document
|
||||
ADRs.
|
66
modules/microsite/docs/dev/adr/0001_components.md
Normal file
66
modules/microsite/docs/dev/adr/0001_components.md
Normal file
@ -0,0 +1,66 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Components
|
||||
---
|
||||
|
||||
# Components
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
How should the application be structured into its main components? The
|
||||
goal is to be able to have multiple rest servers/webapps and multiple
|
||||
document processor components working togehter.
|
||||
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
The following are the "main" modules. There may be more helper modules
|
||||
and libraries that support implementing a feature.
|
||||
|
||||
### store
|
||||
|
||||
The code related to database access. It also provides the job
|
||||
queue. It is designed as a library.
|
||||
|
||||
### joex
|
||||
|
||||
Joex stands for "job executor".
|
||||
|
||||
An application that executes jobs from the queue and therefore depends
|
||||
on the `store` module. It provides the code for all tasks that can be
|
||||
submitted as jobs. If no jobs are in the queue, the joex "sleeps"
|
||||
and must be waked via an external request.
|
||||
|
||||
It provides the document processing code.
|
||||
|
||||
It provides a http rest server to get insight into the joex state
|
||||
and also to be notified for new jobs.
|
||||
|
||||
### backend
|
||||
|
||||
It provides all the logic, except document processing, as a set of
|
||||
"operations". An operation can be directly mapped to a rest
|
||||
endpoint.
|
||||
|
||||
It is designed as a library.
|
||||
|
||||
### rest api
|
||||
|
||||
This module contains the specification for the rest server as an
|
||||
`openapi.yml` file. It is packaged as a scala library that also
|
||||
provides types and conversions to/from json.
|
||||
|
||||
The idea is that the `rest server` module can depend on it as well as
|
||||
rest clients.
|
||||
|
||||
### rest server
|
||||
|
||||
This is the main application. It directly depends on the `backend`
|
||||
module, and each rest endpoint maps to a "backend operation". It is
|
||||
also responsible for converting the json data inside http requests
|
||||
to/from types recognized by the `backend` module.
|
||||
|
||||
|
||||
### webapp
|
||||
|
||||
This module provides the user interface as a web application.
|
65
modules/microsite/docs/dev/adr/0002_component_interaction.md
Normal file
65
modules/microsite/docs/dev/adr/0002_component_interaction.md
Normal file
@ -0,0 +1,65 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Component Interaction
|
||||
---
|
||||
|
||||
# Component Interaction
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
There are multiple web applications with their rest servers and there
|
||||
are multiple document processors. These processes must communicate:
|
||||
|
||||
- once a new job is added to the queue the rest server must somehow
|
||||
notify processors to wake up
|
||||
- once a processor takes a job, it must propagate the progress and
|
||||
outcome to all rest servers only that the rest server can notify the
|
||||
user that is currently logged in. Since it's not known which
|
||||
rest-server the user is using right now, all must be notified.
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. JMS (ActiveMQ or similiar): Message Broker as another active
|
||||
component
|
||||
2. Akka: using a cluster
|
||||
3. DB: Register with "call back urls"
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Choosing option 3: DB as central synchronisation point.
|
||||
|
||||
The reason is that this is the simplest solution and doesn't require
|
||||
external libraries or more processes. The other options seem too big
|
||||
of a weapon for the task at hand. They are both large components
|
||||
itself and require more knowledge to use them efficiently.
|
||||
|
||||
It works roughly like this:
|
||||
|
||||
- rest servers and processors register at the database on startup each
|
||||
with a unique call-back url
|
||||
- and deregister on shutdown
|
||||
- each component has db access
|
||||
- rest servers can list all processors and vice versa
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
- complexity of the whole application is not touched
|
||||
- since a lot of data must be transferred to the document processors,
|
||||
this is solved by simply accessing the db. So the protocol for data
|
||||
exchange is set. There is no need for other protocols that handle
|
||||
large data (http chunking etc)
|
||||
- uses the already exsting db as synchronisation point
|
||||
- no additional knowledge required
|
||||
- simple to understand and so not hard to debug
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
- all components must have db access. this also is a security con,
|
||||
because if one of those processes is hacked, db access is
|
||||
possible. and it simply is another dependency that is not really
|
||||
required for the joex component
|
||||
- the joex component cannot be in an untrusted environment (untrusted
|
||||
from the db's point of view). For example, it is not possible to
|
||||
create "personal joex" that only receive your own jobs…
|
||||
- in order to know if a component is really active, one must run a
|
||||
ping against the call-back url
|
95
modules/microsite/docs/dev/adr/0003_encryption.md
Normal file
95
modules/microsite/docs/dev/adr/0003_encryption.md
Normal file
@ -0,0 +1,95 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Encryption
|
||||
---
|
||||
|
||||
# Encryption
|
||||
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Since docspell may store important documents, it should be possible to
|
||||
encrypt them on the server. It should be (almost) transparent to the
|
||||
user, for example, a user must be able to login and download a file in
|
||||
clear form. That is, the server must also decrypt them.
|
||||
|
||||
Then all users of a collective should have access to the files. This
|
||||
requires to share the key among users of a collective.
|
||||
|
||||
But, even when files are encrypted, the associated meta data is not!
|
||||
So especially access to the database would allow to see tags,
|
||||
associated persons and correspondents of documents.
|
||||
|
||||
So in short, encryption means:
|
||||
|
||||
- file contents (the blobs and extracted text) is encrypted
|
||||
- metadata is not
|
||||
- secret keys are stored at the server (protected by a passphrase),
|
||||
such that files can be downloaded in clear form
|
||||
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* major driver is to provide most possible privacy for users
|
||||
* even at the expense of less features; currently I think that the
|
||||
associated meta data is enough for finding documents (i.e. full text
|
||||
search is not needed)
|
||||
|
||||
## Considered Options
|
||||
|
||||
It is clear, that only blobs (file contents) can be encrypted, but not
|
||||
the associated metadata. And the extracted text must be encrypted,
|
||||
too, obviously.
|
||||
|
||||
|
||||
### Public Key Encryption (PKE)
|
||||
|
||||
With PKE that the server can automatically encrypt files using
|
||||
publicly available key data. It wouldn't require a user to provide a
|
||||
passphrase for encryption, only for decryption.
|
||||
|
||||
This would allows for first processing files (extracting text, doing
|
||||
text analyisis) and encrypting them (and the text) afterwards.
|
||||
|
||||
The public and secret keys are stored at the database. The secret key
|
||||
must be protected. This can be done by encrypting the passphrase to
|
||||
the secret key using each users login password. If a user logs in, he
|
||||
or she must provide the correct password. Using this password, the
|
||||
private key can be unlocked. This requires to store the private key
|
||||
passphrase encrypted with every users password in the database. So the
|
||||
whole security then depends on users password quality.
|
||||
|
||||
There are plenty of other difficulties with this approach (how about
|
||||
password change, new secret keys, adding users etc).
|
||||
|
||||
Using this kind of encryption would protect the data against offline
|
||||
attacks and also for accidental leakage (for example, if a bug in the
|
||||
software would access a file of another user).
|
||||
|
||||
|
||||
### No Encryption
|
||||
|
||||
If only blobs are encrypted, against which type of attack would it
|
||||
provide protection?
|
||||
|
||||
The users must still trust the server. First, in order to provide the
|
||||
wanted features (document processing), the server must see the file
|
||||
contents. Then, it will receive and serve files in clear form, so it
|
||||
has access to them anyways.
|
||||
|
||||
With that in mind, the "only" feature is to protect against "stolen
|
||||
database" attacks. If the database is somehow leaked, the attackers
|
||||
would only see the metadata, but not real documents. It also protects
|
||||
against leakage, maybe caused by a pogramming error.
|
||||
|
||||
But the downside is, that it increases complexity *a lot*. And since
|
||||
this is a personal tool for personal use, is it worth the effort?
|
||||
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
No encryption, because of its complexity.
|
||||
|
||||
For now, this tool is only meant for "self deployment" and personal
|
||||
use. If this changes or there is enough time, this decision should be
|
||||
reconsidered.
|
42
modules/microsite/docs/dev/adr/0004_iso8601vsEpoch.md
Normal file
42
modules/microsite/docs/dev/adr/0004_iso8601vsEpoch.md
Normal file
@ -0,0 +1,42 @@
|
||||
---
|
||||
layout: docs
|
||||
title: ISO8601 vs Millis
|
||||
---
|
||||
|
||||
# ISO8601 vs Millis as Date-Time transfer
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The question is whether the REST Api should return an ISO8601
|
||||
formatted string in UTC timezone, or the unix time (number of
|
||||
milliseconds since 1970-01-01).
|
||||
|
||||
There is quite some controversy about it.
|
||||
|
||||
- <https://stackoverflow.com/questions/47426786/epoch-or-iso8601-date-format>
|
||||
- <https://nbsoftsolutions.com/blog/designing-a-rest-api-unix-time-vs-iso-8601>
|
||||
|
||||
In my opinion, the ISO8601 format (always UTC) is better. The reason
|
||||
is the better readability. But elm folks are on the other side:
|
||||
|
||||
- <https://package.elm-lang.org/packages/elm/time/1.0.0#iso-8601>
|
||||
- <https://package.elm-lang.org/packages/rtfeldman/elm-iso8601-date-strings/latest/>
|
||||
|
||||
One can convert from an ISO8601 date-time string in UTC time into the
|
||||
epoch millis and vice versa. So it is the same to me. There is no less
|
||||
information in a ISO8601 string than in the epoch millis.
|
||||
|
||||
To avoid confusion, all date/time values should use the same encoding.
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
I go with the epoch time. Every timestamp/date-time values is
|
||||
transfered as Unix timestamp.
|
||||
|
||||
Reasons:
|
||||
|
||||
- the Elm application needs to frequently calculate with these values
|
||||
to render the current waiting time etc. This is better if there are
|
||||
numbers without requiring to parse dates first
|
||||
- Since the UI is written with Elm, it's probably good to adopt their
|
||||
style
|
136
modules/microsite/docs/dev/adr/0005_job-executor.md
Normal file
136
modules/microsite/docs/dev/adr/0005_job-executor.md
Normal file
@ -0,0 +1,136 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Joex - Job Executor
|
||||
---
|
||||
|
||||
# Job Executor
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Docspell is a multi-user application. When processing user's
|
||||
documents, there must be some thought on how to distribute all the
|
||||
processing jobs on a much more restricted set of resources. There
|
||||
maybe 100 users but only 4 cores that can process documents at a
|
||||
time. Doing simply FIFO is not enough since it provides an unfair
|
||||
distribution. The first user who submits 20 documents will then occupy
|
||||
all cores for quite some time and all other users would need to wait.
|
||||
|
||||
This tries to find a more fair distribution among the users (strictly
|
||||
meaning collectives here) of docspell.
|
||||
|
||||
The job executor is a separate component that will run in its own
|
||||
process. It takes the next job from the "queue" and executes the
|
||||
associated task. This is used to run the document processing jobs
|
||||
(text extraction, text analysis etc).
|
||||
|
||||
1. The task execution should survive restarts. State and task code
|
||||
must be recreated from some persisted state.
|
||||
|
||||
2. The processing should be fair with respect to collectives.
|
||||
|
||||
3. It must be possible to run many job executors, possibly on
|
||||
different machines. This can be used to quickly enable more
|
||||
processing power and removing it once the peak is over.
|
||||
|
||||
4. Task execution can fail and it should be able to retry those
|
||||
tasks. Reasons are that errors may be temporarily (for example
|
||||
talking to a third party service), and to enable repairing without
|
||||
stopping the job executor. Some errors might be easily repaired (a
|
||||
program was not installed or whatever). In such a case it is good
|
||||
to know that the task will be retried later.
|
||||
|
||||
## Considered Options
|
||||
|
||||
In contrast to other ADRs this is just some sketching of thoughts for
|
||||
the current implementation.
|
||||
|
||||
1. Job description are serialized and written to the database into a
|
||||
table. This becomes the queue. Tasks are identified by names and a
|
||||
job executor implementation must have a map of names to code to
|
||||
lookup the task to perform. The tasks arguments are serialized into
|
||||
a string and written to the database. Tasks must decode the
|
||||
string. This can be conveniently done using JSON and the provided
|
||||
circe decoders.
|
||||
|
||||
2. To provide a fair execution jobs are organized into groups. When a
|
||||
new job is requested from the queue, first a group is selected
|
||||
using a round-robin strategy. This should ensure good enough
|
||||
fairness among groups. A group maps to a collective. Within a
|
||||
group, a job is selected based on priority, submitted time (fifo)
|
||||
and job state (see notes about stuck jobs).
|
||||
|
||||
3. Allowing multiple job executors means that getting the next job can
|
||||
fail due to simultaneous running transactions. It is retried until
|
||||
it succeeds. Taking a job puts in into _scheduled_ state. Each job
|
||||
executor has a unique (manually supplied) id and jobs are marked
|
||||
with that id once it is handed to the executor.
|
||||
|
||||
4. When a task fails, its state is updated to state _stuck_. Stuck
|
||||
jobs are retried in the future. The queue prefers to return stuck
|
||||
jobs that are due at the specific point in time ignoring the
|
||||
priority hint.
|
||||
|
||||
### More Details
|
||||
|
||||
A job has these properties
|
||||
|
||||
- id (something random)
|
||||
- group
|
||||
- taskname (to choose task to run)
|
||||
- submitted-date
|
||||
- worker (the id of the job executor)
|
||||
- state, one of: waiting, scheduled, running, stuck, cancelled,
|
||||
failed, success
|
||||
- waiting: job has been inserted into the queue
|
||||
- scheduled: job has been handed over to some executore and is
|
||||
marked with the job executor id
|
||||
- running: a task is currently executing
|
||||
- stuck: a task has failed and is being retried eventually
|
||||
- cancelled: task has finished and there was a cancel request
|
||||
- failed: task has failed, execeeded the retries
|
||||
- success: task has completed successfully
|
||||
|
||||
The queue has a `take` or `nextJob` operation that takes the worker-id
|
||||
and a priority hint and goes roughly like this:
|
||||
|
||||
- select the next group using round-robin strategy
|
||||
- select all jobs with that group, where
|
||||
- state is stuck and waiting time has elapsed
|
||||
- state is waiting and have the given priority if possible
|
||||
- jobs are ordered by submitted time, but stuck jobs whose waiting
|
||||
time elapsed are preferred
|
||||
|
||||
There are two priorities within a group: high and low. A configured
|
||||
counting scheme determines when to select certain priority. For
|
||||
example, counting scheme of `(2,1)` would select two high priority
|
||||
jobs and then 1 low priority job. The `take` operation tries to prefer
|
||||
this priority but falls back to the other if no job with this priority
|
||||
is available.
|
||||
|
||||
A group corresponds to a collective. Then all collectives get
|
||||
(roughly) equal treatment.
|
||||
|
||||
Once there are no jobs in the queue the executor goes into sleep and
|
||||
must be waked to run again. If a job is submitted, the executors are
|
||||
notified.
|
||||
|
||||
### Stuck Jobs
|
||||
|
||||
A job is going into _stuck_ state, if the task has failed. In this
|
||||
state, the task is rerun after a while until a maximum retry count is
|
||||
reached.
|
||||
|
||||
The problem is how to notify all executors when the waiting time has
|
||||
elapsed. If one executor puts a job into stuck state, it means that
|
||||
all others should start looking into the queue again after `x`
|
||||
minutes. It would be possible to tell all existing executors to
|
||||
schedule themselves to wake up in the future, but this would miss all
|
||||
executors that show up later.
|
||||
|
||||
The waiting time is increased exponentially after each retry (`2 ^
|
||||
retry`) and it is meant as the minimum waiting time. So it is ok if
|
||||
all executors wakeup periodically and check for new work. Most of the
|
||||
time this should not be necessary and is just a fallback if only stuck
|
||||
jobs are in the queue and nothing is submitted for a long time. If the
|
||||
system is used, jobs get submitted once in a while and would awake all
|
||||
executors.
|
72
modules/microsite/docs/dev/adr/template.md
Normal file
72
modules/microsite/docs/dev/adr/template.md
Normal file
@ -0,0 +1,72 @@
|
||||
# [short title of solved problem and solution]
|
||||
|
||||
* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
|
||||
* Deciders: [list everyone involved in the decision] <!-- optional -->
|
||||
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
|
||||
|
||||
Technical Story: [description | ticket/issue URL] <!-- optional -->
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
|
||||
|
||||
## Decision Drivers <!-- optional -->
|
||||
|
||||
* [driver 1, e.g., a force, facing concern, …]
|
||||
* [driver 2, e.g., a force, facing concern, …]
|
||||
* … <!-- numbers of drivers can vary -->
|
||||
|
||||
## Considered Options
|
||||
|
||||
* [option 1]
|
||||
* [option 2]
|
||||
* [option 3]
|
||||
* … <!-- numbers of options can vary -->
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
|
||||
|
||||
### Positive Consequences <!-- optional -->
|
||||
|
||||
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
### Negative Consequences <!-- optional -->
|
||||
|
||||
* [e.g., compromising quality attribute, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
## Pros and Cons of the Options <!-- optional -->
|
||||
|
||||
### [option 1]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 2]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 3]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
## Links <!-- optional -->
|
||||
|
||||
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
|
||||
* … <!-- numbers of links can vary -->
|
99
modules/microsite/docs/doc.md
Normal file
99
modules/microsite/docs/doc.md
Normal file
@ -0,0 +1,99 @@
|
||||
---
|
||||
layout: docs
|
||||
position: 4
|
||||
title: Documentation
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
Docspell assists in organizing large amounts of PDF files that are
|
||||
typically scanned paper documents. You can associate tags, set
|
||||
correspondends, what a document is concerned with, a name, a date and
|
||||
some more. If your documents are associated with this meta data, you
|
||||
should be able to quickly find them later using the search
|
||||
feature. But adding this manually to each document is a tedious
|
||||
task. What if most of it could be attached automatically?
|
||||
|
||||
## How it works
|
||||
|
||||
Documents have two main properties: a correspondent (sender or
|
||||
receiver that is not you) and something the document is about. Usually
|
||||
it is about a person or some thing – maybe your car, or contracts
|
||||
concerning some familiy member, etc.
|
||||
|
||||
1. You maintain a kind of address book. It should list all possible
|
||||
correspondents and the concerning people/things. This grows
|
||||
incrementally with each new unknown document.
|
||||
2. When docspell analyzes a document, it tries to find matches within
|
||||
your address book. It can detect the correspondent and a concerning
|
||||
person or thing. It will then associate this data to your
|
||||
documents.
|
||||
3. You can inspect what docspell has done and correct it. If docspell
|
||||
has found multiple suggestions, they will be shown for you to
|
||||
select one. If it is not correctly associated, very often the
|
||||
correct one is just one click away.
|
||||
|
||||
The set of meta data that docspell uses to draw suggestions from, must
|
||||
be maintained manually. But usually, this data doesn't grow as fast as
|
||||
the documents. After a while there is a quite complete address book
|
||||
and only once in a while it has to be revisited.
|
||||
|
||||
|
||||
## Terms
|
||||
|
||||
In order to better understand these pages, some terms should be
|
||||
explained first.
|
||||
|
||||
### Item
|
||||
|
||||
An **Item** is roughly your (pdf) document, only that an item may span
|
||||
multiple files, which are called **attachments**. And an item has
|
||||
**meta data** associated:
|
||||
|
||||
- a **correspondent**: the other side of the communication. It can be
|
||||
an organization or a person.
|
||||
- a **concerning person** or **equipment**: a person or thing that
|
||||
this item is about. Maybe it is an insurance contract about your
|
||||
car.
|
||||
- **tag**: an item can be tagged with custom tags. A tag can have a
|
||||
*category*. This is intended for grouping tags, for example a
|
||||
category `doctype` could be used to group tags like `bill`,
|
||||
`contract`, `receipt` etc. Usually an item is not tagged with more
|
||||
than one tag of a category.
|
||||
- a **item date**: this is the date of the document – if this is not
|
||||
set, the created date of the item is used.
|
||||
- a **due date**: an optional date indicating that something has to be
|
||||
done (e.g. paying a bill, submitting it) about this item until this
|
||||
date
|
||||
- a **direction**: one of "incoming" or "outgoing"
|
||||
- a **name**: some item name, defaults to the file name of the
|
||||
attachments
|
||||
- some **notes**: arbitraty descriptive text. You can use markdown
|
||||
here, which is appropriately formatted in the web application.
|
||||
|
||||
### Collective
|
||||
|
||||
The users of the application are part of a **collective**. A
|
||||
**collective** is a group of users that share access to the same
|
||||
items. The account name is therefore comprised of a *collective name*
|
||||
and a *user name*.
|
||||
|
||||
All users of a collective are equal; they have same permissions to
|
||||
access all items. The items don't belong to a user, but to the
|
||||
collective.
|
||||
|
||||
That means, to identify yourself when signing in, you have to give the
|
||||
collective name and your user name. By default it is separated by a
|
||||
slash `/`, for example `smith/john`. If your user name is the same as
|
||||
the collective name, you can omit one; so `smith/smith` can be
|
||||
abbreviated to just `smith`.
|
||||
|
||||
|
||||
## Limitations
|
||||
|
||||
* Docspell currently supports only PDF files.
|
||||
* The PDF view relies on the browsers capabilities. Sadly, not all
|
||||
browsers can display PDF files. Some may require extra plugins. And
|
||||
it's especially sad, that mobile browsers wont't display the
|
||||
files. It works with the major desktop browsers (firefox, chromium),
|
||||
though.
|
261
modules/microsite/docs/doc/configure.md
Normal file
261
modules/microsite/docs/doc/configure.md
Normal file
@ -0,0 +1,261 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Configuring
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
Docspell's executable can take one argument – a configuration file. If
|
||||
that is not given, the defaults are used. The config file overrides
|
||||
default values, so only values that differ from the defaults are
|
||||
necessary.
|
||||
|
||||
This applies to the restserver and the joex as well.
|
||||
|
||||
## Important Config Options
|
||||
|
||||
The configuration of both components uses separate namespaces. The
|
||||
configuration for the REST server is below `docspell.server`, while
|
||||
the one for joex is below `docspell.joex`.
|
||||
|
||||
### JDBC
|
||||
|
||||
This configures the connection to the database. This has to be
|
||||
specified for the rest server and joex. By default, a H2 database in
|
||||
the current `/tmp` directory is configured.
|
||||
|
||||
The config looks like this (both components):
|
||||
|
||||
```
|
||||
docspell.joex.jdbc {
|
||||
url = ...
|
||||
user = ...
|
||||
password = ...
|
||||
}
|
||||
|
||||
docspell.server.backend.jdbc {
|
||||
url = ...
|
||||
user = ...
|
||||
password = ...
|
||||
}
|
||||
```
|
||||
|
||||
The `url` is the connection to the database. It must start with
|
||||
`jdbc`, followed by name of the database. The rest is specific to the
|
||||
database used: it is either a path to a file for H2 or a host/database
|
||||
url for MariaDB and PostgreSQL.
|
||||
|
||||
When using H2, the user is `sa`, the password can be empty and the url
|
||||
must include these options:
|
||||
|
||||
```
|
||||
;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE
|
||||
```
|
||||
|
||||
#### Examples
|
||||
|
||||
PostgreSQL:
|
||||
```
|
||||
url = "jdbc:postgresql://localhost:5432/docspelldb"
|
||||
```
|
||||
|
||||
MariaDB:
|
||||
```
|
||||
url = "jdbc:mariadb://localhost:3306/docspelldb"
|
||||
```
|
||||
|
||||
H2
|
||||
```
|
||||
url = "jdbc:h2:///path/to/a/file.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
|
||||
```
|
||||
|
||||
### Bind
|
||||
|
||||
The host and port the http server binds to. This applies to both
|
||||
components. The joex component also exposes a small REST api to
|
||||
inspect its state and notify the scheduler.
|
||||
|
||||
```
|
||||
docspell.server.bind {
|
||||
address = localhost
|
||||
port = 7880
|
||||
}
|
||||
docspell.joex.bind {
|
||||
address = localhost
|
||||
port = 7878
|
||||
}
|
||||
```
|
||||
|
||||
By default, it binds to `localhost` and some predefined port. This
|
||||
must be changed, if components are on different machines.
|
||||
|
||||
### baseurl
|
||||
|
||||
The base url is an important setting that defines the http URL where
|
||||
the corresponding component can be reached. It applies to both
|
||||
components. For a joex component, the url must be resolvable from a
|
||||
REST server component. The REST server also uses this url to create
|
||||
absolute urls and to configure the authenication cookie.
|
||||
|
||||
By default it is build using the information from the `bind` setting.
|
||||
|
||||
|
||||
```
|
||||
docspell.server.baseurl = ...
|
||||
docspell.joex.baseurl = ...
|
||||
```
|
||||
|
||||
#### Examples
|
||||
|
||||
```
|
||||
docspell.server.baseurl = "https://docspell.example.com"
|
||||
docspell.joex.baseurl = "http://192.168.101.10"
|
||||
```
|
||||
|
||||
|
||||
### app-id
|
||||
|
||||
The `app-id` is the identifier of the corresponding instance. It *must
|
||||
be unique* for all instances. By default the REST server uses `rest1`
|
||||
and joex `joex1`. It is recommended to overwrite this setting to have
|
||||
an explicit and stable identifier.
|
||||
|
||||
```
|
||||
docspell.server.app-id = "rest1"
|
||||
docspell.joex.app-id = "joex1"
|
||||
```
|
||||
|
||||
### registration options
|
||||
|
||||
This defines if and how new users can create accounts. There are 3
|
||||
options:
|
||||
|
||||
- *closed* no new user can sign up
|
||||
- *open* new users can sign up
|
||||
- *invite* new users can sign up but require an invitation key
|
||||
|
||||
This applies only to the REST sevrer component.
|
||||
|
||||
```
|
||||
docspell.server.signup {
|
||||
mode = "open"
|
||||
|
||||
# If mode == 'invite', a password must be provided to generate
|
||||
# invitation keys. It must not be empty.
|
||||
new-invite-password = ""
|
||||
|
||||
# If mode == 'invite', this is the period an invitation token is
|
||||
# considered valid.
|
||||
invite-time = "3 days"
|
||||
}
|
||||
```
|
||||
|
||||
The mode `invite` is intended to open the application only to some
|
||||
users. The admin can create these invitation keys and distribute them
|
||||
to the desired people. For this, the `new-invite-password` must be
|
||||
given. The idea is that only the person who installs docspell knows
|
||||
this. If it is not set, then invitation won't work. New invitation
|
||||
keys can be generated from within the web application or via REST
|
||||
calls (using `curl`, for example).
|
||||
|
||||
```
|
||||
curl -X POST -d '{"password":"blabla"}' "http://localhost:7880/api/v1/open/signup/newinvite"
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Authentication works in two ways:
|
||||
|
||||
- with an account-name / password pair
|
||||
- with an authentication token
|
||||
|
||||
The initial authentication must occur with an accountname/password
|
||||
pair. This will generate an authentication token which is valid for a
|
||||
some time. Subsequent calls to secured routes can use this token. The
|
||||
token can be given as a normal http header or via a cookie header.
|
||||
|
||||
These settings apply only to the REST server.
|
||||
|
||||
```
|
||||
docspell.server.auth {
|
||||
server-secret = "hex:caffee" # or "b64:Y2FmZmVlCg=="
|
||||
session-valid = "5 minutes"
|
||||
}
|
||||
```
|
||||
|
||||
The `server-secret` is used to sign the token. If multiple REST
|
||||
servers are deployed, all must share the same server secret. Otherwise
|
||||
tokens from one instance are not valid on another instance. The secret
|
||||
can be given as Base64 encoded string or in hex form. Use the prefix
|
||||
`hex:` and `b64:`, respectively.
|
||||
|
||||
The `session-valid` deterimens how long a token is valid. This can be
|
||||
just some minutes, the web application obtains new ones
|
||||
periodically. So a short time is recommended.
|
||||
|
||||
|
||||
## File Format
|
||||
|
||||
The format of the configuration files can be
|
||||
[HOCON](https://github.com/lightbend/config/blob/master/HOCON.md#hocon-human-optimized-config-object-notation),
|
||||
JSON or whatever the used [config
|
||||
library](https://github.com/lightbend/config) understands. The default
|
||||
values below are in HOCON format, which is recommended, since it
|
||||
allows comments and has some [advanced
|
||||
features](https://github.com/lightbend/config/blob/master/README.md#features-of-hocon). Please
|
||||
refer to their documentation for more on this.
|
||||
|
||||
Here are the default configurations.
|
||||
|
||||
|
||||
## Default Config
|
||||
|
||||
### Rest Server
|
||||
|
||||
```
|
||||
{% include server.conf %}
|
||||
```
|
||||
|
||||
### Joex
|
||||
|
||||
```
|
||||
{% include joex.conf %}
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
By default, docspell logs to stdout. This works well, when managed by
|
||||
systemd or other inits. Logging is done by
|
||||
[logback](https://logback.qos.ch/). Please refer to its documentation
|
||||
for how to configure logging.
|
||||
|
||||
If you created your logback config file, it can be added as argument
|
||||
to the executable using this syntax:
|
||||
|
||||
```
|
||||
/path/to/docspell -Dlogback.configurationFile=/path/to/your/logging-config-file
|
||||
```
|
||||
|
||||
To get started, the default config looks like this:
|
||||
|
||||
``` xml
|
||||
<configuration>
|
||||
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
|
||||
<withJansi>true</withJansi>
|
||||
|
||||
<encoder>
|
||||
<pattern>[%thread] %highlight(%-5level) %cyan(%logger{15}) - %msg %n</pattern>
|
||||
</encoder>
|
||||
</appender>
|
||||
|
||||
<logger name="docspell" level="debug" />
|
||||
<root level="INFO">
|
||||
<appender-ref ref="STDOUT" />
|
||||
</root>
|
||||
</configuration>
|
||||
```
|
||||
|
||||
The `<root level="INFO">` means, that only log statements with level
|
||||
"INFO" will be printed. But the `<logger name="docspell"
|
||||
level="debug">` above says, that for loggers with name "docspell"
|
||||
statements with level "DEBUG" will be printed, too.
|
77
modules/microsite/docs/doc/curate.md
Normal file
77
modules/microsite/docs/doc/curate.md
Normal file
@ -0,0 +1,77 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Find and Review
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
Curating the items meta data helps finding them later. This page
|
||||
describes how you can quickly go through those items and correct or
|
||||
amend with existing data.
|
||||
|
||||
## Select New items
|
||||
|
||||
After files have been uploaded and the job executor created the
|
||||
corresponding items, they will show up on the main page. All items,
|
||||
the job executor has created are initially marked as *New*. The option
|
||||
*only New* in the left search menu can be used to select only new
|
||||
items:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-1.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Check selected items
|
||||
|
||||
Then you can go through all new items and check their metadata: Click
|
||||
on the first item to open the detail view. This shows the documents
|
||||
and the meta data in the header.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-2.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Modify if necessary
|
||||
|
||||
To change something, click the *Edit* button in the menu above the
|
||||
document view. This will open a form next to your documents. You can
|
||||
compare the data with the documents and change as you like. Since the
|
||||
item status is *New*, you'll see the suggestions docspell found during
|
||||
processing. If there were multiple candidates, you can select another
|
||||
one by clicking its name in the suggestion list.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-3.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
When you change something in the form, it is immediatly applied. Only
|
||||
when changing text fields, a click on the *Save* symbol next to the
|
||||
field is required.
|
||||
|
||||
|
||||
## Confirm
|
||||
|
||||
If everything looks good, click the *Confirm* button to confirm the
|
||||
current data. The *New* status goes away and also the suggestions are
|
||||
hidden in this state. You can always go back by clicking the
|
||||
*Unconfirm* button.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-5.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Proceed with next item
|
||||
|
||||
To look at the next item in the search results, click the *Next*
|
||||
button in the menu (next to the *Edit* button). Clicking next, will
|
||||
keep the current view, so you can continue checking the data. If you
|
||||
are on the last item, the view switches to the listing view when
|
||||
clicking *Next*.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-6.jpg">
|
||||
</div>
|
218
modules/microsite/docs/doc/install.md
Normal file
218
modules/microsite/docs/doc/install.md
Normal file
@ -0,0 +1,218 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Installation
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
This page contains detailed installation instructions. For a quick
|
||||
start, refer to [this page](../getit.html).
|
||||
|
||||
Docspell has been developed and tested on a GNU/Linux system. It may
|
||||
run on Windows and MacOS machines, too (ghostscript and tesseract are
|
||||
available on these systems). But I've never tried.
|
||||
|
||||
Docspell consists of two components that are started in separate
|
||||
processes:
|
||||
|
||||
1. *REST Server* This is the main application, providing the REST Api
|
||||
and the web application.
|
||||
2. *Joex* (job executor) This is the component that does the document
|
||||
processing.
|
||||
|
||||
They can run on multiple machines. All REST server and Joex instances
|
||||
should be on the same network. It is not strictly required that they
|
||||
can reach each other, but the components can then notify themselves
|
||||
about new or done work.
|
||||
|
||||
While this is possible, the simple setup is to start both components
|
||||
once on the same machine.
|
||||
|
||||
The [download page](https://github.com/eikek/docspell/releases)
|
||||
provides pre-compiled packages and the [development page](dev.html)
|
||||
contains build instructions.
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
The two components have one prerequisite in common: they both require
|
||||
Java to run. While this is the only requirement for the *REST server*,
|
||||
the *Joex* components requires some more external programs.
|
||||
|
||||
### Java
|
||||
|
||||
Very often, Java is already installed. You can check this by opening a
|
||||
terminal and typing `java -version`. Otherwise install Java using your
|
||||
package manager or see [this site](https://adoptopenjdk.net/) for
|
||||
other options.
|
||||
|
||||
It is enough to install the JRE. The JDK is required, if you want to
|
||||
build docspell from source.
|
||||
|
||||
Docspell has been tested with Java version 1.8 (or sometimes referred
|
||||
to as JRE 8 and JDK 8, respectively). The pre-build packages are also
|
||||
build using JDK 8. But a later version of Java should work as well.
|
||||
|
||||
The next tools are only required on machines running the *Joex*
|
||||
component.
|
||||
|
||||
### External Tools for Joex
|
||||
|
||||
- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
|
||||
is used to extract/convert PDF files into images that are then fed
|
||||
to ocr. It is available on most GNU/Linux distributions.
|
||||
- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
|
||||
pre-processes images to yield better results when doing ocr. If this
|
||||
is not installed, docspell tries without it. However, it is
|
||||
recommended to install, because it [improves text
|
||||
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
||||
(at the expense of a longer runtime).
|
||||
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
||||
doing the OCR (converts images into text). It is a widely used open
|
||||
source OCR engine. Tesseract 3 and 4 should work with docspell; you
|
||||
can adopt the command line in the configuration file, if necessary.
|
||||
|
||||
|
||||
### Example Debian
|
||||
|
||||
On Debian this should install all joex requirements:
|
||||
|
||||
``` bash
|
||||
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
|
||||
```
|
||||
|
||||
## Database
|
||||
|
||||
Both components must have access to a SQL database. Docspell has
|
||||
support these databases:
|
||||
|
||||
- PostreSQL
|
||||
- MariaDB
|
||||
- H2
|
||||
|
||||
The H2 database is an interesting option for personal and mid-size
|
||||
setups, as it requires no additional work. It is integrated into
|
||||
docspell and works really well. It is also configured as the default
|
||||
database.
|
||||
|
||||
For large installations, PostgreSQL or MariaDB is recommended. Create
|
||||
a database and a user with enough privileges (read, write, create
|
||||
table) to that database.
|
||||
|
||||
When using H2, make sure that all components access the same database
|
||||
– the jdbc url must point to the same file. Then, it is important to
|
||||
add the options
|
||||
`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
|
||||
of the url. See the [default config](configure.html) for an example.
|
||||
|
||||
|
||||
## Installing from ZIP files
|
||||
|
||||
After extracting the zip files, you'll find a start script in the
|
||||
`bin/` folder.
|
||||
|
||||
|
||||
## Installing from DEB packages
|
||||
|
||||
The DEB packages can be installed on Debian, or Debian based Distros:
|
||||
|
||||
``` bash
|
||||
$ sudo dpkg -i docspell*.deb
|
||||
```
|
||||
|
||||
Then the start scripts are in your `$PATH`. Run `docspell-restserver`
|
||||
or `docspell-joex` from a terminal window.
|
||||
|
||||
The packages come with a systemd unit file that will be installed to
|
||||
autostart the services.
|
||||
|
||||
|
||||
## Running
|
||||
|
||||
Run the start script (in the corresponding `bin/` directory when using
|
||||
the zip files):
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver
|
||||
$ ./docspell-joex*/bin/docspell-joex
|
||||
```
|
||||
|
||||
This will startup both components using the default configuration. The
|
||||
configuration should be adopted to your needs. For example, the
|
||||
database connection is configured to use a H2 database in the `/tmp`
|
||||
directory. Please refer to the [configuration page](configure.html)
|
||||
for how to create a custom config file. Once you have your config
|
||||
file, simply pass it as argument to the command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
|
||||
$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf
|
||||
```
|
||||
|
||||
After starting the rest server, you can reach the web application at
|
||||
path `/app/index.html`, so using default values it would be
|
||||
`http://localhost:7880/app/index.html`.
|
||||
|
||||
You should be able to create a new account and sign in. Check the
|
||||
[configuration page](configure.html) to further customize docspell.
|
||||
|
||||
|
||||
### Options
|
||||
|
||||
The start scripts support some options to configure the JVM. One often
|
||||
used setting is the maximum heap size of the JVM. By default, java
|
||||
determines it based on properties of the current machine. You can
|
||||
specify it by given java startup options to the command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
|
||||
```
|
||||
|
||||
This would limit the maximum heap to 1GB. The double slash separates
|
||||
internal options and the arguments to the program. Another frequently
|
||||
used option is to change the default temp directory. Usually it is
|
||||
`/tmp`, but it may be desired to have a dedicated temp directory,
|
||||
which can be configured:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
|
||||
```
|
||||
|
||||
The command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -h
|
||||
```
|
||||
|
||||
gives an overview of supported options.
|
||||
|
||||
|
||||
## Raspberry Pi, and similiar
|
||||
|
||||
Both component can run next to each other on a raspberry pi or
|
||||
similiar device.
|
||||
|
||||
|
||||
### REST Server
|
||||
|
||||
The REST server component runs very well on the Raspberry Pi and
|
||||
similiar devices. It doesn't require much resources, because the heavy
|
||||
work is done by the joex components.
|
||||
|
||||
|
||||
### Joex
|
||||
|
||||
Running the joex component on the Raspberry Pi is possible, but will
|
||||
result in long processing times. Tested on a RPi model 3 (4 cores, 1G
|
||||
RAM) processing a PDF (scanned with 300dpi) with two pages took
|
||||
9:52. You can speed it up considerably by uninstalling the `unpaper`
|
||||
command, because this step takes quite long. This, of course, reduces
|
||||
the quality of OCR. But without `unpaper` the same sample pdf was then
|
||||
processed in 1:24, a speedup of 8 minutes.
|
||||
|
||||
You should limit the joex pool size to 1 and, depending on your model
|
||||
and the amount of RAM, set a heap size of at least 500M
|
||||
(`-J-Xmx500M`).
|
||||
|
||||
For personal setups, when you don't need the processing results asap,
|
||||
this can work well enough.
|
155
modules/microsite/docs/doc/joex.md
Normal file
155
modules/microsite/docs/doc/joex.md
Normal file
@ -0,0 +1,155 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Joex
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
Joex is short for *Job Executor* and it is the component managing long
|
||||
running tasks in docspell. One of these long running tasks is the file
|
||||
processing task.
|
||||
|
||||
One joex component handles the processing of all files of all
|
||||
collectives/users. It requires much more resources than the rest
|
||||
server component. Therefore the number of jobs that can run in
|
||||
parallel is limited with respect to the hardware it is running on.
|
||||
|
||||
For larger installations, it is probably better to run several joex
|
||||
components on different machines. That works out of the box, as long
|
||||
as all components point to the same database and use different
|
||||
`app-id`s (see [configuring docspell](./configure.html)).
|
||||
|
||||
When files are submitted to docspell, they are stored in the database
|
||||
and all known joex components are notified about new work. Then they
|
||||
compete on getting the next job from the queue. After a job finishes
|
||||
and no job is waiting in the queue, joex will sleep until notified
|
||||
again. It will also periodically notify itself as a fallback.
|
||||
|
||||
## Scheduler and Queue
|
||||
|
||||
The scheduler is the part that runs and monitors the long running
|
||||
jobs. It works together with the job queue, which defines what job to
|
||||
take next.
|
||||
|
||||
To create a somewhat fair distribution among multiple collectives, a
|
||||
collective is first chosen in a simple round-robin way. Then a job
|
||||
from this collective is chosen by priority.
|
||||
|
||||
There are only two priorities: low and high. A simple *counting
|
||||
scheme* determines if a low prio or high prio job is selected
|
||||
next. The default is `4, 1`, meaning to first select 4 high priority
|
||||
jobs and then 1 low priority job, then starting over. If no such job
|
||||
exists, its falls back to the other priority.
|
||||
|
||||
The priority can be set on a *Source* (see
|
||||
[uploads](uploading.html)). Uploading through the web application will
|
||||
always use priority *high*. The idea is that while logged in, jobs are
|
||||
more important that those submitted when not logged in.
|
||||
|
||||
|
||||
## Scheduler Config
|
||||
|
||||
The relevant part of the config file regarding the scheduler is shown
|
||||
below with some explanations.
|
||||
|
||||
```
|
||||
docspell.joex {
|
||||
# other settings left out for brevity
|
||||
|
||||
scheduler {
|
||||
|
||||
# Number of processing allowed in parallel.
|
||||
pool-size = 2
|
||||
|
||||
# A counting scheme determines the ratio of how high- and low-prio
|
||||
# jobs are run. For example: 4,1 means run 4 high prio jobs, then
|
||||
# 1 low prio and then start over.
|
||||
counting-scheme = "4,1"
|
||||
|
||||
# How often a failed job should be retried until it enters failed
|
||||
# state. If a job fails, it becomes "stuck" and will be retried
|
||||
# after a delay.
|
||||
retries = 5
|
||||
|
||||
# The delay until the next try is performed for a failed job. This
|
||||
# delay is increased exponentially with the number of retries.
|
||||
retry-delay = "1 minute"
|
||||
|
||||
# The queue size of log statements from a job.
|
||||
log-buffer-size = 500
|
||||
|
||||
# If no job is left in the queue, the scheduler will wait until a
|
||||
# notify is requested (using the REST interface). To also retry
|
||||
# stuck jobs, it will notify itself periodically.
|
||||
wakeup-period = "30 minutes"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `pool-size` setting deterimens how many jobs run in parallel. You
|
||||
need to play with this setting on your machine to find an optimal
|
||||
value.
|
||||
|
||||
The `counting-scheme` determines for all collectives how to select
|
||||
between high and low priority jobs; as explained above. It is
|
||||
currently not possible to define that per collective.
|
||||
|
||||
If a job fails, it will be set to *stuck* state and retried by the
|
||||
scheduler. The `retries` setting defines how many times a job is
|
||||
retried until it enters the final *failed* state. The scheduler waits
|
||||
some time until running the next try. This delay is given by
|
||||
`retry-delay`. This is the initial delay, the time until the first
|
||||
re-try (the second attempt). This time increases exponentially with
|
||||
the number of retries.
|
||||
|
||||
The jobs will log about what they do, which is picked up and stored
|
||||
into the database asynchronously. The log events are buffered in a
|
||||
queue and another thread will consume this queue and store them in the
|
||||
database. The `log-buffer-size` determines the size of the queue.
|
||||
|
||||
At last, there is a `wakeup-period` that determines at what interval
|
||||
the joex component notifies itself to look for new jobs. If jobs get
|
||||
stuck, and joex is not notified externally it could miss to
|
||||
retry. Also, since networks are not reliable, a notification may not
|
||||
reach a joex component. This periodic wakup is just to ensure that
|
||||
jobs are eventually run.
|
||||
|
||||
|
||||
## Starting on demand
|
||||
|
||||
The job executor and rest server can be started multiple times. This
|
||||
is especially useful for the job executor. For example, when
|
||||
submitting a lot of files in a short time, you can simply startup more
|
||||
job executors on other computers on your network. Maybe use your
|
||||
laptop to help with processing for a while.
|
||||
|
||||
You have to make sure, that all connect to the same database, and that
|
||||
all have unique `app-id`s.
|
||||
|
||||
Once the files have been processced you can stop the additional
|
||||
executors.
|
||||
|
||||
## Shutting down
|
||||
|
||||
If a job executor is sleeping and not executing any jobs, you can just
|
||||
quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
|
||||
there are jobs currently executing, it is advisable to initiate a
|
||||
graceful shutdown. The job executor will then stop taking new jobs
|
||||
from the queue but it will wait until all running jobs have completed
|
||||
before shutting down.
|
||||
|
||||
This can be done by sending a http POST request to the api of this job
|
||||
executor:
|
||||
|
||||
```
|
||||
curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
|
||||
```
|
||||
|
||||
If joex receives this request it will immediately stop taking new jobs
|
||||
and it will quit when all running jobs are done.
|
||||
|
||||
If a job executor gets terminated while there are running jobs, the
|
||||
jobs are still in the current state marked to be executed by this job
|
||||
executor. In order to fix this, start the job executor again. It will
|
||||
search all jobs that are marked with its id and put them back into
|
||||
waiting state. Then send a graceful shutdown request as shown above.
|
87
modules/microsite/docs/doc/metadata.md
Normal file
87
modules/microsite/docs/doc/metadata.md
Normal file
@ -0,0 +1,87 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Adding Meta Data
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
## Meta Data
|
||||
|
||||
The processing can be controlled implicitely by the provided meta
|
||||
data. The *Meta Data* page allows to manage this meta data. You can
|
||||
create the following:
|
||||
|
||||
- Tags
|
||||
- Organizations
|
||||
- Persons
|
||||
- Equipments
|
||||
|
||||
### Tags
|
||||
|
||||
Items can be tagged with multiple custom tags (aka labels). This
|
||||
allows to describe many different workflows people may have with their
|
||||
documents.
|
||||
|
||||
A tag can have a *category*. This is meant to group tags together. For
|
||||
example, you may want to have a tag category *doctype* that is
|
||||
comprised of tags like *bill*, *contract*, *receipt* and so on. Or for
|
||||
workflows, a tag category *state* may exist that includes tags like
|
||||
*Todo* or *Waiting*. Or you can tag items with user names to provide
|
||||
"assignment" semantics. Docspell doesn't propose any workflow, but it
|
||||
can help to implement some.
|
||||
|
||||
The tags are *not* taken into account when processing. Docspell will
|
||||
not automatically associate tags to your items. The tags are only
|
||||
meant to be used manually.
|
||||
|
||||
|
||||
### Organization and Person
|
||||
|
||||
The organization entity represents an non-personal (organization or
|
||||
company) correspondent of an item. Docspell will choose one or more
|
||||
organizations when processing documents and associate the "best" match
|
||||
with your item.
|
||||
|
||||
The person entitiy can appear in two roles: It may be a correspondent
|
||||
or the person an item is about. So a person is either a correspondent
|
||||
or a concerning person. Docspell can not know which person is which,
|
||||
therefore you need to tell this by checking the box "Use for
|
||||
concerning person suggestion only". If this is checked, docspell will
|
||||
use this person only to suggest a concerning person. Otherwise the
|
||||
person is used only for correspondent suggestions.
|
||||
|
||||
Document processing uses the following properties:
|
||||
|
||||
- name
|
||||
- websites
|
||||
- e-mails
|
||||
|
||||
The website an e-mails can be added as contact information. If these
|
||||
three are present, you should get good matches from docspell. All
|
||||
other fields of an organization and person are not used during
|
||||
document processing. They might be useful when using this as a real
|
||||
address book.
|
||||
|
||||
|
||||
### Equipment
|
||||
|
||||
The equipment entity is almost like a tag. In fact, it could be
|
||||
replaced by a tag with a specific known category. The difference is
|
||||
that docspell will try to find a match and associate it with your
|
||||
item. The equipment represents non-personal things that an item is
|
||||
about. Examples are: bills or insurances for *cars*, contracts for
|
||||
*houses* or *flats*.
|
||||
|
||||
Equipments don't have contact information, so the only property that
|
||||
is used to find matches during document processing is its name.
|
||||
|
||||
|
||||
## Document Language
|
||||
|
||||
An important setting is the language of your documents. This helps OCR
|
||||
and text analysis. You can select between English and German
|
||||
currently.
|
||||
|
||||
Go to the *Collective Settings* page and click *Document
|
||||
Language*. This will set the lanugage for all your documents. It is
|
||||
not (yet) possible to specify it when uploading.
|
40
modules/microsite/docs/doc/processing.md
Normal file
40
modules/microsite/docs/doc/processing.md
Normal file
@ -0,0 +1,40 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Processing Queue
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
|
||||
The page *Processing Queue* shows the current state of document
|
||||
processing for your uploads.
|
||||
|
||||
At the top of the page a list of running jobs is shown. Below that,
|
||||
the left column shows jobs that wait to be picked up by the job
|
||||
executor. On the right are finished jobs. The number of finished jobs
|
||||
is cut to some maximum and is also restricted by a date range. The
|
||||
page refreshes itself automatically to show the progress.
|
||||
|
||||
Example screenshot:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/processing-queue.jpg">
|
||||
</div>
|
||||
|
||||
You can cancel running jobs or remove waiting ones from the queue. If
|
||||
you click on the small file symbol on finished jobs, you can inspect
|
||||
its log messages again. A running job displays the job executor id
|
||||
that executes the job.
|
||||
|
||||
Currently the job queue executes just the document processing tasks,
|
||||
but it may be used for other long running tasks in the future.
|
||||
|
||||
Since job executors are shared among all collectives, it may happen
|
||||
that a job is some time waiting until it is picked up by a job
|
||||
executor. You can always start more job executors to help out.
|
||||
|
||||
If a job fails, it is retried after some time. Only if it fails too
|
||||
often (can be configured), it then is finished with *failed* state. If
|
||||
processing finally fails, the item is still created, just without
|
||||
suggestions. But if processing is cancelled by the user, the item is
|
||||
not created.
|
187
modules/microsite/docs/doc/tools.md
Normal file
187
modules/microsite/docs/doc/tools.md
Normal file
@ -0,0 +1,187 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Tools
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
The `tools/` folder contains some scripts and other resources intented
|
||||
for integrating docspell.
|
||||
|
||||
## consumedir
|
||||
|
||||
The `consumerdir.sh` is a bash script that works in two modes:
|
||||
|
||||
- Go through all files in given directories (non recursively) and sent
|
||||
each to docspell.
|
||||
- Watch one or more directories for new files and upload them to
|
||||
docspell.
|
||||
|
||||
It can watch or go through one or more directories. Files can be
|
||||
uploaded to multiple urls.
|
||||
|
||||
Run the script with the `-h` option, to see a short help text. The
|
||||
help text will also show the values for any given option.
|
||||
|
||||
The script requires `curl` for uploading. It requires the
|
||||
`inotifywait` command if directories should be watched for new
|
||||
files. If the `-m` option is used, the script will skip duplicate
|
||||
files. For this the `sha256sum` command is required.
|
||||
|
||||
Example for watching two directories:
|
||||
|
||||
``` bash
|
||||
./tools/consumedir.sh --path ~/Downloads --path ~/pdfs -m /var/run/consumedir -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
||||
|
||||
The script by default watches the given directories. If the `-o`
|
||||
option is used, it will instead go through these directories and
|
||||
upload all pdf files in there.
|
||||
|
||||
Example for uploading all immediatly (the same as above only with `-o`
|
||||
added):
|
||||
|
||||
``` bash
|
||||
./tools/consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m /var/run/consumedir -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
||||
|
||||
|
||||
### Systemd
|
||||
|
||||
The script can be used with systemd to run as a service. This is an
|
||||
example unit file:
|
||||
|
||||
```
|
||||
[Unit]
|
||||
After=networking.target
|
||||
Description=Docspell Consumedir
|
||||
|
||||
[Service]
|
||||
Environment="PATH=/set/a/path"
|
||||
|
||||
ExecStartPre=mkdir -p /var/run/consumedir && chown -R someuser /var/run/consumedir
|
||||
ExecStart=/bin/su -s /bin/bash someuser -c "consumedir.sh --path '/a/path/' -m '/var/run/consumedir' 'http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ'"
|
||||
```
|
||||
|
||||
This unit file is just an example, it needs some fiddling. It assumes
|
||||
an existing user `someuser` that is used to run this service. The url
|
||||
`http://localhost:7880/api/v1/open/upload/...` is an anonymous upload
|
||||
url as described [here](./uploading.html).
|
||||
|
||||
|
||||
## ds.sh
|
||||
|
||||
A bash script to quickly upload files from the command line. It reads
|
||||
a configuration file containing the URLs to upload to. Then each file
|
||||
given to the script will be uploaded to al URLs in the config.
|
||||
|
||||
The config file is expected in
|
||||
`$XDG_CONFIG_HOME/docspell/ds.conf`. `$XDG_CONFIG_HOME` defaults to
|
||||
`~/.config`.
|
||||
|
||||
The config file contains lines with key-value pairs, separated by an
|
||||
`=` sign. Lines starting with `#` are ignored. Example:
|
||||
|
||||
```
|
||||
# Config file
|
||||
url.1 = http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
url.2 = http://localhost:7880/api/v1/open/upload/item/6DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
||||
|
||||
The key must start with `url`.
|
||||
|
||||
### Usage
|
||||
|
||||
The `-h` option shows a help overview.
|
||||
|
||||
The script takes a list of files as arguments. It checks the file
|
||||
types and will raise an error (and quit) if a file is included that is
|
||||
not a PDF. The `-s` option can be used to skip them instead.
|
||||
|
||||
The `-c` option allows to specifiy a different config file.
|
||||
|
||||
Example:
|
||||
|
||||
``` bash
|
||||
./ds.sh ~/Downloads/*.pdf
|
||||
```
|
||||
|
||||
|
||||
## Webextension for Docspell
|
||||
|
||||
Idea: Inside the browser click on a PDF and send it to docspell. It is
|
||||
downloaded in the context of your current page. Then handed to an
|
||||
application that pushes it to docspell. There is a browser add-on
|
||||
implementing this in `tools/webextension`. This add-on only works with
|
||||
firefox.
|
||||
|
||||
### Install
|
||||
|
||||
This is a bit complicated, since you need to install external tools
|
||||
and the web extension. Both work together.
|
||||
|
||||
#### Install `ds.sh`
|
||||
|
||||
First install the `ds.sh` tool somewhere, maybe `/usr/local/bin` as
|
||||
described above.
|
||||
|
||||
|
||||
#### Install the native part
|
||||
|
||||
Then install the "native" part of the web extension:
|
||||
|
||||
Copy or symlink the `native.py` script into some known location. For
|
||||
example:
|
||||
|
||||
``` bash
|
||||
ln -s ~/docspell-checkout/tools/webextension/native/native.py /usr/local/share/docspell/native.py
|
||||
```
|
||||
|
||||
Then copy the `app_manifest.json` to
|
||||
`$HOME/.mozilla/native-messaging-hosts/docspell.json`. For example:
|
||||
|
||||
``` bash
|
||||
cp ~/docspell-checkout/tools/webextension/native/app_manifest.json ~/.mozilla/native-messaging-hosts/docspell.json
|
||||
```
|
||||
|
||||
See
|
||||
[here](https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Native_manifests#Manifest_location)
|
||||
for details.
|
||||
|
||||
And you might want to modify this json file, so the path to the
|
||||
`native.py` script is correct (it must be absolute).
|
||||
|
||||
If the `ds.sh` script is in your `$PATH`, then this should
|
||||
work. Otherwise, edit the `native.py` script and change the path to
|
||||
the tool. Or create a file `$HOME/.config/docspell/ds.cmd` whose
|
||||
content is the path to the `ds.sh` script.
|
||||
|
||||
|
||||
#### Install the extension
|
||||
|
||||
An extension file can be build using the `make-xpi.sh` script. But
|
||||
installing it in "standard" firefox won't work, because [Mozilla
|
||||
requires extensions to be signed by
|
||||
them](https://wiki.mozilla.org/Add-ons/Extension_Signing). This means
|
||||
creating an account and going through some process…. So here are two
|
||||
alternatives:
|
||||
|
||||
1. Open firefox and type `about:debugging` in the addressbar. Then
|
||||
click on *'Load Temporary Add-on...'* and select the
|
||||
`manifest.json` file. The extension is now installed. The downside
|
||||
is, that the extension will be removed once firefox is closed.
|
||||
2. Use Firefox ESR, which allows to install Add-ons not signed by
|
||||
Mozilla. But it has to be configured: Open firefox and type
|
||||
`about:config` in the address bar. Search for key
|
||||
`xpinstall.signatures.required` and set it to `false`. This is
|
||||
described on the last paragraph on [this
|
||||
page](https://support.mozilla.org/en-US/kb/add-on-signing-in-firefox).
|
||||
|
||||
When you right click on a file link, there should be a context menu
|
||||
entry *'Docspell Upload Helper'*. The add-on will download this file
|
||||
using the browser and then send the file path to the `native.py`
|
||||
script. This script will in turn call `ds.sh` which finally uploads it
|
||||
to your configured URLs.
|
||||
|
||||
Open the Add-ons page (`Ctrl`+`Shift`+`A`), the new add-on should be
|
||||
there.
|
130
modules/microsite/docs/doc/uploading.md
Normal file
130
modules/microsite/docs/doc/uploading.md
Normal file
@ -0,0 +1,130 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Uploads
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
|
||||
This page describes, how files can get into docspell. Technically,
|
||||
there is just one way: via http multipart/form-data requests.
|
||||
|
||||
|
||||
## Authenticated Upload
|
||||
|
||||
From within the web application there is the "Upload Files"
|
||||
page. There you can select multiple files to upload. You can also
|
||||
specify whether these files should become one item or if every file is
|
||||
a separate item.
|
||||
|
||||
When you click "Submit" the files are uploaded and stored in the
|
||||
database. Then the job executor(s) are notified which immediately
|
||||
start processing them.
|
||||
|
||||
Go to the top-right menu and click "Processing Queue" to see the
|
||||
current state.
|
||||
|
||||
This obviously requires an authenticated user. While this is handy for
|
||||
ad-hoc uploads, it is very inconvenient for automating it by custom
|
||||
scripts. For this the next variant exists.
|
||||
|
||||
## Anonymous Upload
|
||||
|
||||
It is also possible to upload files without authentication. This
|
||||
should make tools that interact with docspell much easier to write.
|
||||
|
||||
|
||||
### Creating Anonymous Uploads
|
||||
|
||||
Go to "Collective Settings" and then to the "Source" tab. A *Source*
|
||||
identifies an endpoint where files can be uploaded
|
||||
anonymously. Creating a new source creates a long unique id which is
|
||||
part on an url that can be used to upload files. You can choose any
|
||||
time to deactivate or delete the source at which point uploading is
|
||||
not possible anymore. The idea is to give this URL away safely. You
|
||||
can delete it any time and no passwords or secrets are visible, even
|
||||
your username is not visible.
|
||||
|
||||
Example screenshot:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/sources-form.jpg">
|
||||
</div>
|
||||
|
||||
This example shows a source with name "test". It defines two urls:
|
||||
|
||||
- `/app/index.html#/upload/<id>`
|
||||
- `/api/v1/open/upload/item/<id>`
|
||||
|
||||
The first points to a web page where everyone could upload files into
|
||||
your account. You could give this url to people for sending files
|
||||
directly into your docspell.
|
||||
|
||||
The second url is the API url, which accepts the requests to upload
|
||||
files (which is used by the first url).
|
||||
|
||||
For example, this url can be used to upload files with curl:
|
||||
|
||||
``` bash
|
||||
$ curl -XPOST -F file=@test.pdf http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
{"success":true,"message":"Files submitted."}
|
||||
```
|
||||
|
||||
You could add more `-F file=@/path/to/your/file.pdf` to upload
|
||||
multiple files (note, the `@` is required by curl, so it knows that
|
||||
the following is a file).
|
||||
|
||||
When files are uploaded to an source endpoint, the items resulting
|
||||
from this uploads are marked with the name of the source. So you know
|
||||
which source an item originated.
|
||||
|
||||
If files are uploaded using the web applications *Upload files* page,
|
||||
the source is implicitly set to `webapp`. If you also want to let
|
||||
docspell count the files uploaded through the web interface, just
|
||||
create a source (can be inactive) with that name (`webapp`).
|
||||
|
||||
|
||||
## The Request
|
||||
|
||||
This gives more details about the request for uploads. It is a http
|
||||
`multipart/form-data` request, with two possible fields:
|
||||
|
||||
- meta
|
||||
- file
|
||||
|
||||
The `file` field can appear multiple times and is required at least
|
||||
once. It is the part containing the file to upload.
|
||||
|
||||
The `meta` part is completely optional and can define additional meta
|
||||
data, that docspell uses to create items from the given files. It
|
||||
allows to transfer structured information together with the
|
||||
unstructured binary files.
|
||||
|
||||
The `meta` content must be `application/json` containing this
|
||||
structure:
|
||||
|
||||
```
|
||||
{ multiple: Bool
|
||||
, direction: Maybe String
|
||||
}
|
||||
```
|
||||
|
||||
The `multiple` property is by default `true`. It means that each file
|
||||
in the upload request corresponds to a single item. An upload with 5
|
||||
files will result in 5 items created. If it is `false`, then docspell
|
||||
will create just one item, that will then contain all files.
|
||||
|
||||
Furthermore, the direction of the document (one of `incoming` or
|
||||
`outgoing`) can be given. It is optional, it can be left out or
|
||||
`null`.
|
||||
|
||||
This kind of request is very common and most programming languages
|
||||
have support for this. For example, here is another curl command
|
||||
uploading two files with meta data:
|
||||
|
||||
```
|
||||
curl -XPOST -F meta='{"multiple":false, "direction": "outgoing"}' \
|
||||
-F file=@letter-en-source.pdf \
|
||||
-F file=@letter-de-source.pdf \
|
||||
http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
49
modules/microsite/docs/getit.md
Normal file
49
modules/microsite/docs/getit.md
Normal file
@ -0,0 +1,49 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Quickstart
|
||||
---
|
||||
|
||||
## Download
|
||||
|
||||
You can download pre-compiled binaries from the [Release
|
||||
Page](https://github.com/eikek/docspell/releases). There are `deb`
|
||||
packages and a generic zip files.
|
||||
|
||||
You need to download the two files:
|
||||
|
||||
- [docspell-restserver-{{site.version}}.zip](https://github.com/eikek/docspell/releases/download/v{{site.version}}/docspell-restserver-{{site.version}}.zip)
|
||||
- [docspell-joex-{{site.version}}.zip](https://github.com/eikek/docspell/releases/download/v{{site.version}}/docspell-joex-{{site.version}}.zip)
|
||||
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Install Java (use your package manager or look
|
||||
[here](https://adoptopenjdk.net/)),
|
||||
[tesseract](https://github.com/tesseract-ocr/tesseract),
|
||||
[ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
|
||||
[unpaper](https://github.com/Flameeyes/unpaper). The last is not
|
||||
really required, but improves OCR.
|
||||
|
||||
|
||||
## Running
|
||||
|
||||
1. Unzip both files:
|
||||
``` bash
|
||||
$ unzip docspell-*.zip
|
||||
```
|
||||
2. Open two terminal windows and navigate to the the directory
|
||||
containing the zip files.
|
||||
3. Start both components executing:
|
||||
``` bash
|
||||
$ ./docspell-restserver*/bin/docspell-restserver
|
||||
```
|
||||
in one terminal and
|
||||
``` bash
|
||||
$ ./docspell-joex*/bin/docspell-joex
|
||||
```
|
||||
in the other.
|
||||
4. Point your browser to: <http://localhost:7880/app/index.html>
|
||||
5. Register a new account, sign in and try it.
|
||||
|
||||
Check the [documentation](doc.html) for more information on how to use
|
||||
docspell.
|
46
modules/microsite/docs/index.md
Normal file
46
modules/microsite/docs/index.md
Normal file
@ -0,0 +1,46 @@
|
||||
---
|
||||
layout: homeFeatures
|
||||
features:
|
||||
- first: ["Stow documents away", "Most of the time documents (emails, postal mail) are received or created. It should be fast to stow them away, knowing that they can be found if necessary."]
|
||||
- second: ["Semi-Automatic Tagging", "Documents are analyzed and tagged automatically. “Semi–”, because it may not always be correct; results can be reviewed and corrected."]
|
||||
- third: ["Find them", "If there is a document needed, you can search for it. Usually, restricting to a date range and a correspondent will result in only a few documents to sift through. Alternatively, you can add your own tags, names etc to better match your workflow."]
|
||||
---
|
||||
|
||||
# A Document Organizer
|
||||
|
||||
Docspell is a simple tool to cope with your piles of (digitized) paper
|
||||
documents. You'll need a scanner to convert your papers into PDF
|
||||
files. Docspell can then assist in organizing the resulting PDF files
|
||||
easily. Its main goal is to efficiently support two major use cases:
|
||||
|
||||
1. **Stowing documents away**: Most of the time documents are received
|
||||
or created. It should be *fast* to stow them away, knowing that
|
||||
they can be found if necessary.
|
||||
|
||||
Upload the PDF files to docspell. Docspell finds meta data and will
|
||||
link them to your document, automatically. There may be false
|
||||
positives, so a short review is recommended. Though even if not,
|
||||
the results are not that bad.
|
||||
2. **Finding them**: If there is a document needed, you can search for
|
||||
it. Usually, restricting to a date range and a correspondent will
|
||||
result in only a few documents to sift through. Alternatively, you
|
||||
can add your own tags, names etc to better match your workflow.
|
||||
|
||||
The meta data that docspell uses is provided by you. You need to
|
||||
maintain a list of correspondents and maybe other things you want
|
||||
docspell to draw suggestions from. So if a new document arrives (from
|
||||
an unknown correspondent) then you would add a new entry to your meta
|
||||
data and link it manually to the document. But the next time, docspell
|
||||
will do it for you.
|
||||
|
||||
Docspell is *not* a document management system. There exists a lot of
|
||||
these systems that have much more features. Docspell's focus is around
|
||||
the two use cases described above, which already is quite useful.
|
||||
|
||||
Checkout the quick [demo](demo.html) to get a first impression and the
|
||||
[quickstart](getit.html) page if you want to try it out.
|
||||
|
||||
## License
|
||||
|
||||
This project is distributed under the
|
||||
[GPLv3](http://www.gnu.org/licenses/gpl-3.0.html)
|
Reference in New Issue
Block a user