Initial version.

Features: - Upload PDF files let them analyze - Manage meta data and items - See processing in webapp
2025-06-22 02:18:26 +00:00 · 2019-07-23 00:53:30 +02:00
parent 6154e6a387
commit 831cd8b655
341 changed files with 23634 additions and 484 deletions
--- a/modules/microsite/src/main/resources/microsite/css/docspell.css
+++ b/modules/microsite/src/main/resources/microsite/css/docspell.css
@ -0,0 +1,18 @@
+.jumbotron {
+    background: url(../img/back-master-small.jpg);
+    background-repeat: no-repeat;
+    background-size: 100% 800px;
+}
+
+.content-wrapper h1, .h1 {
+    border-bottom: 1px solid #d8dfe5;
+    padding-bottom: 0.8rem;
+}
+
+body {
+    font-size: 1.75em;
+}
+
+h4 {
+    text-decoration: underline;
+}
--- a/modules/microsite/src/main/resources/microsite/data/menu.yml
+++ b/modules/microsite/src/main/resources/microsite/data/menu.yml
@ -0,0 +1,48 @@
+options:
+  - title: Home
+    url: index.html
+
+  - title: Getit
+    url: getit.html
+
+  - title: Documentation
+    url: doc.html
+
+    nested_options:
+     - title: Installation
+       url: doc/install.html
+
+     - title: Configuring
+       url: doc/configure.html
+
+     - title: Adding Meta Data
+       url: doc/metadata.html
+
+     - title: Uploads
+       url: doc/uploading.html
+
+     - title: Processing Queue
+       url: doc/processing.html
+
+     - title: Find and Review
+       url: doc/curate.html
+
+     - title: Joex
+       url: doc/joex.html
+
+  - title: Development
+    url: dev.html
+
+    nested_options:
+      - tite: ADRs
+        url: dev/adr.html
+
+  - title: Api
+    url: api.html
+
+    nested_options:
+     - title: REST Api Doc
+       url: openapi/docspell-openapi.html
+
+     - title: REST OpenApi Spec
+       url: openapi/docspell-openapi.yml
--- a/modules/microsite/src/main/resources/microsite/img/back-master-small.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/back-master-small.jpg
--- a/modules/microsite/src/main/resources/microsite/img/docspell-curate-1.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-curate-1.jpg
--- a/modules/microsite/src/main/resources/microsite/img/docspell-curate-2.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-curate-2.jpg
--- a/modules/microsite/src/main/resources/microsite/img/docspell-curate-3.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-curate-3.jpg
--- a/modules/microsite/src/main/resources/microsite/img/docspell-curate-4.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-curate-4.jpg
--- a/modules/microsite/src/main/resources/microsite/img/docspell-curate-5.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-curate-5.jpg
--- a/modules/microsite/src/main/resources/microsite/img/docspell-curate-6.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-curate-6.jpg
--- a/modules/microsite/src/main/resources/microsite/img/docspell-demo.gif
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-demo.gif
--- a/modules/microsite/src/main/resources/microsite/img/docspell-try.gif
+++ b/modules/microsite/src/main/resources/microsite/img/docspell-try.gif
--- a/modules/microsite/src/main/resources/microsite/img/favicon.png
+++ b/modules/microsite/src/main/resources/microsite/img/favicon.png
@ -0,0 +1 @@
+../../../../../../webapp/src/main/webjar/favicon/android-icon-96x96.png
--- a/modules/microsite/src/main/resources/microsite/img/navbar_brand.png
+++ b/modules/microsite/src/main/resources/microsite/img/navbar_brand.png
--- a/modules/microsite/src/main/resources/microsite/img/navbar_brand2x.png
+++ b/modules/microsite/src/main/resources/microsite/img/navbar_brand2x.png
--- a/modules/microsite/src/main/resources/microsite/img/processing-queue.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/processing-queue.jpg
--- a/modules/microsite/src/main/resources/microsite/img/sidebar_brand.png
+++ b/modules/microsite/src/main/resources/microsite/img/sidebar_brand.png
--- a/modules/microsite/src/main/resources/microsite/img/sidebar_brand2x.png
+++ b/modules/microsite/src/main/resources/microsite/img/sidebar_brand2x.png
--- a/modules/microsite/src/main/resources/microsite/img/sources-form.jpg
+++ b/modules/microsite/src/main/resources/microsite/img/sources-form.jpg
--- a/modules/microsite/src/main/resources/microsite/img/wand-white.png
+++ b/modules/microsite/src/main/resources/microsite/img/wand-white.png
--- a/modules/microsite/src/main/tut/api.md
+++ b/modules/microsite/src/main/tut/api.md
@ -0,0 +1,92 @@
+---
+layout: docs
+position: 5
+title: Api
+---
+
+# {{page.title}}
+
+Docspell is designed as a REST server that uses JSON to exchange
+data. The REST api can be used to integrate docspell into your
+workflow.
+
+[Docspell REST Api Doc](openapi/docspell-openapi.html)
+
+The "raw" `openapi.yml` specification file can be found
+[here](openapi/docspell-openapi.yml).
+
+The routes can be divided into protected and unprotected routes. The
+unprotected, or open routes are at `/open/*` wihle the protected
+routes are at `/sec/*`. Open routes don't require authenticated access
+and can be used by any user. The protected routes require an
+authenticated user.
+
+## Authentication
+
+The unprotected route `/open/auth/login` can be used to login with
+account name and password. The response contains a token that can be
+used for accessing protected routes. The token is only valid for a
+restricted time which can be configured (default is 5 minutes).
+
+New tokens can be generated using an existing valid token and the
+protected route `/sec/auth/session`. This will return the same
+response as above, giving a new token.
+
+This token can be added to requests in two ways: as a cookie header or
+a "normal" http header. If a cookie header is used, the cookie name
+must be `docspell_auth` and a custom header must be named
+`X-Docspell-Auth`.
+
+## Live Api
+
+Besides the statically generated documentation at this site, the rest
+server provides a swagger generated api documenation, that allows
+playing around with the api. It requires a running docspell rest
+server. If it is deployed at `http://localhost:7880`, then check this
+url:
+
+```
+http://localhost:7880/app/doc
+```
+
+## Examples
+
+These examples use the great command line tool
+[curl](https://curl.haxx.se/).
+
+### Login
+
+```
+$ curl -X POST -d '{"account": "smith", "password": "test"}' http://localhost:7880/api/v1/open/auth/login
+{"collective":"smith"
+,"user":"smith"
+,"success":true
+,"message":"Login successful"
+,"token":"1568142350115-ZWlrZS9laWtl-$2a$10$rGZUFDAVNIKh4Tj6u6tlI.-O2euwCvmBT0TlyDmIHR1ZsLQPAI="
+,"validMs":300000
+}
+```
+
+### Get new token
+
+```
+$ curl -XPOST -H 'X-Docspell-Auth: 1568142350115-ZWlrZS9laWtl-$2a$10$rGZUFDAVNIKh4Tj6u6tlI.-O2euwCvmBT0TlyDmIHR1ZsLQPAI=' http://localhost:7880/api/v1/sec/auth/session
+{"collective":"smith"
+,"user":"smith"
+,"success":true
+,"message":"Login successful"
+,"token":"1568142446077-ZWlrZS9laWtl-$2a$10$3B0teJ9rMpsBJPzHfZZPoO-WeA1bkfEONBN8fyzWE8DeaAHtUc="
+,"validMs":300000
+}
+```
+
+### Get some insights
+
+```
+$ curl -H 'X-Docspell-Auth: 1568142446077-ZWlrZS9laWtl-$2a$10$3B0teJ9rMpsBJPzHfZZPoO-WeA1bkfEONBN8fyzWE8DeaAHtUc=' http://localhost:7880/api/v1/sec/collective/insights
+{"incomingCount":3
+,"outgoingCount":1
+,"itemSize":207310
+,"tagCloud":{"items":[]}
+}
+```
--- a/modules/microsite/src/main/tut/demo.md
+++ b/modules/microsite/src/main/tut/demo.md
@ -0,0 +1,16 @@
+---
+layout: home
+position: 2
+section: demo
+title: Demo
+technologies:
+ - first: ["Scala + Elm", "Backend is in Scala with Cats/Fs2, Webapp in Elm"]
+ - second: ["Unpaper + Tesseract", "Text is extracted using OCR provided by tesseract"]
+ - third: ["Stanford NLP", "Documents are analyzed using Stanford NLP classifiers"]
+---
+
+# {{ page.title }}
+
+
+
+<img width="100%" src="img/docspell-demo.gif" title="Demo">
--- a/modules/microsite/src/main/tut/dev.md
+++ b/modules/microsite/src/main/tut/dev.md
@ -0,0 +1,86 @@
+---
+layout: docs
+title: Development
+---
+
+
+# {{page.title}}
+
+
+## Building
+
+[Sbt](https://scala-sbt.org) is used to build the application. Clone
+the sources and run:
+
+- `make` to compile all sources (Elm + Scala)
+- `make-zip` to create zip packages
+- `make-deb` to create debian packages
+
+The zip files can be found afterwards in:
+
+```
+modules/restserver/target/universal
+modules/joex/target/universal
+```
+
+
+## Starting Servers with `reStart`
+
+When developing, it's very convenient to use the [revolver sbt
+plugin](https://github.com/spray/sbt-revolver). Start the sbt console
+and then run:
+
+```
+sbt:docspell-root> restserver/reStart
+```
+
+This starts a REST server. Once this started up, type:
+
+```
+sbt:docspell-root> joex/reStart
+```
+
+if also a joex component is required. Prefixing the commads with `~`,
+results in recompile+restart once a source file is modified.
+
+
+## Custom config file
+
+The sbt build is setup such that a file `dev.conf` in the root of the
+source tree is picked up as config file, if it exists. So you can
+create a custom config file for development. For example, a custom
+database for development may be setup this way:
+
+```
+#jdbcurl = "jdbc:h2:///home/dev/workspace/projects/docspell/local/docspell-demo.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
+jdbcurl = "jdbc:postgresql://localhost:5432/docspelldev"
+#jdbcurl = "jdbc:mariadb://localhost:3306/docspelldev"
+
+docspell.server {
+  backend {
+    jdbc {
+      url = ${jdbcurl}
+      user = "dev"
+      password = "dev"
+    }
+  }
+}
+
+docspell.joex {
+  jdbc {
+    url = ${jdbcurl}
+    user = "dev"
+    password = "dev"
+  }
+  scheduler {
+    pool-size = 1
+  }
+}
+```
+
+## ADRs
+
+Some early information about certain details can be found in the few
+[ADR](https://adr.github.io/) that exist:
+
+- [ADRs](dev/adr.html)
--- a/modules/microsite/src/main/tut/dev/adr.md
+++ b/modules/microsite/src/main/tut/dev/adr.md
@ -0,0 +1,12 @@
+---
+layout: docs
+title: ADRs
+---
+
+# ADR
+
+- [0001 Components](adr/0001_components.html)
+- [0002 Component Interaction](adr/0002_component_interaction.html)
+- [0003 Encryption](adr/0003_encryption.html)
+- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch.html)
+- [0005 Job Executor](adr/0005_job-executor.html)
--- a/modules/microsite/src/main/tut/dev/adr/0000_use_markdown_architectural_decision_records.md
+++ b/modules/microsite/src/main/tut/dev/adr/0000_use_markdown_architectural_decision_records.md
@ -0,0 +1,33 @@
+# Use Markdown Architectural Decision Records
+
+## Context and Problem Statement
+
+We want to [record architectural decisions](https://adr.github.io/)
+made in this project.  Which format and structure should these records
+follow?
+
+## Considered Options
+
+* [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
+* [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
+* [Sustainable Architectural
+  Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) -
+  The Y-Statements
+* Other templates listed at
+  <https://github.com/joelparkerhenderson/architecture_decision_record>
+* Formless - No conventions for file format and structure
+
+## Decision Outcome
+
+Chosen option: "MADR 2.1.0", because
+
+* Implicit assumptions should be made explicit. Design documentation
+  is important to enable people understanding the decisions later on.
+  See also [A rational design process: How and why to fake
+  it](https://doi.org/10.1109/TSE.1986.6312940).
+* The MADR format is lean and fits our development style.
+* The MADR structure is comprehensible and facilitates usage &
+  maintenance.
+* The MADR project is vivid.
+* Version 2.1.0 is the latest one available when starting to document
+  ADRs.
--- a/modules/microsite/src/main/tut/dev/adr/0001_components.md
+++ b/modules/microsite/src/main/tut/dev/adr/0001_components.md
@ -0,0 +1,66 @@
+---
+layout: docs
+title: Components
+---
+
+# Components
+
+## Context and Problem Statement
+
+How should the application be structured into its main components? The
+goal is to be able to have multiple rest servers/webapps and multiple
+document processor components working togehter.
+
+
+## Decision Outcome
+
+The following are the "main" modules. There may be more helper modules
+and libraries that support implementing a feature.
+
+### store
+
+The code related to database access. It also provides the job
+queue. It is designed as a library.
+
+### joex
+
+Joex stands for "job executor".
+
+An application that executes jobs from the queue and therefore depends
+on the `store` module. It provides the code for all tasks that can be
+submitted as jobs. If no jobs are in the queue, the joex "sleeps"
+and must be waked via an external request.
+
+It provides the document processing code.
+
+It provides a http rest server to get insight into the joex state
+and also to be notified for new jobs.
+
+### backend
+
+It provides all the logic, except document processing, as a set of
+"operations". An operation can be directly mapped to a rest
+endpoint.
+
+It is designed as a library.
+
+### rest api
+
+This module contains the specification for the rest server as an
+`openapi.yml` file. It is packaged as a scala library that also
+provides types and conversions to/from json.
+
+The idea is that the `rest server` module can depend on it as well as
+rest clients.
+
+### rest server
+
+This is the main application. It directly depends on the `backend`
+module, and each rest endpoint maps to a "backend operation". It is
+also responsible for converting the json data inside http requests
+to/from types recognized by the `backend` module.
+
+
+### webapp
+
+This module provides the user interface as a web application.
--- a/modules/microsite/src/main/tut/dev/adr/0002_component_interaction.md
+++ b/modules/microsite/src/main/tut/dev/adr/0002_component_interaction.md
@ -0,0 +1,65 @@
+---
+layout: docs
+title: Component Interaction
+---
+
+# Component Interaction
+
+## Context and Problem Statement
+
+There are multiple web applications with their rest servers and there
+are multiple document processors. These processes must communicate:
+
+- once a new job is added to the queue the rest server must somehow
+  notify processors to wake up
+- once a processor takes a job, it must propagate the progress and
+  outcome to all rest servers only that the rest server can notify the
+  user that is currently logged in. Since it's not known which
+  rest-server the user is using right now, all must be notified.
+
+## Considered Options
+
+1. JMS (ActiveMQ or similiar): Message Broker as another active
+   component
+2. Akka: using a cluster
+3. DB: Register with "call back urls"
+
+## Decision Outcome
+
+Choosing option 3: DB as central synchronisation point.
+
+The reason is that this is the simplest solution and doesn't require
+external libraries or more processes. The other options seem too big
+of a weapon for the task at hand. They are both large components
+itself and require more knowledge to use them efficiently.
+
+It works roughly like this:
+
+- rest servers and processors register at the database on startup each
+  with a unique call-back url
+- and deregister on shutdown
+- each component has db access
+- rest servers can list all processors and vice versa
+
+### Positive Consequences
+
+- complexity of the whole application is not touched
+- since a lot of data must be transferred to the document processors,
+  this is solved by simply accessing the db. So the protocol for data
+  exchange is set. There is no need for other protocols that handle
+  large data (http chunking etc)
+- uses the already exsting db as synchronisation point
+- no additional knowledge required
+- simple to understand and so not hard to debug
+
+### Negative Consequences
+
+- all components must have db access. this also is a security con,
+  because if one of those processes is hacked, db access is
+  possible. and it simply is another dependency that is not really
+  required for the joex component
+- the joex component cannot be in an untrusted environment (untrusted
+  from the db's point of view). For example, it is not possible to
+  create "personal joex" that only receive your own jobs…
+- in order to know if a component is really active, one must run a
+  ping against the call-back url
--- a/modules/microsite/src/main/tut/dev/adr/0003_encryption.md
+++ b/modules/microsite/src/main/tut/dev/adr/0003_encryption.md
@ -0,0 +1,95 @@
+---
+layout: docs
+title: Encryption
+---
+
+# Encryption
+
+
+## Context and Problem Statement
+
+Since docspell may store important documents, it should be possible to
+encrypt them on the server. It should be (almost) transparent to the
+user, for example, a user must be able to login and download a file in
+clear form. That is, the server must also decrypt them.
+
+Then all users of a collective should have access to the files. This
+requires to share the key among users of a collective.
+
+But, even when files are encrypted, the associated meta data is not!
+So especially access to the database would allow to see tags,
+associated persons and correspondents of documents.
+
+So in short, encryption means:
+
+- file contents (the blobs and extracted text) is encrypted
+- metadata is not
+- secret keys are stored at the server (protected by a passphrase),
+  such that files can be downloaded in clear form
+
+
+## Decision Drivers
+
+* major driver is to provide most possible privacy for users
+* even at the expense of less features; currently I think that the
+  associated meta data is enough for finding documents (i.e. full text
+  search is not needed)
+
+## Considered Options
+
+It is clear, that only blobs (file contents) can be encrypted, but not
+the associated metadata. And the extracted text must be encrypted,
+too, obviously.
+
+
+### Public Key Encryption (PKE)
+
+With PKE that the server can automatically encrypt files using
+publicly available key data. It wouldn't require a user to provide a
+passphrase for encryption, only for decryption.
+
+This would allows for first processing files (extracting text, doing
+text analyisis) and encrypting them (and the text) afterwards.
+
+The public and secret keys are stored at the database. The secret key
+must be protected. This can be done by encrypting the passphrase to
+the secret key using each users login password. If a user logs in, he
+or she must provide the correct password. Using this password, the
+private key can be unlocked. This requires to store the private key
+passphrase encrypted with every users password in the database. So the
+whole security then depends on users password quality.
+
+There are plenty of other difficulties with this approach (how about
+password change, new secret keys, adding users etc).
+
+Using this kind of encryption would protect the data against offline
+attacks and also for accidental leakage (for example, if a bug in the
+software would access a file of another user).
+
+
+### No Encryption
+
+If only blobs are encrypted, against which type of attack would it
+provide protection?
+
+The users must still trust the server. First, in order to provide the
+wanted features (document processing), the server must see the file
+contents. Then, it will receive and serve files in clear form, so it
+has access to them anyways.
+
+With that in mind, the "only" feature is to protect against "stolen
+database" attacks. If the database is somehow leaked, the attackers
+would only see the metadata, but not real documents. It also protects
+against leakage, maybe caused by a pogramming error.
+
+But the downside is, that it increases complexity *a lot*. And since
+this is a personal tool for personal use, is it worth the effort?
+
+
+## Decision Outcome
+
+No encryption, because of its complexity.
+
+For now, this tool is only meant for "self deployment" and personal
+use. If this changes or there is enough time, this decision should be
+reconsidered.
--- a/modules/microsite/src/main/tut/dev/adr/0004_iso8601vsEpoch.md
+++ b/modules/microsite/src/main/tut/dev/adr/0004_iso8601vsEpoch.md
@ -0,0 +1,42 @@
+---
+layout: docs
+title: ISO8601 vs Millis
+---
+
+# ISO8601 vs Millis as Date-Time transfer
+
+## Context and Problem Statement
+
+The question is whether the REST Api should return an ISO8601
+formatted string in UTC timezone, or the unix time (number of
+milliseconds since 1970-01-01).
+
+There is quite some controversy about it.
+
+- <https://stackoverflow.com/questions/47426786/epoch-or-iso8601-date-format>
+- <https://nbsoftsolutions.com/blog/designing-a-rest-api-unix-time-vs-iso-8601>
+
+In my opinion, the ISO8601 format (always UTC) is better. The reason
+is the better readability. But elm folks are on the other side:
+
+- <https://package.elm-lang.org/packages/elm/time/1.0.0#iso-8601>
+- <https://package.elm-lang.org/packages/rtfeldman/elm-iso8601-date-strings/latest/>
+
+One can convert from an ISO8601 date-time string in UTC time into the
+epoch millis and vice versa. So it is the same to me. There is no less
+information in a ISO8601 string than in the epoch millis.
+
+To avoid confusion, all date/time values should use the same encoding.
+
+## Decision Outcome
+
+I go with the epoch time. Every timestamp/date-time values is
+transfered as Unix timestamp.
+
+Reasons:
+
+- the Elm application needs to frequently calculate with these values
+  to render the current waiting time etc. This is better if there are
+  numbers without requiring to parse dates first
+- Since the UI is written with Elm, it's probably good to adopt their
+  style
--- a/modules/microsite/src/main/tut/dev/adr/0005_job-executor.md
+++ b/modules/microsite/src/main/tut/dev/adr/0005_job-executor.md
@ -0,0 +1,136 @@
+---
+layout: docs
+title: Joex - Job Executor
+---
+
+# Job Executor
+
+## Context and Problem Statement
+
+Docspell is a multi-user application. When processing user's
+documents, there must be some thought on how to distribute all the
+processing jobs on a much more restricted set of resources. There
+maybe 100 users but only 4 cores that can process documents at a
+time. Doing simply FIFO is not enough since it provides an unfair
+distribution. The first user who submits 20 documents will then occupy
+all cores for quite some time and all other users would need to wait.
+
+This tries to find a more fair distribution among the users (strictly
+meaning collectives here) of docspell.
+
+The job executor is a separate component that will run in its own
+process. It takes the next job from the "queue" and executes the
+associated task. This is used to run the document processing jobs
+(text extraction, text analysis etc).
+
+1. The task execution should survive restarts. State and task code
+   must be recreated from some persisted state.
+
+2. The processing should be fair with respect to collectives.
+
+3. It must be possible to run many job executors, possibly on
+   different machines. This can be used to quickly enable more
+   processing power and removing it once the peak is over.
+
+4. Task execution can fail and it should be able to retry those
+   tasks. Reasons are that errors may be temporarily (for example
+   talking to a third party service), and to enable repairing without
+   stopping the job executor. Some errors might be easily repaired (a
+   program was not installed or whatever). In such a case it is good
+   to know that the task will be retried later.
+
+## Considered Options
+
+In contrast to other ADRs this is just some sketching of thoughts for
+the current implementation.
+
+1. Job description are serialized and written to the database into a
+   table. This becomes the queue. Tasks are identified by names and a
+   job executor implementation must have a map of names to code to
+   lookup the task to perform. The tasks arguments are serialized into
+   a string and written to the database. Tasks must decode the
+   string. This can be conveniently done using JSON and the provided
+   circe decoders.
+
+2. To provide a fair execution jobs are organized into groups. When a
+   new job is requested from the queue, first a group is selected
+   using a round-robin strategy. This should ensure good enough
+   fairness among groups. A group maps to a collective. Within a
+   group, a job is selected based on priority, submitted time (fifo)
+   and job state (see notes about stuck jobs).
+
+3. Allowing multiple job executors means that getting the next job can
+   fail due to simultaneous running transactions. It is retried until
+   it succeeds. Taking a job puts in into _scheduled_ state. Each job
+   executor has a unique (manually supplied) id and jobs are marked
+   with that id once it is handed to the executor.
+
+4. When a task fails, its state is updated to state _stuck_. Stuck
+   jobs are retried in the future. The queue prefers to return stuck
+   jobs that are due at the specific point in time ignoring the
+   priority hint.
+
+### More Details
+
+A job has these properties
+
+- id (something random)
+- group
+- taskname (to choose task to run)
+- submitted-date
+- worker (the id of the job executor)
+- state, one of: waiting, scheduled, running, stuck, cancelled,
+  failed, success
+  - waiting: job has been inserted into the queue
+  - scheduled: job has been handed over to some executore and is
+    marked with the job executor id
+  - running: a task is currently executing
+  - stuck: a task has failed and is being retried eventually
+  - cancelled: task has finished and there was a cancel request
+  - failed: task has failed, execeeded the retries
+  - success: task has completed successfully
+
+The queue has a `take` or `nextJob` operation that takes the worker-id
+and a priority hint and goes roughly like this:
+
+- select the next group using round-robin strategy
+- select all jobs with that group, where
+  - state is stuck and waiting time has elapsed
+  - state is waiting and have the given priority if possible
+- jobs are ordered by submitted time, but stuck jobs whose waiting
+  time elapsed are preferred
+
+There are two priorities within a group: high and low. A configured
+counting scheme determines when to select certain priority. For
+example, counting scheme of `(2,1)` would select two high priority
+jobs and then 1 low priority job. The `take` operation tries to prefer
+this priority but falls back to the other if no job with this priority
+is available.
+
+A group corresponds to a collective. Then all collectives get
+(roughly) equal treatment.
+
+Once there are no jobs in the queue the executor goes into sleep and
+must be waked to run again. If a job is submitted, the executors are
+notified.
+
+### Stuck Jobs
+
+A job is going into _stuck_ state, if the task has failed. In this
+state, the task is rerun after a while until a maximum retry count is
+reached.
+
+The problem is how to notify all executors when the waiting time has
+elapsed. If one executor puts a job into stuck state, it means that
+all others should start looking into the queue again after `x`
+minutes. It would be possible to tell all existing executors to
+schedule themselves to wake up in the future, but this would miss all
+executors that show up later.
+
+The waiting time is increased exponentially after each retry (`2 ^
+retry`) and it is meant as the minimum waiting time. So it is ok if
+all executors wakeup periodically and check for new work. Most of the
+time this should not be necessary and is just a fallback if only stuck
+jobs are in the queue and nothing is submitted for a long time. If the
+system is used, jobs get submitted once in a while and would awake all
+executors.
--- a/modules/microsite/src/main/tut/dev/adr/template.md
+++ b/modules/microsite/src/main/tut/dev/adr/template.md
@ -0,0 +1,72 @@
+# [short title of solved problem and solution]
+
+* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
+* Deciders: [list everyone involved in the decision] <!-- optional -->
+* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
+
+Technical Story: [description | ticket/issue URL] <!-- optional -->
+
+## Context and Problem Statement
+
+[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
+
+## Decision Drivers <!-- optional -->
+
+* [driver 1, e.g., a force, facing concern, …]
+* [driver 2, e.g., a force, facing concern, …]
+* … <!-- numbers of drivers can vary -->
+
+## Considered Options
+
+* [option 1]
+* [option 2]
+* [option 3]
+* … <!-- numbers of options can vary -->
+
+## Decision Outcome
+
+Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
+
+### Positive Consequences <!-- optional -->
+
+* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
+* …
+
+### Negative Consequences <!-- optional -->
+
+* [e.g., compromising quality attribute, follow-up decisions required, …]
+* …
+
+## Pros and Cons of the Options <!-- optional -->
+
+### [option 1]
+
+[example | description | pointer to more information | …] <!-- optional -->
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 2]
+
+[example | description | pointer to more information | …] <!-- optional -->
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 3]
+
+[example | description | pointer to more information | …] <!-- optional -->
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+## Links <!-- optional -->
+
+* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
+* … <!-- numbers of links can vary -->
--- a/modules/microsite/src/main/tut/doc.md
+++ b/modules/microsite/src/main/tut/doc.md
@ -0,0 +1,99 @@
+---
+layout: docs
+position: 4
+title: Documentation
+---
+
+# {{page.title}}
+
+Docspell assists in organizing large amounts of PDF files that are
+typically scanned paper documents. You can associate tags, set
+correspondends, what a document is concerned with, a name, a date and
+some more. If your documents are associated with this meta data, you
+should be able to quickly find them later using the search
+feature. But adding this manually to each document is a tedious
+task. What if most of it could be attached automatically?
+
+## How it works
+
+Documents have two main properties: a correspondent (sender or
+receiver that is not you) and something the document is about. Usually
+it is about a person or some thing – maybe your car, or contracts
+concerning some familiy member, etc.
+
+1. You maintain a kind of address book. It should list all possible
+   correspondents and the concerning people/things. This grows
+   incrementally with each new unknown document.
+2. When docspell analyzes a document, it tries to find matches within
+   your address book. It can detect the correspondent and a concerning
+   person or thing. It will then associate this data to your
+   documents.
+3. You can inspect what docspell has done and correct it. If docspell
+   has found multiple suggestions, they will be shown for you to
+   select one. If it is not correctly associated, very often the
+   correct one is just one click away.
+
+The set of meta data that docspell uses to draw suggestions from, must
+be maintained manually. But usually, this data doesn't grow as fast as
+the documents. After a while there is a quite complete address book
+and only once in a while it has to be revisited.
+
+
+## Terms
+
+In order to better understand these pages, some terms should be
+explained first.
+
+### Item
+
+An **Item** is roughly your (pdf) document, only that an item may span
+multiple files, which are called **attachments**. And an item has
+**meta data** associated:
+
+- a **correspondent**: the other side of the communication. It can be
+  an organization or a person.
+- a **concerning person** or **equipment**: a person or thing that
+  this item is about. Maybe it is an insurance contract about your
+  car.
+- **tag**: an item can be tagged with custom tags. A tag can have a
+  *category*. This is intended for grouping tags, for example a
+  category `doctype` could be used to group tags like `bill`,
+  `contract`, `receipt` etc. Usually an item is not tagged with more
+  than one tag of a category.
+- a **item date**: this is the date of the document – if this is not
+  set, the created date of the item is used.
+- a **due date**: an optional date indicating that something has to be
+  done (e.g. paying a bill, submitting it) about this item until this
+  date
+- a **direction**: one of "incoming" or "outgoing"
+- a **name**: some item name, defaults to the file name of the
+  attachments
+- some **notes**: arbitraty descriptive text. You can use markdown
+  here, which is appropriately formatted in the web application.
+
+### Collective
+
+The users of the application are part of a **collective**. A
+**collective** is a group of users that share access to the same
+items. The account name is therefore comprised of a *collective name*
+and a *user name*.
+
+All users of a collective are equal; they have same permissions to
+access all items. The items don't belong to a user, but to the
+collective.
+
+That means, to identify yourself when signing in, you have to give the
+collective name and your user name. By default it is separated by a
+slash `/`, for example `smith/john`. If your user name is the same as
+the collective name, you can omit one; so `smith/smith` can be
+abbreviated to just `smith`.
+
+
+## Limitations
+
+* Docspell currently supports only PDF files.
+* The PDF view relies on the browsers capabilities. Sadly, not all
+  browsers can display PDF files. Some may require extra plugins. And
+  it's especially sad, that mobile browsers wont't display the
+  files. It works with the major desktop browsers (firefox, chromium),
+  though.
--- a/modules/microsite/src/main/tut/doc/configure.md
+++ b/modules/microsite/src/main/tut/doc/configure.md
@ -0,0 +1,261 @@
+---
+layout: docs
+title: Configuring
+---
+
+# {{ page.title }}
+
+Docspell's executable can take one argument – a configuration file. If
+that is not given, the defaults are used. The config file overrides
+default values, so only values that differ from the defaults are
+necessary.
+
+This applies to the restserver and the joex as well.
+
+## Important Config Options
+
+The configuration of both components uses separate namespaces. The
+configuration for the REST server is below `docspell.server`, while
+the one for joex is below `docspell.joex`.
+
+### JDBC
+
+This configures the connection to the database. This has to be
+specified for the rest server and joex. By default, a H2 database in
+the current `/tmp` directory is configured.
+
+The config looks like this (both components):
+
+```
+docspell.joex.jdbc {
+  url = ...
+  user = ...
+  password = ...
+}
+
+docspell.server.backend.jdbc {
+  url = ...
+  user = ...
+  password = ...
+}
+```
+
+The `url` is the connection to the database. It must start with
+`jdbc`, followed by name of the database. The rest is specific to the
+database used: it is either a path to a file for H2 or a host/database
+url for MariaDB and PostgreSQL.
+
+When using H2, the user is `sa`, the password can be empty and the url
+must include these options:
+
+```
+;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE
+```
+
+#### Examples
+
+PostgreSQL:
+```
+url = "jdbc:postgresql://localhost:5432/docspelldb"
+```
+
+MariaDB:
+```
+url = "jdbc:mariadb://localhost:3306/docspelldb"
+```
+
+H2
+```
+url = "jdbc:h2:///path/to/a/file.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
+```
+
+### Bind
+
+The host and port the http server binds to. This applies to both
+components. The joex component also exposes a small REST api to
+inspect its state and notify the scheduler.
+
+```
+docspell.server.bind {
+  address = localhost
+  port = 7880
+}
+docspell.joex.bind {
+  address = localhost
+  port = 7878
+}
+```
+
+By default, it binds to `localhost` and some predefined port. This
+must be changed, if components are on different machines.
+
+### baseurl
+
+The base url is an important setting that defines the http URL where
+the corresponding component can be reached. It applies to both
+components. For a joex component, the url must be resolvable from a
+REST server component. The REST server also uses this url to create
+absolute urls and to configure the authenication cookie.
+
+By default it is build using the information from the `bind` setting.
+
+
+```
+docspell.server.baseurl = ...
+docspell.joex.baseurl = ...
+```
+
+#### Examples
+
+```
+docspell.server.baseurl = "https://docspell.example.com"
+docspell.joex.baseurl = "http://192.168.101.10"
+```
+
+
+### app-id
+
+The `app-id` is the identifier of the corresponding instance. It *must
+be unique* for all instances. By default the REST server uses `rest1`
+and joex `joex1`. It is recommended to overwrite this setting to have
+an explicit and stable identifier.
+
+```
+docspell.server.app-id = "rest1"
+docspell.joex.app-id = "joex1"
+```
+
+### registration options
+
+This defines if and how new users can create accounts. There are 3
+options:
+
+- *closed* no new user can sign up
+- *open* new users can sign up
+- *invite* new users can sign up but require an invitation key
+
+This applies only to the REST sevrer component.
+
+```
+docspell.server.signup {
+  mode = "open"
+
+  # If mode == 'invite', a password must be provided to generate
+  # invitation keys. It must not be empty.
+  new-invite-password = ""
+
+  # If mode == 'invite', this is the period an invitation token is
+  # considered valid.
+  invite-time = "3 days"
+}
+```
+
+The mode `invite` is intended to open the application only to some
+users. The admin can create these invitation keys and distribute them
+to the desired people. For this, the `new-invite-password` must be
+given. The idea is that only the person who installs docspell knows
+this. If it is not set, then invitation won't work. New invitation
+keys can be generated from within the web application or via REST
+calls (using `curl`, for example).
+
+```
+curl -X POST -d '{"password":"blabla"}' "http://localhost:7880/api/v1/open/signup/newinvite"
+```
+
+### Authentication
+
+Authentication works in two ways:
+
+- with an account-name / password pair
+- with an authentication token
+
+The initial authentication must occur with an accountname/password
+pair. This will generate an authentication token which is valid for a
+some time. Subsequent calls to secured routes can use this token. The
+token can be given as a normal http header or via a cookie header.
+
+These settings apply only to the REST server.
+
+```
+docspell.server.auth {
+  server-secret = "hex:caffee" # or "b64:Y2FmZmVlCg=="
+  session-valid = "5 minutes"
+}
+```
+
+The `server-secret` is used to sign the token. If multiple REST
+servers are deployed, all must share the same server secret. Otherwise
+tokens from one instance are not valid on another instance. The secret
+can be given as Base64 encoded string or in hex form. Use the prefix
+`hex:` and `b64:`, respectively.
+
+The `session-valid` deterimens how long a token is valid. This can be
+just some minutes, the web application obtains new ones
+periodically. So a short time is recommended.
+
+
+## File Format
+
+The format of the configuration files can be
+[HOCON](https://github.com/lightbend/config/blob/master/HOCON.md#hocon-human-optimized-config-object-notation),
+JSON or whatever the used [config
+library](https://github.com/lightbend/config) understands. The default
+values below are in HOCON format, which is recommended, since it
+allows comments and has some [advanced
+features](https://github.com/lightbend/config/blob/master/README.md#features-of-hocon). Please
+refer to their documentation for more on this.
+
+Here are the default configurations.
+
+
+## Default Config
+
+### Rest Server
+
+```
+{% include server.conf %}
+```
+
+### Joex
+
+```
+{% include joex.conf %}
+```
+
+## Logging
+
+By default, docspell logs to stdout. This works well, when managed by
+systemd or other inits. Logging is done by
+[logback](https://logback.qos.ch/). Please refer to its documentation
+for how to configure logging.
+
+If you created your logback config file, it can be added as argument
+to the executable using this syntax:
+
+```
+/path/to/docspell -Dlogback.configurationFile=/path/to/your/logging-config-file
+```
+
+To get started, the default config looks like this:
+
+``` xml
+<configuration>
+  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
+    <withJansi>true</withJansi>
+
+    <encoder>
+      <pattern>[%thread] %highlight(%-5level) %cyan(%logger{15}) - %msg %n</pattern>
+    </encoder>
+  </appender>
+
+  <logger name="docspell" level="debug" />
+  <root level="INFO">
+    <appender-ref ref="STDOUT" />
+  </root>
+</configuration>
+```
+
+The `<root level="INFO">` means, that only log statements with level
+"INFO" will be printed. But the `<logger name="docspell"
+level="debug">` above says, that for loggers with name "docspell"
+statements with level "DEBUG" will be printed, too.
--- a/modules/microsite/src/main/tut/doc/curate.md
+++ b/modules/microsite/src/main/tut/doc/curate.md
@ -0,0 +1,77 @@
+---
+layout: docs
+title: Find and Review
+---
+
+# {{page.title}}
+
+Curating the items meta data helps finding them later. This page
+describes how you can quickly go through those items and correct or
+amend with existing data.
+
+## Select New items
+
+After files have been uploaded and the job executor created the
+corresponding items, they will show up on the main page. All items,
+the job executor has created are initially marked as *New*. The option
+*only New* in the left search menu can be used to select only new
+items:
+
+<div class="thumbnail">
+  <img src="../img/docspell-curate-1.jpg">
+</div>
+
+
+## Check selected items
+
+Then you can go through all new items and check their metadata: Click
+on the first item to open the detail view. This shows the documents
+and the meta data in the header.
+
+<div class="thumbnail">
+  <img src="../img/docspell-curate-2.jpg">
+</div>
+
+
+## Modify if necessary
+
+To change something, click the *Edit* button in the menu above the
+document view. This will open a form next to your documents. You can
+compare the data with the documents and change as you like. Since the
+item status is *New*, you'll see the suggestions docspell found during
+processing. If there were multiple candidates, you can select another
+one by clicking its name in the suggestion list.
+
+<div class="thumbnail">
+  <img src="../img/docspell-curate-3.jpg">
+</div>
+
+
+When you change something in the form, it is immediatly applied. Only
+when changing text fields, a click on the *Save* symbol next to the
+field is required.
+
+
+## Confirm
+
+If everything looks good, click the *Confirm* button to confirm the
+current data. The *New* status goes away and also the suggestions are
+hidden in this state. You can always go back by clicking the
+*Unconfirm* button.
+
+<div class="thumbnail">
+  <img src="../img/docspell-curate-5.jpg">
+</div>
+
+
+## Proceed with next item
+
+To look at the next item in the search results, click the *Next*
+button in the menu (next to the *Edit* button). Clicking next, will
+keep the current view, so you can continue checking the data. If you
+are on the last item, the view switches to the listing view when
+clicking *Next*.
+
+<div class="thumbnail">
+  <img src="../img/docspell-curate-6.jpg">
+</div>
--- a/modules/microsite/src/main/tut/doc/install.md
+++ b/modules/microsite/src/main/tut/doc/install.md
@ -0,0 +1,218 @@
+---
+layout: docs
+title: Installation
+---
+
+# {{ page.title }}
+
+This page contains detailed installation instructions. For a quick
+start, refer to [this page](../getit.html).
+
+Docspell has been developed and tested on a GNU/Linux system. It may
+run on Windows and MacOS machines, too (ghostscript and tesseract are
+available on these systems). But I've never tried.
+
+Docspell consists of two components that are started in separate
+processes:
+
+1. *REST Server* This is the main application, providing the REST Api
+   and the web application.
+2. *Joex* (job executor) This is the component that does the document
+   processing.
+
+They can run on multiple machines. All REST server and Joex instances
+should be on the same network. It is not strictly required that they
+can reach each other, but the components can then notify themselves
+about new or done work.
+
+While this is possible, the simple setup is to start both components
+once on the same machine.
+
+The [download page](https://github.com/eikek/docspell/releases)
+provides pre-compiled packages and the [development page](dev.html)
+contains build instructions.
+
+
+## Prerequisites
+
+The two components have one prerequisite in common: they both require
+Java to run. While this is the only requirement for the *REST server*,
+the *Joex* components requires some more external programs.
+
+### Java
+
+Very often, Java is already installed. You can check this by opening a
+terminal and typing `java -version`. Otherwise install Java using your
+package manager or see [this site](https://adoptopenjdk.net/) for
+other options.
+
+It is enough to install the JRE. The JDK is required, if you want to
+build docspell from source.
+
+Docspell has been tested with Java version 1.8 (or sometimes referred
+to as JRE 8 and JDK 8, respectively). The pre-build packages are also
+build using JDK 8. But a later version of Java should work as well.
+
+The next tools are only required on machines running the *Joex*
+component.
+
+### External Tools for Joex
+
+- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
+  is used to extract/convert PDF files into images that are then fed
+  to ocr. It is available on most GNU/Linux distributions.
+- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
+  pre-processes images to yield better results when doing ocr. If this
+  is not installed, docspell tries without it. However, it is
+  recommended to install, because it [improves text
+  extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
+  (at the expense of a longer runtime).
+- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
+  doing the OCR (converts images into text). It is a widely used open
+  source OCR engine. Tesseract 3 and 4 should work with docspell; you
+  can adopt the command line in the configuration file, if necessary.
+
+
+### Example Debian
+
+On Debian this should install all joex requirements:
+
+``` bash
+sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
+```
+
+## Database
+
+Both components must have access to a SQL database. Docspell has
+support these databases:
+
+- PostreSQL
+- MariaDB
+- H2
+
+The H2 database is an interesting option for personal and mid-size
+setups, as it requires no additional work. It is integrated into
+docspell and works really well. It is also configured as the default
+database.
+
+For large installations, PostgreSQL or MariaDB is recommended. Create
+a database and a user with enough privileges (read, write, create
+table) to that database.
+
+When using H2, make sure that all components access the same database
+– the jdbc url must point to the same file. Then, it is important to
+add the options
+`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
+of the url. See the [default config](configure.html) for an example.
+
+
+## Installing from ZIP files
+
+After extracting the zip files, you'll find a start script in the
+`bin/` folder.
+
+
+## Installing from DEB packages
+
+The DEB packages can be installed on Debian, or Debian based Distros:
+
+``` bash
+$ sudo dpkg -i docspell*.deb
+```
+
+Then the start scripts are in your `$PATH`. Run `docspell-restserver`
+or `docspell-joex` from a terminal window.
+
+The packages come with a systemd unit file that will be installed to
+autostart the services.
+
+
+## Running
+
+Run the start script (in the corresponding `bin/` directory when using
+the zip files):
+
+```
+$ ./docspell-restserver*/bin/docspell-restserver
+$ ./docspell-joex*/bin/docspell-joex
+```
+
+This will startup both components using the default configuration. The
+configuration should be adopted to your needs. For example, the
+database connection is configured to use a H2 database in the `/tmp`
+directory. Please refer to the [configuration page](configure.html)
+for how to create a custom config file. Once you have your config
+file, simply pass it as argument to the command:
+
+```
+$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
+$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf
+```
+
+After starting the rest server, you can reach the web application at
+path `/app/index.html`, so using default values it would be
+`http://localhost:7880/app/index.html`.
+
+You should be able to create a new account and sign in. Check the
+[configuration page](configure.html) to further customize docspell.
+
+
+### Options
+
+The start scripts support some options to configure the JVM. One often
+used setting is the maximum heap size of the JVM. By default, java
+determines it based on properties of the current machine. You can
+specify it by given java startup options to the command:
+
+```
+$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
+```
+
+This would limit the maximum heap to 1GB. The double slash separates
+internal options and the arguments to the program. Another frequently
+used option is to change the default temp directory. Usually it is
+`/tmp`, but it may be desired to have a dedicated temp directory,
+which can be configured:
+
+```
+$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
+```
+
+The command:
+
+```
+$ ./docspell-restserver*/bin/docspell-restserver -h
+```
+
+gives an overview of supported options.
+
+
+## Raspberry Pi, and similiar
+
+Both component can run next to each other on a raspberry pi or
+similiar device.
+
+
+### REST Server
+
+The REST server component runs very well on the Raspberry Pi and
+similiar devices. It doesn't require much resources, because the heavy
+work is done by the joex components.
+
+
+### Joex
+
+Running the joex component on the Raspberry Pi is possible, but will
+result in long processing times. Tested on a RPi model 3 (4 cores, 1G
+RAM) processing a PDF (scanned with 300dpi) with two pages took
+9:52. You can speed it up considerably by uninstalling the `unpaper`
+command, because this step takes quite long. This, of course, reduces
+the quality of OCR. But without `unpaper` the same sample pdf was then
+processed in 1:24, a speedup of 8 minutes.
+
+You should limit the joex pool size to 1 and, depending on your model
+and the amount of RAM, set a heap size of at least 500M
+(`-J-Xmx500M`).
+
+For personal setups, when you don't need the processing results asap,
+this can work well enough.
--- a/modules/microsite/src/main/tut/doc/joex.md
+++ b/modules/microsite/src/main/tut/doc/joex.md
@ -0,0 +1,155 @@
+---
+layout: docs
+title: Joex
+---
+
+# {{ page.title }}
+
+Joex is short for *Job Executor* and it is the component managing long
+running tasks in docspell. One of these long running tasks is the file
+processing task.
+
+One joex component handles the processing of all files of all
+collectives/users. It requires much more resources than the rest
+server component. Therefore the number of jobs that can run in
+parallel is limited with respect to the hardware it is running on.
+
+For larger installations, it is probably better to run several joex
+components on different machines. That works out of the box, as long
+as all components point to the same database and use different
+`app-id`s (see [configuring docspell](./configure.html)).
+
+When files are submitted to docspell, they are stored in the database
+and all known joex components are notified about new work. Then they
+compete on getting the next job from the queue. After a job finishes
+and no job is waiting in the queue, joex will sleep until notified
+again. It will also periodically notify itself as a fallback.
+
+## Scheduler and Queue
+
+The scheduler is the part that runs and monitors the long running
+jobs. It works together with the job queue, which defines what job to
+take next.
+
+To create a somewhat fair distribution among multiple collectives, a
+collective is first chosen in a simple round-robin way. Then a job
+from this collective is chosen by priority.
+
+There are only two priorities: low and high. A simple *counting
+scheme* determines if a low prio or high prio job is selected
+next. The default is `4, 1`, meaning to first select 4 high priority
+jobs and then 1 low priority job, then starting over. If no such job
+exists, its falls back to the other priority.
+
+The priority can be set on a *Source* (see
+[uploads](uploading.html)). Uploading through the web application will
+always use priority *high*. The idea is that while logged in, jobs are
+more important that those submitted when not logged in.
+
+
+## Scheduler Config
+
+The relevant part of the config file regarding the scheduler is shown
+below with some explanations.
+
+```
+docspell.joex {
+  # other settings left out for brevity
+
+  scheduler {
+
+    # Number of processing allowed in parallel.
+    pool-size = 2
+
+    # A counting scheme determines the ratio of how high- and low-prio
+    # jobs are run. For example: 4,1 means run 4 high prio jobs, then
+    # 1 low prio and then start over.
+    counting-scheme = "4,1"
+
+    # How often a failed job should be retried until it enters failed
+    # state. If a job fails, it becomes "stuck" and will be retried
+    # after a delay.
+    retries = 5
+
+    # The delay until the next try is performed for a failed job. This
+    # delay is increased exponentially with the number of retries.
+    retry-delay = "1 minute"
+
+    # The queue size of log statements from a job.
+    log-buffer-size = 500
+
+    # If no job is left in the queue, the scheduler will wait until a
+    # notify is requested (using the REST interface). To also retry
+    # stuck jobs, it will notify itself periodically.
+    wakeup-period = "30 minutes"
+  }
+}
+```
+
+The `pool-size` setting deterimens how many jobs run in parallel. You
+need to play with this setting on your machine to find an optimal
+value.
+
+The `counting-scheme` determines for all collectives how to select
+between high and low priority jobs; as explained above. It is
+currently not possible to define that per collective.
+
+If a job fails, it will be set to *stuck* state and retried by the
+scheduler. The `retries` setting defines how many times a job is
+retried until it enters the final *failed* state. The scheduler waits
+some time until running the next try. This delay is given by
+`retry-delay`. This is the initial delay, the time until the first
+re-try (the second attempt). This time increases exponentially with
+the number of retries.
+
+The jobs will log about what they do, which is picked up and stored
+into the database asynchronously. The log events are buffered in a
+queue and another thread will consume this queue and store them in the
+database. The `log-buffer-size` determines the size of the queue.
+
+At last, there is a `wakeup-period` that determines at what interval
+the joex component notifies itself to look for new jobs. If jobs get
+stuck, and joex is not notified externally it could miss to
+retry. Also, since networks are not reliable, a notification may not
+reach a joex component. This periodic wakup is just to ensure that
+jobs are eventually run.
+
+
+## Starting on demand
+
+The job executor and rest server can be started multiple times. This
+is especially useful for the job executor. For example, when
+submitting a lot of files in a short time, you can simply startup more
+job executors on other computers on your network. Maybe use your
+laptop to help with processing for a while.
+
+You have to make sure, that all connect to the same database, and that
+all have unique `app-id`s.
+
+Once the files have been processced you can stop the additional
+executors.
+
+## Shutting down
+
+If a job executor is sleeping and not executing any jobs, you can just
+quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
+there are jobs currently executing, it is advisable to initiate a
+graceful shutdown. The job executor will then stop taking new jobs
+from the queue but it will wait until all running jobs have completed
+before shutting down.
+
+This can be done by sending a http POST request to the api of this job
+executor:
+
+```
+curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
+```
+
+If joex receives this request it will immediately stop taking new jobs
+and it will quit when all running jobs are done.
+
+If a job executor gets terminated while there are running jobs, the
+jobs are still in the current state marked to be executed by this job
+executor. In order to fix this, start the job executor again. It will
+search all jobs that are marked with its id and put them back into
+waiting state. Then send a graceful shutdown request as shown above.
--- a/modules/microsite/src/main/tut/doc/metadata.md
+++ b/modules/microsite/src/main/tut/doc/metadata.md
@ -0,0 +1,87 @@
+---
+layout: docs
+title: Adding Meta Data
+---
+
+# {{ page.title }}
+
+## Meta Data
+
+The processing can be controlled implicitely by the provided meta
+data. The *Meta Data* page allows to manage this meta data. You can
+create the following:
+
+- Tags
+- Organizations
+- Persons
+- Equipments
+
+### Tags
+
+Items can be tagged with multiple custom tags (aka labels). This
+allows to describe many different workflows people may have with their
+documents.
+
+A tag can have a *category*. This is meant to group tags together. For
+example, you may want to have a tag category *doctype* that is
+comprised of tags like *bill*, *contract*, *receipt* and so on. Or for
+workflows, a tag category *state* may exist that includes tags like
+*Todo* or *Waiting*. Or you can tag items with user names to provide
+"assignment" semantics. Docspell doesn't propose any workflow, but it
+can help to implement some.
+
+The tags are *not* taken into account when processing. Docspell will
+not automatically associate tags to your items. The tags are only
+meant to be used manually.
+
+
+### Organization and Person
+
+The organization entity represents an non-personal (organization or
+company) correspondent of an item. Docspell will choose one or more
+organizations when processing documents and associate the "best" match
+with your item.
+
+The person entitiy can appear in two roles: It may be a correspondent
+or the person an item is about. So a person is either a correspondent
+or a concerning person. Docspell can not know which person is which,
+therefore you need to tell this by checking the box "Use for
+concerning person suggestion only". If this is checked, docspell will
+use this person only to suggest a concerning person. Otherwise the
+person is used only for correspondent suggestions.
+
+Document processing uses the following properties:
+
+- name
+- websites
+- e-mails
+
+The website an e-mails can be added as contact information. If these
+three are present, you should get good matches from docspell. All
+other fields of an organization and person are not used during
+document processing. They might be useful when using this as a real
+address book.
+
+
+### Equipment
+
+The equipment entity is almost like a tag. In fact, it could be
+replaced by a tag with a specific known category. The difference is
+that docspell will try to find a match and associate it with your
+item. The equipment represents non-personal things that an item is
+about. Examples are: bills or insurances for *cars*, contracts for
+*houses* or *flats*.
+
+Equipments don't have contact information, so the only property that
+is used to find matches during document processing is its name.
+
+
+## Document Language
+
+An important setting is the language of your documents. This helps OCR
+and text analysis. You can select between English and German
+currently.
+
+Go to the *Collective Settings* page and click *Document
+Language*. This will set the lanugage for all your documents. It is
+not (yet) possible to specify it when uploading.
--- a/modules/microsite/src/main/tut/doc/processing.md
+++ b/modules/microsite/src/main/tut/doc/processing.md
@ -0,0 +1,40 @@
+---
+layout: docs
+title: Processing Queue
+---
+
+# {{ page.title }}
+
+
+The page *Processing Queue* shows the current state of document
+processing for your uploads.
+
+At the top of the page a list of running jobs is shown. Below that,
+the left column shows jobs that wait to be picked up by the job
+executor. On the right are finished jobs. The number of finished jobs
+is cut to some maximum and is also restricted by a date range. The
+page refreshes itself automatically to show the progress.
+
+Example screenshot:
+
+<div class="thumbnail">
+  <img src="../img/processing-queue.jpg">
+</div>
+
+You can cancel running jobs or remove waiting ones from the queue. If
+you click on the small file symbol on finished jobs, you can inspect
+its log messages again. A running job displays the job executor id
+that executes the job.
+
+Currently the job queue executes just the document processing tasks,
+but it may be used for other long running tasks in the future.
+
+Since job executors are shared among all collectives, it may happen
+that a job is some time waiting until it is picked up by a job
+executor. You can always start more job executors to help out.
+
+If a job fails, it is retried after some time. Only if it fails too
+often (can be configured), it then is finished with *failed* state. If
+processing finally fails, the item is still created, just without
+suggestions. But if processing is cancelled by the user, the item is
+not created.
--- a/modules/microsite/src/main/tut/doc/uploading.md
+++ b/modules/microsite/src/main/tut/doc/uploading.md
@ -0,0 +1,130 @@
+---
+layout: docs
+title: Uploads
+---
+
+# {{page.title}}
+
+
+This page describes, how files can get into docspell. Technically,
+there is just one way: via http multipart/form-data requests.
+
+
+## Authenticated Upload
+
+From within the web application there is the "Upload Files"
+page. There you can select multiple files to upload. You can also
+specify whether these files should become one item or if every file is
+a separate item.
+
+When you click "Submit" the files are uploaded and stored in the
+database. Then the job executor(s) are notified which immediately
+start processing them.
+
+Go to the top-right menu and click "Processing Queue" to see the
+current state.
+
+This obviously requires an authenticated user. While this is handy for
+ad-hoc uploads, it is very inconvenient for automating it by custom
+scripts. For this the next variant exists.
+
+## Anonymous Upload
+
+It is also possible to upload files without authentication. This
+should make tools that interact with docspell much easier to write.
+
+
+### Creating Anonymous Uploads
+
+Go to "Collective Settings" and then to the "Source" tab. A *Source*
+identifies an endpoint where files can be uploaded
+anonymously. Creating a new source creates a long unique id which is
+part on an url that can be used to upload files. You can choose any
+time to deactivate or delete the source at which point uploading is
+not possible anymore. The idea is to give this URL away safely. You
+can delete it any time and no passwords or secrets are visible, even
+your username is not visible.
+
+Example screenshot:
+
+<div class="thumbnail">
+  <img src="../img/sources-form.jpg">
+</div>
+
+This example shows a source with name "test". It defines two urls:
+
+- `/app/index.html#/upload/<id>`
+- `/api/v1/open/upload/item/<id>`
+
+The first points to a web page where everyone could upload files into
+your account. You could give this url to people for sending files
+directly into your docspell.
+
+The second url is the API url, which accepts the requests to upload
+files (which is used by the first url).
+
+For example, this url can be used to upload files with curl:
+
+``` bash
+$ curl -XPOST -F file=@test.pdf http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
+{"success":true,"message":"Files submitted."}
+```
+
+You could add more `-F file=@/path/to/your/file.pdf` to upload
+multiple files (note, the `@` is required by curl, so it knows that
+the following is a file).
+
+When files are uploaded to an source endpoint, the items resulting
+from this uploads are marked with the name of the source. So you know
+which source an item originated.
+
+If files are uploaded using the web applications *Upload files* page,
+the source is implicitly set to `webapp`. If you also want to let
+docspell count the files uploaded through the web interface, just
+create a source (can be inactive) with that name (`webapp`).
+
+
+## The Request
+
+This gives more details about the request for uploads. It is a http
+`multipart/form-data` request, with two possible fields:
+
+- meta
+- file
+
+The `file` field can appear multiple times and is required at least
+once. It is the part containing the file to upload.
+
+The `meta` part is completely optional and can define additional meta
+data, that docspell uses to create items from the given files. It
+allows to transfer structured information together with the
+unstructured binary files.
+
+The `meta` content must be `application/json` containing this
+structure:
+
+```
+{ multiple: Bool
+, direction: Maybe String
+}
+```
+
+The `multiple` property is by default `true`. It means that each file
+in the upload request corresponds to a single item. An upload with 5
+files will result in 5 items created. If it is `false`, then docspell
+will create just one item, that will then contain all files.
+
+Furthermore, the direction of the document (one of `incoming` or
+`outgoing`) can be given. It is optional, it can be left out or
+`null`.
+
+This kind of request is very common and most programming languages
+have support for this. For example, here is another curl command
+uploading two files with meta data:
+
+```
+curl -XPOST -F meta='{"multiple":false, "direction": "outgoing"}' \
+            -F file=@letter-en-source.pdf \
+            -F file=@letter-de-source.pdf \
+            http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
+```
--- a/modules/microsite/src/main/tut/getit.md
+++ b/modules/microsite/src/main/tut/getit.md
@ -0,0 +1,55 @@
+---
+layout: home
+position: 3
+section: quickstart
+title: Quickstart
+technologies:
+ - first: ["Scala + Elm", "Backend is in Scala with Cats/Fs2, Webapp in Elm"]
+ - second: ["Unpaper + Tesseract", "Text is extracted using OCR provided by tesseract"]
+ - third: ["Stanford NLP", "Documents are analyzed using Stanford NLP classifiers"]
+---
+
+## Download
+
+You can download pre-compiled binaries from the [Release
+Page](https://github.com/eikek/docspell/releases). There are `deb`
+packages and a generic zip files.
+
+You need to download the two files:
+
+- [docspell-restserver-{{site.version}}.zip](https://github.com/eikek/docspell/releases/download/v{{site.version}}/docspell-restserver-{{site.version}}.zip)
+- [docspell-joex-{{site.version}}.zip](https://github.com/eikek/docspell/releases/download/v{{site.version}}/docspell-joex-{{site.version}}.zip)
+
+
+## Prerequisite
+
+Install Java (use your package manager or look
+[here](https://adoptopenjdk.net/)),
+[tesseract](https://github.com/tesseract-ocr/tesseract),
+[ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
+[unpaper](https://github.com/Flameeyes/unpaper). The last is not
+really required, but improves OCR.
+
+
+## Running
+
+1. Unzip both files:
+   ``` bash
+   $ unzip docspell-*.zip
+   ```
+2. Open two terminal windows and navigate to the the directory
+   containing the zip files.
+3. Start both components executing:
+   ``` bash
+   $ ./docspell-restserver*/bin/docspell-restserver
+   ```
+   in one terminal and
+   ``` bash
+   $ ./docspell-joex*/bin/docspell-joex
+   ```
+   in the other.
+4. Point your browser to: <http://localhost:7880/app/index.html>
+5. Register a new account, sign in and try it.
+
+Check the [documentation](doc.html) for more information on how to use
+docspell.
--- a/modules/microsite/src/main/tut/index.md
+++ b/modules/microsite/src/main/tut/index.md
@ -0,0 +1,49 @@
+---
+layout: home
+position: 1
+section: home
+title: Home
+technologies:
+ - first: ["Scala + Elm", "Backend is in Scala with Cats/Fs2, Webapp in Elm"]
+ - second: ["Unpaper + Tesseract", "Text is extracted using OCR provided by tesseract"]
+ - third: ["Stanford NLP", "Documents are analyzed using Stanford NLP classifiers"]
+---
+
+# A Document Organizer
+
+Docspell is a simple tool to cope with your piles of (digitized) paper
+documents. You'll need a scanner to convert your papers into PDF
+files. Docspell can then assist in organizing the resulting PDF files
+easily. Its main goal is to efficiently support two major use cases:
+
+1. **Stowing documents away**: Most of the time documents are received
+   or created. It should be *fast* to stow them away, knowing that
+   they can be found if necessary.
+
+   Upload the PDF files to docspell. Docspell finds meta data and will
+   link them to your document, automatically. There may be false
+   positives, so a short review is recommended. Though even if not,
+   the results are not that bad.
+2. **Finding them**: If there is a document needed, you can search for
+   it. Usually, restricting to a date range and a correspondent will
+   result in only a few documents to sift through. Alternatively, you
+   can add your own tags, names etc to better match your workflow.
+
+The meta data that docspell uses is provided by you. You need to
+maintain a list of correspondents and maybe other things you want
+docspell to draw suggestions from. So if a new document arrives (from
+an unknown correspondent) then you would add a new entry to your meta
+data and link it manually to the document. But the next time, docspell
+will do it for you.
+
+Docspell is *not* a document management system. There exists a lot of
+these systems that have much more features. Docspell's focus is around
+the two use cases described above, which already is quite useful.
+
+Checkout the quick [demo](demo.html) to get a first impression and the
+[quickstart](getit.html) page if you want to try it out.
+
+## License
+
+This project is distributed under the
+[GPLv3](http://www.gnu.org/licenses/gpl-3.0.html)
				`@ -0,0 +1 @@`
				`../../../../../../webapp/src/main/webjar/favicon/android-icon-96x96.png`