Upgrade microsite

This commit is contained in:
Eike Kettner
2019-12-29 23:37:32 +01:00
parent 2001cca88b
commit 57e274e2b0
42 changed files with 599 additions and 70 deletions

View File

@ -0,0 +1,12 @@
---
layout: docs
title: ADRs
---
# ADR
- [0001 Components](adr/0001_components.html)
- [0002 Component Interaction](adr/0002_component_interaction.html)
- [0003 Encryption](adr/0003_encryption.html)
- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch.html)
- [0005 Job Executor](adr/0005_job-executor.html)

View File

@ -0,0 +1,33 @@
# Use Markdown Architectural Decision Records
## Context and Problem Statement
We want to [record architectural decisions](https://adr.github.io/)
made in this project. Which format and structure should these records
follow?
## Considered Options
* [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
* [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
* [Sustainable Architectural
Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) -
The Y-Statements
* Other templates listed at
<https://github.com/joelparkerhenderson/architecture_decision_record>
* Formless - No conventions for file format and structure
## Decision Outcome
Chosen option: "MADR 2.1.0", because
* Implicit assumptions should be made explicit. Design documentation
is important to enable people understanding the decisions later on.
See also [A rational design process: How and why to fake
it](https://doi.org/10.1109/TSE.1986.6312940).
* The MADR format is lean and fits our development style.
* The MADR structure is comprehensible and facilitates usage &
maintenance.
* The MADR project is vivid.
* Version 2.1.0 is the latest one available when starting to document
ADRs.

View File

@ -0,0 +1,66 @@
---
layout: docs
title: Components
---
# Components
## Context and Problem Statement
How should the application be structured into its main components? The
goal is to be able to have multiple rest servers/webapps and multiple
document processor components working togehter.
## Decision Outcome
The following are the "main" modules. There may be more helper modules
and libraries that support implementing a feature.
### store
The code related to database access. It also provides the job
queue. It is designed as a library.
### joex
Joex stands for "job executor".
An application that executes jobs from the queue and therefore depends
on the `store` module. It provides the code for all tasks that can be
submitted as jobs. If no jobs are in the queue, the joex "sleeps"
and must be waked via an external request.
It provides the document processing code.
It provides a http rest server to get insight into the joex state
and also to be notified for new jobs.
### backend
It provides all the logic, except document processing, as a set of
"operations". An operation can be directly mapped to a rest
endpoint.
It is designed as a library.
### rest api
This module contains the specification for the rest server as an
`openapi.yml` file. It is packaged as a scala library that also
provides types and conversions to/from json.
The idea is that the `rest server` module can depend on it as well as
rest clients.
### rest server
This is the main application. It directly depends on the `backend`
module, and each rest endpoint maps to a "backend operation". It is
also responsible for converting the json data inside http requests
to/from types recognized by the `backend` module.
### webapp
This module provides the user interface as a web application.

View File

@ -0,0 +1,65 @@
---
layout: docs
title: Component Interaction
---
# Component Interaction
## Context and Problem Statement
There are multiple web applications with their rest servers and there
are multiple document processors. These processes must communicate:
- once a new job is added to the queue the rest server must somehow
notify processors to wake up
- once a processor takes a job, it must propagate the progress and
outcome to all rest servers only that the rest server can notify the
user that is currently logged in. Since it's not known which
rest-server the user is using right now, all must be notified.
## Considered Options
1. JMS (ActiveMQ or similiar): Message Broker as another active
component
2. Akka: using a cluster
3. DB: Register with "call back urls"
## Decision Outcome
Choosing option 3: DB as central synchronisation point.
The reason is that this is the simplest solution and doesn't require
external libraries or more processes. The other options seem too big
of a weapon for the task at hand. They are both large components
itself and require more knowledge to use them efficiently.
It works roughly like this:
- rest servers and processors register at the database on startup each
with a unique call-back url
- and deregister on shutdown
- each component has db access
- rest servers can list all processors and vice versa
### Positive Consequences
- complexity of the whole application is not touched
- since a lot of data must be transferred to the document processors,
this is solved by simply accessing the db. So the protocol for data
exchange is set. There is no need for other protocols that handle
large data (http chunking etc)
- uses the already exsting db as synchronisation point
- no additional knowledge required
- simple to understand and so not hard to debug
### Negative Consequences
- all components must have db access. this also is a security con,
because if one of those processes is hacked, db access is
possible. and it simply is another dependency that is not really
required for the joex component
- the joex component cannot be in an untrusted environment (untrusted
from the db's point of view). For example, it is not possible to
create "personal joex" that only receive your own jobs…
- in order to know if a component is really active, one must run a
ping against the call-back url

View File

@ -0,0 +1,95 @@
---
layout: docs
title: Encryption
---
# Encryption
## Context and Problem Statement
Since docspell may store important documents, it should be possible to
encrypt them on the server. It should be (almost) transparent to the
user, for example, a user must be able to login and download a file in
clear form. That is, the server must also decrypt them.
Then all users of a collective should have access to the files. This
requires to share the key among users of a collective.
But, even when files are encrypted, the associated meta data is not!
So especially access to the database would allow to see tags,
associated persons and correspondents of documents.
So in short, encryption means:
- file contents (the blobs and extracted text) is encrypted
- metadata is not
- secret keys are stored at the server (protected by a passphrase),
such that files can be downloaded in clear form
## Decision Drivers
* major driver is to provide most possible privacy for users
* even at the expense of less features; currently I think that the
associated meta data is enough for finding documents (i.e. full text
search is not needed)
## Considered Options
It is clear, that only blobs (file contents) can be encrypted, but not
the associated metadata. And the extracted text must be encrypted,
too, obviously.
### Public Key Encryption (PKE)
With PKE that the server can automatically encrypt files using
publicly available key data. It wouldn't require a user to provide a
passphrase for encryption, only for decryption.
This would allows for first processing files (extracting text, doing
text analyisis) and encrypting them (and the text) afterwards.
The public and secret keys are stored at the database. The secret key
must be protected. This can be done by encrypting the passphrase to
the secret key using each users login password. If a user logs in, he
or she must provide the correct password. Using this password, the
private key can be unlocked. This requires to store the private key
passphrase encrypted with every users password in the database. So the
whole security then depends on users password quality.
There are plenty of other difficulties with this approach (how about
password change, new secret keys, adding users etc).
Using this kind of encryption would protect the data against offline
attacks and also for accidental leakage (for example, if a bug in the
software would access a file of another user).
### No Encryption
If only blobs are encrypted, against which type of attack would it
provide protection?
The users must still trust the server. First, in order to provide the
wanted features (document processing), the server must see the file
contents. Then, it will receive and serve files in clear form, so it
has access to them anyways.
With that in mind, the "only" feature is to protect against "stolen
database" attacks. If the database is somehow leaked, the attackers
would only see the metadata, but not real documents. It also protects
against leakage, maybe caused by a pogramming error.
But the downside is, that it increases complexity *a lot*. And since
this is a personal tool for personal use, is it worth the effort?
## Decision Outcome
No encryption, because of its complexity.
For now, this tool is only meant for "self deployment" and personal
use. If this changes or there is enough time, this decision should be
reconsidered.

View File

@ -0,0 +1,42 @@
---
layout: docs
title: ISO8601 vs Millis
---
# ISO8601 vs Millis as Date-Time transfer
## Context and Problem Statement
The question is whether the REST Api should return an ISO8601
formatted string in UTC timezone, or the unix time (number of
milliseconds since 1970-01-01).
There is quite some controversy about it.
- <https://stackoverflow.com/questions/47426786/epoch-or-iso8601-date-format>
- <https://nbsoftsolutions.com/blog/designing-a-rest-api-unix-time-vs-iso-8601>
In my opinion, the ISO8601 format (always UTC) is better. The reason
is the better readability. But elm folks are on the other side:
- <https://package.elm-lang.org/packages/elm/time/1.0.0#iso-8601>
- <https://package.elm-lang.org/packages/rtfeldman/elm-iso8601-date-strings/latest/>
One can convert from an ISO8601 date-time string in UTC time into the
epoch millis and vice versa. So it is the same to me. There is no less
information in a ISO8601 string than in the epoch millis.
To avoid confusion, all date/time values should use the same encoding.
## Decision Outcome
I go with the epoch time. Every timestamp/date-time values is
transfered as Unix timestamp.
Reasons:
- the Elm application needs to frequently calculate with these values
to render the current waiting time etc. This is better if there are
numbers without requiring to parse dates first
- Since the UI is written with Elm, it's probably good to adopt their
style

View File

@ -0,0 +1,136 @@
---
layout: docs
title: Joex - Job Executor
---
# Job Executor
## Context and Problem Statement
Docspell is a multi-user application. When processing user's
documents, there must be some thought on how to distribute all the
processing jobs on a much more restricted set of resources. There
maybe 100 users but only 4 cores that can process documents at a
time. Doing simply FIFO is not enough since it provides an unfair
distribution. The first user who submits 20 documents will then occupy
all cores for quite some time and all other users would need to wait.
This tries to find a more fair distribution among the users (strictly
meaning collectives here) of docspell.
The job executor is a separate component that will run in its own
process. It takes the next job from the "queue" and executes the
associated task. This is used to run the document processing jobs
(text extraction, text analysis etc).
1. The task execution should survive restarts. State and task code
must be recreated from some persisted state.
2. The processing should be fair with respect to collectives.
3. It must be possible to run many job executors, possibly on
different machines. This can be used to quickly enable more
processing power and removing it once the peak is over.
4. Task execution can fail and it should be able to retry those
tasks. Reasons are that errors may be temporarily (for example
talking to a third party service), and to enable repairing without
stopping the job executor. Some errors might be easily repaired (a
program was not installed or whatever). In such a case it is good
to know that the task will be retried later.
## Considered Options
In contrast to other ADRs this is just some sketching of thoughts for
the current implementation.
1. Job description are serialized and written to the database into a
table. This becomes the queue. Tasks are identified by names and a
job executor implementation must have a map of names to code to
lookup the task to perform. The tasks arguments are serialized into
a string and written to the database. Tasks must decode the
string. This can be conveniently done using JSON and the provided
circe decoders.
2. To provide a fair execution jobs are organized into groups. When a
new job is requested from the queue, first a group is selected
using a round-robin strategy. This should ensure good enough
fairness among groups. A group maps to a collective. Within a
group, a job is selected based on priority, submitted time (fifo)
and job state (see notes about stuck jobs).
3. Allowing multiple job executors means that getting the next job can
fail due to simultaneous running transactions. It is retried until
it succeeds. Taking a job puts in into _scheduled_ state. Each job
executor has a unique (manually supplied) id and jobs are marked
with that id once it is handed to the executor.
4. When a task fails, its state is updated to state _stuck_. Stuck
jobs are retried in the future. The queue prefers to return stuck
jobs that are due at the specific point in time ignoring the
priority hint.
### More Details
A job has these properties
- id (something random)
- group
- taskname (to choose task to run)
- submitted-date
- worker (the id of the job executor)
- state, one of: waiting, scheduled, running, stuck, cancelled,
failed, success
- waiting: job has been inserted into the queue
- scheduled: job has been handed over to some executore and is
marked with the job executor id
- running: a task is currently executing
- stuck: a task has failed and is being retried eventually
- cancelled: task has finished and there was a cancel request
- failed: task has failed, execeeded the retries
- success: task has completed successfully
The queue has a `take` or `nextJob` operation that takes the worker-id
and a priority hint and goes roughly like this:
- select the next group using round-robin strategy
- select all jobs with that group, where
- state is stuck and waiting time has elapsed
- state is waiting and have the given priority if possible
- jobs are ordered by submitted time, but stuck jobs whose waiting
time elapsed are preferred
There are two priorities within a group: high and low. A configured
counting scheme determines when to select certain priority. For
example, counting scheme of `(2,1)` would select two high priority
jobs and then 1 low priority job. The `take` operation tries to prefer
this priority but falls back to the other if no job with this priority
is available.
A group corresponds to a collective. Then all collectives get
(roughly) equal treatment.
Once there are no jobs in the queue the executor goes into sleep and
must be waked to run again. If a job is submitted, the executors are
notified.
### Stuck Jobs
A job is going into _stuck_ state, if the task has failed. In this
state, the task is rerun after a while until a maximum retry count is
reached.
The problem is how to notify all executors when the waiting time has
elapsed. If one executor puts a job into stuck state, it means that
all others should start looking into the queue again after `x`
minutes. It would be possible to tell all existing executors to
schedule themselves to wake up in the future, but this would miss all
executors that show up later.
The waiting time is increased exponentially after each retry (`2 ^
retry`) and it is meant as the minimum waiting time. So it is ok if
all executors wakeup periodically and check for new work. Most of the
time this should not be necessary and is just a fallback if only stuck
jobs are in the queue and nothing is submitted for a long time. If the
system is used, jobs get submitted once in a while and would awake all
executors.

View File

@ -0,0 +1,72 @@
# [short title of solved problem and solution]
* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
* Deciders: [list everyone involved in the decision] <!-- optional -->
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
Technical Story: [description | ticket/issue URL] <!-- optional -->
## Context and Problem Statement
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
## Decision Drivers <!-- optional -->
* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
*<!-- numbers of drivers can vary -->
## Considered Options
* [option 1]
* [option 2]
* [option 3]
*<!-- numbers of options can vary -->
## Decision Outcome
Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
### Positive Consequences <!-- optional -->
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
*
### Negative Consequences <!-- optional -->
* [e.g., compromising quality attribute, follow-up decisions required, …]
*
## Pros and Cons of the Options <!-- optional -->
### [option 1]
[example | description | pointer to more information | …] <!-- optional -->
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
*<!-- numbers of pros and cons can vary -->
### [option 2]
[example | description | pointer to more information | …] <!-- optional -->
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
*<!-- numbers of pros and cons can vary -->
### [option 3]
[example | description | pointer to more information | …] <!-- optional -->
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
*<!-- numbers of pros and cons can vary -->
## Links <!-- optional -->
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
*<!-- numbers of links can vary -->