mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-22 02:18:26 +00:00
Upgrade microsite
This commit is contained in:
@ -0,0 +1,33 @@
|
||||
# Use Markdown Architectural Decision Records
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We want to [record architectural decisions](https://adr.github.io/)
|
||||
made in this project. Which format and structure should these records
|
||||
follow?
|
||||
|
||||
## Considered Options
|
||||
|
||||
* [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
|
||||
* [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
|
||||
* [Sustainable Architectural
|
||||
Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) -
|
||||
The Y-Statements
|
||||
* Other templates listed at
|
||||
<https://github.com/joelparkerhenderson/architecture_decision_record>
|
||||
* Formless - No conventions for file format and structure
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "MADR 2.1.0", because
|
||||
|
||||
* Implicit assumptions should be made explicit. Design documentation
|
||||
is important to enable people understanding the decisions later on.
|
||||
See also [A rational design process: How and why to fake
|
||||
it](https://doi.org/10.1109/TSE.1986.6312940).
|
||||
* The MADR format is lean and fits our development style.
|
||||
* The MADR structure is comprehensible and facilitates usage &
|
||||
maintenance.
|
||||
* The MADR project is vivid.
|
||||
* Version 2.1.0 is the latest one available when starting to document
|
||||
ADRs.
|
66
modules/microsite/docs/dev/adr/0001_components.md
Normal file
66
modules/microsite/docs/dev/adr/0001_components.md
Normal file
@ -0,0 +1,66 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Components
|
||||
---
|
||||
|
||||
# Components
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
How should the application be structured into its main components? The
|
||||
goal is to be able to have multiple rest servers/webapps and multiple
|
||||
document processor components working togehter.
|
||||
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
The following are the "main" modules. There may be more helper modules
|
||||
and libraries that support implementing a feature.
|
||||
|
||||
### store
|
||||
|
||||
The code related to database access. It also provides the job
|
||||
queue. It is designed as a library.
|
||||
|
||||
### joex
|
||||
|
||||
Joex stands for "job executor".
|
||||
|
||||
An application that executes jobs from the queue and therefore depends
|
||||
on the `store` module. It provides the code for all tasks that can be
|
||||
submitted as jobs. If no jobs are in the queue, the joex "sleeps"
|
||||
and must be waked via an external request.
|
||||
|
||||
It provides the document processing code.
|
||||
|
||||
It provides a http rest server to get insight into the joex state
|
||||
and also to be notified for new jobs.
|
||||
|
||||
### backend
|
||||
|
||||
It provides all the logic, except document processing, as a set of
|
||||
"operations". An operation can be directly mapped to a rest
|
||||
endpoint.
|
||||
|
||||
It is designed as a library.
|
||||
|
||||
### rest api
|
||||
|
||||
This module contains the specification for the rest server as an
|
||||
`openapi.yml` file. It is packaged as a scala library that also
|
||||
provides types and conversions to/from json.
|
||||
|
||||
The idea is that the `rest server` module can depend on it as well as
|
||||
rest clients.
|
||||
|
||||
### rest server
|
||||
|
||||
This is the main application. It directly depends on the `backend`
|
||||
module, and each rest endpoint maps to a "backend operation". It is
|
||||
also responsible for converting the json data inside http requests
|
||||
to/from types recognized by the `backend` module.
|
||||
|
||||
|
||||
### webapp
|
||||
|
||||
This module provides the user interface as a web application.
|
65
modules/microsite/docs/dev/adr/0002_component_interaction.md
Normal file
65
modules/microsite/docs/dev/adr/0002_component_interaction.md
Normal file
@ -0,0 +1,65 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Component Interaction
|
||||
---
|
||||
|
||||
# Component Interaction
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
There are multiple web applications with their rest servers and there
|
||||
are multiple document processors. These processes must communicate:
|
||||
|
||||
- once a new job is added to the queue the rest server must somehow
|
||||
notify processors to wake up
|
||||
- once a processor takes a job, it must propagate the progress and
|
||||
outcome to all rest servers only that the rest server can notify the
|
||||
user that is currently logged in. Since it's not known which
|
||||
rest-server the user is using right now, all must be notified.
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. JMS (ActiveMQ or similiar): Message Broker as another active
|
||||
component
|
||||
2. Akka: using a cluster
|
||||
3. DB: Register with "call back urls"
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Choosing option 3: DB as central synchronisation point.
|
||||
|
||||
The reason is that this is the simplest solution and doesn't require
|
||||
external libraries or more processes. The other options seem too big
|
||||
of a weapon for the task at hand. They are both large components
|
||||
itself and require more knowledge to use them efficiently.
|
||||
|
||||
It works roughly like this:
|
||||
|
||||
- rest servers and processors register at the database on startup each
|
||||
with a unique call-back url
|
||||
- and deregister on shutdown
|
||||
- each component has db access
|
||||
- rest servers can list all processors and vice versa
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
- complexity of the whole application is not touched
|
||||
- since a lot of data must be transferred to the document processors,
|
||||
this is solved by simply accessing the db. So the protocol for data
|
||||
exchange is set. There is no need for other protocols that handle
|
||||
large data (http chunking etc)
|
||||
- uses the already exsting db as synchronisation point
|
||||
- no additional knowledge required
|
||||
- simple to understand and so not hard to debug
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
- all components must have db access. this also is a security con,
|
||||
because if one of those processes is hacked, db access is
|
||||
possible. and it simply is another dependency that is not really
|
||||
required for the joex component
|
||||
- the joex component cannot be in an untrusted environment (untrusted
|
||||
from the db's point of view). For example, it is not possible to
|
||||
create "personal joex" that only receive your own jobs…
|
||||
- in order to know if a component is really active, one must run a
|
||||
ping against the call-back url
|
95
modules/microsite/docs/dev/adr/0003_encryption.md
Normal file
95
modules/microsite/docs/dev/adr/0003_encryption.md
Normal file
@ -0,0 +1,95 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Encryption
|
||||
---
|
||||
|
||||
# Encryption
|
||||
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Since docspell may store important documents, it should be possible to
|
||||
encrypt them on the server. It should be (almost) transparent to the
|
||||
user, for example, a user must be able to login and download a file in
|
||||
clear form. That is, the server must also decrypt them.
|
||||
|
||||
Then all users of a collective should have access to the files. This
|
||||
requires to share the key among users of a collective.
|
||||
|
||||
But, even when files are encrypted, the associated meta data is not!
|
||||
So especially access to the database would allow to see tags,
|
||||
associated persons and correspondents of documents.
|
||||
|
||||
So in short, encryption means:
|
||||
|
||||
- file contents (the blobs and extracted text) is encrypted
|
||||
- metadata is not
|
||||
- secret keys are stored at the server (protected by a passphrase),
|
||||
such that files can be downloaded in clear form
|
||||
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* major driver is to provide most possible privacy for users
|
||||
* even at the expense of less features; currently I think that the
|
||||
associated meta data is enough for finding documents (i.e. full text
|
||||
search is not needed)
|
||||
|
||||
## Considered Options
|
||||
|
||||
It is clear, that only blobs (file contents) can be encrypted, but not
|
||||
the associated metadata. And the extracted text must be encrypted,
|
||||
too, obviously.
|
||||
|
||||
|
||||
### Public Key Encryption (PKE)
|
||||
|
||||
With PKE that the server can automatically encrypt files using
|
||||
publicly available key data. It wouldn't require a user to provide a
|
||||
passphrase for encryption, only for decryption.
|
||||
|
||||
This would allows for first processing files (extracting text, doing
|
||||
text analyisis) and encrypting them (and the text) afterwards.
|
||||
|
||||
The public and secret keys are stored at the database. The secret key
|
||||
must be protected. This can be done by encrypting the passphrase to
|
||||
the secret key using each users login password. If a user logs in, he
|
||||
or she must provide the correct password. Using this password, the
|
||||
private key can be unlocked. This requires to store the private key
|
||||
passphrase encrypted with every users password in the database. So the
|
||||
whole security then depends on users password quality.
|
||||
|
||||
There are plenty of other difficulties with this approach (how about
|
||||
password change, new secret keys, adding users etc).
|
||||
|
||||
Using this kind of encryption would protect the data against offline
|
||||
attacks and also for accidental leakage (for example, if a bug in the
|
||||
software would access a file of another user).
|
||||
|
||||
|
||||
### No Encryption
|
||||
|
||||
If only blobs are encrypted, against which type of attack would it
|
||||
provide protection?
|
||||
|
||||
The users must still trust the server. First, in order to provide the
|
||||
wanted features (document processing), the server must see the file
|
||||
contents. Then, it will receive and serve files in clear form, so it
|
||||
has access to them anyways.
|
||||
|
||||
With that in mind, the "only" feature is to protect against "stolen
|
||||
database" attacks. If the database is somehow leaked, the attackers
|
||||
would only see the metadata, but not real documents. It also protects
|
||||
against leakage, maybe caused by a pogramming error.
|
||||
|
||||
But the downside is, that it increases complexity *a lot*. And since
|
||||
this is a personal tool for personal use, is it worth the effort?
|
||||
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
No encryption, because of its complexity.
|
||||
|
||||
For now, this tool is only meant for "self deployment" and personal
|
||||
use. If this changes or there is enough time, this decision should be
|
||||
reconsidered.
|
42
modules/microsite/docs/dev/adr/0004_iso8601vsEpoch.md
Normal file
42
modules/microsite/docs/dev/adr/0004_iso8601vsEpoch.md
Normal file
@ -0,0 +1,42 @@
|
||||
---
|
||||
layout: docs
|
||||
title: ISO8601 vs Millis
|
||||
---
|
||||
|
||||
# ISO8601 vs Millis as Date-Time transfer
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The question is whether the REST Api should return an ISO8601
|
||||
formatted string in UTC timezone, or the unix time (number of
|
||||
milliseconds since 1970-01-01).
|
||||
|
||||
There is quite some controversy about it.
|
||||
|
||||
- <https://stackoverflow.com/questions/47426786/epoch-or-iso8601-date-format>
|
||||
- <https://nbsoftsolutions.com/blog/designing-a-rest-api-unix-time-vs-iso-8601>
|
||||
|
||||
In my opinion, the ISO8601 format (always UTC) is better. The reason
|
||||
is the better readability. But elm folks are on the other side:
|
||||
|
||||
- <https://package.elm-lang.org/packages/elm/time/1.0.0#iso-8601>
|
||||
- <https://package.elm-lang.org/packages/rtfeldman/elm-iso8601-date-strings/latest/>
|
||||
|
||||
One can convert from an ISO8601 date-time string in UTC time into the
|
||||
epoch millis and vice versa. So it is the same to me. There is no less
|
||||
information in a ISO8601 string than in the epoch millis.
|
||||
|
||||
To avoid confusion, all date/time values should use the same encoding.
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
I go with the epoch time. Every timestamp/date-time values is
|
||||
transfered as Unix timestamp.
|
||||
|
||||
Reasons:
|
||||
|
||||
- the Elm application needs to frequently calculate with these values
|
||||
to render the current waiting time etc. This is better if there are
|
||||
numbers without requiring to parse dates first
|
||||
- Since the UI is written with Elm, it's probably good to adopt their
|
||||
style
|
136
modules/microsite/docs/dev/adr/0005_job-executor.md
Normal file
136
modules/microsite/docs/dev/adr/0005_job-executor.md
Normal file
@ -0,0 +1,136 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Joex - Job Executor
|
||||
---
|
||||
|
||||
# Job Executor
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Docspell is a multi-user application. When processing user's
|
||||
documents, there must be some thought on how to distribute all the
|
||||
processing jobs on a much more restricted set of resources. There
|
||||
maybe 100 users but only 4 cores that can process documents at a
|
||||
time. Doing simply FIFO is not enough since it provides an unfair
|
||||
distribution. The first user who submits 20 documents will then occupy
|
||||
all cores for quite some time and all other users would need to wait.
|
||||
|
||||
This tries to find a more fair distribution among the users (strictly
|
||||
meaning collectives here) of docspell.
|
||||
|
||||
The job executor is a separate component that will run in its own
|
||||
process. It takes the next job from the "queue" and executes the
|
||||
associated task. This is used to run the document processing jobs
|
||||
(text extraction, text analysis etc).
|
||||
|
||||
1. The task execution should survive restarts. State and task code
|
||||
must be recreated from some persisted state.
|
||||
|
||||
2. The processing should be fair with respect to collectives.
|
||||
|
||||
3. It must be possible to run many job executors, possibly on
|
||||
different machines. This can be used to quickly enable more
|
||||
processing power and removing it once the peak is over.
|
||||
|
||||
4. Task execution can fail and it should be able to retry those
|
||||
tasks. Reasons are that errors may be temporarily (for example
|
||||
talking to a third party service), and to enable repairing without
|
||||
stopping the job executor. Some errors might be easily repaired (a
|
||||
program was not installed or whatever). In such a case it is good
|
||||
to know that the task will be retried later.
|
||||
|
||||
## Considered Options
|
||||
|
||||
In contrast to other ADRs this is just some sketching of thoughts for
|
||||
the current implementation.
|
||||
|
||||
1. Job description are serialized and written to the database into a
|
||||
table. This becomes the queue. Tasks are identified by names and a
|
||||
job executor implementation must have a map of names to code to
|
||||
lookup the task to perform. The tasks arguments are serialized into
|
||||
a string and written to the database. Tasks must decode the
|
||||
string. This can be conveniently done using JSON and the provided
|
||||
circe decoders.
|
||||
|
||||
2. To provide a fair execution jobs are organized into groups. When a
|
||||
new job is requested from the queue, first a group is selected
|
||||
using a round-robin strategy. This should ensure good enough
|
||||
fairness among groups. A group maps to a collective. Within a
|
||||
group, a job is selected based on priority, submitted time (fifo)
|
||||
and job state (see notes about stuck jobs).
|
||||
|
||||
3. Allowing multiple job executors means that getting the next job can
|
||||
fail due to simultaneous running transactions. It is retried until
|
||||
it succeeds. Taking a job puts in into _scheduled_ state. Each job
|
||||
executor has a unique (manually supplied) id and jobs are marked
|
||||
with that id once it is handed to the executor.
|
||||
|
||||
4. When a task fails, its state is updated to state _stuck_. Stuck
|
||||
jobs are retried in the future. The queue prefers to return stuck
|
||||
jobs that are due at the specific point in time ignoring the
|
||||
priority hint.
|
||||
|
||||
### More Details
|
||||
|
||||
A job has these properties
|
||||
|
||||
- id (something random)
|
||||
- group
|
||||
- taskname (to choose task to run)
|
||||
- submitted-date
|
||||
- worker (the id of the job executor)
|
||||
- state, one of: waiting, scheduled, running, stuck, cancelled,
|
||||
failed, success
|
||||
- waiting: job has been inserted into the queue
|
||||
- scheduled: job has been handed over to some executore and is
|
||||
marked with the job executor id
|
||||
- running: a task is currently executing
|
||||
- stuck: a task has failed and is being retried eventually
|
||||
- cancelled: task has finished and there was a cancel request
|
||||
- failed: task has failed, execeeded the retries
|
||||
- success: task has completed successfully
|
||||
|
||||
The queue has a `take` or `nextJob` operation that takes the worker-id
|
||||
and a priority hint and goes roughly like this:
|
||||
|
||||
- select the next group using round-robin strategy
|
||||
- select all jobs with that group, where
|
||||
- state is stuck and waiting time has elapsed
|
||||
- state is waiting and have the given priority if possible
|
||||
- jobs are ordered by submitted time, but stuck jobs whose waiting
|
||||
time elapsed are preferred
|
||||
|
||||
There are two priorities within a group: high and low. A configured
|
||||
counting scheme determines when to select certain priority. For
|
||||
example, counting scheme of `(2,1)` would select two high priority
|
||||
jobs and then 1 low priority job. The `take` operation tries to prefer
|
||||
this priority but falls back to the other if no job with this priority
|
||||
is available.
|
||||
|
||||
A group corresponds to a collective. Then all collectives get
|
||||
(roughly) equal treatment.
|
||||
|
||||
Once there are no jobs in the queue the executor goes into sleep and
|
||||
must be waked to run again. If a job is submitted, the executors are
|
||||
notified.
|
||||
|
||||
### Stuck Jobs
|
||||
|
||||
A job is going into _stuck_ state, if the task has failed. In this
|
||||
state, the task is rerun after a while until a maximum retry count is
|
||||
reached.
|
||||
|
||||
The problem is how to notify all executors when the waiting time has
|
||||
elapsed. If one executor puts a job into stuck state, it means that
|
||||
all others should start looking into the queue again after `x`
|
||||
minutes. It would be possible to tell all existing executors to
|
||||
schedule themselves to wake up in the future, but this would miss all
|
||||
executors that show up later.
|
||||
|
||||
The waiting time is increased exponentially after each retry (`2 ^
|
||||
retry`) and it is meant as the minimum waiting time. So it is ok if
|
||||
all executors wakeup periodically and check for new work. Most of the
|
||||
time this should not be necessary and is just a fallback if only stuck
|
||||
jobs are in the queue and nothing is submitted for a long time. If the
|
||||
system is used, jobs get submitted once in a while and would awake all
|
||||
executors.
|
72
modules/microsite/docs/dev/adr/template.md
Normal file
72
modules/microsite/docs/dev/adr/template.md
Normal file
@ -0,0 +1,72 @@
|
||||
# [short title of solved problem and solution]
|
||||
|
||||
* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
|
||||
* Deciders: [list everyone involved in the decision] <!-- optional -->
|
||||
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
|
||||
|
||||
Technical Story: [description | ticket/issue URL] <!-- optional -->
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
|
||||
|
||||
## Decision Drivers <!-- optional -->
|
||||
|
||||
* [driver 1, e.g., a force, facing concern, …]
|
||||
* [driver 2, e.g., a force, facing concern, …]
|
||||
* … <!-- numbers of drivers can vary -->
|
||||
|
||||
## Considered Options
|
||||
|
||||
* [option 1]
|
||||
* [option 2]
|
||||
* [option 3]
|
||||
* … <!-- numbers of options can vary -->
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
|
||||
|
||||
### Positive Consequences <!-- optional -->
|
||||
|
||||
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
### Negative Consequences <!-- optional -->
|
||||
|
||||
* [e.g., compromising quality attribute, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
## Pros and Cons of the Options <!-- optional -->
|
||||
|
||||
### [option 1]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 2]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 3]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
## Links <!-- optional -->
|
||||
|
||||
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
|
||||
* … <!-- numbers of links can vary -->
|
Reference in New Issue
Block a user