Upgrade microsite

2025-08-05 02:24:52 +00:00 · 2019-12-29 23:37:32 +01:00
parent 2001cca88b
commit 57e274e2b0
42 changed files with 599 additions and 70 deletions
--- a/modules/microsite/docs/dev/adr.md
+++ b/modules/microsite/docs/dev/adr.md
@ -0,0 +1,12 @@
+---
+layout: docs
+title: ADRs
+---
+
+# ADR
+
+- [0001 Components](adr/0001_components.html)
+- [0002 Component Interaction](adr/0002_component_interaction.html)
+- [0003 Encryption](adr/0003_encryption.html)
+- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch.html)
+- [0005 Job Executor](adr/0005_job-executor.html)
--- a/modules/microsite/docs/dev/adr/0000_use_markdown_architectural_decision_records.md
+++ b/modules/microsite/docs/dev/adr/0000_use_markdown_architectural_decision_records.md
@ -0,0 +1,33 @@
+# Use Markdown Architectural Decision Records
+
+## Context and Problem Statement
+
+We want to [record architectural decisions](https://adr.github.io/)
+made in this project.  Which format and structure should these records
+follow?
+
+## Considered Options
+
+* [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
+* [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
+* [Sustainable Architectural
+  Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) -
+  The Y-Statements
+* Other templates listed at
+  <https://github.com/joelparkerhenderson/architecture_decision_record>
+* Formless - No conventions for file format and structure
+
+## Decision Outcome
+
+Chosen option: "MADR 2.1.0", because
+
+* Implicit assumptions should be made explicit. Design documentation
+  is important to enable people understanding the decisions later on.
+  See also [A rational design process: How and why to fake
+  it](https://doi.org/10.1109/TSE.1986.6312940).
+* The MADR format is lean and fits our development style.
+* The MADR structure is comprehensible and facilitates usage &
+  maintenance.
+* The MADR project is vivid.
+* Version 2.1.0 is the latest one available when starting to document
+  ADRs.
--- a/modules/microsite/docs/dev/adr/0001_components.md
+++ b/modules/microsite/docs/dev/adr/0001_components.md
@ -0,0 +1,66 @@
+---
+layout: docs
+title: Components
+---
+
+# Components
+
+## Context and Problem Statement
+
+How should the application be structured into its main components? The
+goal is to be able to have multiple rest servers/webapps and multiple
+document processor components working togehter.
+
+
+## Decision Outcome
+
+The following are the "main" modules. There may be more helper modules
+and libraries that support implementing a feature.
+
+### store
+
+The code related to database access. It also provides the job
+queue. It is designed as a library.
+
+### joex
+
+Joex stands for "job executor".
+
+An application that executes jobs from the queue and therefore depends
+on the `store` module. It provides the code for all tasks that can be
+submitted as jobs. If no jobs are in the queue, the joex "sleeps"
+and must be waked via an external request.
+
+It provides the document processing code.
+
+It provides a http rest server to get insight into the joex state
+and also to be notified for new jobs.
+
+### backend
+
+It provides all the logic, except document processing, as a set of
+"operations". An operation can be directly mapped to a rest
+endpoint.
+
+It is designed as a library.
+
+### rest api
+
+This module contains the specification for the rest server as an
+`openapi.yml` file. It is packaged as a scala library that also
+provides types and conversions to/from json.
+
+The idea is that the `rest server` module can depend on it as well as
+rest clients.
+
+### rest server
+
+This is the main application. It directly depends on the `backend`
+module, and each rest endpoint maps to a "backend operation". It is
+also responsible for converting the json data inside http requests
+to/from types recognized by the `backend` module.
+
+
+### webapp
+
+This module provides the user interface as a web application.
--- a/modules/microsite/docs/dev/adr/0002_component_interaction.md
+++ b/modules/microsite/docs/dev/adr/0002_component_interaction.md
@ -0,0 +1,65 @@
+---
+layout: docs
+title: Component Interaction
+---
+
+# Component Interaction
+
+## Context and Problem Statement
+
+There are multiple web applications with their rest servers and there
+are multiple document processors. These processes must communicate:
+
+- once a new job is added to the queue the rest server must somehow
+  notify processors to wake up
+- once a processor takes a job, it must propagate the progress and
+  outcome to all rest servers only that the rest server can notify the
+  user that is currently logged in. Since it's not known which
+  rest-server the user is using right now, all must be notified.
+
+## Considered Options
+
+1. JMS (ActiveMQ or similiar): Message Broker as another active
+   component
+2. Akka: using a cluster
+3. DB: Register with "call back urls"
+
+## Decision Outcome
+
+Choosing option 3: DB as central synchronisation point.
+
+The reason is that this is the simplest solution and doesn't require
+external libraries or more processes. The other options seem too big
+of a weapon for the task at hand. They are both large components
+itself and require more knowledge to use them efficiently.
+
+It works roughly like this:
+
+- rest servers and processors register at the database on startup each
+  with a unique call-back url
+- and deregister on shutdown
+- each component has db access
+- rest servers can list all processors and vice versa
+
+### Positive Consequences
+
+- complexity of the whole application is not touched
+- since a lot of data must be transferred to the document processors,
+  this is solved by simply accessing the db. So the protocol for data
+  exchange is set. There is no need for other protocols that handle
+  large data (http chunking etc)
+- uses the already exsting db as synchronisation point
+- no additional knowledge required
+- simple to understand and so not hard to debug
+
+### Negative Consequences
+
+- all components must have db access. this also is a security con,
+  because if one of those processes is hacked, db access is
+  possible. and it simply is another dependency that is not really
+  required for the joex component
+- the joex component cannot be in an untrusted environment (untrusted
+  from the db's point of view). For example, it is not possible to
+  create "personal joex" that only receive your own jobs…
+- in order to know if a component is really active, one must run a
+  ping against the call-back url
--- a/modules/microsite/docs/dev/adr/0003_encryption.md
+++ b/modules/microsite/docs/dev/adr/0003_encryption.md
@ -0,0 +1,95 @@
+---
+layout: docs
+title: Encryption
+---
+
+# Encryption
+
+
+## Context and Problem Statement
+
+Since docspell may store important documents, it should be possible to
+encrypt them on the server. It should be (almost) transparent to the
+user, for example, a user must be able to login and download a file in
+clear form. That is, the server must also decrypt them.
+
+Then all users of a collective should have access to the files. This
+requires to share the key among users of a collective.
+
+But, even when files are encrypted, the associated meta data is not!
+So especially access to the database would allow to see tags,
+associated persons and correspondents of documents.
+
+So in short, encryption means:
+
+- file contents (the blobs and extracted text) is encrypted
+- metadata is not
+- secret keys are stored at the server (protected by a passphrase),
+  such that files can be downloaded in clear form
+
+
+## Decision Drivers
+
+* major driver is to provide most possible privacy for users
+* even at the expense of less features; currently I think that the
+  associated meta data is enough for finding documents (i.e. full text
+  search is not needed)
+
+## Considered Options
+
+It is clear, that only blobs (file contents) can be encrypted, but not
+the associated metadata. And the extracted text must be encrypted,
+too, obviously.
+
+
+### Public Key Encryption (PKE)
+
+With PKE that the server can automatically encrypt files using
+publicly available key data. It wouldn't require a user to provide a
+passphrase for encryption, only for decryption.
+
+This would allows for first processing files (extracting text, doing
+text analyisis) and encrypting them (and the text) afterwards.
+
+The public and secret keys are stored at the database. The secret key
+must be protected. This can be done by encrypting the passphrase to
+the secret key using each users login password. If a user logs in, he
+or she must provide the correct password. Using this password, the
+private key can be unlocked. This requires to store the private key
+passphrase encrypted with every users password in the database. So the
+whole security then depends on users password quality.
+
+There are plenty of other difficulties with this approach (how about
+password change, new secret keys, adding users etc).
+
+Using this kind of encryption would protect the data against offline
+attacks and also for accidental leakage (for example, if a bug in the
+software would access a file of another user).
+
+
+### No Encryption
+
+If only blobs are encrypted, against which type of attack would it
+provide protection?
+
+The users must still trust the server. First, in order to provide the
+wanted features (document processing), the server must see the file
+contents. Then, it will receive and serve files in clear form, so it
+has access to them anyways.
+
+With that in mind, the "only" feature is to protect against "stolen
+database" attacks. If the database is somehow leaked, the attackers
+would only see the metadata, but not real documents. It also protects
+against leakage, maybe caused by a pogramming error.
+
+But the downside is, that it increases complexity *a lot*. And since
+this is a personal tool for personal use, is it worth the effort?
+
+
+## Decision Outcome
+
+No encryption, because of its complexity.
+
+For now, this tool is only meant for "self deployment" and personal
+use. If this changes or there is enough time, this decision should be
+reconsidered.
--- a/modules/microsite/docs/dev/adr/0004_iso8601vsEpoch.md
+++ b/modules/microsite/docs/dev/adr/0004_iso8601vsEpoch.md
@ -0,0 +1,42 @@
+---
+layout: docs
+title: ISO8601 vs Millis
+---
+
+# ISO8601 vs Millis as Date-Time transfer
+
+## Context and Problem Statement
+
+The question is whether the REST Api should return an ISO8601
+formatted string in UTC timezone, or the unix time (number of
+milliseconds since 1970-01-01).
+
+There is quite some controversy about it.
+
+- <https://stackoverflow.com/questions/47426786/epoch-or-iso8601-date-format>
+- <https://nbsoftsolutions.com/blog/designing-a-rest-api-unix-time-vs-iso-8601>
+
+In my opinion, the ISO8601 format (always UTC) is better. The reason
+is the better readability. But elm folks are on the other side:
+
+- <https://package.elm-lang.org/packages/elm/time/1.0.0#iso-8601>
+- <https://package.elm-lang.org/packages/rtfeldman/elm-iso8601-date-strings/latest/>
+
+One can convert from an ISO8601 date-time string in UTC time into the
+epoch millis and vice versa. So it is the same to me. There is no less
+information in a ISO8601 string than in the epoch millis.
+
+To avoid confusion, all date/time values should use the same encoding.
+
+## Decision Outcome
+
+I go with the epoch time. Every timestamp/date-time values is
+transfered as Unix timestamp.
+
+Reasons:
+
+- the Elm application needs to frequently calculate with these values
+  to render the current waiting time etc. This is better if there are
+  numbers without requiring to parse dates first
+- Since the UI is written with Elm, it's probably good to adopt their
+  style
--- a/modules/microsite/docs/dev/adr/0005_job-executor.md
+++ b/modules/microsite/docs/dev/adr/0005_job-executor.md
@ -0,0 +1,136 @@
+---
+layout: docs
+title: Joex - Job Executor
+---
+
+# Job Executor
+
+## Context and Problem Statement
+
+Docspell is a multi-user application. When processing user's
+documents, there must be some thought on how to distribute all the
+processing jobs on a much more restricted set of resources. There
+maybe 100 users but only 4 cores that can process documents at a
+time. Doing simply FIFO is not enough since it provides an unfair
+distribution. The first user who submits 20 documents will then occupy
+all cores for quite some time and all other users would need to wait.
+
+This tries to find a more fair distribution among the users (strictly
+meaning collectives here) of docspell.
+
+The job executor is a separate component that will run in its own
+process. It takes the next job from the "queue" and executes the
+associated task. This is used to run the document processing jobs
+(text extraction, text analysis etc).
+
+1. The task execution should survive restarts. State and task code
+   must be recreated from some persisted state.
+
+2. The processing should be fair with respect to collectives.
+
+3. It must be possible to run many job executors, possibly on
+   different machines. This can be used to quickly enable more
+   processing power and removing it once the peak is over.
+
+4. Task execution can fail and it should be able to retry those
+   tasks. Reasons are that errors may be temporarily (for example
+   talking to a third party service), and to enable repairing without
+   stopping the job executor. Some errors might be easily repaired (a
+   program was not installed or whatever). In such a case it is good
+   to know that the task will be retried later.
+
+## Considered Options
+
+In contrast to other ADRs this is just some sketching of thoughts for
+the current implementation.
+
+1. Job description are serialized and written to the database into a
+   table. This becomes the queue. Tasks are identified by names and a
+   job executor implementation must have a map of names to code to
+   lookup the task to perform. The tasks arguments are serialized into
+   a string and written to the database. Tasks must decode the
+   string. This can be conveniently done using JSON and the provided
+   circe decoders.
+
+2. To provide a fair execution jobs are organized into groups. When a
+   new job is requested from the queue, first a group is selected
+   using a round-robin strategy. This should ensure good enough
+   fairness among groups. A group maps to a collective. Within a
+   group, a job is selected based on priority, submitted time (fifo)
+   and job state (see notes about stuck jobs).
+
+3. Allowing multiple job executors means that getting the next job can
+   fail due to simultaneous running transactions. It is retried until
+   it succeeds. Taking a job puts in into _scheduled_ state. Each job
+   executor has a unique (manually supplied) id and jobs are marked
+   with that id once it is handed to the executor.
+
+4. When a task fails, its state is updated to state _stuck_. Stuck
+   jobs are retried in the future. The queue prefers to return stuck
+   jobs that are due at the specific point in time ignoring the
+   priority hint.
+
+### More Details
+
+A job has these properties
+
+- id (something random)
+- group
+- taskname (to choose task to run)
+- submitted-date
+- worker (the id of the job executor)
+- state, one of: waiting, scheduled, running, stuck, cancelled,
+  failed, success
+  - waiting: job has been inserted into the queue
+  - scheduled: job has been handed over to some executore and is
+    marked with the job executor id
+  - running: a task is currently executing
+  - stuck: a task has failed and is being retried eventually
+  - cancelled: task has finished and there was a cancel request
+  - failed: task has failed, execeeded the retries
+  - success: task has completed successfully
+
+The queue has a `take` or `nextJob` operation that takes the worker-id
+and a priority hint and goes roughly like this:
+
+- select the next group using round-robin strategy
+- select all jobs with that group, where
+  - state is stuck and waiting time has elapsed
+  - state is waiting and have the given priority if possible
+- jobs are ordered by submitted time, but stuck jobs whose waiting
+  time elapsed are preferred
+
+There are two priorities within a group: high and low. A configured
+counting scheme determines when to select certain priority. For
+example, counting scheme of `(2,1)` would select two high priority
+jobs and then 1 low priority job. The `take` operation tries to prefer
+this priority but falls back to the other if no job with this priority
+is available.
+
+A group corresponds to a collective. Then all collectives get
+(roughly) equal treatment.
+
+Once there are no jobs in the queue the executor goes into sleep and
+must be waked to run again. If a job is submitted, the executors are
+notified.
+
+### Stuck Jobs
+
+A job is going into _stuck_ state, if the task has failed. In this
+state, the task is rerun after a while until a maximum retry count is
+reached.
+
+The problem is how to notify all executors when the waiting time has
+elapsed. If one executor puts a job into stuck state, it means that
+all others should start looking into the queue again after `x`
+minutes. It would be possible to tell all existing executors to
+schedule themselves to wake up in the future, but this would miss all
+executors that show up later.
+
+The waiting time is increased exponentially after each retry (`2 ^
+retry`) and it is meant as the minimum waiting time. So it is ok if
+all executors wakeup periodically and check for new work. Most of the
+time this should not be necessary and is just a fallback if only stuck
+jobs are in the queue and nothing is submitted for a long time. If the
+system is used, jobs get submitted once in a while and would awake all
+executors.
--- a/modules/microsite/docs/dev/adr/template.md
+++ b/modules/microsite/docs/dev/adr/template.md
@ -0,0 +1,72 @@
+# [short title of solved problem and solution]
+
+* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
+* Deciders: [list everyone involved in the decision] <!-- optional -->
+* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
+
+Technical Story: [description | ticket/issue URL] <!-- optional -->
+
+## Context and Problem Statement
+
+[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
+
+## Decision Drivers <!-- optional -->
+
+* [driver 1, e.g., a force, facing concern, …]
+* [driver 2, e.g., a force, facing concern, …]
+* … <!-- numbers of drivers can vary -->
+
+## Considered Options
+
+* [option 1]
+* [option 2]
+* [option 3]
+* … <!-- numbers of options can vary -->
+
+## Decision Outcome
+
+Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
+
+### Positive Consequences <!-- optional -->
+
+* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
+* …
+
+### Negative Consequences <!-- optional -->
+
+* [e.g., compromising quality attribute, follow-up decisions required, …]
+* …
+
+## Pros and Cons of the Options <!-- optional -->
+
+### [option 1]
+
+[example | description | pointer to more information | …] <!-- optional -->
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 2]
+
+[example | description | pointer to more information | …] <!-- optional -->
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+### [option 3]
+
+[example | description | pointer to more information | …] <!-- optional -->
+
+* Good, because [argument a]
+* Good, because [argument b]
+* Bad, because [argument c]
+* … <!-- numbers of pros and cons can vary -->
+
+## Links <!-- optional -->
+
+* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
+* … <!-- numbers of links can vary -->