Initial website

This commit is contained in:
Eike Kettner
2020-07-27 22:13:22 +02:00
parent dbd0f3ff97
commit f8c6f79b10
160 changed files with 8854 additions and 64 deletions

View File

@ -0,0 +1,36 @@
+++
title = "Use Markdown Architectural Decision Records"
weight = 10
+++
# Context and Problem Statement
We want to [record architectural decisions](https://adr.github.io/)
made in this project. Which format and structure should these records
follow?
# Considered Options
* [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
* [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
* [Sustainable Architectural
Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) -
The Y-Statements
* Other templates listed at
<https://github.com/joelparkerhenderson/architecture_decision_record>
* Formless - No conventions for file format and structure
# Decision Outcome
Chosen option: "MADR 2.1.0", because
* Implicit assumptions should be made explicit. Design documentation
is important to enable people understanding the decisions later on.
See also [A rational design process: How and why to fake
it](https://doi.org/10.1109/TSE.1986.6312940).
* The MADR format is lean and fits our development style.
* The MADR structure is comprehensible and facilitates usage &
maintenance.
* The MADR project is vivid.
* Version 2.1.0 is the latest one available when starting to document
ADRs.

View File

@ -0,0 +1,64 @@
+++
title = "Components"
weight = 20
+++
# Context and Problem Statement
How should the application be structured into its main components? The
goal is to be able to have multiple rest servers/webapps and multiple
document processor components working togehter.
# Decision Outcome
The following are the "main" modules. There may be more helper modules
and libraries that support implementing a feature.
## store
The code related to database access. It also provides the job
queue. It is designed as a library.
## joex
Joex stands for "job executor".
An application that executes jobs from the queue and therefore depends
on the `store` module. It provides the code for all tasks that can be
submitted as jobs. If no jobs are in the queue, the joex "sleeps"
and must be waked via an external request.
It provides the document processing code.
It provides a http rest server to get insight into the joex state
and also to be notified for new jobs.
## backend
It provides all the logic, except document processing, as a set of
"operations". An operation can be directly mapped to a rest
endpoint.
It is designed as a library.
## rest api
This module contains the specification for the rest server as an
`openapi.yml` file. It is packaged as a scala library that also
provides types and conversions to/from json.
The idea is that the `rest server` module can depend on it as well as
rest clients.
## rest server
This is the main application. It directly depends on the `backend`
module, and each rest endpoint maps to a "backend operation". It is
also responsible for converting the json data inside http requests
to/from types recognized by the `backend` module.
## webapp
This module provides the user interface as a web application.

View File

@ -0,0 +1,63 @@
+++
title = "Component Interaction"
weight = 30
+++
# Context and Problem Statement
There are multiple web applications with their rest servers and there
are multiple document processors. These processes must communicate:
- once a new job is added to the queue the rest server must somehow
notify processors to wake up
- once a processor takes a job, it must propagate the progress and
outcome to all rest servers only that the rest server can notify the
user that is currently logged in. Since it's not known which
rest-server the user is using right now, all must be notified.
# Considered Options
1. JMS (ActiveMQ or similiar): Message Broker as another active
component
2. Akka: using a cluster
3. DB: Register with "call back urls"
# Decision Outcome
Choosing option 3: DB as central synchronisation point.
The reason is that this is the simplest solution and doesn't require
external libraries or more processes. The other options seem too big
of a weapon for the task at hand. They are both large components
itself and require more knowledge to use them efficiently.
It works roughly like this:
- rest servers and processors register at the database on startup each
with a unique call-back url
- and deregister on shutdown
- each component has db access
- rest servers can list all processors and vice versa
## Positive Consequences
- complexity of the whole application is not touched
- since a lot of data must be transferred to the document processors,
this is solved by simply accessing the db. So the protocol for data
exchange is set. There is no need for other protocols that handle
large data (http chunking etc)
- uses the already exsting db as synchronisation point
- no additional knowledge required
- simple to understand and so not hard to debug
## Negative Consequences
- all components must have db access. this also is a security con,
because if one of those processes is hacked, db access is
possible. and it simply is another dependency that is not really
required for the joex component
- the joex component cannot be in an untrusted environment (untrusted
from the db's point of view). For example, it is not possible to
create "personal joex" that only receive your own jobs…
- in order to know if a component is really active, one must run a
ping against the call-back url

View File

@ -0,0 +1,93 @@
+++
title = "Encryption"
weight = 40
+++
# Context and Problem Statement
Since docspell may store important documents, it should be possible to
encrypt them on the server. It should be (almost) transparent to the
user, for example, a user must be able to login and download a file in
clear form. That is, the server must also decrypt them.
Then all users of a collective should have access to the files. This
requires to share the key among users of a collective.
But, even when files are encrypted, the associated meta data is not!
So especially access to the database would allow to see tags,
associated persons and correspondents of documents.
So in short, encryption means:
- file contents (the blobs and extracted text) is encrypted
- metadata is not
- secret keys are stored at the server (protected by a passphrase),
such that files can be downloaded in clear form
# Decision Drivers
* major driver is to provide most possible privacy for users
* even at the expense of less features; currently I think that the
associated meta data is enough for finding documents (i.e. full text
search is not needed)
# Considered Options
It is clear, that only blobs (file contents) can be encrypted, but not
the associated metadata. And the extracted text must be encrypted,
too, obviously.
## Public Key Encryption (PKE)
With PKE that the server can automatically encrypt files using
publicly available key data. It wouldn't require a user to provide a
passphrase for encryption, only for decryption.
This would allows for first processing files (extracting text, doing
text analyisis) and encrypting them (and the text) afterwards.
The public and secret keys are stored at the database. The secret key
must be protected. This can be done by encrypting the passphrase to
the secret key using each users login password. If a user logs in, he
or she must provide the correct password. Using this password, the
private key can be unlocked. This requires to store the private key
passphrase encrypted with every users password in the database. So the
whole security then depends on users password quality.
There are plenty of other difficulties with this approach (how about
password change, new secret keys, adding users etc).
Using this kind of encryption would protect the data against offline
attacks and also for accidental leakage (for example, if a bug in the
software would access a file of another user).
## No Encryption
If only blobs are encrypted, against which type of attack would it
provide protection?
The users must still trust the server. First, in order to provide the
wanted features (document processing), the server must see the file
contents. Then, it will receive and serve files in clear form, so it
has access to them anyways.
With that in mind, the "only" feature is to protect against "stolen
database" attacks. If the database is somehow leaked, the attackers
would only see the metadata, but not real documents. It also protects
against leakage, maybe caused by a pogramming error.
But the downside is, that it increases complexity *a lot*. And since
this is a personal tool for personal use, is it worth the effort?
# Decision Outcome
No encryption, because of its complexity.
For now, this tool is only meant for "self deployment" and personal
use. If this changes or there is enough time, this decision should be
reconsidered.

View File

@ -0,0 +1,40 @@
+++
title = "ISO8601 vs Millis as Date-Time transfer"
weight = 50
+++
# Context and Problem Statement
The question is whether the REST Api should return an ISO8601
formatted string in UTC timezone, or the unix time (number of
milliseconds since 1970-01-01).
There is quite some controversy about it.
- <https://stackoverflow.com/questions/47426786/epoch-or-iso8601-date-format>
- <https://nbsoftsolutions.com/blog/designing-a-rest-api-unix-time-vs-iso-8601>
In my opinion, the ISO8601 format (always UTC) is better. The reason
is the better readability. But elm folks are on the other side:
- <https://package.elm-lang.org/packages/elm/time/1.0.0#iso-8601>
- <https://package.elm-lang.org/packages/rtfeldman/elm-iso8601-date-strings/latest/>
One can convert from an ISO8601 date-time string in UTC time into the
epoch millis and vice versa. So it is the same to me. There is no less
information in a ISO8601 string than in the epoch millis.
To avoid confusion, all date/time values should use the same encoding.
# Decision Outcome
I go with the epoch time. Every timestamp/date-time values is
transfered as Unix timestamp.
Reasons:
- the Elm application needs to frequently calculate with these values
to render the current waiting time etc. This is better if there are
numbers without requiring to parse dates first
- Since the UI is written with Elm, it's probably good to adopt their
style

View File

@ -0,0 +1,134 @@
+++
title = "Joex - Job Executor"
weight = 60
+++
# Context and Problem Statement
Docspell is a multi-user application. When processing user's
documents, there must be some thought on how to distribute all the
processing jobs on a much more restricted set of resources. There
maybe 100 users but only 4 cores that can process documents at a
time. Doing simply FIFO is not enough since it provides an unfair
distribution. The first user who submits 20 documents will then occupy
all cores for quite some time and all other users would need to wait.
This tries to find a more fair distribution among the users (strictly
meaning collectives here) of docspell.
The job executor is a separate component that will run in its own
process. It takes the next job from the "queue" and executes the
associated task. This is used to run the document processing jobs
(text extraction, text analysis etc).
1. The task execution should survive restarts. State and task code
must be recreated from some persisted state.
2. The processing should be fair with respect to collectives.
3. It must be possible to run many job executors, possibly on
different machines. This can be used to quickly enable more
processing power and removing it once the peak is over.
4. Task execution can fail and it should be able to retry those
tasks. Reasons are that errors may be temporarily (for example
talking to a third party service), and to enable repairing without
stopping the job executor. Some errors might be easily repaired (a
program was not installed or whatever). In such a case it is good
to know that the task will be retried later.
# Considered Options
In contrast to other ADRs this is just some sketching of thoughts for
the current implementation.
1. Job description are serialized and written to the database into a
table. This becomes the queue. Tasks are identified by names and a
job executor implementation must have a map of names to code to
lookup the task to perform. The tasks arguments are serialized into
a string and written to the database. Tasks must decode the
string. This can be conveniently done using JSON and the provided
circe decoders.
2. To provide a fair execution jobs are organized into groups. When a
new job is requested from the queue, first a group is selected
using a round-robin strategy. This should ensure good enough
fairness among groups. A group maps to a collective. Within a
group, a job is selected based on priority, submitted time (fifo)
and job state (see notes about stuck jobs).
3. Allowing multiple job executors means that getting the next job can
fail due to simultaneous running transactions. It is retried until
it succeeds. Taking a job puts in into _scheduled_ state. Each job
executor has a unique (manually supplied) id and jobs are marked
with that id once it is handed to the executor.
4. When a task fails, its state is updated to state _stuck_. Stuck
jobs are retried in the future. The queue prefers to return stuck
jobs that are due at the specific point in time ignoring the
priority hint.
## More Details
A job has these properties
- id (something random)
- group
- taskname (to choose task to run)
- submitted-date
- worker (the id of the job executor)
- state, one of: waiting, scheduled, running, stuck, cancelled,
failed, success
- waiting: job has been inserted into the queue
- scheduled: job has been handed over to some executore and is
marked with the job executor id
- running: a task is currently executing
- stuck: a task has failed and is being retried eventually
- cancelled: task has finished and there was a cancel request
- failed: task has failed, execeeded the retries
- success: task has completed successfully
The queue has a `take` or `nextJob` operation that takes the worker-id
and a priority hint and goes roughly like this:
- select the next group using round-robin strategy
- select all jobs with that group, where
- state is stuck and waiting time has elapsed
- state is waiting and have the given priority if possible
- jobs are ordered by submitted time, but stuck jobs whose waiting
time elapsed are preferred
There are two priorities within a group: high and low. A configured
counting scheme determines when to select certain priority. For
example, counting scheme of `(2,1)` would select two high priority
jobs and then 1 low priority job. The `take` operation tries to prefer
this priority but falls back to the other if no job with this priority
is available.
A group corresponds to a collective. Then all collectives get
(roughly) equal treatment.
Once there are no jobs in the queue the executor goes into sleep and
must be waked to run again. If a job is submitted, the executors are
notified.
## Stuck Jobs
A job is going into _stuck_ state, if the task has failed. In this
state, the task is rerun after a while until a maximum retry count is
reached.
The problem is how to notify all executors when the waiting time has
elapsed. If one executor puts a job into stuck state, it means that
all others should start looking into the queue again after `x`
minutes. It would be possible to tell all existing executors to
schedule themselves to wake up in the future, but this would miss all
executors that show up later.
The waiting time is increased exponentially after each retry (`2 ^
retry`) and it is meant as the minimum waiting time. So it is ok if
all executors wakeup periodically and check for new work. Most of the
time this should not be necessary and is just a fallback if only stuck
jobs are in the queue and nothing is submitted for a long time. If the
system is used, jobs get submitted once in a while and would awake all
executors.

View File

@ -0,0 +1,150 @@
+++
title = "More File Types"
weight = 70
+++
# Context and Problem Statement
Docspell currently only supports PDF files. This has simplified early
development and design a lot and so helped with starting the project.
Handling pdf files is usually easy (to view, to extract text, print
etc).
The pdf format has been chosen, because PDFs files are very common and
can be viewed with many tools on many systems (i.e. non-proprietary
tools). Docspell also is a document archive and from this perspective,
it is important that documents can be viewed in 10 years and more. The
hope is, that the PDF format is best suited for this. Therefore all
documents in Docspell must be accessible as PDF. The trivial solution
to this requirement is to only allow PDF files.
Support for more document types, must then take care of the following:
- extracting text
- converting into pdf
- access original file
Text should be extracted from the source file, in case conversion is
not lossless. Since Docspell can already extract text from PDF files
using OCR, text can also be extracted from the converted file as a
fallback.
The original file must always be accessible. The main reason is that
all uploaded data should be accessible without any modification. And
since the conversion may not always create best results, the original
file should be kept.
# Decision Drivers
People expect that software like Docspell support the most common
document types, like all the “office documents” (`docx`, `rtf`, `odt`,
`xlsx`, …) and images. For many people it is more common to create
those files instead of PDF. Some (older) scanners may not be able to
scan into PDF files but only to image files.
# Considered Options
This ADR does not evaluate different options. It rather documents why
this feature is realized and the thoughts that lead to how it is
implemented.
# Realization
## Data Model
The `attachment` table holds one file. There will be another table
`attachment_source` that holds the original file. It looks like this:
``` sql
CREATE TABLE "attachment_source" (
"id" varchar(254) not null primary key,
"file_id" varchar(254) not null,
"filename" varchar(254),
"created" timestamp not null,
foreign key ("file_id") references "filemeta"("id"),
foreign key ("id") references "attachment"("attachid")
);
```
The `id` is the primary key and is the same as the associated
`attachment`, creating a `1-1` relationship (well, more correct is
`0..1-1`) between `attachment` and `attachment_source`.
There will always be a `attachment_source` record for every
`attachment` record. If the original file is a PDF already, then both
table's `file_id` columns point to the same file. But now the user can
change the filename of an `attachment` while the original filename is
preserved in `attachment_source`. It must not be possible for the user
to change anything in `attachment_source`.
The `attachment` table is not touched in order to keep current code
mostly unchanged and to have a simpler data migration. The downside
is, that the data model allows to have an `attachment` record without
an `attachment_source` record. OTOH, a foreign key inside `attachment`
pointing to an `attachment_source` is also not correct, because it
allows the same `attachment_source` record to be associated with many
`attachment` records. This would do even more harm, in my opinion.
## Migration
Creating a new table and not altering existing ones, should simplify
data migration.
Since only PDF files where allowed and the user could not change
anything in the `attachment` table, the existing data can simply be
inserted into the new table. This presents the trivial case where the
attachment and source are the same.
## Processing
The first step in processing is now converting the file into a pdf. If
it already is a pdf, nothing is done. This step is before text
extraction, so text can first be tried to extract from the source file
and only if that fails (or is not supported), text can be extracted
from the converted pdf file. All remaining steps are untouched.
If conversion is not supported for the input file, it is skipped. If
conversion fails, the error is propagated to let the retry mechanism
take care.
### What types?
Which file types should be supported? At a first step, all major
office documents, common images, plain text (i.e. markdown) and html
should be supported. In terms of file extensions: `doc`, `docx`,
`xls`, `xlsx`, `odt`, `md`, `html`, `txt`, `jpg`, `png`, `tif`.
There is always the preference to use jvm internal libraries in order
to be more platform independent and to reduce external dependencies.
But this is not always possible (like doing OCR).
{{ figure(file="process-files.png") }}
### Conversion
- Office documents (`doc`, `docx`, `xls`, `xlsx`, `odt`, `ods`):
unoconv (see [ADR 9](@/docs/dev/adr/0009_convert_office_docs.md))
- HTML (`html`): wkhtmltopdf (see [ADR 7](@/docs/dev/adr/0007_convert_html_files.md))
- Text/Markdown (`txt`, `md`): Java-Lib flexmark + wkhtmltopdf
- Images (`jpg`, `png`, `tif`): Tesseract (see [ADR
10](@/docs/dev/adr/0010_convert_image_files.md))
### Text Extraction
- Office documents (`doc`, `docx`, `xls`, `xlsx`): Apache Poi
- Office documends (`odt`, `ods`): Apache Tika (including the sources)
- HTML: not supported, extract text from converted PDF
- Images (`jpg`, `png`, `tif`): Tesseract
- Text/Markdown: n.a.
- PDF: Apache PDFBox or Tesseract
# Links
* [Convert HTML Files](@/docs/dev/adr/0007_convert_html_files.md)
* [Convert Plain Text](@/docs/dev/adr/0008_convert_plain_text.md)
* [Convert Office Documents](@/docs/dev/adr/0009_convert_office_docs.md)
* [Convert Image Files](@/docs/dev/adr/0010_convert_image_files.md)
* [Extract Text from Files](@/docs/dev/adr/0011_extract_text.md)

View File

@ -0,0 +1,59 @@
+++
title = "Convert HTML Files"
weight = 80
+++
# Context and Problem Statement
How can HTML documents be converted into a PDF file that looks as much
as possible like the original?
It would be nice to have a java-only solution. But if an external tool
has a better outcome, then an external tool is fine, too.
Since Docspell is free software, the tools must also be free.
# Considered Options
* [pandoc](https://pandoc.org/) external command
* [wkhtmltopdf](https://wkhtmltopdf.org/) external command
* [Unoconv](https://github.com/unoconv/unoconv) external command
Native (firefox) view:
{{ figure(file="example-html-native.jpg") }}
Note: the example html is from
[here](https://www.sparksuite.com/open-source/invoice.html).
I downloaded the HTML file to disk together with its resources (using
*Save as...* in the browser).
## Pandoc
{{ figure(file="example-html-pandoc-latex.jpg") }}
{{ figure(file="example-html-pandoc-html.jpg") }}
Not showing the version using `context` pdf-engine, since it looked
very similiar to the latex variant.
## wkhtmltopdf
{{ figure(file="example-html-wkhtmltopdf.jpg") }}
## Unoconv
{{ figure(file="example-html-unoconv.jpg") }}
# Decision Outcome
wkhtmltopdf.
It shows the best results.

View File

@ -0,0 +1,177 @@
+++
title = "Convert Text Files"
weight = 90
+++
# Context and Problem Statement
How can plain text and markdown documents be converted into a PDF
files?
Rendering images is not important here, since the files must be self
contained when uploaded to Docspell.
The test file is the current documentation page of Docspell, found in
`microsite/docs/doc.md`.
```
---
layout: docs
position: 4
title: Documentation
---
# {page .title}
Docspell assists in organizing large amounts of PDF files that are
...
## How it works
Documents have two ...
1. You maintain a kind of address book. It should list all possible
correspondents and the concerning people/things. This grows
incrementally with each new unknown document.
2. When docspell analyzes a document, it tries to find matches within
your address ...
3. You can inspect ...
The set of meta data that docspell uses to draw suggestions from, must
be maintained ...
## Terms
In order to better understand these pages, some terms should be
explained first.
### Item
An **Item** is roughly your (pdf) document, only that an item may span
multiple files, which are called **attachments**. And an item has
**meta data** associated:
- a **correspondent**: the other side of the communication. It can be
an organization or a person.
- a **concerning person** or **equipment**: a person or thing that
this item is about. Maybe it is an insurance contract about your
car.
- ...
### Collective
The users of the application are part of a **collective**. A
**collective** is a group of users that share access to the same
items. The account name is therefore comprised of a *collective name*
and a *user name*.
All users of a collective are equal; they have same permissions to
access all...
```
Then a plain text file is tried, too (without any markup).
```
Maecenas mauris lectus, lobortis et purus mattis
Duis vehicula mi vel mi pretium
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu.
Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut.
Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros
efficitur tincidunt. Cras justo mi, porttitor quis mattis vel,
ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod:
- Fusce vitae vestibulum velit,
- Pellentesque vulputate lectus quis pellentesque commodo
the end.
```
# Considered Options
* [flexmark](https://github.com/vsch/flexmark-java) for markdown to
HTML, then use existing machinery described in [adr
7](@/docs/dev/adr/0007_convert_html_files.md)
* [pandoc](https://pandoc.org/) external command
## flexmark markdown library for java
Process files with [flexmark](https://github.com/vsch/flexmark-java)
and then create a PDF from the resulting html.
Using the following snippet:
``` scala
def renderMarkdown(): ExitCode = {
val opts = new MutableDataSet()
opts.set(Parser.EXTENSIONS.asInstanceOf[DataKey[util.Collection[_]]],
util.Arrays.asList(TablesExtension.create(),
StrikethroughExtension.create()));
val parser = Parser.builder(opts).build()
val renderer = HtmlRenderer.builder(opts).build()
val reader = Files.newBufferedReader(Paths.get("in.txt|md"))
val doc = parser.parseReader(reader)
val html = renderer.render(doc)
val body = "<html><head></head><body style=\"padding: 0 5em;\">" + html + "</body></html>"
Files.write(
Paths.get("test.html"),
body.getBytes(StandardCharsets.UTF_8))
ExitCode.Success
}
```
Then run the result through `wkhtmltopdf`.
Markdown file:
{{ figure(file="example-md-java.jpg") }}
TXT file:
{{ figure(file="example-txt-java.jpg") }}
## pandoc
Command:
```
pandoc -f markdown -t html -o test.pdf microsite/docs/doc.md
```
Markdown/Latex:
{{ figure(file="example-md-pandoc-latex.jpg") }}
Markdown/Html:
{{ figure(file="example-md-pandoc-html.jpg") }}
Text/Latex:
{{ figure(file="example-txt-pandoc-latex.jpg") }}
Text/Html:
{{ figure(file="example-txt-pandoc-html.jpg") }}
# Decision Outcome
Java library "flexmark".
I think all results are great. It depends on the type of document and
what one expects to see. I guess that most people expect something
like pandoc-html produces for the kind of files docspell is for (it is
not for newspaper articles, where pandoc-latex would be best fit).
But choosing pandoc means yet another external command to depend on.
And the results from flexmark are really good, too. One can fiddle
with options and css to make it look better.
To not introduce another external command, decision is to use flexmark
and then the already existing html->pdf conversion.

View File

@ -0,0 +1,205 @@
+++
title = "Convert Office Documents"
weight = 100
+++
# Context and Problem Statement
How can office documents, like `docx` or `odt` be converted into a PDF
file that looks as much as possible like the original?
It would be nice to have a java-only solution. But if an external tool
has a better outcome, then an external tool is fine, too.
Since Docspell is free software, the tools must also be free.
# Considered Options
* [Apache POI](https://poi.apache.org) together with
[this](https://search.maven.org/artifact/fr.opensagres.xdocreport/org.apache.poi.xwpf.converter.pdf/1.0.6/jar)
library
* [pandoc](https://pandoc.org/) external command
* [abiword](https://www.abisource.com/) external command
* [Unoconv](https://github.com/unoconv/unoconv) external command
To choose an option, some documents are converted to pdf and compared.
Only the formats `docx` and `odt` are considered here. These are the
most used formats. They have to look well, if a `xlsx` or `pptx`
doesn't look so great, that is ok.
Here is the native view to compare with:
ODT:
{{ figure(file="example-odt-native.jpg") }}
## `XWPFConverter`
I couldn't get any example to work. There were exceptions:
```
java.lang.IllegalArgumentException: Value for parameter 'id' was out of bounds
at org.apache.poi.util.IdentifierManager.reserve(IdentifierManager.java:80)
at org.apache.poi.xwpf.usermodel.XWPFRun.<init>(XWPFRun.java:101)
at org.apache.poi.xwpf.usermodel.XWPFRun.<init>(XWPFRun.java:146)
at org.apache.poi.xwpf.usermodel.XWPFParagraph.buildRunsInOrderFromXml(XWPFParagraph.java:135)
at org.apache.poi.xwpf.usermodel.XWPFParagraph.<init>(XWPFParagraph.java:88)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:147)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
at docspell.convert.Testing$.withPoi(Testing.scala:17)
at docspell.convert.Testing$.$anonfun$run$1(Testing.scala:12)
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
at cats.effect.internals.PoolUtils$$anon$2$$anon$3.run(PoolUtils.scala:51)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
The project (not Apache Poi, the other) seems unmaintained. I could
not find any website and the artifact in maven central is from 2016.
## Pandoc
I know pandoc as a very great tool when converting between markup
documents. So this tries it with office documents. It supports `docx`
and `odt` from there `--list-input-formats`.
From the pandoc manual:
> By default, pandoc will use LaTeX to create the PDF, which requires
> that a LaTeX engine be installed (see --pdf-engine below).
> Alternatively, pandoc can use ConTeXt, roff ms, or HTML as an
> intermediate format. To do this, specify an output file with a .pdf
> extension, as before, but add the --pdf-engine option or -t context,
> -t html, or -t ms to the command line. The tool used to generate the
> PDF from the intermediate format may be specified using --pdf-engine.
Trying with latex engine:
```
pandoc -f odt -o test.pdf example.odt
```
Results ODT:
{{ figure(file="example-odt-pandoc-latex.jpg") }}
```
pandoc -f odt -o test.pdf example.docx
```
Results DOCX:
{{ figure(file="example-docx-pandoc-latex.jpg") }}
----
Trying with context engine:
```
pandoc -f odt -t context -o test.pdf example.odt
```
Results ODT:
{{ figure(file="example-odt-pandoc-context.jpg") }}
Results DOCX:
{{ figure(file="example-docx-pandoc-context.jpg") }}
----
Trying with ms engine:
```
pandoc -f odt -t ms -o test.pdf example.odt
```
Results ODT:
{{ figure(file="example-odt-pandoc-ms.jpg") }}
Results DOCX:
{{ figure(file="example-docx-pandoc-ms.jpg") }}
---
Trying with html engine (this requires `wkhtmltopdf` to be present):
```
$ pandoc --extract-media . -f odt -t html -o test.pdf example.odt
```
Results ODT:
{{ figure(file="example-odt-pandoc-html.jpg") }}
Results DOCX:
{{ figure(file="example-docx-pandoc-html.jpg") }}
## Abiword
Trying with:
```
abiword --to=pdf example.odt
```
Results:
{{ figure(file="example-odt-abiword.jpg") }}
Trying with a `docx` file failed. It worked with a `doc` file.
## Unoconv
Unoconv relies on libreoffice/openoffice, so installing it will result
in installing parts of libreoffice, which is a very large dependency.
Trying with:
```
unoconv -f pdf example.odt
```
Results ODT:
{{ figure(file="example-odt-unoconv.jpg") }}
Results DOCX:
{{ figure(file="example-docx-unoconv.jpg") }}
# Decision Outcome
Unoconv.
The results from `unoconv` are really good.
Abiword also is not that bad, it didn't convert the chart, but all
font markup is there. It would be great to not depend on something as
big as libreoffice, but the results are so much better.
Also pandoc deals very well with DOCX files (using the `context`
engine). The only thing that was not rendered was the embedded chart
(like abiword). But all images and font styling was present.
It will be a configurable external command anyways, so users can
exchange it at any time with a different one.

View File

@ -0,0 +1,190 @@
+++
title = "Convert Image Files"
weight = 110
+++
# Context and Problem Statement
How to convert image files properly to pdf?
Since there are thousands of different image formats, there will never
be support for all. The most common containers should be supported,
though:
- jpeg (jfif, exif)
- png
- tiff (baseline, single page)
The focus is on document images, maybe from digital cameras or
scanners.
# Considered Options
* [pdfbox](https://pdfbox.apache.org/) library
* [imagemagick](https://www.imagemagick.org/) external command
* [img2pdf](https://github.com/josch/img2pdf) external command
* [tesseract](https://github.com/tesseract-ocr/tesseract) external command
There are no screenshots here, because it doesn't make sense since
they all look the same on the screen. Instead we look at the files
properties.
**Input File**
The input files are:
```
$ identify input/*
input/jfif.jpg JPEG 2480x3514 2480x3514+0+0 8-bit sRGB 240229B 0.000u 0:00.000
input/letter-en.jpg JPEG 1695x2378 1695x2378+0+0 8-bit Gray 256c 467341B 0.000u 0:00.000
input/letter-en.png PNG 1695x2378 1695x2378+0+0 8-bit Gray 256c 191571B 0.000u 0:00.000
input/letter-en.tiff TIFF 1695x2378 1695x2378+0+0 8-bit Grayscale Gray 4030880B 0.000u 0:00.000
```
Size:
- jfif.jpg 240k
- letter-en.jpg 467k
- letter-en.png 191k
- letter-en.tiff 4.0M
## pdfbox
Using a java library is preferred, if the quality is good enough.
There is an
[example](https://github.com/apache/pdfbox/blob/2cea31cc63623fd6ece149c60d5f0cc05a696ea7/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ImageToPDF.java)
for this exact use case.
This is the sample code:
``` scala
def imgtopdf(file: String): ExitCode = {
val jpg = Paths.get(file).toAbsolutePath
if (!Files.exists(jpg)) {
sys.error(s"file doesn't exist: $jpg")
}
val pd = new PDDocument()
val page = new PDPage(PDRectangle.A4)
pd.addPage(page)
val bimg = ImageIO.read(jpg.toFile)
val img = LosslessFactory.createFromImage(pd, bimg)
val stream = new PDPageContentStream(pd, page)
stream.drawImage(img, 0, 0, PDRectangle.A4.getWidth, PDRectangle.A4.getHeight)
stream.close()
pd.save("test.pdf")
pd.close()
ExitCode.Success
}
```
Using pdfbox 2.0.18 and twelvemonkeys 3.5. Running time: `1384ms`
```
$ identify *.pdf
jfif.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 129660B 0.000u 0:00.000
letter-en.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49118B 0.000u 0:00.000
letter-en.png.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49118B 0.000u 0:00.000
letter-en.tiff.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49118B 0.000u 0:00.000
```
Size:
- jfif.jpg 1.1M
- letter-en.jpg 142k
- letter-en.png 142k
- letter-en.tiff 142k
## img2pdf
This is a python tool that adds the image into the pdf without
reencoding.
Using version 0.3.1. Running time: `323ms`.
```
$ identify *.pdf
jfif.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 129708B 0.000u 0:00.000
letter-en.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49864B 0.000u 0:00.000
letter-en.png.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49864B 0.000u 0:00.000
letter-en.tiff.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49864B 0.000u 0:00.000
```
Size:
- jfif.jpg 241k
- letter-en.jpg 468k
- letter-en.png 191k
- letter-en.tiff 192k
## ImageMagick
The well known imagemagick tool can convert images to pdfs, too.
Using version 6.9.10-71. Running time: `881ms`.
```
$ identify *.pdf
jfif.jpg.pdf PDF 595x843 595x843+0+0 16-bit sRGB 134873B 0.000u 0:00.000
letter-en.jpg.pdf PDF 1695x2378 1695x2378+0+0 16-bit sRGB 360100B 0.000u 0:00.000
letter-en.png.pdf PDF 1695x2378 1695x2378+0+0 16-bit sRGB 322418B 0.000u 0:00.000
letter-en.tiff.pdf PDF 1695x2378 1695x2378+0+0 16-bit sRGB 322418B 0.000u 0:00.000
```
Size:
- jfif.jpg 300k
- letter-en.jpg 390k
- letter-en.png 180k
- letter-en.tiff 5.1M
## Tesseract
Docspell already relies on tesseract for doing OCR. And in contrast to
all other candidates, it can create PDFs that are searchable. Of
course, this yields in much longer running time, that cannot be
compared to the times of the other options.
```
tesseract doc3.jpg out -l deu pdf
```
It can also create both outputs in one go:
```
tesseract doc3.jpg out -l deu pdf txt
```
Using tesseract 4. Running time: `6661ms`
```
$ identify *.pdf
tesseract/jfif.jpg.pdf PDF 595x843 595x843+0+0 16-bit sRGB 130535B 0.000u 0:00.000
tesseract/letter-en.jpg.pdf PDF 1743x2446 1743x2446+0+0 16-bit sRGB 328716B 0.000u 0:00.000
tesseract/letter-en.png.pdf PDF 1743x2446 1743x2446+0+0 16-bit sRGB 328716B 0.000u 0:00.000
tesseract/letter-en.tiff.pdf PDF 1743x2446 1743x2446+0+0 16-bit sRGB 328716B 0.000u 0:00.000
```
Size:
- jfif.jpg 246k
- letter-en.jpg 473k
- letter-en.png 183k
- letter-en.tiff 183k
# Decision
Tesseract.
To not use more external tools, imagemagick and img2pdf are not
chosen, even though img2pdf shows the best results and is fastest.
Pdfbox library would be the favorite, because results are good and
with the [twelvemonkeys](https://github.com/haraldk/TwelveMonkeys)
library there is support for many images. The priority is to avoid
more external commands if possible.
But since there already is a dependency to tesseract and it can create
searchable pdfs, the decision is to use tesseract for this. Then PDFs
with images can be converted to searchable PDFs with images. And text
extraction is required anyways.

View File

@ -0,0 +1,76 @@
+++
title = "Extract Text from Files"
weight = 120
+++
# Context and Problem Statement
With support for more file types there must be a way to extract text
from all of them. It is better to extract text from the source files,
in contrast to extracting the text from the converted pdf file.
There are multiple options and multiple file types. Again, most
priority is to use a java/scala library to reduce external
dependencies.
# Considered Options
## MS Office Documents
There is only one library I know: [Apache
POI](https://poi.apache.org/). It supports `doc(x)` and `xls(x)`.
However, it doesn't support open-document format (odt and ods).
## OpenDocument Format
There are two libraries:
- [Apache Tika Parser](https://tika.apache.org/)
- [ODFToolkit](https://github.com/tdf/odftoolkit)
*Tika:* The tika-parsers package contains an opendocument parser for
extracting text. But it has a huge dependency tree, since it is a
super-package containing a parser for almost every common file type.
*ODF Toolkit:* This depends on [Apache Jena](https://jena.apache.org)
and also pulls in quite some dependencies (while not as much as
tika-parser). It is not too bad, since it is a library for
manipulating opendocument files. But all I need is to only extract
text. I created tests that extracted text from my odt/ods files. It
worked at first sight, but running the tests in a loop resulted in
strange nullpointer exceptions (it only worked the first run).
## Richtext
Richtext is supported by the jdk (using `RichtextEditorKit` from
swing).
## PDF
For "image" pdf files, tesseract is used. For "text" PDF files, the
library [Apache PDFBox](https://pdfbox.apache.org) can be used.
There also is [iText](https://github.com/itext/itext7) with a AGPL
license.
## Images
For images and "image" PDF files, there is already tesseract in place.
## HTML
HTML must be converted into a PDF file before text can be extracted.
## Text/Markdown
These files can be used as-is, obviously.
# Decision Outcome
- MS Office files: POI library
- Open Document files: Tika, but integrating the few source files that
make up the open document parser. Due to its huge dependency tree,
the library is not added.
- PDF: Apache PDFBox. I know this library better than itext.

View File

@ -0,0 +1,103 @@
+++
title = "Periodic Tasks"
weight = 130
+++
# Context and Problem Statement
Currently there is a `Scheduler` that consumes tasks off a queue in
the database. This allows multiple job executors running in parallel
racing for the next job to execute. This is for executing tasks
immediately as long as there are enough resource.
What is missing, is a component that maintains periodic tasks. The
reason for this is to have house keeping tasks that run regularily and
clean up stale or unused data. Later, users should be able to create
periodic tasks, for example to read e-mails from an inbox or to be
notified of due items.
The problem is again, that it must work with multiple job executor
instances running at the same time. This is the same pattern as with
the `Scheduler`: it must be ensured that only one task is used at a
time. Multiple job exectuors must not schedule a perdiodic task more
than once. If a periodic tasks takes longer than the time between
runs, it must wait for the next interval.
# Considered Options
1. Adding a `timer` and `nextrun` field to the current `job` table
2. Creating a separate table for periodic tasks
## Decision Outcome
The 2. option.
For internal housekeeping tasks, it may suffice to reuse the existing
`job` queue by adding more fields such that a job may be considered
periodic. But this conflates with what the `Scheduler` is doing now
(executing tasks as soon as possible while being bound to some
resource limits) with a completely different subject.
There will be a new `PeriodicScheduler` that works on a new table in
the database that is representing periodic tasks. This table will
share fields with the `job` table to be able to create `RJob` records.
This new component is only taking care of periodically submitting jobs
to the job queue such that the `Scheduler` will eventually pick it up
and run it. If the tasks cannot run (for example due to resource
limitation), the periodic scheduler can't do nothing but wait and try
next time.
```sql
CREATE TABLE "periodic_task" (
"id" varchar(254) not null primary key,
"enabled" boolean not null,
"task" varchar(254) not null,
"group_" varchar(254) not null,
"args" text not null,
"subject" varchar(254) not null,
"submitter" varchar(254) not null,
"priority" int not null,
"worker" varchar(254),
"marked" timestamp,
"timer" varchar(254) not null,
"nextrun" timestamp not null,
"created" timestamp not null
);
```
Preparing for other features, at some point periodic tasks will be
created by users. It should be possible to disable/enable them. The
next 6 properties are needed to insert jobs into the `job` table. The
`worker` field (and `marked`) are used to mark a periodic job as
"being worked on by a job executor".
The `timer` is the schedule, which is a
[systemd-like](https://man.cx/systemd.time#heading7) calendar event
string. This is parsed by [this
library](https://github.com/eikek/calev). The `nextrun` field will
store the timestamp of the next time the task would need to be
executed. This is needed to query this table for the newest task.
The `PeriodicScheduler` works roughly like this:
On startup:
- Remove stale worker values. If the process has been killed, there
may be marked tasks which must be cleared now.
Main-Loop:
0. Cancel current scheduled notify (see 4. below)
1. get next (= earliest & enabled) periodic job
2. if none: stop
3. if triggered (= `nextrun <= 'now'`):
- Mark periodic task. On fail: goto 1.
- Submit new job into the jobqueue:
- Update `nextrun` field
- Check for non-final jobs of that name. This is required to not
run the same periodic task multiple times concurrently.
- if exist: goto 4.
- if not exist: submit job
- Unmark periodic task
4. if future
- schedule notify: notify self to run again next time the task
schedule triggers

View File

@ -0,0 +1,42 @@
+++
title = "Archive Files"
weight = 140
+++
# Context and Problem Statement
Docspell should have support for files that contain the actual files
that matter, like zip files and other such things. It should extract
its contents automatcially.
Since docspell should never drop or modify user data, the archive file
must be present in the database. And it must be possible to download
the file unmodified.
On the other hand, files in there need to be text analysed and
converted to pdf files.
# Decision Outcome
There is currently a table `attachment_source` which holds references
to "original" files. These are the files as uploaded by the user,
before converted to pdf. Archive files add a subtlety to this: in case
of an archive, an `attachment_source` is the original (non-archive)
file inside an archive.
The archive file itself will be stored in a separate table `attachment_archive`.
Example: uploading a `files.zip` ZIP file containing `report.jpg`:
- `attachment_source`: report.jpg
- `attachment`: report.pdf
- `attachment_archive`: files.zip
Archive may contain other archives. Then the inner archives will not
be saved. The archive file is extracted recursively, until there is no
known archive file found.
# Initial Support
Initial support is implemented for ZIP and EML (e-mail files) files.

View File

@ -0,0 +1,47 @@
+++
title = "Fulltext Search Engine"
weight = 150
+++
It should be possible to search the contents of all documents.
# Context and Problem Statement
To allow searching the documents contents efficiently, a separate
index is necessary. The "defacto standard" for fulltext search on the
JVM is something backed by [Lucene](https://lucene.apache.org).
Another option is to use a RDBMS that supports fulltext search.
This adds another component to the mix, which increases the complexity
of the setup and the software. Since docspell works great without this
feature, it shouldn't have a huge impact on the application, i.e. if
the fulltext search component is down or broken, docspell should still
work (just the fulltext search is then not working).
# Considered Options
* [Apache SOLR](https://lucene.apache.org/solr)
* [ElasticSearch](https://www.elastic.co/elasticsearch/)
* [PostgreSQL](https://www.postgresql.org/docs/12/textsearch.html)
* All of them or a subset
# Decision Outcome
If docspell is running on PostgreSQL, it would be nice to also use it
for fulltext search to save the cost of running another component. But
I don't want to lock the database to PostgreSQL *only* because of the
fulltext search feature.
ElasticSearch and Apache SOLR are quite similiar in features. SOLR is
part of Lucene and therefore lives in the Apache ecosystem. I would
choose SOLR over ElasticSearch, because I used it before.
The last option (supporting all) is interesting, since it would enable
to use PostgreSQL for fulltext search for those that use PostgreSQL as
the database for docspell.
In a first step, identify what docspell needs from a fulltext search
component and create this interface and an implementation for Apache
SOLR. This enables all users to use the fulltext search feature. As a
later step, an implementation based on PostgreSQL and/or ElasticSearch
could be provided, too.

View File

@ -0,0 +1,64 @@
+++
title = "Convert PDF Files"
weight = 160
+++
# Context and Problem Statement
Some PDFs contain only images (when coming from a scanner) and
therefore one is not able to click into the pdf and select text for
copy&paste. Also it is not searchable in a PDF viewer. These are
really shortcomings that can be fixed, especially when there is
already OCR build in.
For images, this works already as tesseract is used to create the PDF
files. Tesseract creates the files with an additional text layer
containing the OCRed text.
# Considered Options
* [ocrmypdf](https://github.com/jbarlow83/OCRmyPDF) OCRmyPDF adds an
OCR text layer to scanned PDF files, allowing them to be searched
## ocrmypdf
This is a very nice python tool, that uses tesseract to do OCR on each
page and add the extracted text as a pdf text layer to the page.
Additionally it creates PDF/A type pdfs, which are great for
archiving. This fixes exactly the things stated above.
### Integration
Docspell already has this built in for images. When converting images
to a PDF (which is done early in processing), the process creates a
text and a PDF file. Docspell then sets the text in this step and the
text extraction step skips doing its work, if there is already text
available.
It would be possible to use the `--sidecar` option with ocrmypdf to
create a text file of the extracted text with one run, too (exactly
like it works for tesseract). But for "text" pdfs, ocrmypdf writes
some info-message into this text file:
```
[OCR skipped on page 1] [OCR skipped on page 2]
```
Docspell cannot reliably tell, wether this is extracted text or not.
It would be reqiured to load the pdf and check its contents. This is a
bit of bad luck, because everything would just work already. So it
requires a (small) change in the text-extraction step. By default,
text extraction happens on the source file. For PDFs, text extraction
should now be run on the converted file, to avoid running OCR twice.
The converted pdf file is either be a text-pdf in the first place,
where ocrmypdf would only convert it to a PDF/A file; or it may be a
converted file containing the OCR-ed text as a pdf layer. If ocrmypdf
is disabled, the converted file and the source file are the same for
PDFs.
# Decision Outcome
Add ocrmypdf as an optional conversion from PDF to PDF. Ocrmypdf is
distributed under the GPL-3 license.

View File

@ -0,0 +1,14 @@
+++
title = "ADRs"
description = "Contains some ADRs, which are internal notes on decisions made."
weight = 300
sort_by = "weight"
insert_anchor_links = "right"
template = "pages.html"
[extra]
mktoc = true
+++
This contains a list of ADRs, most of them are from very early. It
often just contains notes that could go nowhere else, but still should
be captured.

Binary file not shown.

After

Width:  |  Height:  |  Size: 385 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 443 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 291 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 353 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 292 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 167 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 586 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 479 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 280 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 270 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 363 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 418 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 500 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 349 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 350 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 296 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 155 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

View File

@ -0,0 +1,43 @@
@startuml
scale 1200 width
title: Processing Files
skinparam monochrome true
skinparam backgroundColor white
skinparam rectangle {
roundCorner<<Input>> 25
roundCorner<<Output>> 5
}
rectangle Input <<Input>> {
file "html"
file "plaintext"
file "image"
file "msoffice"
file "rtf"
file "odf"
file "pdf"
}
node toBoth [
PDF + TXT
]
node toPdf [
PDF
]
node toTxt [
TXT
]
image --> toBoth:<tesseract>
html --> toPdf:<wkhtmltopdf>
toPdf --> toTxt:[pdfbox]
plaintext --> html:[flexmark]
msoffice --> toPdf:<unoconv>
msoffice --> toTxt:[poi]
rtf --> toTxt:[jdk]
rtf --> toPdf:<unoconv>
odf --> toTxt:[tika]
odf --> toPdf:<unoconv>
pdf --> toTxt:<tesseract>
pdf --> toTxt:[pdfbox]
plaintext -> toTxt:[identity]
@enduml

View File

@ -0,0 +1,77 @@
+++
title = "Short Title"
draft = true
+++
# [short title of solved problem and solution]
* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
* Deciders: [list everyone involved in the decision] <!-- optional -->
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
Technical Story: [description | ticket/issue URL] <!-- optional -->
## Context and Problem Statement
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
## Decision Drivers <!-- optional -->
* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
*<!-- numbers of drivers can vary -->
## Considered Options
* [option 1]
* [option 2]
* [option 3]
*<!-- numbers of options can vary -->
## Decision Outcome
Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
### Positive Consequences <!-- optional -->
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
*
### Negative Consequences <!-- optional -->
* [e.g., compromising quality attribute, follow-up decisions required, …]
*
## Pros and Cons of the Options <!-- optional -->
### [option 1]
[example | description | pointer to more information | …] <!-- optional -->
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
*<!-- numbers of pros and cons can vary -->
### [option 2]
[example | description | pointer to more information | …] <!-- optional -->
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
*<!-- numbers of pros and cons can vary -->
### [option 3]
[example | description | pointer to more information | …] <!-- optional -->
* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
*<!-- numbers of pros and cons can vary -->
## Links <!-- optional -->
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
*<!-- numbers of links can vary -->