Initial version.
Features: - Upload PDF files let them analyze - Manage meta data and items - See processing in webapp
@ -0,0 +1,18 @@
|
||||
.jumbotron {
|
||||
background: url(../img/back-master-small.jpg);
|
||||
background-repeat: no-repeat;
|
||||
background-size: 100% 800px;
|
||||
}
|
||||
|
||||
.content-wrapper h1, .h1 {
|
||||
border-bottom: 1px solid #d8dfe5;
|
||||
padding-bottom: 0.8rem;
|
||||
}
|
||||
|
||||
body {
|
||||
font-size: 1.75em;
|
||||
}
|
||||
|
||||
h4 {
|
||||
text-decoration: underline;
|
||||
}
|
48
modules/microsite/src/main/resources/microsite/data/menu.yml
Normal file
@ -0,0 +1,48 @@
|
||||
options:
|
||||
- title: Home
|
||||
url: index.html
|
||||
|
||||
- title: Getit
|
||||
url: getit.html
|
||||
|
||||
- title: Documentation
|
||||
url: doc.html
|
||||
|
||||
nested_options:
|
||||
- title: Installation
|
||||
url: doc/install.html
|
||||
|
||||
- title: Configuring
|
||||
url: doc/configure.html
|
||||
|
||||
- title: Adding Meta Data
|
||||
url: doc/metadata.html
|
||||
|
||||
- title: Uploads
|
||||
url: doc/uploading.html
|
||||
|
||||
- title: Processing Queue
|
||||
url: doc/processing.html
|
||||
|
||||
- title: Find and Review
|
||||
url: doc/curate.html
|
||||
|
||||
- title: Joex
|
||||
url: doc/joex.html
|
||||
|
||||
- title: Development
|
||||
url: dev.html
|
||||
|
||||
nested_options:
|
||||
- tite: ADRs
|
||||
url: dev/adr.html
|
||||
|
||||
- title: Api
|
||||
url: api.html
|
||||
|
||||
nested_options:
|
||||
- title: REST Api Doc
|
||||
url: openapi/docspell-openapi.html
|
||||
|
||||
- title: REST OpenApi Spec
|
||||
url: openapi/docspell-openapi.yml
|
After Width: | Height: | Size: 339 KiB |
After Width: | Height: | Size: 62 KiB |
After Width: | Height: | Size: 88 KiB |
After Width: | Height: | Size: 88 KiB |
After Width: | Height: | Size: 84 KiB |
After Width: | Height: | Size: 87 KiB |
After Width: | Height: | Size: 94 KiB |
After Width: | Height: | Size: 1.2 MiB |
After Width: | Height: | Size: 1.7 MiB |
1
modules/microsite/src/main/resources/microsite/img/favicon.png
Symbolic link
@ -0,0 +1 @@
|
||||
../../../../../../webapp/src/main/webjar/favicon/android-icon-96x96.png
|
After Width: | Height: | Size: 5.8 KiB |
After Width: | Height: | Size: 5.8 KiB |
After Width: | Height: | Size: 105 KiB |
After Width: | Height: | Size: 5.8 KiB |
After Width: | Height: | Size: 5.8 KiB |
After Width: | Height: | Size: 153 KiB |
After Width: | Height: | Size: 180 KiB |
92
modules/microsite/src/main/tut/api.md
Normal file
@ -0,0 +1,92 @@
|
||||
---
|
||||
layout: docs
|
||||
position: 5
|
||||
title: Api
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
Docspell is designed as a REST server that uses JSON to exchange
|
||||
data. The REST api can be used to integrate docspell into your
|
||||
workflow.
|
||||
|
||||
[Docspell REST Api Doc](openapi/docspell-openapi.html)
|
||||
|
||||
The "raw" `openapi.yml` specification file can be found
|
||||
[here](openapi/docspell-openapi.yml).
|
||||
|
||||
The routes can be divided into protected and unprotected routes. The
|
||||
unprotected, or open routes are at `/open/*` wihle the protected
|
||||
routes are at `/sec/*`. Open routes don't require authenticated access
|
||||
and can be used by any user. The protected routes require an
|
||||
authenticated user.
|
||||
|
||||
## Authentication
|
||||
|
||||
The unprotected route `/open/auth/login` can be used to login with
|
||||
account name and password. The response contains a token that can be
|
||||
used for accessing protected routes. The token is only valid for a
|
||||
restricted time which can be configured (default is 5 minutes).
|
||||
|
||||
New tokens can be generated using an existing valid token and the
|
||||
protected route `/sec/auth/session`. This will return the same
|
||||
response as above, giving a new token.
|
||||
|
||||
This token can be added to requests in two ways: as a cookie header or
|
||||
a "normal" http header. If a cookie header is used, the cookie name
|
||||
must be `docspell_auth` and a custom header must be named
|
||||
`X-Docspell-Auth`.
|
||||
|
||||
## Live Api
|
||||
|
||||
Besides the statically generated documentation at this site, the rest
|
||||
server provides a swagger generated api documenation, that allows
|
||||
playing around with the api. It requires a running docspell rest
|
||||
server. If it is deployed at `http://localhost:7880`, then check this
|
||||
url:
|
||||
|
||||
```
|
||||
http://localhost:7880/app/doc
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
These examples use the great command line tool
|
||||
[curl](https://curl.haxx.se/).
|
||||
|
||||
### Login
|
||||
|
||||
```
|
||||
$ curl -X POST -d '{"account": "smith", "password": "test"}' http://localhost:7880/api/v1/open/auth/login
|
||||
{"collective":"smith"
|
||||
,"user":"smith"
|
||||
,"success":true
|
||||
,"message":"Login successful"
|
||||
,"token":"1568142350115-ZWlrZS9laWtl-$2a$10$rGZUFDAVNIKh4Tj6u6tlI.-O2euwCvmBT0TlyDmIHR1ZsLQPAI="
|
||||
,"validMs":300000
|
||||
}
|
||||
```
|
||||
|
||||
### Get new token
|
||||
|
||||
```
|
||||
$ curl -XPOST -H 'X-Docspell-Auth: 1568142350115-ZWlrZS9laWtl-$2a$10$rGZUFDAVNIKh4Tj6u6tlI.-O2euwCvmBT0TlyDmIHR1ZsLQPAI=' http://localhost:7880/api/v1/sec/auth/session
|
||||
{"collective":"smith"
|
||||
,"user":"smith"
|
||||
,"success":true
|
||||
,"message":"Login successful"
|
||||
,"token":"1568142446077-ZWlrZS9laWtl-$2a$10$3B0teJ9rMpsBJPzHfZZPoO-WeA1bkfEONBN8fyzWE8DeaAHtUc="
|
||||
,"validMs":300000
|
||||
}
|
||||
```
|
||||
|
||||
### Get some insights
|
||||
|
||||
```
|
||||
$ curl -H 'X-Docspell-Auth: 1568142446077-ZWlrZS9laWtl-$2a$10$3B0teJ9rMpsBJPzHfZZPoO-WeA1bkfEONBN8fyzWE8DeaAHtUc=' http://localhost:7880/api/v1/sec/collective/insights
|
||||
{"incomingCount":3
|
||||
,"outgoingCount":1
|
||||
,"itemSize":207310
|
||||
,"tagCloud":{"items":[]}
|
||||
}
|
||||
```
|
16
modules/microsite/src/main/tut/demo.md
Normal file
@ -0,0 +1,16 @@
|
||||
---
|
||||
layout: home
|
||||
position: 2
|
||||
section: demo
|
||||
title: Demo
|
||||
technologies:
|
||||
- first: ["Scala + Elm", "Backend is in Scala with Cats/Fs2, Webapp in Elm"]
|
||||
- second: ["Unpaper + Tesseract", "Text is extracted using OCR provided by tesseract"]
|
||||
- third: ["Stanford NLP", "Documents are analyzed using Stanford NLP classifiers"]
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
|
||||
|
||||
<img width="100%" src="img/docspell-demo.gif" title="Demo">
|
86
modules/microsite/src/main/tut/dev.md
Normal file
@ -0,0 +1,86 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Development
|
||||
---
|
||||
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
|
||||
## Building
|
||||
|
||||
[Sbt](https://scala-sbt.org) is used to build the application. Clone
|
||||
the sources and run:
|
||||
|
||||
- `make` to compile all sources (Elm + Scala)
|
||||
- `make-zip` to create zip packages
|
||||
- `make-deb` to create debian packages
|
||||
|
||||
The zip files can be found afterwards in:
|
||||
|
||||
```
|
||||
modules/restserver/target/universal
|
||||
modules/joex/target/universal
|
||||
```
|
||||
|
||||
|
||||
## Starting Servers with `reStart`
|
||||
|
||||
When developing, it's very convenient to use the [revolver sbt
|
||||
plugin](https://github.com/spray/sbt-revolver). Start the sbt console
|
||||
and then run:
|
||||
|
||||
```
|
||||
sbt:docspell-root> restserver/reStart
|
||||
```
|
||||
|
||||
This starts a REST server. Once this started up, type:
|
||||
|
||||
```
|
||||
sbt:docspell-root> joex/reStart
|
||||
```
|
||||
|
||||
if also a joex component is required. Prefixing the commads with `~`,
|
||||
results in recompile+restart once a source file is modified.
|
||||
|
||||
|
||||
## Custom config file
|
||||
|
||||
The sbt build is setup such that a file `dev.conf` in the root of the
|
||||
source tree is picked up as config file, if it exists. So you can
|
||||
create a custom config file for development. For example, a custom
|
||||
database for development may be setup this way:
|
||||
|
||||
```
|
||||
#jdbcurl = "jdbc:h2:///home/dev/workspace/projects/docspell/local/docspell-demo.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
|
||||
jdbcurl = "jdbc:postgresql://localhost:5432/docspelldev"
|
||||
#jdbcurl = "jdbc:mariadb://localhost:3306/docspelldev"
|
||||
|
||||
docspell.server {
|
||||
backend {
|
||||
jdbc {
|
||||
url = ${jdbcurl}
|
||||
user = "dev"
|
||||
password = "dev"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
docspell.joex {
|
||||
jdbc {
|
||||
url = ${jdbcurl}
|
||||
user = "dev"
|
||||
password = "dev"
|
||||
}
|
||||
scheduler {
|
||||
pool-size = 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## ADRs
|
||||
|
||||
Some early information about certain details can be found in the few
|
||||
[ADR](https://adr.github.io/) that exist:
|
||||
|
||||
- [ADRs](dev/adr.html)
|
12
modules/microsite/src/main/tut/dev/adr.md
Normal file
@ -0,0 +1,12 @@
|
||||
---
|
||||
layout: docs
|
||||
title: ADRs
|
||||
---
|
||||
|
||||
# ADR
|
||||
|
||||
- [0001 Components](adr/0001_components.html)
|
||||
- [0002 Component Interaction](adr/0002_component_interaction.html)
|
||||
- [0003 Encryption](adr/0003_encryption.html)
|
||||
- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch.html)
|
||||
- [0005 Job Executor](adr/0005_job-executor.html)
|
@ -0,0 +1,33 @@
|
||||
# Use Markdown Architectural Decision Records
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
We want to [record architectural decisions](https://adr.github.io/)
|
||||
made in this project. Which format and structure should these records
|
||||
follow?
|
||||
|
||||
## Considered Options
|
||||
|
||||
* [MADR](https://adr.github.io/madr/) 2.1.0 - The Markdown Architectural Decision Records
|
||||
* [Michael Nygard's template](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions) - The first incarnation of the term "ADR"
|
||||
* [Sustainable Architectural
|
||||
Decisions](https://www.infoq.com/articles/sustainable-architectural-design-decisions) -
|
||||
The Y-Statements
|
||||
* Other templates listed at
|
||||
<https://github.com/joelparkerhenderson/architecture_decision_record>
|
||||
* Formless - No conventions for file format and structure
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "MADR 2.1.0", because
|
||||
|
||||
* Implicit assumptions should be made explicit. Design documentation
|
||||
is important to enable people understanding the decisions later on.
|
||||
See also [A rational design process: How and why to fake
|
||||
it](https://doi.org/10.1109/TSE.1986.6312940).
|
||||
* The MADR format is lean and fits our development style.
|
||||
* The MADR structure is comprehensible and facilitates usage &
|
||||
maintenance.
|
||||
* The MADR project is vivid.
|
||||
* Version 2.1.0 is the latest one available when starting to document
|
||||
ADRs.
|
66
modules/microsite/src/main/tut/dev/adr/0001_components.md
Normal file
@ -0,0 +1,66 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Components
|
||||
---
|
||||
|
||||
# Components
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
How should the application be structured into its main components? The
|
||||
goal is to be able to have multiple rest servers/webapps and multiple
|
||||
document processor components working togehter.
|
||||
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
The following are the "main" modules. There may be more helper modules
|
||||
and libraries that support implementing a feature.
|
||||
|
||||
### store
|
||||
|
||||
The code related to database access. It also provides the job
|
||||
queue. It is designed as a library.
|
||||
|
||||
### joex
|
||||
|
||||
Joex stands for "job executor".
|
||||
|
||||
An application that executes jobs from the queue and therefore depends
|
||||
on the `store` module. It provides the code for all tasks that can be
|
||||
submitted as jobs. If no jobs are in the queue, the joex "sleeps"
|
||||
and must be waked via an external request.
|
||||
|
||||
It provides the document processing code.
|
||||
|
||||
It provides a http rest server to get insight into the joex state
|
||||
and also to be notified for new jobs.
|
||||
|
||||
### backend
|
||||
|
||||
It provides all the logic, except document processing, as a set of
|
||||
"operations". An operation can be directly mapped to a rest
|
||||
endpoint.
|
||||
|
||||
It is designed as a library.
|
||||
|
||||
### rest api
|
||||
|
||||
This module contains the specification for the rest server as an
|
||||
`openapi.yml` file. It is packaged as a scala library that also
|
||||
provides types and conversions to/from json.
|
||||
|
||||
The idea is that the `rest server` module can depend on it as well as
|
||||
rest clients.
|
||||
|
||||
### rest server
|
||||
|
||||
This is the main application. It directly depends on the `backend`
|
||||
module, and each rest endpoint maps to a "backend operation". It is
|
||||
also responsible for converting the json data inside http requests
|
||||
to/from types recognized by the `backend` module.
|
||||
|
||||
|
||||
### webapp
|
||||
|
||||
This module provides the user interface as a web application.
|
@ -0,0 +1,65 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Component Interaction
|
||||
---
|
||||
|
||||
# Component Interaction
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
There are multiple web applications with their rest servers and there
|
||||
are multiple document processors. These processes must communicate:
|
||||
|
||||
- once a new job is added to the queue the rest server must somehow
|
||||
notify processors to wake up
|
||||
- once a processor takes a job, it must propagate the progress and
|
||||
outcome to all rest servers only that the rest server can notify the
|
||||
user that is currently logged in. Since it's not known which
|
||||
rest-server the user is using right now, all must be notified.
|
||||
|
||||
## Considered Options
|
||||
|
||||
1. JMS (ActiveMQ or similiar): Message Broker as another active
|
||||
component
|
||||
2. Akka: using a cluster
|
||||
3. DB: Register with "call back urls"
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Choosing option 3: DB as central synchronisation point.
|
||||
|
||||
The reason is that this is the simplest solution and doesn't require
|
||||
external libraries or more processes. The other options seem too big
|
||||
of a weapon for the task at hand. They are both large components
|
||||
itself and require more knowledge to use them efficiently.
|
||||
|
||||
It works roughly like this:
|
||||
|
||||
- rest servers and processors register at the database on startup each
|
||||
with a unique call-back url
|
||||
- and deregister on shutdown
|
||||
- each component has db access
|
||||
- rest servers can list all processors and vice versa
|
||||
|
||||
### Positive Consequences
|
||||
|
||||
- complexity of the whole application is not touched
|
||||
- since a lot of data must be transferred to the document processors,
|
||||
this is solved by simply accessing the db. So the protocol for data
|
||||
exchange is set. There is no need for other protocols that handle
|
||||
large data (http chunking etc)
|
||||
- uses the already exsting db as synchronisation point
|
||||
- no additional knowledge required
|
||||
- simple to understand and so not hard to debug
|
||||
|
||||
### Negative Consequences
|
||||
|
||||
- all components must have db access. this also is a security con,
|
||||
because if one of those processes is hacked, db access is
|
||||
possible. and it simply is another dependency that is not really
|
||||
required for the joex component
|
||||
- the joex component cannot be in an untrusted environment (untrusted
|
||||
from the db's point of view). For example, it is not possible to
|
||||
create "personal joex" that only receive your own jobs…
|
||||
- in order to know if a component is really active, one must run a
|
||||
ping against the call-back url
|
95
modules/microsite/src/main/tut/dev/adr/0003_encryption.md
Normal file
@ -0,0 +1,95 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Encryption
|
||||
---
|
||||
|
||||
# Encryption
|
||||
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Since docspell may store important documents, it should be possible to
|
||||
encrypt them on the server. It should be (almost) transparent to the
|
||||
user, for example, a user must be able to login and download a file in
|
||||
clear form. That is, the server must also decrypt them.
|
||||
|
||||
Then all users of a collective should have access to the files. This
|
||||
requires to share the key among users of a collective.
|
||||
|
||||
But, even when files are encrypted, the associated meta data is not!
|
||||
So especially access to the database would allow to see tags,
|
||||
associated persons and correspondents of documents.
|
||||
|
||||
So in short, encryption means:
|
||||
|
||||
- file contents (the blobs and extracted text) is encrypted
|
||||
- metadata is not
|
||||
- secret keys are stored at the server (protected by a passphrase),
|
||||
such that files can be downloaded in clear form
|
||||
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
* major driver is to provide most possible privacy for users
|
||||
* even at the expense of less features; currently I think that the
|
||||
associated meta data is enough for finding documents (i.e. full text
|
||||
search is not needed)
|
||||
|
||||
## Considered Options
|
||||
|
||||
It is clear, that only blobs (file contents) can be encrypted, but not
|
||||
the associated metadata. And the extracted text must be encrypted,
|
||||
too, obviously.
|
||||
|
||||
|
||||
### Public Key Encryption (PKE)
|
||||
|
||||
With PKE that the server can automatically encrypt files using
|
||||
publicly available key data. It wouldn't require a user to provide a
|
||||
passphrase for encryption, only for decryption.
|
||||
|
||||
This would allows for first processing files (extracting text, doing
|
||||
text analyisis) and encrypting them (and the text) afterwards.
|
||||
|
||||
The public and secret keys are stored at the database. The secret key
|
||||
must be protected. This can be done by encrypting the passphrase to
|
||||
the secret key using each users login password. If a user logs in, he
|
||||
or she must provide the correct password. Using this password, the
|
||||
private key can be unlocked. This requires to store the private key
|
||||
passphrase encrypted with every users password in the database. So the
|
||||
whole security then depends on users password quality.
|
||||
|
||||
There are plenty of other difficulties with this approach (how about
|
||||
password change, new secret keys, adding users etc).
|
||||
|
||||
Using this kind of encryption would protect the data against offline
|
||||
attacks and also for accidental leakage (for example, if a bug in the
|
||||
software would access a file of another user).
|
||||
|
||||
|
||||
### No Encryption
|
||||
|
||||
If only blobs are encrypted, against which type of attack would it
|
||||
provide protection?
|
||||
|
||||
The users must still trust the server. First, in order to provide the
|
||||
wanted features (document processing), the server must see the file
|
||||
contents. Then, it will receive and serve files in clear form, so it
|
||||
has access to them anyways.
|
||||
|
||||
With that in mind, the "only" feature is to protect against "stolen
|
||||
database" attacks. If the database is somehow leaked, the attackers
|
||||
would only see the metadata, but not real documents. It also protects
|
||||
against leakage, maybe caused by a pogramming error.
|
||||
|
||||
But the downside is, that it increases complexity *a lot*. And since
|
||||
this is a personal tool for personal use, is it worth the effort?
|
||||
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
No encryption, because of its complexity.
|
||||
|
||||
For now, this tool is only meant for "self deployment" and personal
|
||||
use. If this changes or there is enough time, this decision should be
|
||||
reconsidered.
|
@ -0,0 +1,42 @@
|
||||
---
|
||||
layout: docs
|
||||
title: ISO8601 vs Millis
|
||||
---
|
||||
|
||||
# ISO8601 vs Millis as Date-Time transfer
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
The question is whether the REST Api should return an ISO8601
|
||||
formatted string in UTC timezone, or the unix time (number of
|
||||
milliseconds since 1970-01-01).
|
||||
|
||||
There is quite some controversy about it.
|
||||
|
||||
- <https://stackoverflow.com/questions/47426786/epoch-or-iso8601-date-format>
|
||||
- <https://nbsoftsolutions.com/blog/designing-a-rest-api-unix-time-vs-iso-8601>
|
||||
|
||||
In my opinion, the ISO8601 format (always UTC) is better. The reason
|
||||
is the better readability. But elm folks are on the other side:
|
||||
|
||||
- <https://package.elm-lang.org/packages/elm/time/1.0.0#iso-8601>
|
||||
- <https://package.elm-lang.org/packages/rtfeldman/elm-iso8601-date-strings/latest/>
|
||||
|
||||
One can convert from an ISO8601 date-time string in UTC time into the
|
||||
epoch millis and vice versa. So it is the same to me. There is no less
|
||||
information in a ISO8601 string than in the epoch millis.
|
||||
|
||||
To avoid confusion, all date/time values should use the same encoding.
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
I go with the epoch time. Every timestamp/date-time values is
|
||||
transfered as Unix timestamp.
|
||||
|
||||
Reasons:
|
||||
|
||||
- the Elm application needs to frequently calculate with these values
|
||||
to render the current waiting time etc. This is better if there are
|
||||
numbers without requiring to parse dates first
|
||||
- Since the UI is written with Elm, it's probably good to adopt their
|
||||
style
|
136
modules/microsite/src/main/tut/dev/adr/0005_job-executor.md
Normal file
@ -0,0 +1,136 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Joex - Job Executor
|
||||
---
|
||||
|
||||
# Job Executor
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Docspell is a multi-user application. When processing user's
|
||||
documents, there must be some thought on how to distribute all the
|
||||
processing jobs on a much more restricted set of resources. There
|
||||
maybe 100 users but only 4 cores that can process documents at a
|
||||
time. Doing simply FIFO is not enough since it provides an unfair
|
||||
distribution. The first user who submits 20 documents will then occupy
|
||||
all cores for quite some time and all other users would need to wait.
|
||||
|
||||
This tries to find a more fair distribution among the users (strictly
|
||||
meaning collectives here) of docspell.
|
||||
|
||||
The job executor is a separate component that will run in its own
|
||||
process. It takes the next job from the "queue" and executes the
|
||||
associated task. This is used to run the document processing jobs
|
||||
(text extraction, text analysis etc).
|
||||
|
||||
1. The task execution should survive restarts. State and task code
|
||||
must be recreated from some persisted state.
|
||||
|
||||
2. The processing should be fair with respect to collectives.
|
||||
|
||||
3. It must be possible to run many job executors, possibly on
|
||||
different machines. This can be used to quickly enable more
|
||||
processing power and removing it once the peak is over.
|
||||
|
||||
4. Task execution can fail and it should be able to retry those
|
||||
tasks. Reasons are that errors may be temporarily (for example
|
||||
talking to a third party service), and to enable repairing without
|
||||
stopping the job executor. Some errors might be easily repaired (a
|
||||
program was not installed or whatever). In such a case it is good
|
||||
to know that the task will be retried later.
|
||||
|
||||
## Considered Options
|
||||
|
||||
In contrast to other ADRs this is just some sketching of thoughts for
|
||||
the current implementation.
|
||||
|
||||
1. Job description are serialized and written to the database into a
|
||||
table. This becomes the queue. Tasks are identified by names and a
|
||||
job executor implementation must have a map of names to code to
|
||||
lookup the task to perform. The tasks arguments are serialized into
|
||||
a string and written to the database. Tasks must decode the
|
||||
string. This can be conveniently done using JSON and the provided
|
||||
circe decoders.
|
||||
|
||||
2. To provide a fair execution jobs are organized into groups. When a
|
||||
new job is requested from the queue, first a group is selected
|
||||
using a round-robin strategy. This should ensure good enough
|
||||
fairness among groups. A group maps to a collective. Within a
|
||||
group, a job is selected based on priority, submitted time (fifo)
|
||||
and job state (see notes about stuck jobs).
|
||||
|
||||
3. Allowing multiple job executors means that getting the next job can
|
||||
fail due to simultaneous running transactions. It is retried until
|
||||
it succeeds. Taking a job puts in into _scheduled_ state. Each job
|
||||
executor has a unique (manually supplied) id and jobs are marked
|
||||
with that id once it is handed to the executor.
|
||||
|
||||
4. When a task fails, its state is updated to state _stuck_. Stuck
|
||||
jobs are retried in the future. The queue prefers to return stuck
|
||||
jobs that are due at the specific point in time ignoring the
|
||||
priority hint.
|
||||
|
||||
### More Details
|
||||
|
||||
A job has these properties
|
||||
|
||||
- id (something random)
|
||||
- group
|
||||
- taskname (to choose task to run)
|
||||
- submitted-date
|
||||
- worker (the id of the job executor)
|
||||
- state, one of: waiting, scheduled, running, stuck, cancelled,
|
||||
failed, success
|
||||
- waiting: job has been inserted into the queue
|
||||
- scheduled: job has been handed over to some executore and is
|
||||
marked with the job executor id
|
||||
- running: a task is currently executing
|
||||
- stuck: a task has failed and is being retried eventually
|
||||
- cancelled: task has finished and there was a cancel request
|
||||
- failed: task has failed, execeeded the retries
|
||||
- success: task has completed successfully
|
||||
|
||||
The queue has a `take` or `nextJob` operation that takes the worker-id
|
||||
and a priority hint and goes roughly like this:
|
||||
|
||||
- select the next group using round-robin strategy
|
||||
- select all jobs with that group, where
|
||||
- state is stuck and waiting time has elapsed
|
||||
- state is waiting and have the given priority if possible
|
||||
- jobs are ordered by submitted time, but stuck jobs whose waiting
|
||||
time elapsed are preferred
|
||||
|
||||
There are two priorities within a group: high and low. A configured
|
||||
counting scheme determines when to select certain priority. For
|
||||
example, counting scheme of `(2,1)` would select two high priority
|
||||
jobs and then 1 low priority job. The `take` operation tries to prefer
|
||||
this priority but falls back to the other if no job with this priority
|
||||
is available.
|
||||
|
||||
A group corresponds to a collective. Then all collectives get
|
||||
(roughly) equal treatment.
|
||||
|
||||
Once there are no jobs in the queue the executor goes into sleep and
|
||||
must be waked to run again. If a job is submitted, the executors are
|
||||
notified.
|
||||
|
||||
### Stuck Jobs
|
||||
|
||||
A job is going into _stuck_ state, if the task has failed. In this
|
||||
state, the task is rerun after a while until a maximum retry count is
|
||||
reached.
|
||||
|
||||
The problem is how to notify all executors when the waiting time has
|
||||
elapsed. If one executor puts a job into stuck state, it means that
|
||||
all others should start looking into the queue again after `x`
|
||||
minutes. It would be possible to tell all existing executors to
|
||||
schedule themselves to wake up in the future, but this would miss all
|
||||
executors that show up later.
|
||||
|
||||
The waiting time is increased exponentially after each retry (`2 ^
|
||||
retry`) and it is meant as the minimum waiting time. So it is ok if
|
||||
all executors wakeup periodically and check for new work. Most of the
|
||||
time this should not be necessary and is just a fallback if only stuck
|
||||
jobs are in the queue and nothing is submitted for a long time. If the
|
||||
system is used, jobs get submitted once in a while and would awake all
|
||||
executors.
|
72
modules/microsite/src/main/tut/dev/adr/template.md
Normal file
@ -0,0 +1,72 @@
|
||||
# [short title of solved problem and solution]
|
||||
|
||||
* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
|
||||
* Deciders: [list everyone involved in the decision] <!-- optional -->
|
||||
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
|
||||
|
||||
Technical Story: [description | ticket/issue URL] <!-- optional -->
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
|
||||
|
||||
## Decision Drivers <!-- optional -->
|
||||
|
||||
* [driver 1, e.g., a force, facing concern, …]
|
||||
* [driver 2, e.g., a force, facing concern, …]
|
||||
* … <!-- numbers of drivers can vary -->
|
||||
|
||||
## Considered Options
|
||||
|
||||
* [option 1]
|
||||
* [option 2]
|
||||
* [option 3]
|
||||
* … <!-- numbers of options can vary -->
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
|
||||
|
||||
### Positive Consequences <!-- optional -->
|
||||
|
||||
* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
### Negative Consequences <!-- optional -->
|
||||
|
||||
* [e.g., compromising quality attribute, follow-up decisions required, …]
|
||||
* …
|
||||
|
||||
## Pros and Cons of the Options <!-- optional -->
|
||||
|
||||
### [option 1]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 2]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 3]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
* Good, because [argument a]
|
||||
* Good, because [argument b]
|
||||
* Bad, because [argument c]
|
||||
* … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
## Links <!-- optional -->
|
||||
|
||||
* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
|
||||
* … <!-- numbers of links can vary -->
|
99
modules/microsite/src/main/tut/doc.md
Normal file
@ -0,0 +1,99 @@
|
||||
---
|
||||
layout: docs
|
||||
position: 4
|
||||
title: Documentation
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
Docspell assists in organizing large amounts of PDF files that are
|
||||
typically scanned paper documents. You can associate tags, set
|
||||
correspondends, what a document is concerned with, a name, a date and
|
||||
some more. If your documents are associated with this meta data, you
|
||||
should be able to quickly find them later using the search
|
||||
feature. But adding this manually to each document is a tedious
|
||||
task. What if most of it could be attached automatically?
|
||||
|
||||
## How it works
|
||||
|
||||
Documents have two main properties: a correspondent (sender or
|
||||
receiver that is not you) and something the document is about. Usually
|
||||
it is about a person or some thing – maybe your car, or contracts
|
||||
concerning some familiy member, etc.
|
||||
|
||||
1. You maintain a kind of address book. It should list all possible
|
||||
correspondents and the concerning people/things. This grows
|
||||
incrementally with each new unknown document.
|
||||
2. When docspell analyzes a document, it tries to find matches within
|
||||
your address book. It can detect the correspondent and a concerning
|
||||
person or thing. It will then associate this data to your
|
||||
documents.
|
||||
3. You can inspect what docspell has done and correct it. If docspell
|
||||
has found multiple suggestions, they will be shown for you to
|
||||
select one. If it is not correctly associated, very often the
|
||||
correct one is just one click away.
|
||||
|
||||
The set of meta data that docspell uses to draw suggestions from, must
|
||||
be maintained manually. But usually, this data doesn't grow as fast as
|
||||
the documents. After a while there is a quite complete address book
|
||||
and only once in a while it has to be revisited.
|
||||
|
||||
|
||||
## Terms
|
||||
|
||||
In order to better understand these pages, some terms should be
|
||||
explained first.
|
||||
|
||||
### Item
|
||||
|
||||
An **Item** is roughly your (pdf) document, only that an item may span
|
||||
multiple files, which are called **attachments**. And an item has
|
||||
**meta data** associated:
|
||||
|
||||
- a **correspondent**: the other side of the communication. It can be
|
||||
an organization or a person.
|
||||
- a **concerning person** or **equipment**: a person or thing that
|
||||
this item is about. Maybe it is an insurance contract about your
|
||||
car.
|
||||
- **tag**: an item can be tagged with custom tags. A tag can have a
|
||||
*category*. This is intended for grouping tags, for example a
|
||||
category `doctype` could be used to group tags like `bill`,
|
||||
`contract`, `receipt` etc. Usually an item is not tagged with more
|
||||
than one tag of a category.
|
||||
- a **item date**: this is the date of the document – if this is not
|
||||
set, the created date of the item is used.
|
||||
- a **due date**: an optional date indicating that something has to be
|
||||
done (e.g. paying a bill, submitting it) about this item until this
|
||||
date
|
||||
- a **direction**: one of "incoming" or "outgoing"
|
||||
- a **name**: some item name, defaults to the file name of the
|
||||
attachments
|
||||
- some **notes**: arbitraty descriptive text. You can use markdown
|
||||
here, which is appropriately formatted in the web application.
|
||||
|
||||
### Collective
|
||||
|
||||
The users of the application are part of a **collective**. A
|
||||
**collective** is a group of users that share access to the same
|
||||
items. The account name is therefore comprised of a *collective name*
|
||||
and a *user name*.
|
||||
|
||||
All users of a collective are equal; they have same permissions to
|
||||
access all items. The items don't belong to a user, but to the
|
||||
collective.
|
||||
|
||||
That means, to identify yourself when signing in, you have to give the
|
||||
collective name and your user name. By default it is separated by a
|
||||
slash `/`, for example `smith/john`. If your user name is the same as
|
||||
the collective name, you can omit one; so `smith/smith` can be
|
||||
abbreviated to just `smith`.
|
||||
|
||||
|
||||
## Limitations
|
||||
|
||||
* Docspell currently supports only PDF files.
|
||||
* The PDF view relies on the browsers capabilities. Sadly, not all
|
||||
browsers can display PDF files. Some may require extra plugins. And
|
||||
it's especially sad, that mobile browsers wont't display the
|
||||
files. It works with the major desktop browsers (firefox, chromium),
|
||||
though.
|
261
modules/microsite/src/main/tut/doc/configure.md
Normal file
@ -0,0 +1,261 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Configuring
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
Docspell's executable can take one argument – a configuration file. If
|
||||
that is not given, the defaults are used. The config file overrides
|
||||
default values, so only values that differ from the defaults are
|
||||
necessary.
|
||||
|
||||
This applies to the restserver and the joex as well.
|
||||
|
||||
## Important Config Options
|
||||
|
||||
The configuration of both components uses separate namespaces. The
|
||||
configuration for the REST server is below `docspell.server`, while
|
||||
the one for joex is below `docspell.joex`.
|
||||
|
||||
### JDBC
|
||||
|
||||
This configures the connection to the database. This has to be
|
||||
specified for the rest server and joex. By default, a H2 database in
|
||||
the current `/tmp` directory is configured.
|
||||
|
||||
The config looks like this (both components):
|
||||
|
||||
```
|
||||
docspell.joex.jdbc {
|
||||
url = ...
|
||||
user = ...
|
||||
password = ...
|
||||
}
|
||||
|
||||
docspell.server.backend.jdbc {
|
||||
url = ...
|
||||
user = ...
|
||||
password = ...
|
||||
}
|
||||
```
|
||||
|
||||
The `url` is the connection to the database. It must start with
|
||||
`jdbc`, followed by name of the database. The rest is specific to the
|
||||
database used: it is either a path to a file for H2 or a host/database
|
||||
url for MariaDB and PostgreSQL.
|
||||
|
||||
When using H2, the user is `sa`, the password can be empty and the url
|
||||
must include these options:
|
||||
|
||||
```
|
||||
;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE
|
||||
```
|
||||
|
||||
#### Examples
|
||||
|
||||
PostgreSQL:
|
||||
```
|
||||
url = "jdbc:postgresql://localhost:5432/docspelldb"
|
||||
```
|
||||
|
||||
MariaDB:
|
||||
```
|
||||
url = "jdbc:mariadb://localhost:3306/docspelldb"
|
||||
```
|
||||
|
||||
H2
|
||||
```
|
||||
url = "jdbc:h2:///path/to/a/file.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
|
||||
```
|
||||
|
||||
### Bind
|
||||
|
||||
The host and port the http server binds to. This applies to both
|
||||
components. The joex component also exposes a small REST api to
|
||||
inspect its state and notify the scheduler.
|
||||
|
||||
```
|
||||
docspell.server.bind {
|
||||
address = localhost
|
||||
port = 7880
|
||||
}
|
||||
docspell.joex.bind {
|
||||
address = localhost
|
||||
port = 7878
|
||||
}
|
||||
```
|
||||
|
||||
By default, it binds to `localhost` and some predefined port. This
|
||||
must be changed, if components are on different machines.
|
||||
|
||||
### baseurl
|
||||
|
||||
The base url is an important setting that defines the http URL where
|
||||
the corresponding component can be reached. It applies to both
|
||||
components. For a joex component, the url must be resolvable from a
|
||||
REST server component. The REST server also uses this url to create
|
||||
absolute urls and to configure the authenication cookie.
|
||||
|
||||
By default it is build using the information from the `bind` setting.
|
||||
|
||||
|
||||
```
|
||||
docspell.server.baseurl = ...
|
||||
docspell.joex.baseurl = ...
|
||||
```
|
||||
|
||||
#### Examples
|
||||
|
||||
```
|
||||
docspell.server.baseurl = "https://docspell.example.com"
|
||||
docspell.joex.baseurl = "http://192.168.101.10"
|
||||
```
|
||||
|
||||
|
||||
### app-id
|
||||
|
||||
The `app-id` is the identifier of the corresponding instance. It *must
|
||||
be unique* for all instances. By default the REST server uses `rest1`
|
||||
and joex `joex1`. It is recommended to overwrite this setting to have
|
||||
an explicit and stable identifier.
|
||||
|
||||
```
|
||||
docspell.server.app-id = "rest1"
|
||||
docspell.joex.app-id = "joex1"
|
||||
```
|
||||
|
||||
### registration options
|
||||
|
||||
This defines if and how new users can create accounts. There are 3
|
||||
options:
|
||||
|
||||
- *closed* no new user can sign up
|
||||
- *open* new users can sign up
|
||||
- *invite* new users can sign up but require an invitation key
|
||||
|
||||
This applies only to the REST sevrer component.
|
||||
|
||||
```
|
||||
docspell.server.signup {
|
||||
mode = "open"
|
||||
|
||||
# If mode == 'invite', a password must be provided to generate
|
||||
# invitation keys. It must not be empty.
|
||||
new-invite-password = ""
|
||||
|
||||
# If mode == 'invite', this is the period an invitation token is
|
||||
# considered valid.
|
||||
invite-time = "3 days"
|
||||
}
|
||||
```
|
||||
|
||||
The mode `invite` is intended to open the application only to some
|
||||
users. The admin can create these invitation keys and distribute them
|
||||
to the desired people. For this, the `new-invite-password` must be
|
||||
given. The idea is that only the person who installs docspell knows
|
||||
this. If it is not set, then invitation won't work. New invitation
|
||||
keys can be generated from within the web application or via REST
|
||||
calls (using `curl`, for example).
|
||||
|
||||
```
|
||||
curl -X POST -d '{"password":"blabla"}' "http://localhost:7880/api/v1/open/signup/newinvite"
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Authentication works in two ways:
|
||||
|
||||
- with an account-name / password pair
|
||||
- with an authentication token
|
||||
|
||||
The initial authentication must occur with an accountname/password
|
||||
pair. This will generate an authentication token which is valid for a
|
||||
some time. Subsequent calls to secured routes can use this token. The
|
||||
token can be given as a normal http header or via a cookie header.
|
||||
|
||||
These settings apply only to the REST server.
|
||||
|
||||
```
|
||||
docspell.server.auth {
|
||||
server-secret = "hex:caffee" # or "b64:Y2FmZmVlCg=="
|
||||
session-valid = "5 minutes"
|
||||
}
|
||||
```
|
||||
|
||||
The `server-secret` is used to sign the token. If multiple REST
|
||||
servers are deployed, all must share the same server secret. Otherwise
|
||||
tokens from one instance are not valid on another instance. The secret
|
||||
can be given as Base64 encoded string or in hex form. Use the prefix
|
||||
`hex:` and `b64:`, respectively.
|
||||
|
||||
The `session-valid` deterimens how long a token is valid. This can be
|
||||
just some minutes, the web application obtains new ones
|
||||
periodically. So a short time is recommended.
|
||||
|
||||
|
||||
## File Format
|
||||
|
||||
The format of the configuration files can be
|
||||
[HOCON](https://github.com/lightbend/config/blob/master/HOCON.md#hocon-human-optimized-config-object-notation),
|
||||
JSON or whatever the used [config
|
||||
library](https://github.com/lightbend/config) understands. The default
|
||||
values below are in HOCON format, which is recommended, since it
|
||||
allows comments and has some [advanced
|
||||
features](https://github.com/lightbend/config/blob/master/README.md#features-of-hocon). Please
|
||||
refer to their documentation for more on this.
|
||||
|
||||
Here are the default configurations.
|
||||
|
||||
|
||||
## Default Config
|
||||
|
||||
### Rest Server
|
||||
|
||||
```
|
||||
{% include server.conf %}
|
||||
```
|
||||
|
||||
### Joex
|
||||
|
||||
```
|
||||
{% include joex.conf %}
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
By default, docspell logs to stdout. This works well, when managed by
|
||||
systemd or other inits. Logging is done by
|
||||
[logback](https://logback.qos.ch/). Please refer to its documentation
|
||||
for how to configure logging.
|
||||
|
||||
If you created your logback config file, it can be added as argument
|
||||
to the executable using this syntax:
|
||||
|
||||
```
|
||||
/path/to/docspell -Dlogback.configurationFile=/path/to/your/logging-config-file
|
||||
```
|
||||
|
||||
To get started, the default config looks like this:
|
||||
|
||||
``` xml
|
||||
<configuration>
|
||||
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
|
||||
<withJansi>true</withJansi>
|
||||
|
||||
<encoder>
|
||||
<pattern>[%thread] %highlight(%-5level) %cyan(%logger{15}) - %msg %n</pattern>
|
||||
</encoder>
|
||||
</appender>
|
||||
|
||||
<logger name="docspell" level="debug" />
|
||||
<root level="INFO">
|
||||
<appender-ref ref="STDOUT" />
|
||||
</root>
|
||||
</configuration>
|
||||
```
|
||||
|
||||
The `<root level="INFO">` means, that only log statements with level
|
||||
"INFO" will be printed. But the `<logger name="docspell"
|
||||
level="debug">` above says, that for loggers with name "docspell"
|
||||
statements with level "DEBUG" will be printed, too.
|
77
modules/microsite/src/main/tut/doc/curate.md
Normal file
@ -0,0 +1,77 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Find and Review
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
Curating the items meta data helps finding them later. This page
|
||||
describes how you can quickly go through those items and correct or
|
||||
amend with existing data.
|
||||
|
||||
## Select New items
|
||||
|
||||
After files have been uploaded and the job executor created the
|
||||
corresponding items, they will show up on the main page. All items,
|
||||
the job executor has created are initially marked as *New*. The option
|
||||
*only New* in the left search menu can be used to select only new
|
||||
items:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-1.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Check selected items
|
||||
|
||||
Then you can go through all new items and check their metadata: Click
|
||||
on the first item to open the detail view. This shows the documents
|
||||
and the meta data in the header.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-2.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Modify if necessary
|
||||
|
||||
To change something, click the *Edit* button in the menu above the
|
||||
document view. This will open a form next to your documents. You can
|
||||
compare the data with the documents and change as you like. Since the
|
||||
item status is *New*, you'll see the suggestions docspell found during
|
||||
processing. If there were multiple candidates, you can select another
|
||||
one by clicking its name in the suggestion list.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-3.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
When you change something in the form, it is immediatly applied. Only
|
||||
when changing text fields, a click on the *Save* symbol next to the
|
||||
field is required.
|
||||
|
||||
|
||||
## Confirm
|
||||
|
||||
If everything looks good, click the *Confirm* button to confirm the
|
||||
current data. The *New* status goes away and also the suggestions are
|
||||
hidden in this state. You can always go back by clicking the
|
||||
*Unconfirm* button.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-5.jpg">
|
||||
</div>
|
||||
|
||||
|
||||
## Proceed with next item
|
||||
|
||||
To look at the next item in the search results, click the *Next*
|
||||
button in the menu (next to the *Edit* button). Clicking next, will
|
||||
keep the current view, so you can continue checking the data. If you
|
||||
are on the last item, the view switches to the listing view when
|
||||
clicking *Next*.
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/docspell-curate-6.jpg">
|
||||
</div>
|
218
modules/microsite/src/main/tut/doc/install.md
Normal file
@ -0,0 +1,218 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Installation
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
This page contains detailed installation instructions. For a quick
|
||||
start, refer to [this page](../getit.html).
|
||||
|
||||
Docspell has been developed and tested on a GNU/Linux system. It may
|
||||
run on Windows and MacOS machines, too (ghostscript and tesseract are
|
||||
available on these systems). But I've never tried.
|
||||
|
||||
Docspell consists of two components that are started in separate
|
||||
processes:
|
||||
|
||||
1. *REST Server* This is the main application, providing the REST Api
|
||||
and the web application.
|
||||
2. *Joex* (job executor) This is the component that does the document
|
||||
processing.
|
||||
|
||||
They can run on multiple machines. All REST server and Joex instances
|
||||
should be on the same network. It is not strictly required that they
|
||||
can reach each other, but the components can then notify themselves
|
||||
about new or done work.
|
||||
|
||||
While this is possible, the simple setup is to start both components
|
||||
once on the same machine.
|
||||
|
||||
The [download page](https://github.com/eikek/docspell/releases)
|
||||
provides pre-compiled packages and the [development page](dev.html)
|
||||
contains build instructions.
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
The two components have one prerequisite in common: they both require
|
||||
Java to run. While this is the only requirement for the *REST server*,
|
||||
the *Joex* components requires some more external programs.
|
||||
|
||||
### Java
|
||||
|
||||
Very often, Java is already installed. You can check this by opening a
|
||||
terminal and typing `java -version`. Otherwise install Java using your
|
||||
package manager or see [this site](https://adoptopenjdk.net/) for
|
||||
other options.
|
||||
|
||||
It is enough to install the JRE. The JDK is required, if you want to
|
||||
build docspell from source.
|
||||
|
||||
Docspell has been tested with Java version 1.8 (or sometimes referred
|
||||
to as JRE 8 and JDK 8, respectively). The pre-build packages are also
|
||||
build using JDK 8. But a later version of Java should work as well.
|
||||
|
||||
The next tools are only required on machines running the *Joex*
|
||||
component.
|
||||
|
||||
### External Tools for Joex
|
||||
|
||||
- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
|
||||
is used to extract/convert PDF files into images that are then fed
|
||||
to ocr. It is available on most GNU/Linux distributions.
|
||||
- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
|
||||
pre-processes images to yield better results when doing ocr. If this
|
||||
is not installed, docspell tries without it. However, it is
|
||||
recommended to install, because it [improves text
|
||||
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
|
||||
(at the expense of a longer runtime).
|
||||
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
|
||||
doing the OCR (converts images into text). It is a widely used open
|
||||
source OCR engine. Tesseract 3 and 4 should work with docspell; you
|
||||
can adopt the command line in the configuration file, if necessary.
|
||||
|
||||
|
||||
### Example Debian
|
||||
|
||||
On Debian this should install all joex requirements:
|
||||
|
||||
``` bash
|
||||
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
|
||||
```
|
||||
|
||||
## Database
|
||||
|
||||
Both components must have access to a SQL database. Docspell has
|
||||
support these databases:
|
||||
|
||||
- PostreSQL
|
||||
- MariaDB
|
||||
- H2
|
||||
|
||||
The H2 database is an interesting option for personal and mid-size
|
||||
setups, as it requires no additional work. It is integrated into
|
||||
docspell and works really well. It is also configured as the default
|
||||
database.
|
||||
|
||||
For large installations, PostgreSQL or MariaDB is recommended. Create
|
||||
a database and a user with enough privileges (read, write, create
|
||||
table) to that database.
|
||||
|
||||
When using H2, make sure that all components access the same database
|
||||
– the jdbc url must point to the same file. Then, it is important to
|
||||
add the options
|
||||
`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
|
||||
of the url. See the [default config](configure.html) for an example.
|
||||
|
||||
|
||||
## Installing from ZIP files
|
||||
|
||||
After extracting the zip files, you'll find a start script in the
|
||||
`bin/` folder.
|
||||
|
||||
|
||||
## Installing from DEB packages
|
||||
|
||||
The DEB packages can be installed on Debian, or Debian based Distros:
|
||||
|
||||
``` bash
|
||||
$ sudo dpkg -i docspell*.deb
|
||||
```
|
||||
|
||||
Then the start scripts are in your `$PATH`. Run `docspell-restserver`
|
||||
or `docspell-joex` from a terminal window.
|
||||
|
||||
The packages come with a systemd unit file that will be installed to
|
||||
autostart the services.
|
||||
|
||||
|
||||
## Running
|
||||
|
||||
Run the start script (in the corresponding `bin/` directory when using
|
||||
the zip files):
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver
|
||||
$ ./docspell-joex*/bin/docspell-joex
|
||||
```
|
||||
|
||||
This will startup both components using the default configuration. The
|
||||
configuration should be adopted to your needs. For example, the
|
||||
database connection is configured to use a H2 database in the `/tmp`
|
||||
directory. Please refer to the [configuration page](configure.html)
|
||||
for how to create a custom config file. Once you have your config
|
||||
file, simply pass it as argument to the command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
|
||||
$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf
|
||||
```
|
||||
|
||||
After starting the rest server, you can reach the web application at
|
||||
path `/app/index.html`, so using default values it would be
|
||||
`http://localhost:7880/app/index.html`.
|
||||
|
||||
You should be able to create a new account and sign in. Check the
|
||||
[configuration page](configure.html) to further customize docspell.
|
||||
|
||||
|
||||
### Options
|
||||
|
||||
The start scripts support some options to configure the JVM. One often
|
||||
used setting is the maximum heap size of the JVM. By default, java
|
||||
determines it based on properties of the current machine. You can
|
||||
specify it by given java startup options to the command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
|
||||
```
|
||||
|
||||
This would limit the maximum heap to 1GB. The double slash separates
|
||||
internal options and the arguments to the program. Another frequently
|
||||
used option is to change the default temp directory. Usually it is
|
||||
`/tmp`, but it may be desired to have a dedicated temp directory,
|
||||
which can be configured:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
|
||||
```
|
||||
|
||||
The command:
|
||||
|
||||
```
|
||||
$ ./docspell-restserver*/bin/docspell-restserver -h
|
||||
```
|
||||
|
||||
gives an overview of supported options.
|
||||
|
||||
|
||||
## Raspberry Pi, and similiar
|
||||
|
||||
Both component can run next to each other on a raspberry pi or
|
||||
similiar device.
|
||||
|
||||
|
||||
### REST Server
|
||||
|
||||
The REST server component runs very well on the Raspberry Pi and
|
||||
similiar devices. It doesn't require much resources, because the heavy
|
||||
work is done by the joex components.
|
||||
|
||||
|
||||
### Joex
|
||||
|
||||
Running the joex component on the Raspberry Pi is possible, but will
|
||||
result in long processing times. Tested on a RPi model 3 (4 cores, 1G
|
||||
RAM) processing a PDF (scanned with 300dpi) with two pages took
|
||||
9:52. You can speed it up considerably by uninstalling the `unpaper`
|
||||
command, because this step takes quite long. This, of course, reduces
|
||||
the quality of OCR. But without `unpaper` the same sample pdf was then
|
||||
processed in 1:24, a speedup of 8 minutes.
|
||||
|
||||
You should limit the joex pool size to 1 and, depending on your model
|
||||
and the amount of RAM, set a heap size of at least 500M
|
||||
(`-J-Xmx500M`).
|
||||
|
||||
For personal setups, when you don't need the processing results asap,
|
||||
this can work well enough.
|
155
modules/microsite/src/main/tut/doc/joex.md
Normal file
@ -0,0 +1,155 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Joex
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
Joex is short for *Job Executor* and it is the component managing long
|
||||
running tasks in docspell. One of these long running tasks is the file
|
||||
processing task.
|
||||
|
||||
One joex component handles the processing of all files of all
|
||||
collectives/users. It requires much more resources than the rest
|
||||
server component. Therefore the number of jobs that can run in
|
||||
parallel is limited with respect to the hardware it is running on.
|
||||
|
||||
For larger installations, it is probably better to run several joex
|
||||
components on different machines. That works out of the box, as long
|
||||
as all components point to the same database and use different
|
||||
`app-id`s (see [configuring docspell](./configure.html)).
|
||||
|
||||
When files are submitted to docspell, they are stored in the database
|
||||
and all known joex components are notified about new work. Then they
|
||||
compete on getting the next job from the queue. After a job finishes
|
||||
and no job is waiting in the queue, joex will sleep until notified
|
||||
again. It will also periodically notify itself as a fallback.
|
||||
|
||||
## Scheduler and Queue
|
||||
|
||||
The scheduler is the part that runs and monitors the long running
|
||||
jobs. It works together with the job queue, which defines what job to
|
||||
take next.
|
||||
|
||||
To create a somewhat fair distribution among multiple collectives, a
|
||||
collective is first chosen in a simple round-robin way. Then a job
|
||||
from this collective is chosen by priority.
|
||||
|
||||
There are only two priorities: low and high. A simple *counting
|
||||
scheme* determines if a low prio or high prio job is selected
|
||||
next. The default is `4, 1`, meaning to first select 4 high priority
|
||||
jobs and then 1 low priority job, then starting over. If no such job
|
||||
exists, its falls back to the other priority.
|
||||
|
||||
The priority can be set on a *Source* (see
|
||||
[uploads](uploading.html)). Uploading through the web application will
|
||||
always use priority *high*. The idea is that while logged in, jobs are
|
||||
more important that those submitted when not logged in.
|
||||
|
||||
|
||||
## Scheduler Config
|
||||
|
||||
The relevant part of the config file regarding the scheduler is shown
|
||||
below with some explanations.
|
||||
|
||||
```
|
||||
docspell.joex {
|
||||
# other settings left out for brevity
|
||||
|
||||
scheduler {
|
||||
|
||||
# Number of processing allowed in parallel.
|
||||
pool-size = 2
|
||||
|
||||
# A counting scheme determines the ratio of how high- and low-prio
|
||||
# jobs are run. For example: 4,1 means run 4 high prio jobs, then
|
||||
# 1 low prio and then start over.
|
||||
counting-scheme = "4,1"
|
||||
|
||||
# How often a failed job should be retried until it enters failed
|
||||
# state. If a job fails, it becomes "stuck" and will be retried
|
||||
# after a delay.
|
||||
retries = 5
|
||||
|
||||
# The delay until the next try is performed for a failed job. This
|
||||
# delay is increased exponentially with the number of retries.
|
||||
retry-delay = "1 minute"
|
||||
|
||||
# The queue size of log statements from a job.
|
||||
log-buffer-size = 500
|
||||
|
||||
# If no job is left in the queue, the scheduler will wait until a
|
||||
# notify is requested (using the REST interface). To also retry
|
||||
# stuck jobs, it will notify itself periodically.
|
||||
wakeup-period = "30 minutes"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `pool-size` setting deterimens how many jobs run in parallel. You
|
||||
need to play with this setting on your machine to find an optimal
|
||||
value.
|
||||
|
||||
The `counting-scheme` determines for all collectives how to select
|
||||
between high and low priority jobs; as explained above. It is
|
||||
currently not possible to define that per collective.
|
||||
|
||||
If a job fails, it will be set to *stuck* state and retried by the
|
||||
scheduler. The `retries` setting defines how many times a job is
|
||||
retried until it enters the final *failed* state. The scheduler waits
|
||||
some time until running the next try. This delay is given by
|
||||
`retry-delay`. This is the initial delay, the time until the first
|
||||
re-try (the second attempt). This time increases exponentially with
|
||||
the number of retries.
|
||||
|
||||
The jobs will log about what they do, which is picked up and stored
|
||||
into the database asynchronously. The log events are buffered in a
|
||||
queue and another thread will consume this queue and store them in the
|
||||
database. The `log-buffer-size` determines the size of the queue.
|
||||
|
||||
At last, there is a `wakeup-period` that determines at what interval
|
||||
the joex component notifies itself to look for new jobs. If jobs get
|
||||
stuck, and joex is not notified externally it could miss to
|
||||
retry. Also, since networks are not reliable, a notification may not
|
||||
reach a joex component. This periodic wakup is just to ensure that
|
||||
jobs are eventually run.
|
||||
|
||||
|
||||
## Starting on demand
|
||||
|
||||
The job executor and rest server can be started multiple times. This
|
||||
is especially useful for the job executor. For example, when
|
||||
submitting a lot of files in a short time, you can simply startup more
|
||||
job executors on other computers on your network. Maybe use your
|
||||
laptop to help with processing for a while.
|
||||
|
||||
You have to make sure, that all connect to the same database, and that
|
||||
all have unique `app-id`s.
|
||||
|
||||
Once the files have been processced you can stop the additional
|
||||
executors.
|
||||
|
||||
## Shutting down
|
||||
|
||||
If a job executor is sleeping and not executing any jobs, you can just
|
||||
quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
|
||||
there are jobs currently executing, it is advisable to initiate a
|
||||
graceful shutdown. The job executor will then stop taking new jobs
|
||||
from the queue but it will wait until all running jobs have completed
|
||||
before shutting down.
|
||||
|
||||
This can be done by sending a http POST request to the api of this job
|
||||
executor:
|
||||
|
||||
```
|
||||
curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
|
||||
```
|
||||
|
||||
If joex receives this request it will immediately stop taking new jobs
|
||||
and it will quit when all running jobs are done.
|
||||
|
||||
If a job executor gets terminated while there are running jobs, the
|
||||
jobs are still in the current state marked to be executed by this job
|
||||
executor. In order to fix this, start the job executor again. It will
|
||||
search all jobs that are marked with its id and put them back into
|
||||
waiting state. Then send a graceful shutdown request as shown above.
|
87
modules/microsite/src/main/tut/doc/metadata.md
Normal file
@ -0,0 +1,87 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Adding Meta Data
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
## Meta Data
|
||||
|
||||
The processing can be controlled implicitely by the provided meta
|
||||
data. The *Meta Data* page allows to manage this meta data. You can
|
||||
create the following:
|
||||
|
||||
- Tags
|
||||
- Organizations
|
||||
- Persons
|
||||
- Equipments
|
||||
|
||||
### Tags
|
||||
|
||||
Items can be tagged with multiple custom tags (aka labels). This
|
||||
allows to describe many different workflows people may have with their
|
||||
documents.
|
||||
|
||||
A tag can have a *category*. This is meant to group tags together. For
|
||||
example, you may want to have a tag category *doctype* that is
|
||||
comprised of tags like *bill*, *contract*, *receipt* and so on. Or for
|
||||
workflows, a tag category *state* may exist that includes tags like
|
||||
*Todo* or *Waiting*. Or you can tag items with user names to provide
|
||||
"assignment" semantics. Docspell doesn't propose any workflow, but it
|
||||
can help to implement some.
|
||||
|
||||
The tags are *not* taken into account when processing. Docspell will
|
||||
not automatically associate tags to your items. The tags are only
|
||||
meant to be used manually.
|
||||
|
||||
|
||||
### Organization and Person
|
||||
|
||||
The organization entity represents an non-personal (organization or
|
||||
company) correspondent of an item. Docspell will choose one or more
|
||||
organizations when processing documents and associate the "best" match
|
||||
with your item.
|
||||
|
||||
The person entitiy can appear in two roles: It may be a correspondent
|
||||
or the person an item is about. So a person is either a correspondent
|
||||
or a concerning person. Docspell can not know which person is which,
|
||||
therefore you need to tell this by checking the box "Use for
|
||||
concerning person suggestion only". If this is checked, docspell will
|
||||
use this person only to suggest a concerning person. Otherwise the
|
||||
person is used only for correspondent suggestions.
|
||||
|
||||
Document processing uses the following properties:
|
||||
|
||||
- name
|
||||
- websites
|
||||
- e-mails
|
||||
|
||||
The website an e-mails can be added as contact information. If these
|
||||
three are present, you should get good matches from docspell. All
|
||||
other fields of an organization and person are not used during
|
||||
document processing. They might be useful when using this as a real
|
||||
address book.
|
||||
|
||||
|
||||
### Equipment
|
||||
|
||||
The equipment entity is almost like a tag. In fact, it could be
|
||||
replaced by a tag with a specific known category. The difference is
|
||||
that docspell will try to find a match and associate it with your
|
||||
item. The equipment represents non-personal things that an item is
|
||||
about. Examples are: bills or insurances for *cars*, contracts for
|
||||
*houses* or *flats*.
|
||||
|
||||
Equipments don't have contact information, so the only property that
|
||||
is used to find matches during document processing is its name.
|
||||
|
||||
|
||||
## Document Language
|
||||
|
||||
An important setting is the language of your documents. This helps OCR
|
||||
and text analysis. You can select between English and German
|
||||
currently.
|
||||
|
||||
Go to the *Collective Settings* page and click *Document
|
||||
Language*. This will set the lanugage for all your documents. It is
|
||||
not (yet) possible to specify it when uploading.
|
40
modules/microsite/src/main/tut/doc/processing.md
Normal file
@ -0,0 +1,40 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Processing Queue
|
||||
---
|
||||
|
||||
# {{ page.title }}
|
||||
|
||||
|
||||
The page *Processing Queue* shows the current state of document
|
||||
processing for your uploads.
|
||||
|
||||
At the top of the page a list of running jobs is shown. Below that,
|
||||
the left column shows jobs that wait to be picked up by the job
|
||||
executor. On the right are finished jobs. The number of finished jobs
|
||||
is cut to some maximum and is also restricted by a date range. The
|
||||
page refreshes itself automatically to show the progress.
|
||||
|
||||
Example screenshot:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/processing-queue.jpg">
|
||||
</div>
|
||||
|
||||
You can cancel running jobs or remove waiting ones from the queue. If
|
||||
you click on the small file symbol on finished jobs, you can inspect
|
||||
its log messages again. A running job displays the job executor id
|
||||
that executes the job.
|
||||
|
||||
Currently the job queue executes just the document processing tasks,
|
||||
but it may be used for other long running tasks in the future.
|
||||
|
||||
Since job executors are shared among all collectives, it may happen
|
||||
that a job is some time waiting until it is picked up by a job
|
||||
executor. You can always start more job executors to help out.
|
||||
|
||||
If a job fails, it is retried after some time. Only if it fails too
|
||||
often (can be configured), it then is finished with *failed* state. If
|
||||
processing finally fails, the item is still created, just without
|
||||
suggestions. But if processing is cancelled by the user, the item is
|
||||
not created.
|
130
modules/microsite/src/main/tut/doc/uploading.md
Normal file
@ -0,0 +1,130 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Uploads
|
||||
---
|
||||
|
||||
# {{page.title}}
|
||||
|
||||
|
||||
This page describes, how files can get into docspell. Technically,
|
||||
there is just one way: via http multipart/form-data requests.
|
||||
|
||||
|
||||
## Authenticated Upload
|
||||
|
||||
From within the web application there is the "Upload Files"
|
||||
page. There you can select multiple files to upload. You can also
|
||||
specify whether these files should become one item or if every file is
|
||||
a separate item.
|
||||
|
||||
When you click "Submit" the files are uploaded and stored in the
|
||||
database. Then the job executor(s) are notified which immediately
|
||||
start processing them.
|
||||
|
||||
Go to the top-right menu and click "Processing Queue" to see the
|
||||
current state.
|
||||
|
||||
This obviously requires an authenticated user. While this is handy for
|
||||
ad-hoc uploads, it is very inconvenient for automating it by custom
|
||||
scripts. For this the next variant exists.
|
||||
|
||||
## Anonymous Upload
|
||||
|
||||
It is also possible to upload files without authentication. This
|
||||
should make tools that interact with docspell much easier to write.
|
||||
|
||||
|
||||
### Creating Anonymous Uploads
|
||||
|
||||
Go to "Collective Settings" and then to the "Source" tab. A *Source*
|
||||
identifies an endpoint where files can be uploaded
|
||||
anonymously. Creating a new source creates a long unique id which is
|
||||
part on an url that can be used to upload files. You can choose any
|
||||
time to deactivate or delete the source at which point uploading is
|
||||
not possible anymore. The idea is to give this URL away safely. You
|
||||
can delete it any time and no passwords or secrets are visible, even
|
||||
your username is not visible.
|
||||
|
||||
Example screenshot:
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="../img/sources-form.jpg">
|
||||
</div>
|
||||
|
||||
This example shows a source with name "test". It defines two urls:
|
||||
|
||||
- `/app/index.html#/upload/<id>`
|
||||
- `/api/v1/open/upload/item/<id>`
|
||||
|
||||
The first points to a web page where everyone could upload files into
|
||||
your account. You could give this url to people for sending files
|
||||
directly into your docspell.
|
||||
|
||||
The second url is the API url, which accepts the requests to upload
|
||||
files (which is used by the first url).
|
||||
|
||||
For example, this url can be used to upload files with curl:
|
||||
|
||||
``` bash
|
||||
$ curl -XPOST -F file=@test.pdf http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
{"success":true,"message":"Files submitted."}
|
||||
```
|
||||
|
||||
You could add more `-F file=@/path/to/your/file.pdf` to upload
|
||||
multiple files (note, the `@` is required by curl, so it knows that
|
||||
the following is a file).
|
||||
|
||||
When files are uploaded to an source endpoint, the items resulting
|
||||
from this uploads are marked with the name of the source. So you know
|
||||
which source an item originated.
|
||||
|
||||
If files are uploaded using the web applications *Upload files* page,
|
||||
the source is implicitly set to `webapp`. If you also want to let
|
||||
docspell count the files uploaded through the web interface, just
|
||||
create a source (can be inactive) with that name (`webapp`).
|
||||
|
||||
|
||||
## The Request
|
||||
|
||||
This gives more details about the request for uploads. It is a http
|
||||
`multipart/form-data` request, with two possible fields:
|
||||
|
||||
- meta
|
||||
- file
|
||||
|
||||
The `file` field can appear multiple times and is required at least
|
||||
once. It is the part containing the file to upload.
|
||||
|
||||
The `meta` part is completely optional and can define additional meta
|
||||
data, that docspell uses to create items from the given files. It
|
||||
allows to transfer structured information together with the
|
||||
unstructured binary files.
|
||||
|
||||
The `meta` content must be `application/json` containing this
|
||||
structure:
|
||||
|
||||
```
|
||||
{ multiple: Bool
|
||||
, direction: Maybe String
|
||||
}
|
||||
```
|
||||
|
||||
The `multiple` property is by default `true`. It means that each file
|
||||
in the upload request corresponds to a single item. An upload with 5
|
||||
files will result in 5 items created. If it is `false`, then docspell
|
||||
will create just one item, that will then contain all files.
|
||||
|
||||
Furthermore, the direction of the document (one of `incoming` or
|
||||
`outgoing`) can be given. It is optional, it can be left out or
|
||||
`null`.
|
||||
|
||||
This kind of request is very common and most programming languages
|
||||
have support for this. For example, here is another curl command
|
||||
uploading two files with meta data:
|
||||
|
||||
```
|
||||
curl -XPOST -F meta='{"multiple":false, "direction": "outgoing"}' \
|
||||
-F file=@letter-en-source.pdf \
|
||||
-F file=@letter-de-source.pdf \
|
||||
http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
|
||||
```
|
55
modules/microsite/src/main/tut/getit.md
Normal file
@ -0,0 +1,55 @@
|
||||
---
|
||||
layout: home
|
||||
position: 3
|
||||
section: quickstart
|
||||
title: Quickstart
|
||||
technologies:
|
||||
- first: ["Scala + Elm", "Backend is in Scala with Cats/Fs2, Webapp in Elm"]
|
||||
- second: ["Unpaper + Tesseract", "Text is extracted using OCR provided by tesseract"]
|
||||
- third: ["Stanford NLP", "Documents are analyzed using Stanford NLP classifiers"]
|
||||
---
|
||||
|
||||
## Download
|
||||
|
||||
You can download pre-compiled binaries from the [Release
|
||||
Page](https://github.com/eikek/docspell/releases). There are `deb`
|
||||
packages and a generic zip files.
|
||||
|
||||
You need to download the two files:
|
||||
|
||||
- [docspell-restserver-{{site.version}}.zip](https://github.com/eikek/docspell/releases/download/v{{site.version}}/docspell-restserver-{{site.version}}.zip)
|
||||
- [docspell-joex-{{site.version}}.zip](https://github.com/eikek/docspell/releases/download/v{{site.version}}/docspell-joex-{{site.version}}.zip)
|
||||
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Install Java (use your package manager or look
|
||||
[here](https://adoptopenjdk.net/)),
|
||||
[tesseract](https://github.com/tesseract-ocr/tesseract),
|
||||
[ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
|
||||
[unpaper](https://github.com/Flameeyes/unpaper). The last is not
|
||||
really required, but improves OCR.
|
||||
|
||||
|
||||
## Running
|
||||
|
||||
1. Unzip both files:
|
||||
``` bash
|
||||
$ unzip docspell-*.zip
|
||||
```
|
||||
2. Open two terminal windows and navigate to the the directory
|
||||
containing the zip files.
|
||||
3. Start both components executing:
|
||||
``` bash
|
||||
$ ./docspell-restserver*/bin/docspell-restserver
|
||||
```
|
||||
in one terminal and
|
||||
``` bash
|
||||
$ ./docspell-joex*/bin/docspell-joex
|
||||
```
|
||||
in the other.
|
||||
4. Point your browser to: <http://localhost:7880/app/index.html>
|
||||
5. Register a new account, sign in and try it.
|
||||
|
||||
Check the [documentation](doc.html) for more information on how to use
|
||||
docspell.
|
49
modules/microsite/src/main/tut/index.md
Normal file
@ -0,0 +1,49 @@
|
||||
---
|
||||
layout: home
|
||||
position: 1
|
||||
section: home
|
||||
title: Home
|
||||
technologies:
|
||||
- first: ["Scala + Elm", "Backend is in Scala with Cats/Fs2, Webapp in Elm"]
|
||||
- second: ["Unpaper + Tesseract", "Text is extracted using OCR provided by tesseract"]
|
||||
- third: ["Stanford NLP", "Documents are analyzed using Stanford NLP classifiers"]
|
||||
---
|
||||
|
||||
# A Document Organizer
|
||||
|
||||
Docspell is a simple tool to cope with your piles of (digitized) paper
|
||||
documents. You'll need a scanner to convert your papers into PDF
|
||||
files. Docspell can then assist in organizing the resulting PDF files
|
||||
easily. Its main goal is to efficiently support two major use cases:
|
||||
|
||||
1. **Stowing documents away**: Most of the time documents are received
|
||||
or created. It should be *fast* to stow them away, knowing that
|
||||
they can be found if necessary.
|
||||
|
||||
Upload the PDF files to docspell. Docspell finds meta data and will
|
||||
link them to your document, automatically. There may be false
|
||||
positives, so a short review is recommended. Though even if not,
|
||||
the results are not that bad.
|
||||
2. **Finding them**: If there is a document needed, you can search for
|
||||
it. Usually, restricting to a date range and a correspondent will
|
||||
result in only a few documents to sift through. Alternatively, you
|
||||
can add your own tags, names etc to better match your workflow.
|
||||
|
||||
The meta data that docspell uses is provided by you. You need to
|
||||
maintain a list of correspondents and maybe other things you want
|
||||
docspell to draw suggestions from. So if a new document arrives (from
|
||||
an unknown correspondent) then you would add a new entry to your meta
|
||||
data and link it manually to the document. But the next time, docspell
|
||||
will do it for you.
|
||||
|
||||
Docspell is *not* a document management system. There exists a lot of
|
||||
these systems that have much more features. Docspell's focus is around
|
||||
the two use cases described above, which already is quite useful.
|
||||
|
||||
Checkout the quick [demo](demo.html) to get a first impression and the
|
||||
[quickstart](getit.html) page if you want to try it out.
|
||||
|
||||
## License
|
||||
|
||||
This project is distributed under the
|
||||
[GPLv3](http://www.gnu.org/licenses/gpl-3.0.html)
|