Upgrade microsite

This commit is contained in:
Eike Kettner
2019-12-29 23:37:32 +01:00
parent 2001cca88b
commit 57e274e2b0
42 changed files with 599 additions and 70 deletions

View File

@ -0,0 +1,261 @@
---
layout: docs
title: Configuring
---
# {{ page.title }}
Docspell's executable can take one argument a configuration file. If
that is not given, the defaults are used. The config file overrides
default values, so only values that differ from the defaults are
necessary.
This applies to the restserver and the joex as well.
## Important Config Options
The configuration of both components uses separate namespaces. The
configuration for the REST server is below `docspell.server`, while
the one for joex is below `docspell.joex`.
### JDBC
This configures the connection to the database. This has to be
specified for the rest server and joex. By default, a H2 database in
the current `/tmp` directory is configured.
The config looks like this (both components):
```
docspell.joex.jdbc {
url = ...
user = ...
password = ...
}
docspell.server.backend.jdbc {
url = ...
user = ...
password = ...
}
```
The `url` is the connection to the database. It must start with
`jdbc`, followed by name of the database. The rest is specific to the
database used: it is either a path to a file for H2 or a host/database
url for MariaDB and PostgreSQL.
When using H2, the user is `sa`, the password can be empty and the url
must include these options:
```
;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE
```
#### Examples
PostgreSQL:
```
url = "jdbc:postgresql://localhost:5432/docspelldb"
```
MariaDB:
```
url = "jdbc:mariadb://localhost:3306/docspelldb"
```
H2
```
url = "jdbc:h2:///path/to/a/file.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
```
### Bind
The host and port the http server binds to. This applies to both
components. The joex component also exposes a small REST api to
inspect its state and notify the scheduler.
```
docspell.server.bind {
address = localhost
port = 7880
}
docspell.joex.bind {
address = localhost
port = 7878
}
```
By default, it binds to `localhost` and some predefined port. This
must be changed, if components are on different machines.
### baseurl
The base url is an important setting that defines the http URL where
the corresponding component can be reached. It applies to both
components. For a joex component, the url must be resolvable from a
REST server component. The REST server also uses this url to create
absolute urls and to configure the authenication cookie.
By default it is build using the information from the `bind` setting.
```
docspell.server.baseurl = ...
docspell.joex.baseurl = ...
```
#### Examples
```
docspell.server.baseurl = "https://docspell.example.com"
docspell.joex.baseurl = "http://192.168.101.10"
```
### app-id
The `app-id` is the identifier of the corresponding instance. It *must
be unique* for all instances. By default the REST server uses `rest1`
and joex `joex1`. It is recommended to overwrite this setting to have
an explicit and stable identifier.
```
docspell.server.app-id = "rest1"
docspell.joex.app-id = "joex1"
```
### registration options
This defines if and how new users can create accounts. There are 3
options:
- *closed* no new user can sign up
- *open* new users can sign up
- *invite* new users can sign up but require an invitation key
This applies only to the REST sevrer component.
```
docspell.server.signup {
mode = "open"
# If mode == 'invite', a password must be provided to generate
# invitation keys. It must not be empty.
new-invite-password = ""
# If mode == 'invite', this is the period an invitation token is
# considered valid.
invite-time = "3 days"
}
```
The mode `invite` is intended to open the application only to some
users. The admin can create these invitation keys and distribute them
to the desired people. For this, the `new-invite-password` must be
given. The idea is that only the person who installs docspell knows
this. If it is not set, then invitation won't work. New invitation
keys can be generated from within the web application or via REST
calls (using `curl`, for example).
```
curl -X POST -d '{"password":"blabla"}' "http://localhost:7880/api/v1/open/signup/newinvite"
```
### Authentication
Authentication works in two ways:
- with an account-name / password pair
- with an authentication token
The initial authentication must occur with an accountname/password
pair. This will generate an authentication token which is valid for a
some time. Subsequent calls to secured routes can use this token. The
token can be given as a normal http header or via a cookie header.
These settings apply only to the REST server.
```
docspell.server.auth {
server-secret = "hex:caffee" # or "b64:Y2FmZmVlCg=="
session-valid = "5 minutes"
}
```
The `server-secret` is used to sign the token. If multiple REST
servers are deployed, all must share the same server secret. Otherwise
tokens from one instance are not valid on another instance. The secret
can be given as Base64 encoded string or in hex form. Use the prefix
`hex:` and `b64:`, respectively.
The `session-valid` deterimens how long a token is valid. This can be
just some minutes, the web application obtains new ones
periodically. So a short time is recommended.
## File Format
The format of the configuration files can be
[HOCON](https://github.com/lightbend/config/blob/master/HOCON.md#hocon-human-optimized-config-object-notation),
JSON or whatever the used [config
library](https://github.com/lightbend/config) understands. The default
values below are in HOCON format, which is recommended, since it
allows comments and has some [advanced
features](https://github.com/lightbend/config/blob/master/README.md#features-of-hocon). Please
refer to their documentation for more on this.
Here are the default configurations.
## Default Config
### Rest Server
```
{% include server.conf %}
```
### Joex
```
{% include joex.conf %}
```
## Logging
By default, docspell logs to stdout. This works well, when managed by
systemd or other inits. Logging is done by
[logback](https://logback.qos.ch/). Please refer to its documentation
for how to configure logging.
If you created your logback config file, it can be added as argument
to the executable using this syntax:
```
/path/to/docspell -Dlogback.configurationFile=/path/to/your/logging-config-file
```
To get started, the default config looks like this:
``` xml
<configuration>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<withJansi>true</withJansi>
<encoder>
<pattern>[%thread] %highlight(%-5level) %cyan(%logger{15}) - %msg %n</pattern>
</encoder>
</appender>
<logger name="docspell" level="debug" />
<root level="INFO">
<appender-ref ref="STDOUT" />
</root>
</configuration>
```
The `<root level="INFO">` means, that only log statements with level
"INFO" will be printed. But the `<logger name="docspell"
level="debug">` above says, that for loggers with name "docspell"
statements with level "DEBUG" will be printed, too.

View File

@ -0,0 +1,77 @@
---
layout: docs
title: Find and Review
---
# {{page.title}}
Curating the items meta data helps finding them later. This page
describes how you can quickly go through those items and correct or
amend with existing data.
## Select New items
After files have been uploaded and the job executor created the
corresponding items, they will show up on the main page. All items,
the job executor has created are initially marked as *New*. The option
*only New* in the left search menu can be used to select only new
items:
<div class="thumbnail">
<img src="../img/docspell-curate-1.jpg">
</div>
## Check selected items
Then you can go through all new items and check their metadata: Click
on the first item to open the detail view. This shows the documents
and the meta data in the header.
<div class="thumbnail">
<img src="../img/docspell-curate-2.jpg">
</div>
## Modify if necessary
To change something, click the *Edit* button in the menu above the
document view. This will open a form next to your documents. You can
compare the data with the documents and change as you like. Since the
item status is *New*, you'll see the suggestions docspell found during
processing. If there were multiple candidates, you can select another
one by clicking its name in the suggestion list.
<div class="thumbnail">
<img src="../img/docspell-curate-3.jpg">
</div>
When you change something in the form, it is immediatly applied. Only
when changing text fields, a click on the *Save* symbol next to the
field is required.
## Confirm
If everything looks good, click the *Confirm* button to confirm the
current data. The *New* status goes away and also the suggestions are
hidden in this state. You can always go back by clicking the
*Unconfirm* button.
<div class="thumbnail">
<img src="../img/docspell-curate-5.jpg">
</div>
## Proceed with next item
To look at the next item in the search results, click the *Next*
button in the menu (next to the *Edit* button). Clicking next, will
keep the current view, so you can continue checking the data. If you
are on the last item, the view switches to the listing view when
clicking *Next*.
<div class="thumbnail">
<img src="../img/docspell-curate-6.jpg">
</div>

View File

@ -0,0 +1,218 @@
---
layout: docs
title: Installation
---
# {{ page.title }}
This page contains detailed installation instructions. For a quick
start, refer to [this page](../getit.html).
Docspell has been developed and tested on a GNU/Linux system. It may
run on Windows and MacOS machines, too (ghostscript and tesseract are
available on these systems). But I've never tried.
Docspell consists of two components that are started in separate
processes:
1. *REST Server* This is the main application, providing the REST Api
and the web application.
2. *Joex* (job executor) This is the component that does the document
processing.
They can run on multiple machines. All REST server and Joex instances
should be on the same network. It is not strictly required that they
can reach each other, but the components can then notify themselves
about new or done work.
While this is possible, the simple setup is to start both components
once on the same machine.
The [download page](https://github.com/eikek/docspell/releases)
provides pre-compiled packages and the [development page](dev.html)
contains build instructions.
## Prerequisites
The two components have one prerequisite in common: they both require
Java to run. While this is the only requirement for the *REST server*,
the *Joex* components requires some more external programs.
### Java
Very often, Java is already installed. You can check this by opening a
terminal and typing `java -version`. Otherwise install Java using your
package manager or see [this site](https://adoptopenjdk.net/) for
other options.
It is enough to install the JRE. The JDK is required, if you want to
build docspell from source.
Docspell has been tested with Java version 1.8 (or sometimes referred
to as JRE 8 and JDK 8, respectively). The pre-build packages are also
build using JDK 8. But a later version of Java should work as well.
The next tools are only required on machines running the *Joex*
component.
### External Tools for Joex
- [Ghostscript](http://pages.cs.wisc.edu/~ghost/) (the `gs` command)
is used to extract/convert PDF files into images that are then fed
to ocr. It is available on most GNU/Linux distributions.
- [Unpaper](https://github.com/Flameeyes/unpaper) is a program that
pre-processes images to yield better results when doing ocr. If this
is not installed, docspell tries without it. However, it is
recommended to install, because it [improves text
extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
(at the expense of a longer runtime).
- [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
doing the OCR (converts images into text). It is a widely used open
source OCR engine. Tesseract 3 and 4 should work with docspell; you
can adopt the command line in the configuration file, if necessary.
### Example Debian
On Debian this should install all joex requirements:
``` bash
sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
```
## Database
Both components must have access to a SQL database. Docspell has
support these databases:
- PostreSQL
- MariaDB
- H2
The H2 database is an interesting option for personal and mid-size
setups, as it requires no additional work. It is integrated into
docspell and works really well. It is also configured as the default
database.
For large installations, PostgreSQL or MariaDB is recommended. Create
a database and a user with enough privileges (read, write, create
table) to that database.
When using H2, make sure that all components access the same database
the jdbc url must point to the same file. Then, it is important to
add the options
`;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE` at the end
of the url. See the [default config](configure.html) for an example.
## Installing from ZIP files
After extracting the zip files, you'll find a start script in the
`bin/` folder.
## Installing from DEB packages
The DEB packages can be installed on Debian, or Debian based Distros:
``` bash
$ sudo dpkg -i docspell*.deb
```
Then the start scripts are in your `$PATH`. Run `docspell-restserver`
or `docspell-joex` from a terminal window.
The packages come with a systemd unit file that will be installed to
autostart the services.
## Running
Run the start script (in the corresponding `bin/` directory when using
the zip files):
```
$ ./docspell-restserver*/bin/docspell-restserver
$ ./docspell-joex*/bin/docspell-joex
```
This will startup both components using the default configuration. The
configuration should be adopted to your needs. For example, the
database connection is configured to use a H2 database in the `/tmp`
directory. Please refer to the [configuration page](configure.html)
for how to create a custom config file. Once you have your config
file, simply pass it as argument to the command:
```
$ ./docspell-restserver*/bin/docspell-restserver /path/to/server-config.conf
$ ./docspell-joex*/bin/docspell-joex /path/to/joex-config.conf
```
After starting the rest server, you can reach the web application at
path `/app/index.html`, so using default values it would be
`http://localhost:7880/app/index.html`.
You should be able to create a new account and sign in. Check the
[configuration page](configure.html) to further customize docspell.
### Options
The start scripts support some options to configure the JVM. One often
used setting is the maximum heap size of the JVM. By default, java
determines it based on properties of the current machine. You can
specify it by given java startup options to the command:
```
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -- /path/to/server-config.conf
```
This would limit the maximum heap to 1GB. The double slash separates
internal options and the arguments to the program. Another frequently
used option is to change the default temp directory. Usually it is
`/tmp`, but it may be desired to have a dedicated temp directory,
which can be configured:
```
$ ./docspell-restserver*/bin/docspell-restserver -J-Xmx1G -Djava.io.tmpdir=/path/to/othertemp -- /path/to/server-config.conf
```
The command:
```
$ ./docspell-restserver*/bin/docspell-restserver -h
```
gives an overview of supported options.
## Raspberry Pi, and similiar
Both component can run next to each other on a raspberry pi or
similiar device.
### REST Server
The REST server component runs very well on the Raspberry Pi and
similiar devices. It doesn't require much resources, because the heavy
work is done by the joex components.
### Joex
Running the joex component on the Raspberry Pi is possible, but will
result in long processing times. Tested on a RPi model 3 (4 cores, 1G
RAM) processing a PDF (scanned with 300dpi) with two pages took
9:52. You can speed it up considerably by uninstalling the `unpaper`
command, because this step takes quite long. This, of course, reduces
the quality of OCR. But without `unpaper` the same sample pdf was then
processed in 1:24, a speedup of 8 minutes.
You should limit the joex pool size to 1 and, depending on your model
and the amount of RAM, set a heap size of at least 500M
(`-J-Xmx500M`).
For personal setups, when you don't need the processing results asap,
this can work well enough.

View File

@ -0,0 +1,155 @@
---
layout: docs
title: Joex
---
# {{ page.title }}
Joex is short for *Job Executor* and it is the component managing long
running tasks in docspell. One of these long running tasks is the file
processing task.
One joex component handles the processing of all files of all
collectives/users. It requires much more resources than the rest
server component. Therefore the number of jobs that can run in
parallel is limited with respect to the hardware it is running on.
For larger installations, it is probably better to run several joex
components on different machines. That works out of the box, as long
as all components point to the same database and use different
`app-id`s (see [configuring docspell](./configure.html)).
When files are submitted to docspell, they are stored in the database
and all known joex components are notified about new work. Then they
compete on getting the next job from the queue. After a job finishes
and no job is waiting in the queue, joex will sleep until notified
again. It will also periodically notify itself as a fallback.
## Scheduler and Queue
The scheduler is the part that runs and monitors the long running
jobs. It works together with the job queue, which defines what job to
take next.
To create a somewhat fair distribution among multiple collectives, a
collective is first chosen in a simple round-robin way. Then a job
from this collective is chosen by priority.
There are only two priorities: low and high. A simple *counting
scheme* determines if a low prio or high prio job is selected
next. The default is `4, 1`, meaning to first select 4 high priority
jobs and then 1 low priority job, then starting over. If no such job
exists, its falls back to the other priority.
The priority can be set on a *Source* (see
[uploads](uploading.html)). Uploading through the web application will
always use priority *high*. The idea is that while logged in, jobs are
more important that those submitted when not logged in.
## Scheduler Config
The relevant part of the config file regarding the scheduler is shown
below with some explanations.
```
docspell.joex {
# other settings left out for brevity
scheduler {
# Number of processing allowed in parallel.
pool-size = 2
# A counting scheme determines the ratio of how high- and low-prio
# jobs are run. For example: 4,1 means run 4 high prio jobs, then
# 1 low prio and then start over.
counting-scheme = "4,1"
# How often a failed job should be retried until it enters failed
# state. If a job fails, it becomes "stuck" and will be retried
# after a delay.
retries = 5
# The delay until the next try is performed for a failed job. This
# delay is increased exponentially with the number of retries.
retry-delay = "1 minute"
# The queue size of log statements from a job.
log-buffer-size = 500
# If no job is left in the queue, the scheduler will wait until a
# notify is requested (using the REST interface). To also retry
# stuck jobs, it will notify itself periodically.
wakeup-period = "30 minutes"
}
}
```
The `pool-size` setting deterimens how many jobs run in parallel. You
need to play with this setting on your machine to find an optimal
value.
The `counting-scheme` determines for all collectives how to select
between high and low priority jobs; as explained above. It is
currently not possible to define that per collective.
If a job fails, it will be set to *stuck* state and retried by the
scheduler. The `retries` setting defines how many times a job is
retried until it enters the final *failed* state. The scheduler waits
some time until running the next try. This delay is given by
`retry-delay`. This is the initial delay, the time until the first
re-try (the second attempt). This time increases exponentially with
the number of retries.
The jobs will log about what they do, which is picked up and stored
into the database asynchronously. The log events are buffered in a
queue and another thread will consume this queue and store them in the
database. The `log-buffer-size` determines the size of the queue.
At last, there is a `wakeup-period` that determines at what interval
the joex component notifies itself to look for new jobs. If jobs get
stuck, and joex is not notified externally it could miss to
retry. Also, since networks are not reliable, a notification may not
reach a joex component. This periodic wakup is just to ensure that
jobs are eventually run.
## Starting on demand
The job executor and rest server can be started multiple times. This
is especially useful for the job executor. For example, when
submitting a lot of files in a short time, you can simply startup more
job executors on other computers on your network. Maybe use your
laptop to help with processing for a while.
You have to make sure, that all connect to the same database, and that
all have unique `app-id`s.
Once the files have been processced you can stop the additional
executors.
## Shutting down
If a job executor is sleeping and not executing any jobs, you can just
quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
there are jobs currently executing, it is advisable to initiate a
graceful shutdown. The job executor will then stop taking new jobs
from the queue but it will wait until all running jobs have completed
before shutting down.
This can be done by sending a http POST request to the api of this job
executor:
```
curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
```
If joex receives this request it will immediately stop taking new jobs
and it will quit when all running jobs are done.
If a job executor gets terminated while there are running jobs, the
jobs are still in the current state marked to be executed by this job
executor. In order to fix this, start the job executor again. It will
search all jobs that are marked with its id and put them back into
waiting state. Then send a graceful shutdown request as shown above.

View File

@ -0,0 +1,87 @@
---
layout: docs
title: Adding Meta Data
---
# {{ page.title }}
## Meta Data
The processing can be controlled implicitely by the provided meta
data. The *Meta Data* page allows to manage this meta data. You can
create the following:
- Tags
- Organizations
- Persons
- Equipments
### Tags
Items can be tagged with multiple custom tags (aka labels). This
allows to describe many different workflows people may have with their
documents.
A tag can have a *category*. This is meant to group tags together. For
example, you may want to have a tag category *doctype* that is
comprised of tags like *bill*, *contract*, *receipt* and so on. Or for
workflows, a tag category *state* may exist that includes tags like
*Todo* or *Waiting*. Or you can tag items with user names to provide
"assignment" semantics. Docspell doesn't propose any workflow, but it
can help to implement some.
The tags are *not* taken into account when processing. Docspell will
not automatically associate tags to your items. The tags are only
meant to be used manually.
### Organization and Person
The organization entity represents an non-personal (organization or
company) correspondent of an item. Docspell will choose one or more
organizations when processing documents and associate the "best" match
with your item.
The person entitiy can appear in two roles: It may be a correspondent
or the person an item is about. So a person is either a correspondent
or a concerning person. Docspell can not know which person is which,
therefore you need to tell this by checking the box "Use for
concerning person suggestion only". If this is checked, docspell will
use this person only to suggest a concerning person. Otherwise the
person is used only for correspondent suggestions.
Document processing uses the following properties:
- name
- websites
- e-mails
The website an e-mails can be added as contact information. If these
three are present, you should get good matches from docspell. All
other fields of an organization and person are not used during
document processing. They might be useful when using this as a real
address book.
### Equipment
The equipment entity is almost like a tag. In fact, it could be
replaced by a tag with a specific known category. The difference is
that docspell will try to find a match and associate it with your
item. The equipment represents non-personal things that an item is
about. Examples are: bills or insurances for *cars*, contracts for
*houses* or *flats*.
Equipments don't have contact information, so the only property that
is used to find matches during document processing is its name.
## Document Language
An important setting is the language of your documents. This helps OCR
and text analysis. You can select between English and German
currently.
Go to the *Collective Settings* page and click *Document
Language*. This will set the lanugage for all your documents. It is
not (yet) possible to specify it when uploading.

View File

@ -0,0 +1,40 @@
---
layout: docs
title: Processing Queue
---
# {{ page.title }}
The page *Processing Queue* shows the current state of document
processing for your uploads.
At the top of the page a list of running jobs is shown. Below that,
the left column shows jobs that wait to be picked up by the job
executor. On the right are finished jobs. The number of finished jobs
is cut to some maximum and is also restricted by a date range. The
page refreshes itself automatically to show the progress.
Example screenshot:
<div class="thumbnail">
<img src="../img/processing-queue.jpg">
</div>
You can cancel running jobs or remove waiting ones from the queue. If
you click on the small file symbol on finished jobs, you can inspect
its log messages again. A running job displays the job executor id
that executes the job.
Currently the job queue executes just the document processing tasks,
but it may be used for other long running tasks in the future.
Since job executors are shared among all collectives, it may happen
that a job is some time waiting until it is picked up by a job
executor. You can always start more job executors to help out.
If a job fails, it is retried after some time. Only if it fails too
often (can be configured), it then is finished with *failed* state. If
processing finally fails, the item is still created, just without
suggestions. But if processing is cancelled by the user, the item is
not created.

View File

@ -0,0 +1,187 @@
---
layout: docs
title: Tools
---
# {{ page.title }}
The `tools/` folder contains some scripts and other resources intented
for integrating docspell.
## consumedir
The `consumerdir.sh` is a bash script that works in two modes:
- Go through all files in given directories (non recursively) and sent
each to docspell.
- Watch one or more directories for new files and upload them to
docspell.
It can watch or go through one or more directories. Files can be
uploaded to multiple urls.
Run the script with the `-h` option, to see a short help text. The
help text will also show the values for any given option.
The script requires `curl` for uploading. It requires the
`inotifywait` command if directories should be watched for new
files. If the `-m` option is used, the script will skip duplicate
files. For this the `sha256sum` command is required.
Example for watching two directories:
``` bash
./tools/consumedir.sh --path ~/Downloads --path ~/pdfs -m /var/run/consumedir -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
```
The script by default watches the given directories. If the `-o`
option is used, it will instead go through these directories and
upload all pdf files in there.
Example for uploading all immediatly (the same as above only with `-o`
added):
``` bash
./tools/consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m /var/run/consumedir -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
```
### Systemd
The script can be used with systemd to run as a service. This is an
example unit file:
```
[Unit]
After=networking.target
Description=Docspell Consumedir
[Service]
Environment="PATH=/set/a/path"
ExecStartPre=mkdir -p /var/run/consumedir && chown -R someuser /var/run/consumedir
ExecStart=/bin/su -s /bin/bash someuser -c "consumedir.sh --path '/a/path/' -m '/var/run/consumedir' 'http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ'"
```
This unit file is just an example, it needs some fiddling. It assumes
an existing user `someuser` that is used to run this service. The url
`http://localhost:7880/api/v1/open/upload/...` is an anonymous upload
url as described [here](./uploading.html).
## ds.sh
A bash script to quickly upload files from the command line. It reads
a configuration file containing the URLs to upload to. Then each file
given to the script will be uploaded to al URLs in the config.
The config file is expected in
`$XDG_CONFIG_HOME/docspell/ds.conf`. `$XDG_CONFIG_HOME` defaults to
`~/.config`.
The config file contains lines with key-value pairs, separated by an
`=` sign. Lines starting with `#` are ignored. Example:
```
# Config file
url.1 = http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
url.2 = http://localhost:7880/api/v1/open/upload/item/6DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
```
The key must start with `url`.
### Usage
The `-h` option shows a help overview.
The script takes a list of files as arguments. It checks the file
types and will raise an error (and quit) if a file is included that is
not a PDF. The `-s` option can be used to skip them instead.
The `-c` option allows to specifiy a different config file.
Example:
``` bash
./ds.sh ~/Downloads/*.pdf
```
## Webextension for Docspell
Idea: Inside the browser click on a PDF and send it to docspell. It is
downloaded in the context of your current page. Then handed to an
application that pushes it to docspell. There is a browser add-on
implementing this in `tools/webextension`. This add-on only works with
firefox.
### Install
This is a bit complicated, since you need to install external tools
and the web extension. Both work together.
#### Install `ds.sh`
First install the `ds.sh` tool somewhere, maybe `/usr/local/bin` as
described above.
#### Install the native part
Then install the "native" part of the web extension:
Copy or symlink the `native.py` script into some known location. For
example:
``` bash
ln -s ~/docspell-checkout/tools/webextension/native/native.py /usr/local/share/docspell/native.py
```
Then copy the `app_manifest.json` to
`$HOME/.mozilla/native-messaging-hosts/docspell.json`. For example:
``` bash
cp ~/docspell-checkout/tools/webextension/native/app_manifest.json ~/.mozilla/native-messaging-hosts/docspell.json
```
See
[here](https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Native_manifests#Manifest_location)
for details.
And you might want to modify this json file, so the path to the
`native.py` script is correct (it must be absolute).
If the `ds.sh` script is in your `$PATH`, then this should
work. Otherwise, edit the `native.py` script and change the path to
the tool. Or create a file `$HOME/.config/docspell/ds.cmd` whose
content is the path to the `ds.sh` script.
#### Install the extension
An extension file can be build using the `make-xpi.sh` script. But
installing it in "standard" firefox won't work, because [Mozilla
requires extensions to be signed by
them](https://wiki.mozilla.org/Add-ons/Extension_Signing). This means
creating an account and going through some process…. So here are two
alternatives:
1. Open firefox and type `about:debugging` in the addressbar. Then
click on *'Load Temporary Add-on...'* and select the
`manifest.json` file. The extension is now installed. The downside
is, that the extension will be removed once firefox is closed.
2. Use Firefox ESR, which allows to install Add-ons not signed by
Mozilla. But it has to be configured: Open firefox and type
`about:config` in the address bar. Search for key
`xpinstall.signatures.required` and set it to `false`. This is
described on the last paragraph on [this
page](https://support.mozilla.org/en-US/kb/add-on-signing-in-firefox).
When you right click on a file link, there should be a context menu
entry *'Docspell Upload Helper'*. The add-on will download this file
using the browser and then send the file path to the `native.py`
script. This script will in turn call `ds.sh` which finally uploads it
to your configured URLs.
Open the Add-ons page (`Ctrl`+`Shift`+`A`), the new add-on should be
there.

View File

@ -0,0 +1,130 @@
---
layout: docs
title: Uploads
---
# {{page.title}}
This page describes, how files can get into docspell. Technically,
there is just one way: via http multipart/form-data requests.
## Authenticated Upload
From within the web application there is the "Upload Files"
page. There you can select multiple files to upload. You can also
specify whether these files should become one item or if every file is
a separate item.
When you click "Submit" the files are uploaded and stored in the
database. Then the job executor(s) are notified which immediately
start processing them.
Go to the top-right menu and click "Processing Queue" to see the
current state.
This obviously requires an authenticated user. While this is handy for
ad-hoc uploads, it is very inconvenient for automating it by custom
scripts. For this the next variant exists.
## Anonymous Upload
It is also possible to upload files without authentication. This
should make tools that interact with docspell much easier to write.
### Creating Anonymous Uploads
Go to "Collective Settings" and then to the "Source" tab. A *Source*
identifies an endpoint where files can be uploaded
anonymously. Creating a new source creates a long unique id which is
part on an url that can be used to upload files. You can choose any
time to deactivate or delete the source at which point uploading is
not possible anymore. The idea is to give this URL away safely. You
can delete it any time and no passwords or secrets are visible, even
your username is not visible.
Example screenshot:
<div class="thumbnail">
<img src="../img/sources-form.jpg">
</div>
This example shows a source with name "test". It defines two urls:
- `/app/index.html#/upload/<id>`
- `/api/v1/open/upload/item/<id>`
The first points to a web page where everyone could upload files into
your account. You could give this url to people for sending files
directly into your docspell.
The second url is the API url, which accepts the requests to upload
files (which is used by the first url).
For example, this url can be used to upload files with curl:
``` bash
$ curl -XPOST -F file=@test.pdf http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
{"success":true,"message":"Files submitted."}
```
You could add more `-F file=@/path/to/your/file.pdf` to upload
multiple files (note, the `@` is required by curl, so it knows that
the following is a file).
When files are uploaded to an source endpoint, the items resulting
from this uploads are marked with the name of the source. So you know
which source an item originated.
If files are uploaded using the web applications *Upload files* page,
the source is implicitly set to `webapp`. If you also want to let
docspell count the files uploaded through the web interface, just
create a source (can be inactive) with that name (`webapp`).
## The Request
This gives more details about the request for uploads. It is a http
`multipart/form-data` request, with two possible fields:
- meta
- file
The `file` field can appear multiple times and is required at least
once. It is the part containing the file to upload.
The `meta` part is completely optional and can define additional meta
data, that docspell uses to create items from the given files. It
allows to transfer structured information together with the
unstructured binary files.
The `meta` content must be `application/json` containing this
structure:
```
{ multiple: Bool
, direction: Maybe String
}
```
The `multiple` property is by default `true`. It means that each file
in the upload request corresponds to a single item. An upload with 5
files will result in 5 items created. If it is `false`, then docspell
will create just one item, that will then contain all files.
Furthermore, the direction of the document (one of `incoming` or
`outgoing`) can be given. It is optional, it can be left out or
`null`.
This kind of request is very common and most programming languages
have support for this. For example, here is another curl command
uploading two files with meta data:
```
curl -XPOST -F meta='{"multiple":false, "direction": "outgoing"}' \
-F file=@letter-en-source.pdf \
-F file=@letter-de-source.pdf \
http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
```