mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-06 07:05:59 +00:00
Add docs for file processing
This commit is contained in:
parent
96c337f7db
commit
8910ac6954
@ -3,178 +3,10 @@ title = "Joex"
|
|||||||
description = "More information about the job executor component."
|
description = "More information about the job executor component."
|
||||||
weight = 90
|
weight = 90
|
||||||
insert_anchor_links = "right"
|
insert_anchor_links = "right"
|
||||||
[extra]
|
summary = true
|
||||||
mktoc = true
|
template = "pages.html"
|
||||||
|
sort_by = "weight"
|
||||||
|
redirect_to = "docs/joex/intro"
|
||||||
+++
|
+++
|
||||||
|
|
||||||
# Introduction
|
No content here.
|
||||||
|
|
||||||
Joex is short for *Job Executor* and it is the component managing long
|
|
||||||
running tasks in docspell. One of these long running tasks is the file
|
|
||||||
processing task.
|
|
||||||
|
|
||||||
One joex component handles the processing of all files of all
|
|
||||||
collectives/users. It requires much more resources than the rest
|
|
||||||
server component. Therefore the number of jobs that can run in
|
|
||||||
parallel is limited with respect to the hardware it is running on.
|
|
||||||
|
|
||||||
For larger installations, it is probably better to run several joex
|
|
||||||
components on different machines. That works out of the box, as long
|
|
||||||
as all components point to the same database and use different
|
|
||||||
`app-id`s (see [configuring
|
|
||||||
docspell](@/docs/configure/_index.md#app-id)).
|
|
||||||
|
|
||||||
When files are submitted to docspell, they are stored in the database
|
|
||||||
and all known joex components are notified about new work. Then they
|
|
||||||
compete on getting the next job from the queue. After a job finishes
|
|
||||||
and no job is waiting in the queue, joex will sleep until notified
|
|
||||||
again. It will also periodically notify itself as a fallback.
|
|
||||||
|
|
||||||
# Task vs Job
|
|
||||||
|
|
||||||
Just for the sake of this document, a task denotes the code that has
|
|
||||||
to be executed or the thing that has to be done. It emerges in a job,
|
|
||||||
once a task is submitted into the queue from where it will be picked
|
|
||||||
up and executed eventually. A job maintains a state and other things,
|
|
||||||
while a task is just code.
|
|
||||||
|
|
||||||
|
|
||||||
# Scheduler and Queue
|
|
||||||
|
|
||||||
The scheduler is the part that runs and monitors the long running
|
|
||||||
jobs. It works together with the job queue, which defines what job to
|
|
||||||
take next.
|
|
||||||
|
|
||||||
To create a somewhat fair distribution among multiple collectives, a
|
|
||||||
collective is first chosen in a simple round-robin way. Then a job
|
|
||||||
from this collective is chosen by priority.
|
|
||||||
|
|
||||||
There are only two priorities: low and high. A simple *counting
|
|
||||||
scheme* determines if a low prio or high prio job is selected
|
|
||||||
next. The default is `4, 1`, meaning to first select 4 high priority
|
|
||||||
jobs and then 1 low priority job, then starting over. If no such job
|
|
||||||
exists, its falls back to the other priority.
|
|
||||||
|
|
||||||
The priority can be set on a *Source* (see
|
|
||||||
[uploads](@/docs/webapp/uploading.md)). Uploading through the web
|
|
||||||
application will always use priority *high*. The idea is that while
|
|
||||||
logged in, jobs are more important that those submitted when not
|
|
||||||
logged in.
|
|
||||||
|
|
||||||
|
|
||||||
# Scheduler Config
|
|
||||||
|
|
||||||
The relevant part of the config file regarding the scheduler is shown
|
|
||||||
below with some explanations.
|
|
||||||
|
|
||||||
``` bash
|
|
||||||
docspell.joex {
|
|
||||||
# other settings left out for brevity
|
|
||||||
|
|
||||||
scheduler {
|
|
||||||
|
|
||||||
# Number of processing allowed in parallel.
|
|
||||||
pool-size = 2
|
|
||||||
|
|
||||||
# A counting scheme determines the ratio of how high- and low-prio
|
|
||||||
# jobs are run. For example: 4,1 means run 4 high prio jobs, then
|
|
||||||
# 1 low prio and then start over.
|
|
||||||
counting-scheme = "4,1"
|
|
||||||
|
|
||||||
# How often a failed job should be retried until it enters failed
|
|
||||||
# state. If a job fails, it becomes "stuck" and will be retried
|
|
||||||
# after a delay.
|
|
||||||
retries = 5
|
|
||||||
|
|
||||||
# The delay until the next try is performed for a failed job. This
|
|
||||||
# delay is increased exponentially with the number of retries.
|
|
||||||
retry-delay = "1 minute"
|
|
||||||
|
|
||||||
# The queue size of log statements from a job.
|
|
||||||
log-buffer-size = 500
|
|
||||||
|
|
||||||
# If no job is left in the queue, the scheduler will wait until a
|
|
||||||
# notify is requested (using the REST interface). To also retry
|
|
||||||
# stuck jobs, it will notify itself periodically.
|
|
||||||
wakeup-period = "30 minutes"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
The `pool-size` setting determines how many jobs run in parallel. You
|
|
||||||
need to play with this setting on your machine to find an optimal
|
|
||||||
value.
|
|
||||||
|
|
||||||
The `counting-scheme` determines for all collectives how to select
|
|
||||||
between high and low priority jobs; as explained above. It is
|
|
||||||
currently not possible to define that per collective.
|
|
||||||
|
|
||||||
If a job fails, it will be set to *stuck* state and retried by the
|
|
||||||
scheduler. The `retries` setting defines how many times a job is
|
|
||||||
retried until it enters the final *failed* state. The scheduler waits
|
|
||||||
some time until running the next try. This delay is given by
|
|
||||||
`retry-delay`. This is the initial delay, the time until the first
|
|
||||||
re-try (the second attempt). This time increases exponentially with
|
|
||||||
the number of retries.
|
|
||||||
|
|
||||||
The jobs will log about what they do, which is picked up and stored
|
|
||||||
into the database asynchronously. The log events are buffered in a
|
|
||||||
queue and another thread will consume this queue and store them in the
|
|
||||||
database. The `log-buffer-size` determines the size of the queue.
|
|
||||||
|
|
||||||
At last, there is a `wakeup-period` that determines at what interval
|
|
||||||
the joex component notifies itself to look for new jobs. If jobs get
|
|
||||||
stuck, and joex is not notified externally it could miss to
|
|
||||||
retry. Also, since networks are not reliable, a notification may not
|
|
||||||
reach a joex component. This periodic wakup is just to ensure that
|
|
||||||
jobs are eventually run.
|
|
||||||
|
|
||||||
|
|
||||||
# Periodic Tasks
|
|
||||||
|
|
||||||
The job executor can execute tasks periodically. These tasks are
|
|
||||||
stored in the database such that they can be submitted into the job
|
|
||||||
queue. Multiple job executors can run at once, only one is ever doing
|
|
||||||
something with a task. So a periodic task is never submitted twice. It
|
|
||||||
is also not submitted, if a previous task has not finished yet.
|
|
||||||
|
|
||||||
|
|
||||||
# Starting on demand
|
|
||||||
|
|
||||||
The job executor and rest server can be started multiple times. This
|
|
||||||
is especially useful for the job executor. For example, when
|
|
||||||
submitting a lot of files in a short time, you can simply startup more
|
|
||||||
job executors on other computers on your network. Maybe use your
|
|
||||||
laptop to help with processing for a while.
|
|
||||||
|
|
||||||
You have to make sure, that all connect to the same database, and that
|
|
||||||
all have unique `app-id`s.
|
|
||||||
|
|
||||||
Once the files have been processced you can stop the additional
|
|
||||||
executors.
|
|
||||||
|
|
||||||
|
|
||||||
# Shutting down
|
|
||||||
|
|
||||||
If a job executor is sleeping and not executing any jobs, you can just
|
|
||||||
quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
|
|
||||||
there are jobs currently executing, it is advisable to initiate a
|
|
||||||
graceful shutdown. The job executor will then stop taking new jobs
|
|
||||||
from the queue but it will wait until all running jobs have completed
|
|
||||||
before shutting down.
|
|
||||||
|
|
||||||
This can be done by sending a http POST request to the api of this job
|
|
||||||
executor:
|
|
||||||
|
|
||||||
```
|
|
||||||
curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
|
|
||||||
```
|
|
||||||
|
|
||||||
If joex receives this request it will immediately stop taking new jobs
|
|
||||||
and it will quit when all running jobs are done.
|
|
||||||
|
|
||||||
If a job executor gets terminated while there are running jobs, the
|
|
||||||
jobs are still in the current state marked to be executed by this job
|
|
||||||
executor. In order to fix this, start the job executor again. It will
|
|
||||||
search all jobs that are marked with its id and put them back into
|
|
||||||
waiting state. Then send a graceful shutdown request as shown above.
|
|
||||||
|
366
website/site/content/docs/joex/file-processing.md
Normal file
366
website/site/content/docs/joex/file-processing.md
Normal file
@ -0,0 +1,366 @@
|
|||||||
|
+++
|
||||||
|
title = "File Processing"
|
||||||
|
description = "How Docspell processes files."
|
||||||
|
weight = 20
|
||||||
|
insert_anchor_links = "right"
|
||||||
|
[extra]
|
||||||
|
mktoc = true
|
||||||
|
+++
|
||||||
|
|
||||||
|
When uploading a file, it is only saved to the database together with
|
||||||
|
the given meta information. The file is not visible in the ui yet.
|
||||||
|
Then joex takes the next such file (or files in case you uploaded
|
||||||
|
many) and starts processing it. When processing finished, it the item
|
||||||
|
and its files will show up in the ui.
|
||||||
|
|
||||||
|
If an error occurs during processing, the item will be created
|
||||||
|
anyways, so you can see it. Depending on the error, some information
|
||||||
|
may not be available.
|
||||||
|
|
||||||
|
Processing files may require some resources, like memory and cpu. Many
|
||||||
|
things can be configured in the config file to adapt it to the machine
|
||||||
|
it is running on.
|
||||||
|
|
||||||
|
Important is the setting `docspell.joex.scheduler.pool-size` which
|
||||||
|
defines how many tasks can run in parallel on the machine running
|
||||||
|
joex. For machines that are not very strong, choosing a `1` is
|
||||||
|
recommended.
|
||||||
|
|
||||||
|
|
||||||
|
# Stages
|
||||||
|
|
||||||
|
```
|
||||||
|
DuplicateCheck ->
|
||||||
|
Extract Archives ->
|
||||||
|
Conversion to PDF ->
|
||||||
|
Text Extraction ->
|
||||||
|
Generate Previews ->
|
||||||
|
Text Analysis
|
||||||
|
```
|
||||||
|
|
||||||
|
These steps are executed sequentially. There are many config options
|
||||||
|
available for each step.
|
||||||
|
|
||||||
|
## External Commands
|
||||||
|
|
||||||
|
External programs are all configured the same way. You can change the
|
||||||
|
command (add, remove options etc) in the config file. As an example,
|
||||||
|
here is the `wkhtmltopdf` command that is used to convert html files
|
||||||
|
to pdf:
|
||||||
|
|
||||||
|
``` conf
|
||||||
|
docspell.joex.convert {
|
||||||
|
wkhtmlpdf {
|
||||||
|
command = {
|
||||||
|
program = "wkhtmltopdf"
|
||||||
|
args = [
|
||||||
|
"-s",
|
||||||
|
"A4",
|
||||||
|
"--encoding",
|
||||||
|
"{{encoding}}",
|
||||||
|
"--load-error-handling", "ignore",
|
||||||
|
"--load-media-error-handling", "ignore",
|
||||||
|
"-",
|
||||||
|
"{{outfile}}"
|
||||||
|
]
|
||||||
|
timeout = "2 minutes"
|
||||||
|
}
|
||||||
|
working-dir = ${java.io.tmpdir}"/docspell-convert"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Strings in `{{…}}` are replaced by docspell with the appropriate
|
||||||
|
values at runtime. However, based on your use case you can just set
|
||||||
|
constant values or add other options. This might be necessary when
|
||||||
|
there are different version installed where changes in the command
|
||||||
|
line are required. As you see for `wkhtmltopdf` the page size is fixed
|
||||||
|
to DIN A4. Other commands are configured like this as well.
|
||||||
|
|
||||||
|
For the default values, please see the [configuration
|
||||||
|
page](@/docs/configure/_index.md#joex).
|
||||||
|
|
||||||
|
## Duplicate Check
|
||||||
|
|
||||||
|
If specified, the uploaded file is checked via a sha256 hash, if it
|
||||||
|
has been uploaded before. If so, it is removed from the set of
|
||||||
|
uploaded files. You can define this with the upload metadata.
|
||||||
|
|
||||||
|
If this results in an empty set, the processing ends.
|
||||||
|
|
||||||
|
|
||||||
|
## Extract Archives
|
||||||
|
|
||||||
|
If a file is a `zip` or `eml` (e-mail) file, it is extracted and its
|
||||||
|
entries are added to the file set. The original (archive) file is kept
|
||||||
|
in the database, but removed from further processing.
|
||||||
|
|
||||||
|
|
||||||
|
## Conversion to PDF
|
||||||
|
|
||||||
|
All files are converted to a PDF file. How this is done depends on the
|
||||||
|
file type. External programs are required, which must be installed on
|
||||||
|
the machine running joex. The config file allows to specify the exact
|
||||||
|
commands used.
|
||||||
|
|
||||||
|
See the section `docspell.joex.convert` in the config file.
|
||||||
|
|
||||||
|
The following config options apply to the conversion as a whole:
|
||||||
|
|
||||||
|
``` conf
|
||||||
|
docspell.joex.convert {
|
||||||
|
converted-filename-part = "converted"
|
||||||
|
max-image-size = ${docspell.joex.extraction.ocr.max-image-size}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The first setting defines a suffix that is appended to the original
|
||||||
|
file name to name the converted file. You can set an empty string to
|
||||||
|
keep the same filename as the original. The extension is always
|
||||||
|
changed to `.pdf`, of course.
|
||||||
|
|
||||||
|
The second option defines a limit for reading images. Some images may
|
||||||
|
be small as a file but uncompressed very large. To avoid allocating
|
||||||
|
too much memory, there is a limit. It defaults to 14mp.
|
||||||
|
|
||||||
|
### Html
|
||||||
|
|
||||||
|
Html files are converted with the external tool
|
||||||
|
[wkhtmltopdf](https://wkhtmltopdf.org/). It produces quite nice
|
||||||
|
results by using the webkit rendering engine. So the resulting PDF
|
||||||
|
looks just like in a browser.
|
||||||
|
|
||||||
|
|
||||||
|
### Images
|
||||||
|
|
||||||
|
Images are converted using
|
||||||
|
[tesseract](https://github.com/tesseract-ocr).
|
||||||
|
|
||||||
|
This might be interesting, if you want to try a different language
|
||||||
|
that is not available in docspell's settings yet. Tesseract also adds
|
||||||
|
the extracted text as a separate layer to the PDF.
|
||||||
|
|
||||||
|
For images, tesseract is configured to create a text and a pdf file.
|
||||||
|
|
||||||
|
### Text
|
||||||
|
|
||||||
|
Plaintext files are treated as markdown. You can modify the results by
|
||||||
|
providing some custom css.
|
||||||
|
|
||||||
|
The resulting HTML files are then converted to PDF via `wkhtmltopdf`
|
||||||
|
as described above.
|
||||||
|
|
||||||
|
### Office
|
||||||
|
|
||||||
|
To convert office files, [Libreoffice](https://www.libreoffice.org/)
|
||||||
|
is required and used via the command line tool
|
||||||
|
[unoconv](https://github.com/unoconv/unoconv).
|
||||||
|
|
||||||
|
To improve performance, it is recommended to start a libreoffice
|
||||||
|
listener by running `unoconv -l` in a separate process.
|
||||||
|
|
||||||
|
|
||||||
|
### PDF
|
||||||
|
|
||||||
|
PDFs can be converted into PDFs, which may sound silly at first. But
|
||||||
|
PDFs come in many different flavors and may not contain a separate
|
||||||
|
text layer, making it impossible to "copy & paste" text in them. So
|
||||||
|
you can optionally use the tool
|
||||||
|
[ocrmypdf](https://github.com/jbarlow83/OCRmyPDF) to create a PDF/A
|
||||||
|
type PDF file containing a text layer with the extracted text.
|
||||||
|
|
||||||
|
It is recommended to install ocrympdf, but it also is optional. If it
|
||||||
|
is enabled but fails, the error is not fatal and the processing will
|
||||||
|
continue using the original pdf for extracting text. You can also
|
||||||
|
disable it to remove the errors from the processing logs.
|
||||||
|
|
||||||
|
The `--skip-text` option is necessary to not fail on "text" pdfs
|
||||||
|
(where ocr is not necessary). In this case, the pdf will be converted
|
||||||
|
to PDF/A.
|
||||||
|
|
||||||
|
|
||||||
|
## Text Extraction
|
||||||
|
|
||||||
|
Text extraction also depends on the file type. Some tools from the
|
||||||
|
convert section are used here, too.
|
||||||
|
|
||||||
|
Text is tried to extract from the original file. If that can't be done
|
||||||
|
or results in an error, the converted file is tried next.
|
||||||
|
|
||||||
|
### Html
|
||||||
|
|
||||||
|
Html files are not used directly, but the converted PDF file is used
|
||||||
|
to extract the text. This makes sure that the text is extracted you
|
||||||
|
actually see. The conversion is done anyways and the resulting PDF
|
||||||
|
already has a text layer.
|
||||||
|
|
||||||
|
### Images
|
||||||
|
|
||||||
|
For images, [tesseract](https://github.com/tesseract-ocr) is used
|
||||||
|
again. In most cases this step is not executed, because the text has
|
||||||
|
already been extracted in the conversion step. But if the conversion
|
||||||
|
would have failed for some reason, tesseract is called here (with
|
||||||
|
different options).
|
||||||
|
|
||||||
|
### Text
|
||||||
|
|
||||||
|
This is obviously trivial :)
|
||||||
|
|
||||||
|
### Office
|
||||||
|
|
||||||
|
MS Office files are processed using a library without any external
|
||||||
|
tool. It uses [apache poi](https://poi.apache.org/) which is well
|
||||||
|
known for these tasks.
|
||||||
|
|
||||||
|
A rich text file (`.rtf`) is procssed by Java "natively" (using their
|
||||||
|
standard library).
|
||||||
|
|
||||||
|
OpenDocument files are proecessed using the ODS/ODT/ODF parser from
|
||||||
|
tika.
|
||||||
|
|
||||||
|
### PDF
|
||||||
|
|
||||||
|
PDF files are first checked for a text layer. If this returns some
|
||||||
|
text that is greater than the configured minimum length, it is used.
|
||||||
|
Otherwise, OCR is started for the whole pdf file page by page.
|
||||||
|
|
||||||
|
|
||||||
|
```conf
|
||||||
|
docspell.joex {
|
||||||
|
extraction {
|
||||||
|
pdf {
|
||||||
|
min-text-len = 500
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
After OCR both texts are compared and the longer is used. Since PDFs
|
||||||
|
can contain text and images, it might be safer to always do OCR, but
|
||||||
|
this is something to choose by the user.
|
||||||
|
|
||||||
|
PDF ocr is comprised of multiple steps. At first only the first
|
||||||
|
`page-range` pages are extracted to avoid too long running tasks
|
||||||
|
(someone submit an ebook for example). But you can disable this limit
|
||||||
|
by setting a `-1`. After all, text that is not extracted, won't be
|
||||||
|
indexed either and is therefore not searchable. It depends on your
|
||||||
|
machine/setup.
|
||||||
|
|
||||||
|
Another limit is `max-image-size` which defines the size of an image
|
||||||
|
in pixel (`width * height`) where processing is skipped.
|
||||||
|
|
||||||
|
Then [ghostscript](http://pages.cs.wisc.edu/~ghost/) is used to
|
||||||
|
extract single pages into image files and
|
||||||
|
[unpaper](https://github.com/Flameeyes/unpaper) is used to optimize
|
||||||
|
the images for ocr. Unpaper is optional, if it is not found, it is
|
||||||
|
skipped, which may be a compromise on slow machines.
|
||||||
|
|
||||||
|
```conf
|
||||||
|
docspell.joex {
|
||||||
|
extraction {
|
||||||
|
ocr {
|
||||||
|
max-image-size = 14000000
|
||||||
|
page-range {
|
||||||
|
begin = 10
|
||||||
|
}
|
||||||
|
ghostscript {
|
||||||
|
command {
|
||||||
|
program = "gs"
|
||||||
|
args = [ "-dNOPAUSE"
|
||||||
|
, "-dBATCH"
|
||||||
|
, "-dSAFER"
|
||||||
|
, "-sDEVICE=tiffscaled8"
|
||||||
|
, "-sOutputFile={{outfile}}"
|
||||||
|
, "{{infile}}"
|
||||||
|
]
|
||||||
|
timeout = "5 minutes"
|
||||||
|
}
|
||||||
|
working-dir = ${java.io.tmpdir}"/docspell-extraction"
|
||||||
|
}
|
||||||
|
unpaper {
|
||||||
|
command {
|
||||||
|
program = "unpaper"
|
||||||
|
args = [ "{{infile}}", "{{outfile}}" ]
|
||||||
|
timeout = "5 minutes"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
tesseract {
|
||||||
|
command {
|
||||||
|
program = "tesseract"
|
||||||
|
args = ["{{file}}"
|
||||||
|
, "stdout"
|
||||||
|
, "-l"
|
||||||
|
, "{{lang}}"
|
||||||
|
]
|
||||||
|
timeout = "5 minutes"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
# Generating Previews
|
||||||
|
|
||||||
|
Previews are generated from the converted PDF of every file. The first
|
||||||
|
page of each file is converted into an image file. The config file
|
||||||
|
allows to specify a dpi which is used to render the pdf page. The
|
||||||
|
default is set to 32dpi, which results roughly in a 200x300px image.
|
||||||
|
For comparison, a standard A4 is usually rendered at 96dpi, which
|
||||||
|
results in a 790x1100px image.
|
||||||
|
|
||||||
|
```conf
|
||||||
|
docspell.joex {
|
||||||
|
extraction {
|
||||||
|
preview {
|
||||||
|
dpi = 32
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
{% infobubble(mode="warning", title="Please note") %}
|
||||||
|
|
||||||
|
When this is changed, you must re-generate all preview images. Check
|
||||||
|
the api for this, there is an endpoint to regenerate all preview
|
||||||
|
images for a collective. There is also a bash script provided in the
|
||||||
|
`tools/` directory that can be used to call this endpoint.
|
||||||
|
|
||||||
|
{% end %}
|
||||||
|
|
||||||
|
|
||||||
|
# Text Analysis
|
||||||
|
|
||||||
|
This uses the extracted text to find what could be attached to the new
|
||||||
|
item. There are multiple things provided.
|
||||||
|
|
||||||
|
|
||||||
|
## Classification
|
||||||
|
|
||||||
|
If you enabled classification in the config file, a model is trained
|
||||||
|
periodically from your files. This is now used to guess a tag for the
|
||||||
|
item.
|
||||||
|
|
||||||
|
|
||||||
|
## Natural Language Processing
|
||||||
|
|
||||||
|
NLP is used to find out which terms in the text may be a company or
|
||||||
|
person that is later used to find metadata to attach to. It also uses
|
||||||
|
your address book to match terms in the text.
|
||||||
|
|
||||||
|
This requires to load language model files in memory, which is quite a
|
||||||
|
lot. Also, the number of languages is much more restricted than for
|
||||||
|
tesseract. Currently English, German and French are supported.
|
||||||
|
|
||||||
|
Another feature that is planned, but not yet provided is to propose
|
||||||
|
new companies/people you don't have yet in your address book.
|
||||||
|
|
||||||
|
The config file allows some settings. You can specify a limit for
|
||||||
|
texts. Large texts result in higher memory consumption. By default,
|
||||||
|
the first 10'000 characters are taken into account.
|
||||||
|
|
||||||
|
The setting `clear-stanford-nlp-interval` allows to define an idle
|
||||||
|
time after which the model files are cleared from memory. This allows
|
||||||
|
to be reclaimed by the OS. The timer starts after the last file has
|
||||||
|
been processed. If you can afford it, it is recommended to disable it
|
||||||
|
by setting it to `0`.
|
180
website/site/content/docs/joex/intro.md
Normal file
180
website/site/content/docs/joex/intro.md
Normal file
@ -0,0 +1,180 @@
|
|||||||
|
+++
|
||||||
|
title = "Joex"
|
||||||
|
description = "More information about the job executor component."
|
||||||
|
weight = 10
|
||||||
|
insert_anchor_links = "right"
|
||||||
|
[extra]
|
||||||
|
mktoc = true
|
||||||
|
+++
|
||||||
|
|
||||||
|
# Introduction
|
||||||
|
|
||||||
|
Joex is short for *Job Executor* and it is the component managing long
|
||||||
|
running tasks in docspell. One of these long running tasks is the file
|
||||||
|
processing task.
|
||||||
|
|
||||||
|
One joex component handles the processing of all files of all
|
||||||
|
collectives/users. It requires much more resources than the rest
|
||||||
|
server component. Therefore the number of jobs that can run in
|
||||||
|
parallel is limited with respect to the hardware it is running on.
|
||||||
|
|
||||||
|
For larger installations, it is probably better to run several joex
|
||||||
|
components on different machines. That works out of the box, as long
|
||||||
|
as all components point to the same database and use different
|
||||||
|
`app-id`s (see [configuring
|
||||||
|
docspell](@/docs/configure/_index.md#app-id)).
|
||||||
|
|
||||||
|
When files are submitted to docspell, they are stored in the database
|
||||||
|
and all known joex components are notified about new work. Then they
|
||||||
|
compete on getting the next job from the queue. After a job finishes
|
||||||
|
and no job is waiting in the queue, joex will sleep until notified
|
||||||
|
again. It will also periodically notify itself as a fallback.
|
||||||
|
|
||||||
|
# Task vs Job
|
||||||
|
|
||||||
|
Just for the sake of this document, a task denotes the code that has
|
||||||
|
to be executed or the thing that has to be done. It emerges in a job,
|
||||||
|
once a task is submitted into the queue from where it will be picked
|
||||||
|
up and executed eventually. A job maintains a state and other things,
|
||||||
|
while a task is just code.
|
||||||
|
|
||||||
|
|
||||||
|
# Scheduler and Queue
|
||||||
|
|
||||||
|
The scheduler is the part that runs and monitors the long running
|
||||||
|
jobs. It works together with the job queue, which defines what job to
|
||||||
|
take next.
|
||||||
|
|
||||||
|
To create a somewhat fair distribution among multiple collectives, a
|
||||||
|
collective is first chosen in a simple round-robin way. Then a job
|
||||||
|
from this collective is chosen by priority.
|
||||||
|
|
||||||
|
There are only two priorities: low and high. A simple *counting
|
||||||
|
scheme* determines if a low prio or high prio job is selected
|
||||||
|
next. The default is `4, 1`, meaning to first select 4 high priority
|
||||||
|
jobs and then 1 low priority job, then starting over. If no such job
|
||||||
|
exists, its falls back to the other priority.
|
||||||
|
|
||||||
|
The priority can be set on a *Source* (see
|
||||||
|
[uploads](@/docs/webapp/uploading.md)). Uploading through the web
|
||||||
|
application will always use priority *high*. The idea is that while
|
||||||
|
logged in, jobs are more important that those submitted when not
|
||||||
|
logged in.
|
||||||
|
|
||||||
|
|
||||||
|
# Scheduler Config
|
||||||
|
|
||||||
|
The relevant part of the config file regarding the scheduler is shown
|
||||||
|
below with some explanations.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
docspell.joex {
|
||||||
|
# other settings left out for brevity
|
||||||
|
|
||||||
|
scheduler {
|
||||||
|
|
||||||
|
# Number of processing allowed in parallel.
|
||||||
|
pool-size = 2
|
||||||
|
|
||||||
|
# A counting scheme determines the ratio of how high- and low-prio
|
||||||
|
# jobs are run. For example: 4,1 means run 4 high prio jobs, then
|
||||||
|
# 1 low prio and then start over.
|
||||||
|
counting-scheme = "4,1"
|
||||||
|
|
||||||
|
# How often a failed job should be retried until it enters failed
|
||||||
|
# state. If a job fails, it becomes "stuck" and will be retried
|
||||||
|
# after a delay.
|
||||||
|
retries = 5
|
||||||
|
|
||||||
|
# The delay until the next try is performed for a failed job. This
|
||||||
|
# delay is increased exponentially with the number of retries.
|
||||||
|
retry-delay = "1 minute"
|
||||||
|
|
||||||
|
# The queue size of log statements from a job.
|
||||||
|
log-buffer-size = 500
|
||||||
|
|
||||||
|
# If no job is left in the queue, the scheduler will wait until a
|
||||||
|
# notify is requested (using the REST interface). To also retry
|
||||||
|
# stuck jobs, it will notify itself periodically.
|
||||||
|
wakeup-period = "30 minutes"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `pool-size` setting determines how many jobs run in parallel. You
|
||||||
|
need to play with this setting on your machine to find an optimal
|
||||||
|
value.
|
||||||
|
|
||||||
|
The `counting-scheme` determines for all collectives how to select
|
||||||
|
between high and low priority jobs; as explained above. It is
|
||||||
|
currently not possible to define that per collective.
|
||||||
|
|
||||||
|
If a job fails, it will be set to *stuck* state and retried by the
|
||||||
|
scheduler. The `retries` setting defines how many times a job is
|
||||||
|
retried until it enters the final *failed* state. The scheduler waits
|
||||||
|
some time until running the next try. This delay is given by
|
||||||
|
`retry-delay`. This is the initial delay, the time until the first
|
||||||
|
re-try (the second attempt). This time increases exponentially with
|
||||||
|
the number of retries.
|
||||||
|
|
||||||
|
The jobs will log about what they do, which is picked up and stored
|
||||||
|
into the database asynchronously. The log events are buffered in a
|
||||||
|
queue and another thread will consume this queue and store them in the
|
||||||
|
database. The `log-buffer-size` determines the size of the queue.
|
||||||
|
|
||||||
|
At last, there is a `wakeup-period` that determines at what interval
|
||||||
|
the joex component notifies itself to look for new jobs. If jobs get
|
||||||
|
stuck, and joex is not notified externally it could miss to
|
||||||
|
retry. Also, since networks are not reliable, a notification may not
|
||||||
|
reach a joex component. This periodic wakup is just to ensure that
|
||||||
|
jobs are eventually run.
|
||||||
|
|
||||||
|
|
||||||
|
# Periodic Tasks
|
||||||
|
|
||||||
|
The job executor can execute tasks periodically. These tasks are
|
||||||
|
stored in the database such that they can be submitted into the job
|
||||||
|
queue. Multiple job executors can run at once, only one is ever doing
|
||||||
|
something with a task. So a periodic task is never submitted twice. It
|
||||||
|
is also not submitted, if a previous task has not finished yet.
|
||||||
|
|
||||||
|
|
||||||
|
# Starting on demand
|
||||||
|
|
||||||
|
The job executor and rest server can be started multiple times. This
|
||||||
|
is especially useful for the job executor. For example, when
|
||||||
|
submitting a lot of files in a short time, you can simply startup more
|
||||||
|
job executors on other computers on your network. Maybe use your
|
||||||
|
laptop to help with processing for a while.
|
||||||
|
|
||||||
|
You have to make sure, that all connect to the same database, and that
|
||||||
|
all have unique `app-id`s.
|
||||||
|
|
||||||
|
Once the files have been processced you can stop the additional
|
||||||
|
executors.
|
||||||
|
|
||||||
|
|
||||||
|
# Shutting down
|
||||||
|
|
||||||
|
If a job executor is sleeping and not executing any jobs, you can just
|
||||||
|
quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
|
||||||
|
there are jobs currently executing, it is advisable to initiate a
|
||||||
|
graceful shutdown. The job executor will then stop taking new jobs
|
||||||
|
from the queue but it will wait until all running jobs have completed
|
||||||
|
before shutting down.
|
||||||
|
|
||||||
|
This can be done by sending a http POST request to the api of this job
|
||||||
|
executor:
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
|
||||||
|
```
|
||||||
|
|
||||||
|
If joex receives this request it will immediately stop taking new jobs
|
||||||
|
and it will quit when all running jobs are done.
|
||||||
|
|
||||||
|
If a job executor gets terminated while there are running jobs, the
|
||||||
|
jobs are still in the current state marked to be executed by this job
|
||||||
|
executor. In order to fix this, start the job executor again. It will
|
||||||
|
search all jobs that are marked with its id and put them back into
|
||||||
|
waiting state. Then send a graceful shutdown request as shown above.
|
@ -46,6 +46,6 @@ There will be one task per file to convert. All these tasks are
|
|||||||
submitted with a low priority. So files uploaded through the webapp or
|
submitted with a low priority. So files uploaded through the webapp or
|
||||||
a [source](@/docs/webapp/uploading.md#anonymous-upload) with a high
|
a [source](@/docs/webapp/uploading.md#anonymous-upload) with a high
|
||||||
priority, will be preferred as [configured in the job
|
priority, will be preferred as [configured in the job
|
||||||
executor](@/docs/joex/_index.md#scheduler-config). This is to not
|
executor](@/docs/joex/intro.md#scheduler-config). This is to not
|
||||||
disturb normal processing when many conversion tasks are being
|
disturb normal processing when many conversion tasks are being
|
||||||
executed.
|
executed.
|
||||||
|
42
website/site/content/docs/tools/regenerate-previews.md
Normal file
42
website/site/content/docs/tools/regenerate-previews.md
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
+++
|
||||||
|
title = "Regenerate Preview Images"
|
||||||
|
description = "Re-generates all preview images."
|
||||||
|
weight = 80
|
||||||
|
+++
|
||||||
|
|
||||||
|
# regenerate-previews.sh
|
||||||
|
|
||||||
|
This is a simple bash script to trigger the endpoint that submits task
|
||||||
|
for generating preview images of your files. This is usually not
|
||||||
|
needed, but should you change the `preview.dpi` setting in joex'
|
||||||
|
config file, you need to regenerate the images to have any effect.
|
||||||
|
|
||||||
|
# Requirements
|
||||||
|
|
||||||
|
It is a bash script that additionally needs
|
||||||
|
[curl](https://curl.haxx.se/) and
|
||||||
|
[jq](https://stedolan.github.io/jq/).
|
||||||
|
|
||||||
|
# Usage
|
||||||
|
|
||||||
|
```
|
||||||
|
./regenerate-previews.sh [docspell-base-url]
|
||||||
|
```
|
||||||
|
|
||||||
|
For example, if docspell is at `http://localhost:7880`:
|
||||||
|
|
||||||
|
```
|
||||||
|
./convert-all-pdfs.sh http://localhost:7880
|
||||||
|
```
|
||||||
|
|
||||||
|
The script asks for your account name and password. It then logs in
|
||||||
|
and triggers the said endpoint. After this you should see a few tasks
|
||||||
|
running.
|
||||||
|
|
||||||
|
There will be one task per file to convert. All these tasks are
|
||||||
|
submitted with a low priority. So files uploaded through the webapp or
|
||||||
|
a [source](@/docs/webapp/uploading.md#anonymous-upload) with a high
|
||||||
|
priority, will be preferred as [configured in the job
|
||||||
|
executor](@/docs/joex/intro.md#scheduler-config). This is to not
|
||||||
|
disturb normal processing when many conversion tasks are being
|
||||||
|
executed.
|
Loading…
x
Reference in New Issue
Block a user