Add docs for file processing

2025-09-15 21:46:53 +00:00 · 2021-01-11 21:20:01 +01:00
parent 96c337f7db
commit 8910ac6954
5 changed files with 594 additions and 174 deletions
--- a/website/site/content/docs/joex/_index.md
+++ b/website/site/content/docs/joex/_index.md
@@ -3,178 +3,10 @@ title = "Joex"
 description = "More information about the job executor component."
 weight = 90
 insert_anchor_links = "right"
-[extra]
+summary = true
-mktoc = true
+template = "pages.html"
 sort_by = "weight"
 redirect_to = "docs/joex/intro"
 +++
-# Introduction
+No content here.
 Joex is short for *Job Executor* and it is the component managing long
 running tasks in docspell. One of these long running tasks is the file
 processing task.
 One joex component handles the processing of all files of all
 collectives/users. It requires much more resources than the rest
 server component. Therefore the number of jobs that can run in
 parallel is limited with respect to the hardware it is running on.
 For larger installations, it is probably better to run several joex
 components on different machines. That works out of the box, as long
 as all components point to the same database and use different
 `app-id`s (see [configuring
 docspell](@/docs/configure/_index.md#app-id)).
 When files are submitted to docspell, they are stored in the database
 and all known joex components are notified about new work. Then they
 compete on getting the next job from the queue. After a job finishes
 and no job is waiting in the queue, joex will sleep until notified
 again. It will also periodically notify itself as a fallback.
 # Task vs Job
 Just for the sake of this document, a task denotes the code that has
 to be executed or the thing that has to be done. It emerges in a job,
 once a task is submitted into the queue from where it will be picked
 up and executed eventually. A job maintains a state and other things,
 while a task is just code.
 # Scheduler and Queue
 The scheduler is the part that runs and monitors the long running
 jobs. It works together with the job queue, which defines what job to
 take next.
 To create a somewhat fair distribution among multiple collectives, a
 collective is first chosen in a simple round-robin way. Then a job
 from this collective is chosen by priority.
 There are only two priorities: low and high. A simple *counting
 scheme* determines if a low prio or high prio job is selected
 next. The default is `4, 1`, meaning to first select 4 high priority
 jobs and then 1 low priority job, then starting over. If no such job
 exists, its falls back to the other priority.
 The priority can be set on a *Source* (see
 [uploads](@/docs/webapp/uploading.md)). Uploading through the web
 application will always use priority *high*. The idea is that while
 logged in, jobs are more important that those submitted when not
 logged in.
 # Scheduler Config
 The relevant part of the config file regarding the scheduler is shown
 below with some explanations.
 ``` bash
 docspell.joex {
  # other settings left out for brevity
  scheduler {
    # Number of processing allowed in parallel.
    pool-size = 2
    # A counting scheme determines the ratio of how high- and low-prio
    # jobs are run. For example: 4,1 means run 4 high prio jobs, then
    # 1 low prio and then start over.
    counting-scheme = "4,1"
    # How often a failed job should be retried until it enters failed
    # state. If a job fails, it becomes "stuck" and will be retried
    # after a delay.
    retries = 5
    # The delay until the next try is performed for a failed job. This
    # delay is increased exponentially with the number of retries.
    retry-delay = "1 minute"
    # The queue size of log statements from a job.
    log-buffer-size = 500
    # If no job is left in the queue, the scheduler will wait until a
    # notify is requested (using the REST interface). To also retry
    # stuck jobs, it will notify itself periodically.
    wakeup-period = "30 minutes"
  }
 }
 ```
 The `pool-size` setting determines how many jobs run in parallel. You
 need to play with this setting on your machine to find an optimal
 value.
 The `counting-scheme` determines for all collectives how to select
 between high and low priority jobs; as explained above. It is
 currently not possible to define that per collective.
 If a job fails, it will be set to *stuck* state and retried by the
 scheduler. The `retries` setting defines how many times a job is
 retried until it enters the final *failed* state. The scheduler waits
 some time until running the next try. This delay is given by
 `retry-delay`. This is the initial delay, the time until the first
 re-try (the second attempt). This time increases exponentially with
 the number of retries.
 The jobs will log about what they do, which is picked up and stored
 into the database asynchronously. The log events are buffered in a
 queue and another thread will consume this queue and store them in the
 database. The `log-buffer-size` determines the size of the queue.
 At last, there is a `wakeup-period` that determines at what interval
 the joex component notifies itself to look for new jobs. If jobs get
 stuck, and joex is not notified externally it could miss to
 retry. Also, since networks are not reliable, a notification may not
 reach a joex component. This periodic wakup is just to ensure that
 jobs are eventually run.
 # Periodic Tasks
 The job executor can execute tasks periodically. These tasks are
 stored in the database such that they can be submitted into the job
 queue. Multiple job executors can run at once, only one is ever doing
 something with a task. So a periodic task is never submitted twice. It
 is also not submitted, if a previous task has not finished yet.
 # Starting on demand
 The job executor and rest server can be started multiple times. This
 is especially useful for the job executor. For example, when
 submitting a lot of files in a short time, you can simply startup more
 job executors on other computers on your network. Maybe use your
 laptop to help with processing for a while.
 You have to make sure, that all connect to the same database, and that
 all have unique `app-id`s.
 Once the files have been processced you can stop the additional
 executors.
 # Shutting down
 If a job executor is sleeping and not executing any jobs, you can just
 quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
 there are jobs currently executing, it is advisable to initiate a
 graceful shutdown. The job executor will then stop taking new jobs
 from the queue but it will wait until all running jobs have completed
 before shutting down.
 This can be done by sending a http POST request to the api of this job
 executor:
 ```
 curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
 ```
 If joex receives this request it will immediately stop taking new jobs
 and it will quit when all running jobs are done.
 If a job executor gets terminated while there are running jobs, the
 jobs are still in the current state marked to be executed by this job
 executor. In order to fix this, start the job executor again. It will
 search all jobs that are marked with its id and put them back into
 waiting state. Then send a graceful shutdown request as shown above.
--- a/website/site/content/docs/joex/file-processing.md
+++ b/website/site/content/docs/joex/file-processing.md
@@ -0,0 +1,366 @@
 +++
 title = "File Processing"
 description = "How Docspell processes files."
 weight = 20
 insert_anchor_links = "right"
 [extra]
 mktoc = true
 +++
 When uploading a file, it is only saved to the database together with
 the given meta information. The file is not visible in the ui yet.
 Then joex takes the next such file (or files in case you uploaded
 many) and starts processing it. When processing finished, it the item
 and its files will show up in the ui.
 If an error occurs during processing, the item will be created
 anyways, so you can see it. Depending on the error, some information
 may not be available.
 Processing files may require some resources, like memory and cpu. Many
 things can be configured in the config file to adapt it to the machine
 it is running on.
 Important is the setting `docspell.joex.scheduler.pool-size` which
 defines how many tasks can run in parallel on the machine running
 joex. For machines that are not very strong, choosing a `1` is
 recommended.
 # Stages
 ```
 DuplicateCheck ->
 Extract Archives ->
 Conversion to PDF ->
 Text Extraction ->
 Generate Previews ->
 Text Analysis
 ```
 These steps are executed sequentially. There are many config options
 available for each step.
 ## External Commands
 External programs are all configured the same way. You can change the
 command (add, remove options etc) in the config file. As an example,
 here is the `wkhtmltopdf` command that is used to convert html files
 to pdf:
 ``` conf
 docspell.joex.convert {
  wkhtmlpdf {
    command = {
      program = "wkhtmltopdf"
      args = [
        "-s",
        "A4",
        "--encoding",
        "{{encoding}}",
        "--load-error-handling", "ignore",
        "--load-media-error-handling", "ignore",
        "-",
        "{{outfile}}"
      ]
      timeout = "2 minutes"
    }
    working-dir = ${java.io.tmpdir}"/docspell-convert"
  }
 }
 ```
 Strings in `{{…}}` are replaced by docspell with the appropriate
 values at runtime. However, based on your use case you can just set
 constant values or add other options. This might be necessary when
 there are different version installed where changes in the command
 line are required. As you see for `wkhtmltopdf` the page size is fixed
 to DIN A4. Other commands are configured like this as well.
 For the default values, please see the [configuration
 page](@/docs/configure/_index.md#joex).
 ## Duplicate Check
 If specified, the uploaded file is checked via a sha256 hash, if it
 has been uploaded before. If so, it is removed from the set of
 uploaded files. You can define this with the upload metadata.
 If this results in an empty set, the processing ends.
 ## Extract Archives
 If a file is a `zip` or `eml` (e-mail) file, it is extracted and its
 entries are added to the file set. The original (archive) file is kept
 in the database, but removed from further processing.
 ## Conversion to PDF
 All files are converted to a PDF file. How this is done depends on the
 file type. External programs are required, which must be installed on
 the machine running joex. The config file allows to specify the exact
 commands used.
 See the section `docspell.joex.convert` in the config file.
 The following config options apply to the conversion as a whole:
 ``` conf
 docspell.joex.convert {
  converted-filename-part = "converted"
  max-image-size = ${docspell.joex.extraction.ocr.max-image-size}
 }
 ```
 The first setting defines a suffix that is appended to the original
 file name to name the converted file. You can set an empty string to
 keep the same filename as the original. The extension is always
 changed to `.pdf`, of course.
 The second option defines a limit for reading images. Some images may
 be small as a file but uncompressed very large. To avoid allocating
 too much memory, there is a limit. It defaults to 14mp.
 ### Html
 Html files are converted with the external tool
 [wkhtmltopdf](https://wkhtmltopdf.org/). It produces quite nice
 results by using the webkit rendering engine. So the resulting PDF
 looks just like in a browser.
 ### Images
 Images are converted using
 [tesseract](https://github.com/tesseract-ocr).
 This might be interesting, if you want to try a different language
 that is not available in docspell's settings yet. Tesseract also adds
 the extracted text as a separate layer to the PDF.
 For images, tesseract is configured to create a text and a pdf file.
 ### Text
 Plaintext files are treated as markdown. You can modify the results by
 providing some custom css.
 The resulting HTML files are then converted to PDF via `wkhtmltopdf`
 as described above.
 ### Office
 To convert office files, [Libreoffice](https://www.libreoffice.org/)
 is required and used via the command line tool
 [unoconv](https://github.com/unoconv/unoconv).
 To improve performance, it is recommended to start a libreoffice
 listener by running `unoconv -l` in a separate process.
 ### PDF
 PDFs can be converted into PDFs, which may sound silly at first. But
 PDFs come in many different flavors and may not contain a separate
 text layer, making it impossible to "copy & paste" text in them. So
 you can optionally use the tool
 [ocrmypdf](https://github.com/jbarlow83/OCRmyPDF) to create a PDF/A
 type PDF file containing a text layer with the extracted text.
 It is recommended to install ocrympdf, but it also is optional. If it
 is enabled but fails, the error is not fatal and the processing will
 continue using the original pdf for extracting text. You can also
 disable it to remove the errors from the processing logs.
 The `--skip-text` option is necessary to not fail on "text" pdfs
 (where ocr is not necessary). In this case, the pdf will be converted
 to PDF/A.
 ## Text Extraction
 Text extraction also depends on the file type. Some tools from the
 convert section are used here, too.
 Text is tried to extract from the original file. If that can't be done
 or results in an error, the converted file is tried next.
 ### Html
 Html files are not used directly, but the converted PDF file is used
 to extract the text. This makes sure that the text is extracted you
 actually see. The conversion is done anyways and the resulting PDF
 already has a text layer.
 ### Images
 For images, [tesseract](https://github.com/tesseract-ocr) is used
 again. In most cases this step is not executed, because the text has
 already been extracted in the conversion step. But if the conversion
 would have failed for some reason, tesseract is called here (with
 different options).
 ### Text
 This is obviously trivial :)
 ### Office
 MS Office files are processed using a library without any external
 tool. It uses [apache poi](https://poi.apache.org/) which is well
 known for these tasks.
 A rich text file (`.rtf`) is procssed by Java "natively" (using their
 standard library).
 OpenDocument files are proecessed using the ODS/ODT/ODF parser from
 tika.
 ### PDF
 PDF files are first checked for a text layer. If this returns some
 text that is greater than the configured minimum length, it is used.
 Otherwise, OCR is started for the whole pdf file page by page.
 ```conf
 docspell.joex {
  extraction {
    pdf {
      min-text-len = 500
    }
  }
 }
 ```
 After OCR both texts are compared and the longer is used. Since PDFs
 can contain text and images, it might be safer to always do OCR, but
 this is something to choose by the user.
 PDF ocr is comprised of multiple steps. At first only the first
 `page-range` pages are extracted to avoid too long running tasks
 (someone submit an ebook for example). But you can disable this limit
 by setting a `-1`. After all, text that is not extracted, won't be
 indexed either and is therefore not searchable. It depends on your
 machine/setup.
 Another limit is `max-image-size` which defines the size of an image
 in pixel (`width * height`) where processing is skipped.
 Then [ghostscript](http://pages.cs.wisc.edu/~ghost/) is used to
 extract single pages into image files and
 [unpaper](https://github.com/Flameeyes/unpaper) is used to optimize
 the images for ocr. Unpaper is optional, if it is not found, it is
 skipped, which may be a compromise on slow machines.
 ```conf
 docspell.joex {
  extraction {
    ocr {
      max-image-size = 14000000
      page-range {
        begin = 10
      }
      ghostscript {
        command {
          program = "gs"
          args = [ "-dNOPAUSE"
                 , "-dBATCH"
                 , "-dSAFER"
                 , "-sDEVICE=tiffscaled8"
                 , "-sOutputFile={{outfile}}"
                 , "{{infile}}"
                 ]
          timeout = "5 minutes"
        }
        working-dir = ${java.io.tmpdir}"/docspell-extraction"
      }
      unpaper {
        command {
          program = "unpaper"
          args = [ "{{infile}}", "{{outfile}}" ]
          timeout = "5 minutes"
        }
      }
      tesseract {
        command {
          program = "tesseract"
          args = ["{{file}}"
                 , "stdout"
                 , "-l"
                 , "{{lang}}"
                 ]
          timeout = "5 minutes"
        }
      }
    }
  }
 }
 ```
 # Generating Previews
 Previews are generated from the converted PDF of every file. The first
 page of each file is converted into an image file. The config file
 allows to specify a dpi which is used to render the pdf page. The
 default is set to 32dpi, which results roughly in a 200x300px image.
 For comparison, a standard A4 is usually rendered at 96dpi, which
 results in a 790x1100px image.
 ```conf
 docspell.joex {
  extraction {
    preview {
      dpi = 32
    }
  }
 }
 ```
 {% infobubble(mode="warning", title="Please note") %}
 When this is changed, you must re-generate all preview images. Check
 the api for this, there is an endpoint to regenerate all preview
 images for a collective. There is also a bash script provided in the
 `tools/` directory that can be used to call this endpoint.
 {% end %}
 # Text Analysis
 This uses the extracted text to find what could be attached to the new
 item. There are multiple things provided.
 ## Classification
 If you enabled classification in the config file, a model is trained
 periodically from your files. This is now used to guess a tag for the
 item.
 ## Natural Language Processing
 NLP is used to find out which terms in the text may be a company or
 person that is later used to find metadata to attach to. It also uses
 your address book to match terms in the text.
 This requires to load language model files in memory, which is quite a
 lot. Also, the number of languages is much more restricted than for
 tesseract. Currently English, German and French are supported.
 Another feature that is planned, but not yet provided is to propose
 new companies/people you don't have yet in your address book.
 The config file allows some settings. You can specify a limit for
 texts. Large texts result in higher memory consumption. By default,
 the first 10'000 characters are taken into account.
 The setting `clear-stanford-nlp-interval` allows to define an idle
 time after which the model files are cleared from memory. This allows
 to be reclaimed by the OS. The timer starts after the last file has
 been processed. If you can afford it, it is recommended to disable it
 by setting it to `0`.
--- a/website/site/content/docs/joex/intro.md
+++ b/website/site/content/docs/joex/intro.md
@@ -0,0 +1,180 @@
 +++
 title = "Joex"
 description = "More information about the job executor component."
 weight = 10
 insert_anchor_links = "right"
 [extra]
 mktoc = true
 +++
 # Introduction
 Joex is short for *Job Executor* and it is the component managing long
 running tasks in docspell. One of these long running tasks is the file
 processing task.
 One joex component handles the processing of all files of all
 collectives/users. It requires much more resources than the rest
 server component. Therefore the number of jobs that can run in
 parallel is limited with respect to the hardware it is running on.
 For larger installations, it is probably better to run several joex
 components on different machines. That works out of the box, as long
 as all components point to the same database and use different
 `app-id`s (see [configuring
 docspell](@/docs/configure/_index.md#app-id)).
 When files are submitted to docspell, they are stored in the database
 and all known joex components are notified about new work. Then they
 compete on getting the next job from the queue. After a job finishes
 and no job is waiting in the queue, joex will sleep until notified
 again. It will also periodically notify itself as a fallback.
 # Task vs Job
 Just for the sake of this document, a task denotes the code that has
 to be executed or the thing that has to be done. It emerges in a job,
 once a task is submitted into the queue from where it will be picked
 up and executed eventually. A job maintains a state and other things,
 while a task is just code.
 # Scheduler and Queue
 The scheduler is the part that runs and monitors the long running
 jobs. It works together with the job queue, which defines what job to
 take next.
 To create a somewhat fair distribution among multiple collectives, a
 collective is first chosen in a simple round-robin way. Then a job
 from this collective is chosen by priority.
 There are only two priorities: low and high. A simple *counting
 scheme* determines if a low prio or high prio job is selected
 next. The default is `4, 1`, meaning to first select 4 high priority
 jobs and then 1 low priority job, then starting over. If no such job
 exists, its falls back to the other priority.
 The priority can be set on a *Source* (see
 [uploads](@/docs/webapp/uploading.md)). Uploading through the web
 application will always use priority *high*. The idea is that while
 logged in, jobs are more important that those submitted when not
 logged in.
 # Scheduler Config
 The relevant part of the config file regarding the scheduler is shown
 below with some explanations.
 ``` bash
 docspell.joex {
  # other settings left out for brevity
  scheduler {
    # Number of processing allowed in parallel.
    pool-size = 2
    # A counting scheme determines the ratio of how high- and low-prio
    # jobs are run. For example: 4,1 means run 4 high prio jobs, then
    # 1 low prio and then start over.
    counting-scheme = "4,1"
    # How often a failed job should be retried until it enters failed
    # state. If a job fails, it becomes "stuck" and will be retried
    # after a delay.
    retries = 5
    # The delay until the next try is performed for a failed job. This
    # delay is increased exponentially with the number of retries.
    retry-delay = "1 minute"
    # The queue size of log statements from a job.
    log-buffer-size = 500
    # If no job is left in the queue, the scheduler will wait until a
    # notify is requested (using the REST interface). To also retry
    # stuck jobs, it will notify itself periodically.
    wakeup-period = "30 minutes"
  }
 }
 ```
 The `pool-size` setting determines how many jobs run in parallel. You
 need to play with this setting on your machine to find an optimal
 value.
 The `counting-scheme` determines for all collectives how to select
 between high and low priority jobs; as explained above. It is
 currently not possible to define that per collective.
 If a job fails, it will be set to *stuck* state and retried by the
 scheduler. The `retries` setting defines how many times a job is
 retried until it enters the final *failed* state. The scheduler waits
 some time until running the next try. This delay is given by
 `retry-delay`. This is the initial delay, the time until the first
 re-try (the second attempt). This time increases exponentially with
 the number of retries.
 The jobs will log about what they do, which is picked up and stored
 into the database asynchronously. The log events are buffered in a
 queue and another thread will consume this queue and store them in the
 database. The `log-buffer-size` determines the size of the queue.
 At last, there is a `wakeup-period` that determines at what interval
 the joex component notifies itself to look for new jobs. If jobs get
 stuck, and joex is not notified externally it could miss to
 retry. Also, since networks are not reliable, a notification may not
 reach a joex component. This periodic wakup is just to ensure that
 jobs are eventually run.
 # Periodic Tasks
 The job executor can execute tasks periodically. These tasks are
 stored in the database such that they can be submitted into the job
 queue. Multiple job executors can run at once, only one is ever doing
 something with a task. So a periodic task is never submitted twice. It
 is also not submitted, if a previous task has not finished yet.
 # Starting on demand
 The job executor and rest server can be started multiple times. This
 is especially useful for the job executor. For example, when
 submitting a lot of files in a short time, you can simply startup more
 job executors on other computers on your network. Maybe use your
 laptop to help with processing for a while.
 You have to make sure, that all connect to the same database, and that
 all have unique `app-id`s.
 Once the files have been processced you can stop the additional
 executors.
 # Shutting down
 If a job executor is sleeping and not executing any jobs, you can just
 quit using SIGTERM or `Ctrl-C` when running in a terminal. But if
 there are jobs currently executing, it is advisable to initiate a
 graceful shutdown. The job executor will then stop taking new jobs
 from the queue but it will wait until all running jobs have completed
 before shutting down.
 This can be done by sending a http POST request to the api of this job
 executor:
 ```
 curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit"
 ```
 If joex receives this request it will immediately stop taking new jobs
 and it will quit when all running jobs are done.
 If a job executor gets terminated while there are running jobs, the
 jobs are still in the current state marked to be executed by this job
 executor. In order to fix this, start the job executor again. It will
 search all jobs that are marked with its id and put them back into
 waiting state. Then send a graceful shutdown request as shown above.
--- a/website/site/content/docs/tools/convert-all-pdf.md
+++ b/website/site/content/docs/tools/convert-all-pdf.md
@@ -46,6 +46,6 @@ There will be one task per file to convert. All these tasks are
 submitted with a low priority. So files uploaded through the webapp or
 a [source](@/docs/webapp/uploading.md#anonymous-upload) with a high
 priority, will be preferred as [configured in the job
-executor](@/docs/joex/_index.md#scheduler-config). This is to not
+executor](@/docs/joex/intro.md#scheduler-config). This is to not
 disturb normal processing when many conversion tasks are being
 executed.
--- a/website/site/content/docs/tools/regenerate-previews.md
+++ b/website/site/content/docs/tools/regenerate-previews.md
@@ -0,0 +1,42 @@
 +++
 title = "Regenerate Preview Images"
 description = "Re-generates all preview images."
 weight = 80
 +++
 # regenerate-previews.sh
 This is a simple bash script to trigger the endpoint that submits task
 for generating preview images of your files. This is usually not
 needed, but should you change the `preview.dpi` setting in joex'
 config file, you need to regenerate the images to have any effect.
 # Requirements
 It is a bash script that additionally needs
 [curl](https://curl.haxx.se/) and
 [jq](https://stedolan.github.io/jq/).
 # Usage
 ```
 ./regenerate-previews.sh [docspell-base-url]
 ```
 For example, if docspell is at `http://localhost:7880`:
 ```
 ./convert-all-pdfs.sh http://localhost:7880
 ```
 The script asks for your account name and password. It then logs in
 and triggers the said endpoint. After this you should see a few tasks
 running.
 There will be one task per file to convert. All these tasks are
 submitted with a low priority. So files uploaded through the webapp or
 a [source](@/docs/webapp/uploading.md#anonymous-upload) with a high
 priority, will be preferred as [configured in the job
 executor](@/docs/joex/intro.md#scheduler-config). This is to not
 disturb normal processing when many conversion tasks are being
 executed.