diff --git a/website/site/content/docs/joex/_index.md b/website/site/content/docs/joex/_index.md index 7614d2e9..9fb9a1c5 100644 --- a/website/site/content/docs/joex/_index.md +++ b/website/site/content/docs/joex/_index.md @@ -3,178 +3,10 @@ title = "Joex" description = "More information about the job executor component." weight = 90 insert_anchor_links = "right" -[extra] -mktoc = true +summary = true +template = "pages.html" +sort_by = "weight" +redirect_to = "docs/joex/intro" +++ -# Introduction - -Joex is short for *Job Executor* and it is the component managing long -running tasks in docspell. One of these long running tasks is the file -processing task. - -One joex component handles the processing of all files of all -collectives/users. It requires much more resources than the rest -server component. Therefore the number of jobs that can run in -parallel is limited with respect to the hardware it is running on. - -For larger installations, it is probably better to run several joex -components on different machines. That works out of the box, as long -as all components point to the same database and use different -`app-id`s (see [configuring -docspell](@/docs/configure/_index.md#app-id)). - -When files are submitted to docspell, they are stored in the database -and all known joex components are notified about new work. Then they -compete on getting the next job from the queue. After a job finishes -and no job is waiting in the queue, joex will sleep until notified -again. It will also periodically notify itself as a fallback. - -# Task vs Job - -Just for the sake of this document, a task denotes the code that has -to be executed or the thing that has to be done. It emerges in a job, -once a task is submitted into the queue from where it will be picked -up and executed eventually. A job maintains a state and other things, -while a task is just code. - - -# Scheduler and Queue - -The scheduler is the part that runs and monitors the long running -jobs. It works together with the job queue, which defines what job to -take next. - -To create a somewhat fair distribution among multiple collectives, a -collective is first chosen in a simple round-robin way. Then a job -from this collective is chosen by priority. - -There are only two priorities: low and high. A simple *counting -scheme* determines if a low prio or high prio job is selected -next. The default is `4, 1`, meaning to first select 4 high priority -jobs and then 1 low priority job, then starting over. If no such job -exists, its falls back to the other priority. - -The priority can be set on a *Source* (see -[uploads](@/docs/webapp/uploading.md)). Uploading through the web -application will always use priority *high*. The idea is that while -logged in, jobs are more important that those submitted when not -logged in. - - -# Scheduler Config - -The relevant part of the config file regarding the scheduler is shown -below with some explanations. - -``` bash -docspell.joex { - # other settings left out for brevity - - scheduler { - - # Number of processing allowed in parallel. - pool-size = 2 - - # A counting scheme determines the ratio of how high- and low-prio - # jobs are run. For example: 4,1 means run 4 high prio jobs, then - # 1 low prio and then start over. - counting-scheme = "4,1" - - # How often a failed job should be retried until it enters failed - # state. If a job fails, it becomes "stuck" and will be retried - # after a delay. - retries = 5 - - # The delay until the next try is performed for a failed job. This - # delay is increased exponentially with the number of retries. - retry-delay = "1 minute" - - # The queue size of log statements from a job. - log-buffer-size = 500 - - # If no job is left in the queue, the scheduler will wait until a - # notify is requested (using the REST interface). To also retry - # stuck jobs, it will notify itself periodically. - wakeup-period = "30 minutes" - } -} -``` - -The `pool-size` setting determines how many jobs run in parallel. You -need to play with this setting on your machine to find an optimal -value. - -The `counting-scheme` determines for all collectives how to select -between high and low priority jobs; as explained above. It is -currently not possible to define that per collective. - -If a job fails, it will be set to *stuck* state and retried by the -scheduler. The `retries` setting defines how many times a job is -retried until it enters the final *failed* state. The scheduler waits -some time until running the next try. This delay is given by -`retry-delay`. This is the initial delay, the time until the first -re-try (the second attempt). This time increases exponentially with -the number of retries. - -The jobs will log about what they do, which is picked up and stored -into the database asynchronously. The log events are buffered in a -queue and another thread will consume this queue and store them in the -database. The `log-buffer-size` determines the size of the queue. - -At last, there is a `wakeup-period` that determines at what interval -the joex component notifies itself to look for new jobs. If jobs get -stuck, and joex is not notified externally it could miss to -retry. Also, since networks are not reliable, a notification may not -reach a joex component. This periodic wakup is just to ensure that -jobs are eventually run. - - -# Periodic Tasks - -The job executor can execute tasks periodically. These tasks are -stored in the database such that they can be submitted into the job -queue. Multiple job executors can run at once, only one is ever doing -something with a task. So a periodic task is never submitted twice. It -is also not submitted, if a previous task has not finished yet. - - -# Starting on demand - -The job executor and rest server can be started multiple times. This -is especially useful for the job executor. For example, when -submitting a lot of files in a short time, you can simply startup more -job executors on other computers on your network. Maybe use your -laptop to help with processing for a while. - -You have to make sure, that all connect to the same database, and that -all have unique `app-id`s. - -Once the files have been processced you can stop the additional -executors. - - -# Shutting down - -If a job executor is sleeping and not executing any jobs, you can just -quit using SIGTERM or `Ctrl-C` when running in a terminal. But if -there are jobs currently executing, it is advisable to initiate a -graceful shutdown. The job executor will then stop taking new jobs -from the queue but it will wait until all running jobs have completed -before shutting down. - -This can be done by sending a http POST request to the api of this job -executor: - -``` -curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit" -``` - -If joex receives this request it will immediately stop taking new jobs -and it will quit when all running jobs are done. - -If a job executor gets terminated while there are running jobs, the -jobs are still in the current state marked to be executed by this job -executor. In order to fix this, start the job executor again. It will -search all jobs that are marked with its id and put them back into -waiting state. Then send a graceful shutdown request as shown above. +No content here. diff --git a/website/site/content/docs/joex/file-processing.md b/website/site/content/docs/joex/file-processing.md new file mode 100644 index 00000000..68811bca --- /dev/null +++ b/website/site/content/docs/joex/file-processing.md @@ -0,0 +1,366 @@ ++++ +title = "File Processing" +description = "How Docspell processes files." +weight = 20 +insert_anchor_links = "right" +[extra] +mktoc = true ++++ + +When uploading a file, it is only saved to the database together with +the given meta information. The file is not visible in the ui yet. +Then joex takes the next such file (or files in case you uploaded +many) and starts processing it. When processing finished, it the item +and its files will show up in the ui. + +If an error occurs during processing, the item will be created +anyways, so you can see it. Depending on the error, some information +may not be available. + +Processing files may require some resources, like memory and cpu. Many +things can be configured in the config file to adapt it to the machine +it is running on. + +Important is the setting `docspell.joex.scheduler.pool-size` which +defines how many tasks can run in parallel on the machine running +joex. For machines that are not very strong, choosing a `1` is +recommended. + + +# Stages + +``` +DuplicateCheck -> +Extract Archives -> +Conversion to PDF -> +Text Extraction -> +Generate Previews -> +Text Analysis +``` + +These steps are executed sequentially. There are many config options +available for each step. + +## External Commands + +External programs are all configured the same way. You can change the +command (add, remove options etc) in the config file. As an example, +here is the `wkhtmltopdf` command that is used to convert html files +to pdf: + +``` conf +docspell.joex.convert { + wkhtmlpdf { + command = { + program = "wkhtmltopdf" + args = [ + "-s", + "A4", + "--encoding", + "{{encoding}}", + "--load-error-handling", "ignore", + "--load-media-error-handling", "ignore", + "-", + "{{outfile}}" + ] + timeout = "2 minutes" + } + working-dir = ${java.io.tmpdir}"/docspell-convert" + } +} +``` + +Strings in `{{…}}` are replaced by docspell with the appropriate +values at runtime. However, based on your use case you can just set +constant values or add other options. This might be necessary when +there are different version installed where changes in the command +line are required. As you see for `wkhtmltopdf` the page size is fixed +to DIN A4. Other commands are configured like this as well. + +For the default values, please see the [configuration +page](@/docs/configure/_index.md#joex). + +## Duplicate Check + +If specified, the uploaded file is checked via a sha256 hash, if it +has been uploaded before. If so, it is removed from the set of +uploaded files. You can define this with the upload metadata. + +If this results in an empty set, the processing ends. + + +## Extract Archives + +If a file is a `zip` or `eml` (e-mail) file, it is extracted and its +entries are added to the file set. The original (archive) file is kept +in the database, but removed from further processing. + + +## Conversion to PDF + +All files are converted to a PDF file. How this is done depends on the +file type. External programs are required, which must be installed on +the machine running joex. The config file allows to specify the exact +commands used. + +See the section `docspell.joex.convert` in the config file. + +The following config options apply to the conversion as a whole: + +``` conf +docspell.joex.convert { + converted-filename-part = "converted" + max-image-size = ${docspell.joex.extraction.ocr.max-image-size} +} +``` + +The first setting defines a suffix that is appended to the original +file name to name the converted file. You can set an empty string to +keep the same filename as the original. The extension is always +changed to `.pdf`, of course. + +The second option defines a limit for reading images. Some images may +be small as a file but uncompressed very large. To avoid allocating +too much memory, there is a limit. It defaults to 14mp. + +### Html + +Html files are converted with the external tool +[wkhtmltopdf](https://wkhtmltopdf.org/). It produces quite nice +results by using the webkit rendering engine. So the resulting PDF +looks just like in a browser. + + +### Images + +Images are converted using +[tesseract](https://github.com/tesseract-ocr). + +This might be interesting, if you want to try a different language +that is not available in docspell's settings yet. Tesseract also adds +the extracted text as a separate layer to the PDF. + +For images, tesseract is configured to create a text and a pdf file. + +### Text + +Plaintext files are treated as markdown. You can modify the results by +providing some custom css. + +The resulting HTML files are then converted to PDF via `wkhtmltopdf` +as described above. + +### Office + +To convert office files, [Libreoffice](https://www.libreoffice.org/) +is required and used via the command line tool +[unoconv](https://github.com/unoconv/unoconv). + +To improve performance, it is recommended to start a libreoffice +listener by running `unoconv -l` in a separate process. + + +### PDF + +PDFs can be converted into PDFs, which may sound silly at first. But +PDFs come in many different flavors and may not contain a separate +text layer, making it impossible to "copy & paste" text in them. So +you can optionally use the tool +[ocrmypdf](https://github.com/jbarlow83/OCRmyPDF) to create a PDF/A +type PDF file containing a text layer with the extracted text. + +It is recommended to install ocrympdf, but it also is optional. If it +is enabled but fails, the error is not fatal and the processing will +continue using the original pdf for extracting text. You can also +disable it to remove the errors from the processing logs. + +The `--skip-text` option is necessary to not fail on "text" pdfs +(where ocr is not necessary). In this case, the pdf will be converted +to PDF/A. + + +## Text Extraction + +Text extraction also depends on the file type. Some tools from the +convert section are used here, too. + +Text is tried to extract from the original file. If that can't be done +or results in an error, the converted file is tried next. + +### Html + +Html files are not used directly, but the converted PDF file is used +to extract the text. This makes sure that the text is extracted you +actually see. The conversion is done anyways and the resulting PDF +already has a text layer. + +### Images + +For images, [tesseract](https://github.com/tesseract-ocr) is used +again. In most cases this step is not executed, because the text has +already been extracted in the conversion step. But if the conversion +would have failed for some reason, tesseract is called here (with +different options). + +### Text + +This is obviously trivial :) + +### Office + +MS Office files are processed using a library without any external +tool. It uses [apache poi](https://poi.apache.org/) which is well +known for these tasks. + +A rich text file (`.rtf`) is procssed by Java "natively" (using their +standard library). + +OpenDocument files are proecessed using the ODS/ODT/ODF parser from +tika. + +### PDF + +PDF files are first checked for a text layer. If this returns some +text that is greater than the configured minimum length, it is used. +Otherwise, OCR is started for the whole pdf file page by page. + + +```conf +docspell.joex { + extraction { + pdf { + min-text-len = 500 + } + } +} +``` + +After OCR both texts are compared and the longer is used. Since PDFs +can contain text and images, it might be safer to always do OCR, but +this is something to choose by the user. + +PDF ocr is comprised of multiple steps. At first only the first +`page-range` pages are extracted to avoid too long running tasks +(someone submit an ebook for example). But you can disable this limit +by setting a `-1`. After all, text that is not extracted, won't be +indexed either and is therefore not searchable. It depends on your +machine/setup. + +Another limit is `max-image-size` which defines the size of an image +in pixel (`width * height`) where processing is skipped. + +Then [ghostscript](http://pages.cs.wisc.edu/~ghost/) is used to +extract single pages into image files and +[unpaper](https://github.com/Flameeyes/unpaper) is used to optimize +the images for ocr. Unpaper is optional, if it is not found, it is +skipped, which may be a compromise on slow machines. + +```conf +docspell.joex { + extraction { + ocr { + max-image-size = 14000000 + page-range { + begin = 10 + } + ghostscript { + command { + program = "gs" + args = [ "-dNOPAUSE" + , "-dBATCH" + , "-dSAFER" + , "-sDEVICE=tiffscaled8" + , "-sOutputFile={{outfile}}" + , "{{infile}}" + ] + timeout = "5 minutes" + } + working-dir = ${java.io.tmpdir}"/docspell-extraction" + } + unpaper { + command { + program = "unpaper" + args = [ "{{infile}}", "{{outfile}}" ] + timeout = "5 minutes" + } + } + tesseract { + command { + program = "tesseract" + args = ["{{file}}" + , "stdout" + , "-l" + , "{{lang}}" + ] + timeout = "5 minutes" + } + } + } + } +} +``` + +# Generating Previews + +Previews are generated from the converted PDF of every file. The first +page of each file is converted into an image file. The config file +allows to specify a dpi which is used to render the pdf page. The +default is set to 32dpi, which results roughly in a 200x300px image. +For comparison, a standard A4 is usually rendered at 96dpi, which +results in a 790x1100px image. + +```conf +docspell.joex { + extraction { + preview { + dpi = 32 + } + } +} +``` + +{% infobubble(mode="warning", title="Please note") %} + +When this is changed, you must re-generate all preview images. Check +the api for this, there is an endpoint to regenerate all preview +images for a collective. There is also a bash script provided in the +`tools/` directory that can be used to call this endpoint. + +{% end %} + + +# Text Analysis + +This uses the extracted text to find what could be attached to the new +item. There are multiple things provided. + + +## Classification + +If you enabled classification in the config file, a model is trained +periodically from your files. This is now used to guess a tag for the +item. + + +## Natural Language Processing + +NLP is used to find out which terms in the text may be a company or +person that is later used to find metadata to attach to. It also uses +your address book to match terms in the text. + +This requires to load language model files in memory, which is quite a +lot. Also, the number of languages is much more restricted than for +tesseract. Currently English, German and French are supported. + +Another feature that is planned, but not yet provided is to propose +new companies/people you don't have yet in your address book. + +The config file allows some settings. You can specify a limit for +texts. Large texts result in higher memory consumption. By default, +the first 10'000 characters are taken into account. + +The setting `clear-stanford-nlp-interval` allows to define an idle +time after which the model files are cleared from memory. This allows +to be reclaimed by the OS. The timer starts after the last file has +been processed. If you can afford it, it is recommended to disable it +by setting it to `0`. diff --git a/website/site/content/docs/joex/intro.md b/website/site/content/docs/joex/intro.md new file mode 100644 index 00000000..930439fa --- /dev/null +++ b/website/site/content/docs/joex/intro.md @@ -0,0 +1,180 @@ ++++ +title = "Joex" +description = "More information about the job executor component." +weight = 10 +insert_anchor_links = "right" +[extra] +mktoc = true ++++ + +# Introduction + +Joex is short for *Job Executor* and it is the component managing long +running tasks in docspell. One of these long running tasks is the file +processing task. + +One joex component handles the processing of all files of all +collectives/users. It requires much more resources than the rest +server component. Therefore the number of jobs that can run in +parallel is limited with respect to the hardware it is running on. + +For larger installations, it is probably better to run several joex +components on different machines. That works out of the box, as long +as all components point to the same database and use different +`app-id`s (see [configuring +docspell](@/docs/configure/_index.md#app-id)). + +When files are submitted to docspell, they are stored in the database +and all known joex components are notified about new work. Then they +compete on getting the next job from the queue. After a job finishes +and no job is waiting in the queue, joex will sleep until notified +again. It will also periodically notify itself as a fallback. + +# Task vs Job + +Just for the sake of this document, a task denotes the code that has +to be executed or the thing that has to be done. It emerges in a job, +once a task is submitted into the queue from where it will be picked +up and executed eventually. A job maintains a state and other things, +while a task is just code. + + +# Scheduler and Queue + +The scheduler is the part that runs and monitors the long running +jobs. It works together with the job queue, which defines what job to +take next. + +To create a somewhat fair distribution among multiple collectives, a +collective is first chosen in a simple round-robin way. Then a job +from this collective is chosen by priority. + +There are only two priorities: low and high. A simple *counting +scheme* determines if a low prio or high prio job is selected +next. The default is `4, 1`, meaning to first select 4 high priority +jobs and then 1 low priority job, then starting over. If no such job +exists, its falls back to the other priority. + +The priority can be set on a *Source* (see +[uploads](@/docs/webapp/uploading.md)). Uploading through the web +application will always use priority *high*. The idea is that while +logged in, jobs are more important that those submitted when not +logged in. + + +# Scheduler Config + +The relevant part of the config file regarding the scheduler is shown +below with some explanations. + +``` bash +docspell.joex { + # other settings left out for brevity + + scheduler { + + # Number of processing allowed in parallel. + pool-size = 2 + + # A counting scheme determines the ratio of how high- and low-prio + # jobs are run. For example: 4,1 means run 4 high prio jobs, then + # 1 low prio and then start over. + counting-scheme = "4,1" + + # How often a failed job should be retried until it enters failed + # state. If a job fails, it becomes "stuck" and will be retried + # after a delay. + retries = 5 + + # The delay until the next try is performed for a failed job. This + # delay is increased exponentially with the number of retries. + retry-delay = "1 minute" + + # The queue size of log statements from a job. + log-buffer-size = 500 + + # If no job is left in the queue, the scheduler will wait until a + # notify is requested (using the REST interface). To also retry + # stuck jobs, it will notify itself periodically. + wakeup-period = "30 minutes" + } +} +``` + +The `pool-size` setting determines how many jobs run in parallel. You +need to play with this setting on your machine to find an optimal +value. + +The `counting-scheme` determines for all collectives how to select +between high and low priority jobs; as explained above. It is +currently not possible to define that per collective. + +If a job fails, it will be set to *stuck* state and retried by the +scheduler. The `retries` setting defines how many times a job is +retried until it enters the final *failed* state. The scheduler waits +some time until running the next try. This delay is given by +`retry-delay`. This is the initial delay, the time until the first +re-try (the second attempt). This time increases exponentially with +the number of retries. + +The jobs will log about what they do, which is picked up and stored +into the database asynchronously. The log events are buffered in a +queue and another thread will consume this queue and store them in the +database. The `log-buffer-size` determines the size of the queue. + +At last, there is a `wakeup-period` that determines at what interval +the joex component notifies itself to look for new jobs. If jobs get +stuck, and joex is not notified externally it could miss to +retry. Also, since networks are not reliable, a notification may not +reach a joex component. This periodic wakup is just to ensure that +jobs are eventually run. + + +# Periodic Tasks + +The job executor can execute tasks periodically. These tasks are +stored in the database such that they can be submitted into the job +queue. Multiple job executors can run at once, only one is ever doing +something with a task. So a periodic task is never submitted twice. It +is also not submitted, if a previous task has not finished yet. + + +# Starting on demand + +The job executor and rest server can be started multiple times. This +is especially useful for the job executor. For example, when +submitting a lot of files in a short time, you can simply startup more +job executors on other computers on your network. Maybe use your +laptop to help with processing for a while. + +You have to make sure, that all connect to the same database, and that +all have unique `app-id`s. + +Once the files have been processced you can stop the additional +executors. + + +# Shutting down + +If a job executor is sleeping and not executing any jobs, you can just +quit using SIGTERM or `Ctrl-C` when running in a terminal. But if +there are jobs currently executing, it is advisable to initiate a +graceful shutdown. The job executor will then stop taking new jobs +from the queue but it will wait until all running jobs have completed +before shutting down. + +This can be done by sending a http POST request to the api of this job +executor: + +``` +curl -XPOST "http://localhost:7878/api/v1/shutdownAndExit" +``` + +If joex receives this request it will immediately stop taking new jobs +and it will quit when all running jobs are done. + +If a job executor gets terminated while there are running jobs, the +jobs are still in the current state marked to be executed by this job +executor. In order to fix this, start the job executor again. It will +search all jobs that are marked with its id and put them back into +waiting state. Then send a graceful shutdown request as shown above. diff --git a/website/site/content/docs/tools/convert-all-pdf.md b/website/site/content/docs/tools/convert-all-pdf.md index 08265127..412c363b 100644 --- a/website/site/content/docs/tools/convert-all-pdf.md +++ b/website/site/content/docs/tools/convert-all-pdf.md @@ -46,6 +46,6 @@ There will be one task per file to convert. All these tasks are submitted with a low priority. So files uploaded through the webapp or a [source](@/docs/webapp/uploading.md#anonymous-upload) with a high priority, will be preferred as [configured in the job -executor](@/docs/joex/_index.md#scheduler-config). This is to not +executor](@/docs/joex/intro.md#scheduler-config). This is to not disturb normal processing when many conversion tasks are being executed. diff --git a/website/site/content/docs/tools/regenerate-previews.md b/website/site/content/docs/tools/regenerate-previews.md new file mode 100644 index 00000000..878a56ef --- /dev/null +++ b/website/site/content/docs/tools/regenerate-previews.md @@ -0,0 +1,42 @@ ++++ +title = "Regenerate Preview Images" +description = "Re-generates all preview images." +weight = 80 ++++ + +# regenerate-previews.sh + +This is a simple bash script to trigger the endpoint that submits task +for generating preview images of your files. This is usually not +needed, but should you change the `preview.dpi` setting in joex' +config file, you need to regenerate the images to have any effect. + +# Requirements + +It is a bash script that additionally needs +[curl](https://curl.haxx.se/) and +[jq](https://stedolan.github.io/jq/). + +# Usage + +``` +./regenerate-previews.sh [docspell-base-url] +``` + +For example, if docspell is at `http://localhost:7880`: + +``` +./convert-all-pdfs.sh http://localhost:7880 +``` + +The script asks for your account name and password. It then logs in +and triggers the said endpoint. After this you should see a few tasks +running. + +There will be one task per file to convert. All these tasks are +submitted with a low priority. So files uploaded through the webapp or +a [source](@/docs/webapp/uploading.md#anonymous-upload) with a high +priority, will be preferred as [configured in the job +executor](@/docs/joex/intro.md#scheduler-config). This is to not +disturb normal processing when many conversion tasks are being +executed.