Adding a new language without nlp requires now only to fill out the
pieces:
- define a list of month names to support date recognition
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions
- update the elm file so it shows up on the client
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.
This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
There is a task to generate preview images per attachment. It can
either add them (if not present yet) or overwrite them (e.g. some
config has changed).
There is a task that selects all attachments without previews and
submits a task to create it. This is submitted on start automatically
to generate previews for all existing attachments.
When `base-url` is the default (i.e. localhost), the cookie is now
configured with the domain doing the request and the webapp is
configured to run requests against the host in the address bar of the
browser.
A request to cancel a job was not processed correctly. The cancelling
routine of a task must run, regardless of the (non-final) state. Now
it works like this: if a job is currently running, it is interrupted
and its cancel routine is invoked. It then enters "cancelled" state.
If it is stuck, it is loaded and only its cancel routine is run. If it
is in a final state or waiting, it is removed from the queue.
Invalid items are those that are not ready, and not shown to the user.
When changing metadata, it should only be changed, if the item was not
already shown to the user.
Applications running next to docspell may want a way to upload files
to any collective for integration purposes. This endpoint can be used
for this. It is disabled by default and can be enabled via the
configuration file.
- When converting from html->pdf, the wkhtmltopdf program exits with
errors if the document contains invalid links. The content is now
cleaned before handed to wkhtmltopdf.
- Update emil library which fixes a bug when reading mails without
explicit transfer encoding (8bit)
- Add a info header to converted mails
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.
There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
Periodic tasks are special in that they are usually kept around and
started based on a schedule. A new component checks periodic tasks and
submits them in the queue once they are due.
In order to avoid duplicate periodic jobs, the tracker of a job is
used to store the periodic job id. Each time a periodic task is due,
it is first checked if there is a job running (or queued) for this
task.
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.
First, files are be converted to PDF for archiving. It is also easier
to create a preview. This is done via the `ConvertPdf` processing
task (which is not yet implemented).
Text extraction then tries first with the original file. If that
fails, OCR is done on the (potentially) converted pdf file.
To not loose information of the original file, it is saved using the
table `attachment_source`. If the original file is already a pdf, or
the conversion did not succeed, the `attachment` and
`attachment_source` record point to the same file.