Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
- Log levels of specific loggers can be defined in the config
file (doesn't work with env variables)
- Log events of background tasks carry now additional data
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.
This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.