This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.
A request to cancel a job was not processed correctly. The cancelling
routine of a task must run, regardless of the (non-final) state. Now
it works like this: if a job is currently running, it is interrupted
and its cancel routine is invoked. It then enters "cancelled" state.
If it is stuck, it is loaded and only its cancel routine is run. If it
is in a final state or waiting, it is removed from the queue.
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
It worked for new items, because the implicit offset was 0. when
adding archives to existing items, there are already attachments and
the new attachments are added to the end. This won't work if files are
added concurrently, because there is no quick and reliable way to
determine the offset then.