This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.
A request to cancel a job was not processed correctly. The cancelling
routine of a task must run, regardless of the (non-final) state. Now
it works like this: if a job is currently running, it is interrupted
and its cancel routine is invoked. It then enters "cancelled" state.
If it is stuck, it is loaded and only its cancel routine is run. If it
is in a final state or waiting, it is removed from the queue.
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
It worked for new items, because the implicit offset was 0. when
adding archives to existing items, there are already attachments and
the new attachments are added to the end. This won't work if files are
added concurrently, because there is no quick and reliable way to
determine the offset then.
Invalid items are those that are not ready, and not shown to the user.
When changing metadata, it should only be changed, if the item was not
already shown to the user.
The job executor depends on backend module, since it may control the
application via user tasks. The `ONode` can now be moved from the
store module into the backend module.
- When converting from html->pdf, the wkhtmltopdf program exits with
errors if the document contains invalid links. The content is now
cleaned before handed to wkhtmltopdf.
- Update emil library which fixes a bug when reading mails without
explicit transfer encoding (8bit)
- Add a info header to converted mails
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.
There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
Each attachment is now first extracted into potentially multiple ones,
if it is recognized as an archive. This is the first step in
processing. The original archive file is also stored and the resulting
attachments are associated to their original archive.
First support is implemented for zip files.
Periodic tasks are special in that they are usually kept around and
started based on a schedule. A new component checks periodic tasks and
submits them in the queue once they are due.
In order to avoid duplicate periodic jobs, the tracker of a job is
used to store the periodic job id. Each time a periodic task is due,
it is first checked if there is a job running (or queued) for this
task.
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.
First, files are be converted to PDF for archiving. It is also easier
to create a preview. This is done via the `ConvertPdf` processing
task (which is not yet implemented).
Text extraction then tries first with the original file. If that
fails, OCR is done on the (potentially) converted pdf file.
To not loose information of the original file, it is saved using the
table `attachment_source`. If the original file is already a pdf, or
the conversion did not succeed, the `attachment` and
`attachment_source` record point to the same file.