docspell/website/site/content/docs/webapp/scanmailbox.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

206 lines
8.4 KiB
Markdown
Raw Normal View History

2020-07-27 20:13:22 +00:00
+++
title = "Scan Mailboxes"
weight = 70
[extra]
mktoc = true
+++
2022-01-27 19:23:15 +00:00
# Scan Mailboxes
2020-07-27 20:13:22 +00:00
User that provide valid email (imap) settings, can import mails from
their mailbox into docspell periodically.
You need first define imap settings, please see [this
page](@/docs/webapp/emailsettings.md#imap-settings).
Go to *User Settings -> Scan Mailbox Task*. You can define periodic
tasks that connects to your mailbox and import mails into docspell. It
is possible to define multiple tasks, for example, if you have
multiple e-mail accounts you want to import periodically.
2022-01-27 19:23:15 +00:00
{{ figure2(light="scanmailbox-list.png", dark="scanmailbox-list_dark.png") }}
2020-07-27 20:13:22 +00:00
2022-01-27 19:23:15 +00:00
## Details
2020-07-27 20:13:22 +00:00
2022-01-27 19:23:15 +00:00
### General
2020-07-27 20:13:22 +00:00
2022-01-27 19:23:15 +00:00
{{ figure2(light="scanmailbox-detail-01.png", dark="scanmailbox-detail-01_dark.png") }}
2020-07-27 20:13:22 +00:00
You can enable or disable this task. A disabled task will not run
periodically. You can still choose to run it manually if you click the
`Start Once` button.
Then you need to specify which [IMAP
connection](@/docs/webapp/emailsettings.md#imap-settings) to use.
2021-01-25 07:50:46 +00:00
2022-01-27 19:23:15 +00:00
### Processing
2021-01-25 07:50:46 +00:00
2022-01-27 19:23:15 +00:00
{{ figure2(light="scanmailbox-detail-02.png", dark="scanmailbox-detail-02_dark.png") }}
2021-01-25 07:50:46 +00:00
2020-07-27 20:13:22 +00:00
A list of folders is required. Docspell will only look into these
folders. You can specify multiple folders. The "Inbox" folder is a
special folder, which will usually appear translated in your web-mail
client. You can specify "INBOX" case insensitive, it will then read
mails in your inbox. Any other folder is usually case-sensitive
(depends on the imap server, but usually they are case sensitive
2021-03-19 20:15:36 +00:00
except the INBOX folder). The path separator may also vary depending
on your imap server - if the slash doesn't work, try using a dot
(i.e. "INBOX.invoices" instead of "INBOX/invoices").
Type in a folder name and click the add button on the right.
2020-07-27 20:13:22 +00:00
2020-11-14 20:48:52 +00:00
Then the field *Received Since Hours* defines how many hours to go
back and look for mails. Usually there are many mails in your inbox
and importing them all at once is not feasible or desirable. It can
work together with the *Schedule* field below. For example, you could
2021-01-25 07:50:46 +00:00
run this task all 6 hours and read mails from 8 hours back. This
setting is used to query the mail server.
2022-01-27 19:23:15 +00:00
### Additional Filter
2021-01-25 07:50:46 +00:00
2022-01-27 19:23:15 +00:00
{{ figure2(light="scanmailbox-detail-03.png", dark="scanmailbox-detail-03_dark.png") }}
2021-01-25 07:50:46 +00:00
The following properties allow to filter those downloaded mails that
should be imported.
2020-11-14 20:48:52 +00:00
The *File Filter* can be specified as a glob to only import mail
attachments based on their file name. For example, a value of `*.pdf`
will only import files that have a `pdf` extension. The mail body is
named `mail.html` by convention and would be excluded when only
specifying `*.pdf`. You can combine multiple globs using OR via the
pipe `|` symbol. For example, to also include the mail body as well as
all pdf attachments: `*.pdf|mail.html`.
The *Subject Filter* is a glob filter that is applied to the mail
subject. This can be used to only import mails whose subject have some
pattern. For example, if your scanner mails to you with a certain
subject like _"Scanned Document 214"_, you could include those via a
`Scanned Document*` pattern.
2022-01-27 19:23:15 +00:00
### Post Processing
2021-01-25 07:50:46 +00:00
2022-01-27 19:23:15 +00:00
{{ figure2(light="scanmailbox-detail-04.png", dark="scanmailbox-detail-04_dark.png") }}
2021-01-25 07:50:46 +00:00
The next settings tell docspell what to do once a mail has been read
by docspell. It can be moved into another folder in your mail account.
This moves it out of the way for the next run. You can also choose to
delete the mail, but *note that it will really be deleted and not
moved to your trash folder*. If both options are off, nothing happens
with that mail, it simply stays (and could be re-read on the next
run).
Be careful when mails are neither moved nor deleted after processing.
They could be selected anew in the next run, meaning that the job can
not progress, because it filters out the same mails all the time. You
can however, simply schedule the task in an interval >= the `Received
Since Hours` setting.
By default, post-processing is only applied to mails that have been
*submitted to docspell*. Some mails may have been skipped due subject
filtering. If you also want these skipped mails to be affected by
post-processing, enabled the *Apply post-processing to all fetched
mails*.
2022-01-27 19:23:15 +00:00
### Metadata
2020-11-14 20:48:52 +00:00
2022-01-27 19:23:15 +00:00
{{ figure2(light="scanmailbox-detail-05.png", dark="scanmailbox-detail-05_dark.png") }}
2021-01-25 07:50:46 +00:00
These properties allow to specify some metadata that are automatically
attached to the items being created.
2020-11-14 20:48:52 +00:00
Every item in docspell has a direction value (incoming or outgoing).
If you know that all mails you want to import have a specific
directon, then you can set it here. Otherwise, *automatic* means that
docspell chooses a direction based on the `From` header of a mail. If
the `From` header is an e-mail address that belongs to a “concerning”
person in your address book, then it is set to "outgoing". Otherwise
it is set to "incoming". To support this, you need to add your own
e-mail address(es) to your address book.
2020-07-27 20:13:22 +00:00
The *Item Folder* setting is used to put all items that are created
from mails into the specified [folder](metadata#folders). If you
define a folder here, where you are not a member, you won't find
resulting items.
2020-11-14 20:48:52 +00:00
The *Tags* setting can be used to associate a fixed number of tags to
all items that are imported from this mail task.
2021-01-25 07:50:46 +00:00
The *Language* setting is applied when processing the mails. If not
set, the default language of the collective is used.
2022-01-27 19:23:15 +00:00
### Schedule
2021-01-25 07:50:46 +00:00
2022-01-27 19:23:15 +00:00
{{ figure2(light="scanmailbox-detail-06.png", dark="scanmailbox-detail-06_dark.png") }}
2021-01-25 07:50:46 +00:00
At last the *Schedule* defines when and how often this task should
run. The syntax is similiar to a date-time string, like `2019-09-15
12:32`, where each part is a pattern to also match multple values. The
ui tries to help a little by displaying the next two date-times this
task would execute. A more in depth help is available
[here](https://github.com/eikek/calev#what-are-calendar-events).
For example, to execute the task every monday at noon, you would
write: `Mon *-*-* 12:00`. A date-time part can match all values (`*`),
a list of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long
lists may be written in a shorter way using a repetition value. It is
written like this: `1/7` which is the same as a list with `1` and all
2020-07-27 20:13:22 +00:00
multiples of `7` added to it. In other words, it matches `1`, `1+7`,
`1+7+7`, `1+7+7+7` and so on.
2021-01-10 19:06:30 +00:00
# Configuraion
The admin can tweak some properties in the config file at
`docspell.joex.user-tasks.scan-mailbox`:
- `max-folders`: Maximum number of folders to scan. This sets a hard
limit on the folder selection of the user settings.
- `mail-chunk-size`: the batch size to use when fetching mails; after
this many mails have been processed, the connection is
re-established to get the next `mail-chunk-size` mails.
- `max-mails`: the maximum amount of mails to process in one job run.
If the mailbox contains more than this amount of mails, it must be
waited for the next scheduled executon.
2020-07-27 20:13:22 +00:00
2020-07-30 20:27:10 +00:00
# Reading Mails twice / Duplicates
2020-07-27 20:13:22 +00:00
Since users can move around mails in their mailboxes, it can happen
that docspell unintentionally reads a mail multiple times. If docspell
reads a mail, it will first check if an item already exists that
originated from this mail. It only proceeds to import it, if it cannot
find any. If you deleted an item in the meantime, docspell would
import the mail again.
This check uses the
[`Message-ID`](https://en.wikipedia.org/wiki/Message-ID) of an e-mail.
This is usually there and should identify a complete mail. But it
won't catch duplicate mails, that are sent multiple times - they might
have different `Message-ID`s. Also some mails have no such ids and are
then imported from docspell without any checks.
In later versions, docspell may use the checksum of the generated eml
file to look for duplicates, too.
2020-07-30 20:27:10 +00:00
# How it works
2020-07-27 20:13:22 +00:00
Docspell will go through all folders and download mails in “batches”.
This size can be set by the admin in the [configuration
2022-05-21 15:00:27 +00:00
file](@/docs/configure/defaults.md#joex) and applies to all these
2022-03-21 13:41:39 +00:00
tasks (same for all users). This batch only contains the mail headers
and not the complete mail.
2020-07-27 20:13:22 +00:00
Then each mail is downloaded completely one by one and converted into
an [eml](https://en.wikipedia.org/wiki/Email#Filename_extensions) file
which is then submitted to docspell. Then the usual processing
machinery starts, just like uploading an eml file via the webapp.
The number of folders and the number of mails to import can be limited
by an admin via the config file. Note that this limit applies to one
task run only, it is meant to reduce resource allocation of one task.