mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-04-04 10:29:34 +00:00
Update documentation
This commit is contained in:
parent
4309bd8dfd
commit
145c308461
@ -67,7 +67,7 @@ Text is extracted from all files. For scanned documents/images, OCR is used by u
|
|||||||
, { image = "img/analyze-feature.png"
|
, { image = "img/analyze-feature.png"
|
||||||
, header = "Text Analysis"
|
, header = "Text Analysis"
|
||||||
, description = """
|
, description = """
|
||||||
The extracted text is analyzed and is used to find properties that can be annotated to your documents automatically.
|
The extracted text is analyzed using ML techniques to find properties that can be annotated to your documents automatically.
|
||||||
"""
|
"""
|
||||||
}
|
}
|
||||||
, { image = "img/filetype-feature.svg"
|
, { image = "img/filetype-feature.svg"
|
||||||
|
@ -33,11 +33,26 @@ workflows, a tag category *state* may exist that includes tags like
|
|||||||
"assignment" semantics. Docspell doesn't propose any workflow, but it
|
"assignment" semantics. Docspell doesn't propose any workflow, but it
|
||||||
can help to implement some.
|
can help to implement some.
|
||||||
|
|
||||||
The tags are *not* taken into account when creating suggestions from
|
Docspell can try to predict a tag for new incoming documents
|
||||||
analyzed text yet. However, PDF files may contain metadata itself and
|
automatically based on your existing data. This requires to train an
|
||||||
if there is a metadata *keywords* list, these keywords are matched
|
algorithm. There are some caveats: the more data you have correctly
|
||||||
against the tags in the database. If they match, the item is tagged
|
tagged, the better are the results. So it won't work well for maybe
|
||||||
automatically.
|
the first 100 documents. Then the tags must somehow relate to a
|
||||||
|
pattern in the document text. Tags like *todo* or *waiting* probably
|
||||||
|
won't work, obviously. But the typical "document type" tag, like
|
||||||
|
*invoice* and *receipt* is a good fit! That is why you need to provide
|
||||||
|
a tag category so only sensible tags are being learned. The algorithm
|
||||||
|
goes through all your items and learns patterns in the text that
|
||||||
|
relate to the given tags. This training step can be run periodically,
|
||||||
|
as specified in your collective settings such that docspell keeps
|
||||||
|
learning from your already tagged data! More information about the
|
||||||
|
algorithm can be found in the config, where it is possible to
|
||||||
|
fine-tune this process.
|
||||||
|
|
||||||
|
Another way to have items tagged automatically is when an input PDF
|
||||||
|
file contains a list of keywords in its metadata section (this only
|
||||||
|
applies to PDF files). These keywords are then matched against the
|
||||||
|
tags in the database. If they match, the item is tagged with them.
|
||||||
|
|
||||||
|
|
||||||
## Organization and Person
|
## Organization and Person
|
||||||
|
Loading…
x
Reference in New Issue
Block a user