Extend consumedir.sh to work with integration endpoint

Now running one consumedir script can upload files to multiple
collectives separately.
This commit is contained in:
Eike Kettner 2020-06-28 00:08:37 +02:00
parent d13e0a4370
commit 8500d4d804
6 changed files with 269 additions and 55 deletions

View File

@ -8,21 +8,20 @@ permalink: doc/tools/consumedir
The `consumerdir.sh` is a bash script that works in two modes:
- Go through all files in given directories (non recursively) and sent
each to docspell.
- Go through all files in given directories (recursively, if `-r` is
specified) and sent each to docspell.
- Watch one or more directories for new files and upload them to
docspell.
It can watch or go through one or more directories. Files can be
uploaded to multiple urls.
Run the script with the `-h` option, to see a short help text. The
help text will also show the values for any given option.
Run the script with the `-h` or `--help` option, to see a short help
text. The help text will also show the values for any given option.
The script requires `curl` for uploading. It requires the
`inotifywait` command if directories should be watched for new
files. If the `-m` option is used, the script will skip duplicate
files. For this the `sha256sum` command is required.
files.
Example for watching two directories:
@ -30,18 +29,69 @@ Example for watching two directories:
./tools/consumedir.sh --path ~/Downloads --path ~/pdfs -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
```
The script by default watches the given directories. If the `-o`
option is used, it will instead go through these directories and
upload all files in there.
The script by default watches the given directories. If the `-o` or
`--once` option is used, it will instead go through these directories
and upload all files in there.
Example for uploading all immediatly (the same as above only with `-o`
added):
``` bash
./tools/consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
$ consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
```
The URL can be any docspell url that accepts uploads without
authentication. This is usually a [source
url](../uploading#anonymous-upload). It is also possible to use the
script with the [integration
endpoint](../uploading#integration-endpoint).
## Integration Endpoint
When given the `-i` or `--integration` option, the script changes its
behaviour slightly to work with the [integration
endpoint](../uploading#integration-endpoint).
First, if `-i` is given, it implies `-r` so the directories are
watched or traversed recursively. The script then assumes that there
is a subfolder with the collective name. Files must not be placed
directly into a folder given by `-p`, but below a sub-directory that
matches a collective name. In order to know for which collective the
file is, the script uses the first subfolder.
If the endpoint is protected, these credentials can be specified as
arguments `--iuser` and `--iheader`, respectively. The format is for
both `<name>:<value>`, so the username cannot contain a colon
character (but the password can).
Example:
``` bash
$ consumedir.sh -i -iheader 'Docspell-Integration:test123' -m -p ~/Downloads/ http://localhost:7880/api/v1/open/integration/item
```
The url is the integration endpoint url without the collective, as
this is amended by the script.
This watches the folder `~/Downloads`. If a file is placed in this
folder directly, say `~/Downloads/test.pdf` the upload will fail,
because the collective cannot be determined. Create a subfolder below
`~/Downloads` with the name of a collective, for example
`~/Downloads/family` and place files somewhere below this `family`
subfolder, like `~/Downloads/family/test.pdf`.
## Duplicates
With the `-m` option, the script will not upload files that already
exist at docspell. For this the `sha256sum` command is required.
So you can move and rename files in those folders without worring
about duplicates. This allows to keep your files organized using the
file-system and have them mirrored into docspell as well.
## Systemd
The script can be used with systemd to run as a service. This is an

View File

@ -112,8 +112,15 @@ the [configuration file](configure#rest-server).
If queried by a `GET` request, it returns whether it is enabled and
the collective exists.
See the [SMTP gateway](tools/smtpgateway) for an example to use this
endpoint.
It is also possible to check for existing files using their sha256
checksum with:
```
/api/v1/open/integration/checkfile/[collective-name]/[sha256-checksum]
```
See the [SMTP gateway](tools/smtpgateway) or the [consumedir
script](tools/consumedir) for examples to use this endpoint.
## The Request

View File

@ -299,8 +299,8 @@ paths:
$ref: "#/components/schemas/BasicResult"
/open/integration/item/{id}:
get:
tags: [ Upload Integration ]
summary: Upload files to docspell.
tags: [ Integration Endpoint ]
summary: Check if integration endpoint is available.
description: |
Allows to check whether an integration endpoint is enabled for
a collective. The collective is given by the `id` parameter.
@ -325,7 +325,7 @@ paths:
401:
description: Unauthorized
post:
tags: [ Upload Integration ]
tags: [ Integration Endpoint ]
summary: Upload files to docspell.
description: |
Upload a file to docspell for processing. The id is a
@ -368,6 +368,30 @@ paths:
application/json:
schema:
$ref: "#/components/schemas/BasicResult"
/open/integration/checkfile/{id}/{checksum}:
get:
tags: [ Integration Endpoint ]
summary: Check if a file is in docspell.
description: |
Checks if a file with the given SHA-256 checksum is in
docspell. The `id` is the *collective name*. This route only
exists, if it is enabled in the configuration file.
The result shows all items that contains a file with the given
checksum.
security:
- authTokenHeader: []
parameters:
- $ref: "#/components/parameters/id"
- $ref: "#/components/parameters/checksum"
responses:
200:
description: Ok
content:
application/json:
schema:
$ref: "#/components/schemas/CheckFileResult"
/open/signup/register:
post:
tags: [ Registration ]

View File

@ -42,7 +42,7 @@ object CheckFileRoutes {
}
}
private def convert(v: Vector[RItem]): CheckFileResult =
def convert(v: Vector[RItem]): CheckFileResult =
CheckFileResult(
v.nonEmpty,
v.map(r => BasicItem(r.id, r.name, r.direction, r.state, r.created, r.itemDate))

View File

@ -8,6 +8,7 @@ import docspell.common._
import docspell.restserver.Config
import docspell.restserver.conv.Conversions._
import docspell.restserver.http4s.Responses
import docspell.store.records.RItem
import org.http4s._
import org.http4s.circe.CirceEntityEncoder._
import org.http4s.dsl.Http4sDsl
@ -24,12 +25,17 @@ object IntegrationEndpointRoutes {
val dsl = new Http4sDsl[F] {}
import dsl._
def validate(req: Request[F], collective: Ident) =
for {
_ <- authRequest(req, cfg.integrationEndpoint)
_ <- checkEnabled(cfg.integrationEndpoint)
_ <- lookupCollective(collective, backend)
} yield ()
HttpRoutes.of {
case req @ POST -> Root / "item" / Ident(collective) =>
(for {
_ <- authRequest(req, cfg.integrationEndpoint)
_ <- checkEnabled(cfg.integrationEndpoint)
_ <- lookupCollective(collective, backend)
_ <- validate(req, collective)
res <- EitherT.liftF[F, Response[F], Response[F]](
uploadFile(collective, backend, cfg, dsl)(req)
)
@ -37,11 +43,20 @@ object IntegrationEndpointRoutes {
case req @ GET -> Root / "item" / Ident(collective) =>
(for {
_ <- authRequest(req, cfg.integrationEndpoint)
_ <- checkEnabled(cfg.integrationEndpoint)
_ <- lookupCollective(collective, backend)
_ <- validate(req, collective)
res <- EitherT.liftF[F, Response[F], Response[F]](Ok(()))
} yield res).fold(identity, identity)
case req @ GET -> Root / "checkfile" / Ident(collective) / checksum =>
(for {
_ <- validate(req, collective)
items <- EitherT.liftF[F, Response[F], Vector[RItem]](
backend.itemSearch.findByFileCollective(checksum, collective)
)
resp <-
EitherT.liftF[F, Response[F], Response[F]](Ok(CheckFileRoutes.convert(items)))
} yield resp).fold(identity, identity)
}
}

View File

@ -13,6 +13,7 @@ CURL_CMD="curl"
INOTIFY_CMD="inotifywait"
SHA256_CMD="sha256sum"
MKTEMP_CMD="mktemp"
CURL_OPTS=${CURL_OPTS:-}
! getopt --test > /dev/null
if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
@ -20,8 +21,8 @@ if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
exit 1
fi
OPTIONS=omhdp:vr
LONGOPTS=once,distinct,help,delete,path:,verbose,recursive,dry
OPTIONS=omhdp:vrmi
LONGOPTS=once,distinct,help,delete,path:,verbose,recursive,dry,integration,iuser:,iheader:
! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@")
if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
@ -35,6 +36,7 @@ eval set -- "$PARSED"
declare -a watchdir
help=n verbose=n delete=n once=n distinct=n recursive=n dryrun=n
integration=n iuser="" iheader=""
while true; do
case "$1" in
-h|--help)
@ -69,6 +71,19 @@ while true; do
dryrun=y
shift
;;
-i|--integration)
integration=y
recursive=y
shift
;;
--iuser)
iuser="$2"
shift 2
;;
--iheader)
iheader="$2"
shift 2
;;
--)
shift
break
@ -87,14 +102,27 @@ showUsage() {
echo "Usage: $0 [options] url url ..."
echo
echo "Options:"
echo " -v | --verbose Print more to stdout. (value: $verbose)"
echo " -d | --delete Delete the file if successfully uploaded. (value: $delete)"
echo " -p | --path <dir> The directories to watch. This is required. (value: ${watchdir[@]})"
echo " -h | --help Prints this help text. (value: $help)"
echo " -m | --distinct Optional. Upload only if the file doesn't already exist. (value: $distinct)"
echo " -o | --once Instead of watching, upload all files in that dir. (value: $once)"
echo " -r | --recursive Traverse the directory(ies) recursively (value: $recursive)"
echo " --dry Do a 'dry run', not uploading anything only printing to stdout (value: $dryrun)"
echo " -v | --verbose Print more to stdout. (value: $verbose)"
echo " -d | --delete Delete the file if successfully uploaded. (value: $delete)"
echo " -p | --path <dir> The directories to watch. This is required. (value: ${watchdir[@]})"
echo " -h | --help Prints this help text. (value: $help)"
echo " -m | --distinct Optional. Upload only if the file doesn't already exist. (value: $distinct)"
echo " -o | --once Instead of watching, upload all files in that dir. (value: $once)"
echo " -r | --recursive Traverse the directory(ies) recursively (value: $recursive)"
echo " -i | --integration Upload to the integration endpoint. It implies -r. This puts the script in"
echo " a different mode, where the first subdirectory of any given starting point"
echo " is read as the collective name. The url(s) are completed with this name in"
echo " order to upload files to the respective collective. So each directory"
echo " given is expected to contain one subdirectory per collective and the urls"
echo " are expected to identify the integration endpoint, which is"
echo " /api/v1/open/integration/item/<collective-name>. (value: $integration)"
echo " --iheader The header name and value to use with the integration endpoint. This must be"
echo " in form 'headername:value'. Only used if '-i' is supplied."
echo " (value: $iheader)"
echo " --iuser The username and password for basic auth to use with the integration"
echo " endpoint. This must be of form 'user:pass'. Only used if '-i' is supplied."
echo " (value: $iuser)"
echo " --dry Do a 'dry run', not uploading anything only printing to stdout (value: $dryrun)"
echo ""
echo "Arguments:"
echo " A list of URLs to upload the files to."
@ -105,6 +133,9 @@ showUsage() {
echo "Example: Upload all files in a directory"
echo "$0 --path ~/Downloads -m -dv --once http://localhost:7880/api/v1/open/upload/item/abcde-12345-abcde-12345"
echo ""
echo "Example: Integration Endpoint"
echo "$0 -i -iheader 'Docspell-Integration:test123' -m -p ~/Downloads/ http://localhost:7880/api/v1/open/integration/item"
echo ""
}
if [ "$help" = "y" ]; then
@ -127,32 +158,67 @@ fi
trace() {
if [ "$verbose" = "y" ]; then
echo "$1"
>&2 echo "$1"
fi
}
info() {
echo $1
>&2 echo $1
}
getCollective() {
file="$(realpath -e $1)"
dir="$(realpath -e $2)"
collective=${file#"$dir"}
coll=$(echo $collective | cut -d'/' -f1)
if [ -z "$coll" ]; then
coll=$(echo $collective | cut -d'/' -f2)
fi
echo $coll
}
upload() {
dir="$(realpath -e $1)"
file="$(realpath -e $2)"
url="$3"
OPTS="$CURL_OPTS"
if [ "$integration" = "y" ]; then
collective=$(getCollective "$file" "$dir")
trace "- upload: collective = $collective"
url="$url/$collective"
if [ $iuser ]; then
OPTS="$OPTS --user $iuser"
fi
if [ $iheader ]; then
OPTS="$OPTS -H $iheader"
fi
fi
if [ "$dryrun" = "y" ]; then
info "Not uploading (dry-run) $1 to $2"
info "- Not uploading (dry-run) $file to $url with opts $OPTS"
else
tf=$($MKTEMP_CMD) rc=0
$CURL_CMD -# -o "$tf" --stderr "$tf" -w "%{http_code}" -XPOST -F file=@"$1" "$2" | (2>&1 1>/dev/null grep 200)
rc=$(expr $rc + $?)
cat $tf | (2>&1 1>/dev/null grep '{"success":true')
rc=$(expr $rc + $?)
if [ $rc -ne 0 ]; then
trace "- Uploading $file to $url with options $OPTS"
tf1=$($MKTEMP_CMD) tf2=$($MKTEMP_CMD) rc=0
$CURL_CMD --fail -# -o "$tf1" --stderr "$tf2" $OPTS -XPOST -F file=@"$file" "$url"
if [ $? -ne 0 ]; then
info "Upload failed. Exit code: $rc"
cat "$tf"
cat "$tf1"
cat "$tf2"
echo ""
rm "$tf"
rm "$tf1" "$tf2"
return $rc
else
rm "$tf"
return 0
if cat $tf1 | grep -q '{"success":false'; then
echo "Upload failed. Message from server:"
cat "$tf1"
echo ""
rm "$tf1" "$tf2"
return 1
else
info "- Upload done."
rm "$tf1" "$tf2"
return 0
fi
fi
fi
}
@ -162,28 +228,69 @@ checksum() {
}
checkFile() {
local url=$(echo "$1" | sed 's,upload/item,checkfile,g')
local url="$1"
local file="$2"
trace "Check file: $url/$(checksum "$file")"
$CURL_CMD -XGET -s "$url/$(checksum "$file")" | (2>&1 1>/dev/null grep '"exists":true')
local dir="$3"
OPTS="$CURL_OPTS"
if [ "$integration" = "y" ]; then
collective=$(getCollective "$file" "$dir")
url="$url/$collective"
url=$(echo "$url" | sed 's,/item/,/checkfile/,g')
if [ $iuser ]; then
OPTS="$OPTS --user $iuser"
fi
if [ $iheader ]; then
OPTS="$OPTS -H $iheader"
fi
else
url=$(echo "$1" | sed 's,upload/item,checkfile,g')
fi
trace "- Check file: $url/$(checksum $file)"
tf1=$($MKTEMP_CMD) tf2=$($MKTEMP_CMD)
$CURL_CMD --fail -o "$tf1" --stderr "$tf2" $OPTS -XGET -s "$url/$(checksum "$file")"
if [ $? -ne 0 ]; then
info "Checking file failed!"
cat "$tf1" >&2
cat "$tf2" >&2
info ""
rm "$tf1" "$tf2"
echo "failed"
return 1
else
if cat "$tf1" | grep -q '{"exists":true'; then
rm "$tf1" "$tf2"
echo "y"
else
rm "$tf1" "$tf2"
echo "n"
fi
fi
}
process() {
file="$1"
file="$(realpath -e $1)"
dir="$2"
info "---- Processing $file ----------"
declare -i curlrc=0
set +e
for url in $urls; do
if [ "$distinct" = "y" ]; then
trace "- Checking if $file has been uploaded to $url already"
checkFile "$url" "$file"
if [ $? -eq 0 ]; then
res=$(checkFile "$url" "$file" "$dir")
rc=$?
curlrc=$(expr $curlrc + $rc)
trace "- Result from checkfile: $res"
if [ "$res" = "y" ]; then
info "- Skipping file '$file' because it has been uploaded in the past."
continue
elif [ "$res" != "n" ]; then
info "- Checking file failed, skipping the file."
continue
fi
fi
trace "- Uploading '$file' to '$url'."
upload "$file" "$url"
upload "$dir" "$file" "$url"
rc=$?
curlrc=$(expr $curlrc + $rc)
if [ $rc -ne 0 ]; then
@ -207,6 +314,16 @@ process() {
fi
}
findDir() {
path="$1"
for dir in "${watchdir[@]}"; do
if [[ $path = ${dir}* ]]
then
echo $dir
fi
done
}
if [ "$once" = "y" ]; then
info "Uploading all files in '$watchdir'."
MD="-maxdepth 1"
@ -215,7 +332,7 @@ if [ "$once" = "y" ]; then
fi
for dir in "${watchdir[@]}"; do
find "$dir" $MD -type f -print0 | while IFS= read -d '' -r file; do
process "$file"
process "$file" "$dir"
done
done
else
@ -225,8 +342,9 @@ else
fi
$INOTIFY_CMD $REC -m "${watchdir[@]}" -e close_write -e moved_to |
while read path action file; do
trace "The file '$file' appeared in directory '$path' via '$action'"
dir=$(findDir "$path")
trace "The file '$file' appeared in directory '$path' below '$dir' via '$action'"
sleep 1
process "$path$file"
process "$(realpath -e "$path$file")" "$dir"
done
fi