Extend consumedir.sh to work with integration endpoint

Now running one consumedir script can upload files to multiple
collectives separately.
This commit is contained in:
Eike Kettner 2020-06-28 00:08:37 +02:00
parent d13e0a4370
commit 8500d4d804
6 changed files with 269 additions and 55 deletions

View File

@ -8,21 +8,20 @@ permalink: doc/tools/consumedir
The `consumerdir.sh` is a bash script that works in two modes: The `consumerdir.sh` is a bash script that works in two modes:
- Go through all files in given directories (non recursively) and sent - Go through all files in given directories (recursively, if `-r` is
each to docspell. specified) and sent each to docspell.
- Watch one or more directories for new files and upload them to - Watch one or more directories for new files and upload them to
docspell. docspell.
It can watch or go through one or more directories. Files can be It can watch or go through one or more directories. Files can be
uploaded to multiple urls. uploaded to multiple urls.
Run the script with the `-h` option, to see a short help text. The Run the script with the `-h` or `--help` option, to see a short help
help text will also show the values for any given option. text. The help text will also show the values for any given option.
The script requires `curl` for uploading. It requires the The script requires `curl` for uploading. It requires the
`inotifywait` command if directories should be watched for new `inotifywait` command if directories should be watched for new
files. If the `-m` option is used, the script will skip duplicate files.
files. For this the `sha256sum` command is required.
Example for watching two directories: Example for watching two directories:
@ -30,18 +29,69 @@ Example for watching two directories:
./tools/consumedir.sh --path ~/Downloads --path ~/pdfs -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ ./tools/consumedir.sh --path ~/Downloads --path ~/pdfs -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
``` ```
The script by default watches the given directories. If the `-o` The script by default watches the given directories. If the `-o` or
option is used, it will instead go through these directories and `--once` option is used, it will instead go through these directories
upload all files in there. and upload all files in there.
Example for uploading all immediatly (the same as above only with `-o` Example for uploading all immediatly (the same as above only with `-o`
added): added):
``` bash ``` bash
./tools/consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ $ consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ
``` ```
The URL can be any docspell url that accepts uploads without
authentication. This is usually a [source
url](../uploading#anonymous-upload). It is also possible to use the
script with the [integration
endpoint](../uploading#integration-endpoint).
## Integration Endpoint
When given the `-i` or `--integration` option, the script changes its
behaviour slightly to work with the [integration
endpoint](../uploading#integration-endpoint).
First, if `-i` is given, it implies `-r` so the directories are
watched or traversed recursively. The script then assumes that there
is a subfolder with the collective name. Files must not be placed
directly into a folder given by `-p`, but below a sub-directory that
matches a collective name. In order to know for which collective the
file is, the script uses the first subfolder.
If the endpoint is protected, these credentials can be specified as
arguments `--iuser` and `--iheader`, respectively. The format is for
both `<name>:<value>`, so the username cannot contain a colon
character (but the password can).
Example:
``` bash
$ consumedir.sh -i -iheader 'Docspell-Integration:test123' -m -p ~/Downloads/ http://localhost:7880/api/v1/open/integration/item
```
The url is the integration endpoint url without the collective, as
this is amended by the script.
This watches the folder `~/Downloads`. If a file is placed in this
folder directly, say `~/Downloads/test.pdf` the upload will fail,
because the collective cannot be determined. Create a subfolder below
`~/Downloads` with the name of a collective, for example
`~/Downloads/family` and place files somewhere below this `family`
subfolder, like `~/Downloads/family/test.pdf`.
## Duplicates
With the `-m` option, the script will not upload files that already
exist at docspell. For this the `sha256sum` command is required.
So you can move and rename files in those folders without worring
about duplicates. This allows to keep your files organized using the
file-system and have them mirrored into docspell as well.
## Systemd ## Systemd
The script can be used with systemd to run as a service. This is an The script can be used with systemd to run as a service. This is an

View File

@ -112,8 +112,15 @@ the [configuration file](configure#rest-server).
If queried by a `GET` request, it returns whether it is enabled and If queried by a `GET` request, it returns whether it is enabled and
the collective exists. the collective exists.
See the [SMTP gateway](tools/smtpgateway) for an example to use this It is also possible to check for existing files using their sha256
endpoint. checksum with:
```
/api/v1/open/integration/checkfile/[collective-name]/[sha256-checksum]
```
See the [SMTP gateway](tools/smtpgateway) or the [consumedir
script](tools/consumedir) for examples to use this endpoint.
## The Request ## The Request

View File

@ -299,8 +299,8 @@ paths:
$ref: "#/components/schemas/BasicResult" $ref: "#/components/schemas/BasicResult"
/open/integration/item/{id}: /open/integration/item/{id}:
get: get:
tags: [ Upload Integration ] tags: [ Integration Endpoint ]
summary: Upload files to docspell. summary: Check if integration endpoint is available.
description: | description: |
Allows to check whether an integration endpoint is enabled for Allows to check whether an integration endpoint is enabled for
a collective. The collective is given by the `id` parameter. a collective. The collective is given by the `id` parameter.
@ -325,7 +325,7 @@ paths:
401: 401:
description: Unauthorized description: Unauthorized
post: post:
tags: [ Upload Integration ] tags: [ Integration Endpoint ]
summary: Upload files to docspell. summary: Upload files to docspell.
description: | description: |
Upload a file to docspell for processing. The id is a Upload a file to docspell for processing. The id is a
@ -368,6 +368,30 @@ paths:
application/json: application/json:
schema: schema:
$ref: "#/components/schemas/BasicResult" $ref: "#/components/schemas/BasicResult"
/open/integration/checkfile/{id}/{checksum}:
get:
tags: [ Integration Endpoint ]
summary: Check if a file is in docspell.
description: |
Checks if a file with the given SHA-256 checksum is in
docspell. The `id` is the *collective name*. This route only
exists, if it is enabled in the configuration file.
The result shows all items that contains a file with the given
checksum.
security:
- authTokenHeader: []
parameters:
- $ref: "#/components/parameters/id"
- $ref: "#/components/parameters/checksum"
responses:
200:
description: Ok
content:
application/json:
schema:
$ref: "#/components/schemas/CheckFileResult"
/open/signup/register: /open/signup/register:
post: post:
tags: [ Registration ] tags: [ Registration ]

View File

@ -42,7 +42,7 @@ object CheckFileRoutes {
} }
} }
private def convert(v: Vector[RItem]): CheckFileResult = def convert(v: Vector[RItem]): CheckFileResult =
CheckFileResult( CheckFileResult(
v.nonEmpty, v.nonEmpty,
v.map(r => BasicItem(r.id, r.name, r.direction, r.state, r.created, r.itemDate)) v.map(r => BasicItem(r.id, r.name, r.direction, r.state, r.created, r.itemDate))

View File

@ -8,6 +8,7 @@ import docspell.common._
import docspell.restserver.Config import docspell.restserver.Config
import docspell.restserver.conv.Conversions._ import docspell.restserver.conv.Conversions._
import docspell.restserver.http4s.Responses import docspell.restserver.http4s.Responses
import docspell.store.records.RItem
import org.http4s._ import org.http4s._
import org.http4s.circe.CirceEntityEncoder._ import org.http4s.circe.CirceEntityEncoder._
import org.http4s.dsl.Http4sDsl import org.http4s.dsl.Http4sDsl
@ -24,12 +25,17 @@ object IntegrationEndpointRoutes {
val dsl = new Http4sDsl[F] {} val dsl = new Http4sDsl[F] {}
import dsl._ import dsl._
def validate(req: Request[F], collective: Ident) =
for {
_ <- authRequest(req, cfg.integrationEndpoint)
_ <- checkEnabled(cfg.integrationEndpoint)
_ <- lookupCollective(collective, backend)
} yield ()
HttpRoutes.of { HttpRoutes.of {
case req @ POST -> Root / "item" / Ident(collective) => case req @ POST -> Root / "item" / Ident(collective) =>
(for { (for {
_ <- authRequest(req, cfg.integrationEndpoint) _ <- validate(req, collective)
_ <- checkEnabled(cfg.integrationEndpoint)
_ <- lookupCollective(collective, backend)
res <- EitherT.liftF[F, Response[F], Response[F]]( res <- EitherT.liftF[F, Response[F], Response[F]](
uploadFile(collective, backend, cfg, dsl)(req) uploadFile(collective, backend, cfg, dsl)(req)
) )
@ -37,11 +43,20 @@ object IntegrationEndpointRoutes {
case req @ GET -> Root / "item" / Ident(collective) => case req @ GET -> Root / "item" / Ident(collective) =>
(for { (for {
_ <- authRequest(req, cfg.integrationEndpoint) _ <- validate(req, collective)
_ <- checkEnabled(cfg.integrationEndpoint)
_ <- lookupCollective(collective, backend)
res <- EitherT.liftF[F, Response[F], Response[F]](Ok(())) res <- EitherT.liftF[F, Response[F], Response[F]](Ok(()))
} yield res).fold(identity, identity) } yield res).fold(identity, identity)
case req @ GET -> Root / "checkfile" / Ident(collective) / checksum =>
(for {
_ <- validate(req, collective)
items <- EitherT.liftF[F, Response[F], Vector[RItem]](
backend.itemSearch.findByFileCollective(checksum, collective)
)
resp <-
EitherT.liftF[F, Response[F], Response[F]](Ok(CheckFileRoutes.convert(items)))
} yield resp).fold(identity, identity)
} }
} }

View File

@ -13,6 +13,7 @@ CURL_CMD="curl"
INOTIFY_CMD="inotifywait" INOTIFY_CMD="inotifywait"
SHA256_CMD="sha256sum" SHA256_CMD="sha256sum"
MKTEMP_CMD="mktemp" MKTEMP_CMD="mktemp"
CURL_OPTS=${CURL_OPTS:-}
! getopt --test > /dev/null ! getopt --test > /dev/null
if [[ ${PIPESTATUS[0]} -ne 4 ]]; then if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
@ -20,8 +21,8 @@ if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
exit 1 exit 1
fi fi
OPTIONS=omhdp:vr OPTIONS=omhdp:vrmi
LONGOPTS=once,distinct,help,delete,path:,verbose,recursive,dry LONGOPTS=once,distinct,help,delete,path:,verbose,recursive,dry,integration,iuser:,iheader:
! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@") ! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@")
if [[ ${PIPESTATUS[0]} -ne 0 ]]; then if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
@ -35,6 +36,7 @@ eval set -- "$PARSED"
declare -a watchdir declare -a watchdir
help=n verbose=n delete=n once=n distinct=n recursive=n dryrun=n help=n verbose=n delete=n once=n distinct=n recursive=n dryrun=n
integration=n iuser="" iheader=""
while true; do while true; do
case "$1" in case "$1" in
-h|--help) -h|--help)
@ -69,6 +71,19 @@ while true; do
dryrun=y dryrun=y
shift shift
;; ;;
-i|--integration)
integration=y
recursive=y
shift
;;
--iuser)
iuser="$2"
shift 2
;;
--iheader)
iheader="$2"
shift 2
;;
--) --)
shift shift
break break
@ -87,14 +102,27 @@ showUsage() {
echo "Usage: $0 [options] url url ..." echo "Usage: $0 [options] url url ..."
echo echo
echo "Options:" echo "Options:"
echo " -v | --verbose Print more to stdout. (value: $verbose)" echo " -v | --verbose Print more to stdout. (value: $verbose)"
echo " -d | --delete Delete the file if successfully uploaded. (value: $delete)" echo " -d | --delete Delete the file if successfully uploaded. (value: $delete)"
echo " -p | --path <dir> The directories to watch. This is required. (value: ${watchdir[@]})" echo " -p | --path <dir> The directories to watch. This is required. (value: ${watchdir[@]})"
echo " -h | --help Prints this help text. (value: $help)" echo " -h | --help Prints this help text. (value: $help)"
echo " -m | --distinct Optional. Upload only if the file doesn't already exist. (value: $distinct)" echo " -m | --distinct Optional. Upload only if the file doesn't already exist. (value: $distinct)"
echo " -o | --once Instead of watching, upload all files in that dir. (value: $once)" echo " -o | --once Instead of watching, upload all files in that dir. (value: $once)"
echo " -r | --recursive Traverse the directory(ies) recursively (value: $recursive)" echo " -r | --recursive Traverse the directory(ies) recursively (value: $recursive)"
echo " --dry Do a 'dry run', not uploading anything only printing to stdout (value: $dryrun)" echo " -i | --integration Upload to the integration endpoint. It implies -r. This puts the script in"
echo " a different mode, where the first subdirectory of any given starting point"
echo " is read as the collective name. The url(s) are completed with this name in"
echo " order to upload files to the respective collective. So each directory"
echo " given is expected to contain one subdirectory per collective and the urls"
echo " are expected to identify the integration endpoint, which is"
echo " /api/v1/open/integration/item/<collective-name>. (value: $integration)"
echo " --iheader The header name and value to use with the integration endpoint. This must be"
echo " in form 'headername:value'. Only used if '-i' is supplied."
echo " (value: $iheader)"
echo " --iuser The username and password for basic auth to use with the integration"
echo " endpoint. This must be of form 'user:pass'. Only used if '-i' is supplied."
echo " (value: $iuser)"
echo " --dry Do a 'dry run', not uploading anything only printing to stdout (value: $dryrun)"
echo "" echo ""
echo "Arguments:" echo "Arguments:"
echo " A list of URLs to upload the files to." echo " A list of URLs to upload the files to."
@ -105,6 +133,9 @@ showUsage() {
echo "Example: Upload all files in a directory" echo "Example: Upload all files in a directory"
echo "$0 --path ~/Downloads -m -dv --once http://localhost:7880/api/v1/open/upload/item/abcde-12345-abcde-12345" echo "$0 --path ~/Downloads -m -dv --once http://localhost:7880/api/v1/open/upload/item/abcde-12345-abcde-12345"
echo "" echo ""
echo "Example: Integration Endpoint"
echo "$0 -i -iheader 'Docspell-Integration:test123' -m -p ~/Downloads/ http://localhost:7880/api/v1/open/integration/item"
echo ""
} }
if [ "$help" = "y" ]; then if [ "$help" = "y" ]; then
@ -127,32 +158,67 @@ fi
trace() { trace() {
if [ "$verbose" = "y" ]; then if [ "$verbose" = "y" ]; then
echo "$1" >&2 echo "$1"
fi fi
} }
info() { info() {
echo $1 >&2 echo $1
} }
getCollective() {
file="$(realpath -e $1)"
dir="$(realpath -e $2)"
collective=${file#"$dir"}
coll=$(echo $collective | cut -d'/' -f1)
if [ -z "$coll" ]; then
coll=$(echo $collective | cut -d'/' -f2)
fi
echo $coll
}
upload() { upload() {
dir="$(realpath -e $1)"
file="$(realpath -e $2)"
url="$3"
OPTS="$CURL_OPTS"
if [ "$integration" = "y" ]; then
collective=$(getCollective "$file" "$dir")
trace "- upload: collective = $collective"
url="$url/$collective"
if [ $iuser ]; then
OPTS="$OPTS --user $iuser"
fi
if [ $iheader ]; then
OPTS="$OPTS -H $iheader"
fi
fi
if [ "$dryrun" = "y" ]; then if [ "$dryrun" = "y" ]; then
info "Not uploading (dry-run) $1 to $2" info "- Not uploading (dry-run) $file to $url with opts $OPTS"
else else
tf=$($MKTEMP_CMD) rc=0 trace "- Uploading $file to $url with options $OPTS"
$CURL_CMD -# -o "$tf" --stderr "$tf" -w "%{http_code}" -XPOST -F file=@"$1" "$2" | (2>&1 1>/dev/null grep 200) tf1=$($MKTEMP_CMD) tf2=$($MKTEMP_CMD) rc=0
rc=$(expr $rc + $?) $CURL_CMD --fail -# -o "$tf1" --stderr "$tf2" $OPTS -XPOST -F file=@"$file" "$url"
cat $tf | (2>&1 1>/dev/null grep '{"success":true') if [ $? -ne 0 ]; then
rc=$(expr $rc + $?)
if [ $rc -ne 0 ]; then
info "Upload failed. Exit code: $rc" info "Upload failed. Exit code: $rc"
cat "$tf" cat "$tf1"
cat "$tf2"
echo "" echo ""
rm "$tf" rm "$tf1" "$tf2"
return $rc return $rc
else else
rm "$tf" if cat $tf1 | grep -q '{"success":false'; then
return 0 echo "Upload failed. Message from server:"
cat "$tf1"
echo ""
rm "$tf1" "$tf2"
return 1
else
info "- Upload done."
rm "$tf1" "$tf2"
return 0
fi
fi fi
fi fi
} }
@ -162,28 +228,69 @@ checksum() {
} }
checkFile() { checkFile() {
local url=$(echo "$1" | sed 's,upload/item,checkfile,g') local url="$1"
local file="$2" local file="$2"
trace "Check file: $url/$(checksum "$file")" local dir="$3"
$CURL_CMD -XGET -s "$url/$(checksum "$file")" | (2>&1 1>/dev/null grep '"exists":true') OPTS="$CURL_OPTS"
if [ "$integration" = "y" ]; then
collective=$(getCollective "$file" "$dir")
url="$url/$collective"
url=$(echo "$url" | sed 's,/item/,/checkfile/,g')
if [ $iuser ]; then
OPTS="$OPTS --user $iuser"
fi
if [ $iheader ]; then
OPTS="$OPTS -H $iheader"
fi
else
url=$(echo "$1" | sed 's,upload/item,checkfile,g')
fi
trace "- Check file: $url/$(checksum $file)"
tf1=$($MKTEMP_CMD) tf2=$($MKTEMP_CMD)
$CURL_CMD --fail -o "$tf1" --stderr "$tf2" $OPTS -XGET -s "$url/$(checksum "$file")"
if [ $? -ne 0 ]; then
info "Checking file failed!"
cat "$tf1" >&2
cat "$tf2" >&2
info ""
rm "$tf1" "$tf2"
echo "failed"
return 1
else
if cat "$tf1" | grep -q '{"exists":true'; then
rm "$tf1" "$tf2"
echo "y"
else
rm "$tf1" "$tf2"
echo "n"
fi
fi
} }
process() { process() {
file="$1" file="$(realpath -e $1)"
dir="$2"
info "---- Processing $file ----------" info "---- Processing $file ----------"
declare -i curlrc=0 declare -i curlrc=0
set +e set +e
for url in $urls; do for url in $urls; do
if [ "$distinct" = "y" ]; then if [ "$distinct" = "y" ]; then
trace "- Checking if $file has been uploaded to $url already" trace "- Checking if $file has been uploaded to $url already"
checkFile "$url" "$file" res=$(checkFile "$url" "$file" "$dir")
if [ $? -eq 0 ]; then rc=$?
curlrc=$(expr $curlrc + $rc)
trace "- Result from checkfile: $res"
if [ "$res" = "y" ]; then
info "- Skipping file '$file' because it has been uploaded in the past." info "- Skipping file '$file' because it has been uploaded in the past."
continue continue
elif [ "$res" != "n" ]; then
info "- Checking file failed, skipping the file."
continue
fi fi
fi fi
trace "- Uploading '$file' to '$url'." trace "- Uploading '$file' to '$url'."
upload "$file" "$url" upload "$dir" "$file" "$url"
rc=$? rc=$?
curlrc=$(expr $curlrc + $rc) curlrc=$(expr $curlrc + $rc)
if [ $rc -ne 0 ]; then if [ $rc -ne 0 ]; then
@ -207,6 +314,16 @@ process() {
fi fi
} }
findDir() {
path="$1"
for dir in "${watchdir[@]}"; do
if [[ $path = ${dir}* ]]
then
echo $dir
fi
done
}
if [ "$once" = "y" ]; then if [ "$once" = "y" ]; then
info "Uploading all files in '$watchdir'." info "Uploading all files in '$watchdir'."
MD="-maxdepth 1" MD="-maxdepth 1"
@ -215,7 +332,7 @@ if [ "$once" = "y" ]; then
fi fi
for dir in "${watchdir[@]}"; do for dir in "${watchdir[@]}"; do
find "$dir" $MD -type f -print0 | while IFS= read -d '' -r file; do find "$dir" $MD -type f -print0 | while IFS= read -d '' -r file; do
process "$file" process "$file" "$dir"
done done
done done
else else
@ -225,8 +342,9 @@ else
fi fi
$INOTIFY_CMD $REC -m "${watchdir[@]}" -e close_write -e moved_to | $INOTIFY_CMD $REC -m "${watchdir[@]}" -e close_write -e moved_to |
while read path action file; do while read path action file; do
trace "The file '$file' appeared in directory '$path' via '$action'" dir=$(findDir "$path")
trace "The file '$file' appeared in directory '$path' below '$dir' via '$action'"
sleep 1 sleep 1
process "$path$file" process "$(realpath -e "$path$file")" "$dir"
done done
fi fi