diff --git a/modules/microsite/docs/doc/tools/consumedir.md b/modules/microsite/docs/doc/tools/consumedir.md index f95fff0c..1369b67b 100644 --- a/modules/microsite/docs/doc/tools/consumedir.md +++ b/modules/microsite/docs/doc/tools/consumedir.md @@ -8,21 +8,20 @@ permalink: doc/tools/consumedir The `consumerdir.sh` is a bash script that works in two modes: -- Go through all files in given directories (non recursively) and sent - each to docspell. +- Go through all files in given directories (recursively, if `-r` is + specified) and sent each to docspell. - Watch one or more directories for new files and upload them to docspell. It can watch or go through one or more directories. Files can be uploaded to multiple urls. -Run the script with the `-h` option, to see a short help text. The -help text will also show the values for any given option. +Run the script with the `-h` or `--help` option, to see a short help +text. The help text will also show the values for any given option. The script requires `curl` for uploading. It requires the `inotifywait` command if directories should be watched for new -files. If the `-m` option is used, the script will skip duplicate -files. For this the `sha256sum` command is required. +files. Example for watching two directories: @@ -30,18 +29,69 @@ Example for watching two directories: ./tools/consumedir.sh --path ~/Downloads --path ~/pdfs -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ ``` -The script by default watches the given directories. If the `-o` -option is used, it will instead go through these directories and -upload all files in there. +The script by default watches the given directories. If the `-o` or +`--once` option is used, it will instead go through these directories +and upload all files in there. Example for uploading all immediatly (the same as above only with `-o` added): ``` bash -./tools/consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ +$ consumedir.sh -o --path ~/Downloads --path ~/pdfs/ -m -dv http://localhost:7880/api/v1/open/upload/item/5DxhjkvWf9S-CkWqF3Kr892-WgoCspFWDo7-XBykwCyAUxQ ``` +The URL can be any docspell url that accepts uploads without +authentication. This is usually a [source +url](../uploading#anonymous-upload). It is also possible to use the +script with the [integration +endpoint](../uploading#integration-endpoint). + + +## Integration Endpoint + +When given the `-i` or `--integration` option, the script changes its +behaviour slightly to work with the [integration +endpoint](../uploading#integration-endpoint). + +First, if `-i` is given, it implies `-r` – so the directories are +watched or traversed recursively. The script then assumes that there +is a subfolder with the collective name. Files must not be placed +directly into a folder given by `-p`, but below a sub-directory that +matches a collective name. In order to know for which collective the +file is, the script uses the first subfolder. + +If the endpoint is protected, these credentials can be specified as +arguments `--iuser` and `--iheader`, respectively. The format is for +both `:`, so the username cannot contain a colon +character (but the password can). + +Example: +``` bash +$ consumedir.sh -i -iheader 'Docspell-Integration:test123' -m -p ~/Downloads/ http://localhost:7880/api/v1/open/integration/item +``` + +The url is the integration endpoint url without the collective, as +this is amended by the script. + +This watches the folder `~/Downloads`. If a file is placed in this +folder directly, say `~/Downloads/test.pdf` the upload will fail, +because the collective cannot be determined. Create a subfolder below +`~/Downloads` with the name of a collective, for example +`~/Downloads/family` and place files somewhere below this `family` +subfolder, like `~/Downloads/family/test.pdf`. + + +## Duplicates + +With the `-m` option, the script will not upload files that already +exist at docspell. For this the `sha256sum` command is required. + +So you can move and rename files in those folders without worring +about duplicates. This allows to keep your files organized using the +file-system and have them mirrored into docspell as well. + + ## Systemd The script can be used with systemd to run as a service. This is an diff --git a/modules/microsite/docs/doc/uploading.md b/modules/microsite/docs/doc/uploading.md index b827246c..1b6d8239 100644 --- a/modules/microsite/docs/doc/uploading.md +++ b/modules/microsite/docs/doc/uploading.md @@ -112,8 +112,15 @@ the [configuration file](configure#rest-server). If queried by a `GET` request, it returns whether it is enabled and the collective exists. -See the [SMTP gateway](tools/smtpgateway) for an example to use this -endpoint. +It is also possible to check for existing files using their sha256 +checksum with: + +``` +/api/v1/open/integration/checkfile/[collective-name]/[sha256-checksum] +``` + +See the [SMTP gateway](tools/smtpgateway) or the [consumedir +script](tools/consumedir) for examples to use this endpoint. ## The Request diff --git a/modules/restapi/src/main/resources/docspell-openapi.yml b/modules/restapi/src/main/resources/docspell-openapi.yml index aaf52fb5..aacca9b3 100644 --- a/modules/restapi/src/main/resources/docspell-openapi.yml +++ b/modules/restapi/src/main/resources/docspell-openapi.yml @@ -299,8 +299,8 @@ paths: $ref: "#/components/schemas/BasicResult" /open/integration/item/{id}: get: - tags: [ Upload Integration ] - summary: Upload files to docspell. + tags: [ Integration Endpoint ] + summary: Check if integration endpoint is available. description: | Allows to check whether an integration endpoint is enabled for a collective. The collective is given by the `id` parameter. @@ -325,7 +325,7 @@ paths: 401: description: Unauthorized post: - tags: [ Upload Integration ] + tags: [ Integration Endpoint ] summary: Upload files to docspell. description: | Upload a file to docspell for processing. The id is a @@ -368,6 +368,30 @@ paths: application/json: schema: $ref: "#/components/schemas/BasicResult" + /open/integration/checkfile/{id}/{checksum}: + get: + tags: [ Integration Endpoint ] + summary: Check if a file is in docspell. + description: | + Checks if a file with the given SHA-256 checksum is in + docspell. The `id` is the *collective name*. This route only + exists, if it is enabled in the configuration file. + + The result shows all items that contains a file with the given + checksum. + security: + - authTokenHeader: [] + parameters: + - $ref: "#/components/parameters/id" + - $ref: "#/components/parameters/checksum" + responses: + 200: + description: Ok + content: + application/json: + schema: + $ref: "#/components/schemas/CheckFileResult" + /open/signup/register: post: tags: [ Registration ] diff --git a/modules/restserver/src/main/scala/docspell/restserver/routes/CheckFileRoutes.scala b/modules/restserver/src/main/scala/docspell/restserver/routes/CheckFileRoutes.scala index 3065761a..73defa96 100644 --- a/modules/restserver/src/main/scala/docspell/restserver/routes/CheckFileRoutes.scala +++ b/modules/restserver/src/main/scala/docspell/restserver/routes/CheckFileRoutes.scala @@ -42,7 +42,7 @@ object CheckFileRoutes { } } - private def convert(v: Vector[RItem]): CheckFileResult = + def convert(v: Vector[RItem]): CheckFileResult = CheckFileResult( v.nonEmpty, v.map(r => BasicItem(r.id, r.name, r.direction, r.state, r.created, r.itemDate)) diff --git a/modules/restserver/src/main/scala/docspell/restserver/routes/IntegrationEndpointRoutes.scala b/modules/restserver/src/main/scala/docspell/restserver/routes/IntegrationEndpointRoutes.scala index 15ee7dfe..c6a6f67e 100644 --- a/modules/restserver/src/main/scala/docspell/restserver/routes/IntegrationEndpointRoutes.scala +++ b/modules/restserver/src/main/scala/docspell/restserver/routes/IntegrationEndpointRoutes.scala @@ -8,6 +8,7 @@ import docspell.common._ import docspell.restserver.Config import docspell.restserver.conv.Conversions._ import docspell.restserver.http4s.Responses +import docspell.store.records.RItem import org.http4s._ import org.http4s.circe.CirceEntityEncoder._ import org.http4s.dsl.Http4sDsl @@ -24,12 +25,17 @@ object IntegrationEndpointRoutes { val dsl = new Http4sDsl[F] {} import dsl._ + def validate(req: Request[F], collective: Ident) = + for { + _ <- authRequest(req, cfg.integrationEndpoint) + _ <- checkEnabled(cfg.integrationEndpoint) + _ <- lookupCollective(collective, backend) + } yield () + HttpRoutes.of { case req @ POST -> Root / "item" / Ident(collective) => (for { - _ <- authRequest(req, cfg.integrationEndpoint) - _ <- checkEnabled(cfg.integrationEndpoint) - _ <- lookupCollective(collective, backend) + _ <- validate(req, collective) res <- EitherT.liftF[F, Response[F], Response[F]]( uploadFile(collective, backend, cfg, dsl)(req) ) @@ -37,11 +43,20 @@ object IntegrationEndpointRoutes { case req @ GET -> Root / "item" / Ident(collective) => (for { - _ <- authRequest(req, cfg.integrationEndpoint) - _ <- checkEnabled(cfg.integrationEndpoint) - _ <- lookupCollective(collective, backend) + _ <- validate(req, collective) res <- EitherT.liftF[F, Response[F], Response[F]](Ok(())) } yield res).fold(identity, identity) + + case req @ GET -> Root / "checkfile" / Ident(collective) / checksum => + (for { + _ <- validate(req, collective) + items <- EitherT.liftF[F, Response[F], Vector[RItem]]( + backend.itemSearch.findByFileCollective(checksum, collective) + ) + resp <- + EitherT.liftF[F, Response[F], Response[F]](Ok(CheckFileRoutes.convert(items))) + } yield resp).fold(identity, identity) + } } diff --git a/tools/consumedir.sh b/tools/consumedir.sh index f5b01bc5..345a3f1a 100755 --- a/tools/consumedir.sh +++ b/tools/consumedir.sh @@ -13,6 +13,7 @@ CURL_CMD="curl" INOTIFY_CMD="inotifywait" SHA256_CMD="sha256sum" MKTEMP_CMD="mktemp" +CURL_OPTS=${CURL_OPTS:-} ! getopt --test > /dev/null if [[ ${PIPESTATUS[0]} -ne 4 ]]; then @@ -20,8 +21,8 @@ if [[ ${PIPESTATUS[0]} -ne 4 ]]; then exit 1 fi -OPTIONS=omhdp:vr -LONGOPTS=once,distinct,help,delete,path:,verbose,recursive,dry +OPTIONS=omhdp:vrmi +LONGOPTS=once,distinct,help,delete,path:,verbose,recursive,dry,integration,iuser:,iheader: ! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@") if [[ ${PIPESTATUS[0]} -ne 0 ]]; then @@ -35,6 +36,7 @@ eval set -- "$PARSED" declare -a watchdir help=n verbose=n delete=n once=n distinct=n recursive=n dryrun=n +integration=n iuser="" iheader="" while true; do case "$1" in -h|--help) @@ -69,6 +71,19 @@ while true; do dryrun=y shift ;; + -i|--integration) + integration=y + recursive=y + shift + ;; + --iuser) + iuser="$2" + shift 2 + ;; + --iheader) + iheader="$2" + shift 2 + ;; --) shift break @@ -87,14 +102,27 @@ showUsage() { echo "Usage: $0 [options] url url ..." echo echo "Options:" - echo " -v | --verbose Print more to stdout. (value: $verbose)" - echo " -d | --delete Delete the file if successfully uploaded. (value: $delete)" - echo " -p | --path The directories to watch. This is required. (value: ${watchdir[@]})" - echo " -h | --help Prints this help text. (value: $help)" - echo " -m | --distinct Optional. Upload only if the file doesn't already exist. (value: $distinct)" - echo " -o | --once Instead of watching, upload all files in that dir. (value: $once)" - echo " -r | --recursive Traverse the directory(ies) recursively (value: $recursive)" - echo " --dry Do a 'dry run', not uploading anything only printing to stdout (value: $dryrun)" + echo " -v | --verbose Print more to stdout. (value: $verbose)" + echo " -d | --delete Delete the file if successfully uploaded. (value: $delete)" + echo " -p | --path The directories to watch. This is required. (value: ${watchdir[@]})" + echo " -h | --help Prints this help text. (value: $help)" + echo " -m | --distinct Optional. Upload only if the file doesn't already exist. (value: $distinct)" + echo " -o | --once Instead of watching, upload all files in that dir. (value: $once)" + echo " -r | --recursive Traverse the directory(ies) recursively (value: $recursive)" + echo " -i | --integration Upload to the integration endpoint. It implies -r. This puts the script in" + echo " a different mode, where the first subdirectory of any given starting point" + echo " is read as the collective name. The url(s) are completed with this name in" + echo " order to upload files to the respective collective. So each directory" + echo " given is expected to contain one subdirectory per collective and the urls" + echo " are expected to identify the integration endpoint, which is" + echo " /api/v1/open/integration/item/. (value: $integration)" + echo " --iheader The header name and value to use with the integration endpoint. This must be" + echo " in form 'headername:value'. Only used if '-i' is supplied." + echo " (value: $iheader)" + echo " --iuser The username and password for basic auth to use with the integration" + echo " endpoint. This must be of form 'user:pass'. Only used if '-i' is supplied." + echo " (value: $iuser)" + echo " --dry Do a 'dry run', not uploading anything only printing to stdout (value: $dryrun)" echo "" echo "Arguments:" echo " A list of URLs to upload the files to." @@ -105,6 +133,9 @@ showUsage() { echo "Example: Upload all files in a directory" echo "$0 --path ~/Downloads -m -dv --once http://localhost:7880/api/v1/open/upload/item/abcde-12345-abcde-12345" echo "" + echo "Example: Integration Endpoint" + echo "$0 -i -iheader 'Docspell-Integration:test123' -m -p ~/Downloads/ http://localhost:7880/api/v1/open/integration/item" + echo "" } if [ "$help" = "y" ]; then @@ -127,32 +158,67 @@ fi trace() { if [ "$verbose" = "y" ]; then - echo "$1" + >&2 echo "$1" fi } info() { - echo $1 + >&2 echo $1 } +getCollective() { + file="$(realpath -e $1)" + dir="$(realpath -e $2)" + collective=${file#"$dir"} + coll=$(echo $collective | cut -d'/' -f1) + if [ -z "$coll" ]; then + coll=$(echo $collective | cut -d'/' -f2) + fi + echo $coll +} + + upload() { + dir="$(realpath -e $1)" + file="$(realpath -e $2)" + url="$3" + OPTS="$CURL_OPTS" + if [ "$integration" = "y" ]; then + collective=$(getCollective "$file" "$dir") + trace "- upload: collective = $collective" + url="$url/$collective" + if [ $iuser ]; then + OPTS="$OPTS --user $iuser" + fi + if [ $iheader ]; then + OPTS="$OPTS -H $iheader" + fi + fi if [ "$dryrun" = "y" ]; then - info "Not uploading (dry-run) $1 to $2" + info "- Not uploading (dry-run) $file to $url with opts $OPTS" else - tf=$($MKTEMP_CMD) rc=0 - $CURL_CMD -# -o "$tf" --stderr "$tf" -w "%{http_code}" -XPOST -F file=@"$1" "$2" | (2>&1 1>/dev/null grep 200) - rc=$(expr $rc + $?) - cat $tf | (2>&1 1>/dev/null grep '{"success":true') - rc=$(expr $rc + $?) - if [ $rc -ne 0 ]; then + trace "- Uploading $file to $url with options $OPTS" + tf1=$($MKTEMP_CMD) tf2=$($MKTEMP_CMD) rc=0 + $CURL_CMD --fail -# -o "$tf1" --stderr "$tf2" $OPTS -XPOST -F file=@"$file" "$url" + if [ $? -ne 0 ]; then info "Upload failed. Exit code: $rc" - cat "$tf" + cat "$tf1" + cat "$tf2" echo "" - rm "$tf" + rm "$tf1" "$tf2" return $rc else - rm "$tf" - return 0 + if cat $tf1 | grep -q '{"success":false'; then + echo "Upload failed. Message from server:" + cat "$tf1" + echo "" + rm "$tf1" "$tf2" + return 1 + else + info "- Upload done." + rm "$tf1" "$tf2" + return 0 + fi fi fi } @@ -162,28 +228,69 @@ checksum() { } checkFile() { - local url=$(echo "$1" | sed 's,upload/item,checkfile,g') + local url="$1" local file="$2" - trace "Check file: $url/$(checksum "$file")" - $CURL_CMD -XGET -s "$url/$(checksum "$file")" | (2>&1 1>/dev/null grep '"exists":true') + local dir="$3" + OPTS="$CURL_OPTS" + if [ "$integration" = "y" ]; then + collective=$(getCollective "$file" "$dir") + url="$url/$collective" + url=$(echo "$url" | sed 's,/item/,/checkfile/,g') + if [ $iuser ]; then + OPTS="$OPTS --user $iuser" + fi + if [ $iheader ]; then + OPTS="$OPTS -H $iheader" + fi + else + url=$(echo "$1" | sed 's,upload/item,checkfile,g') + fi + trace "- Check file: $url/$(checksum $file)" + tf1=$($MKTEMP_CMD) tf2=$($MKTEMP_CMD) + + $CURL_CMD --fail -o "$tf1" --stderr "$tf2" $OPTS -XGET -s "$url/$(checksum "$file")" + if [ $? -ne 0 ]; then + info "Checking file failed!" + cat "$tf1" >&2 + cat "$tf2" >&2 + info "" + rm "$tf1" "$tf2" + echo "failed" + return 1 + else + if cat "$tf1" | grep -q '{"exists":true'; then + rm "$tf1" "$tf2" + echo "y" + else + rm "$tf1" "$tf2" + echo "n" + fi + fi } process() { - file="$1" + file="$(realpath -e $1)" + dir="$2" info "---- Processing $file ----------" declare -i curlrc=0 set +e for url in $urls; do if [ "$distinct" = "y" ]; then trace "- Checking if $file has been uploaded to $url already" - checkFile "$url" "$file" - if [ $? -eq 0 ]; then + res=$(checkFile "$url" "$file" "$dir") + rc=$? + curlrc=$(expr $curlrc + $rc) + trace "- Result from checkfile: $res" + if [ "$res" = "y" ]; then info "- Skipping file '$file' because it has been uploaded in the past." continue + elif [ "$res" != "n" ]; then + info "- Checking file failed, skipping the file." + continue fi fi trace "- Uploading '$file' to '$url'." - upload "$file" "$url" + upload "$dir" "$file" "$url" rc=$? curlrc=$(expr $curlrc + $rc) if [ $rc -ne 0 ]; then @@ -207,6 +314,16 @@ process() { fi } +findDir() { + path="$1" + for dir in "${watchdir[@]}"; do + if [[ $path = ${dir}* ]] + then + echo $dir + fi + done +} + if [ "$once" = "y" ]; then info "Uploading all files in '$watchdir'." MD="-maxdepth 1" @@ -215,7 +332,7 @@ if [ "$once" = "y" ]; then fi for dir in "${watchdir[@]}"; do find "$dir" $MD -type f -print0 | while IFS= read -d '' -r file; do - process "$file" + process "$file" "$dir" done done else @@ -225,8 +342,9 @@ else fi $INOTIFY_CMD $REC -m "${watchdir[@]}" -e close_write -e moved_to | while read path action file; do - trace "The file '$file' appeared in directory '$path' via '$action'" + dir=$(findDir "$path") + trace "The file '$file' appeared in directory '$path' below '$dir' via '$action'" sleep 1 - process "$path$file" + process "$(realpath -e "$path$file")" "$dir" done fi