Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.
This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
Container name can't be hard coded and each joex instance needs a unique
name. Since Docker always sets the `HOSTNAME` variable and these are
unique, we can just interpolate the hostname into the joex app
identifier, to avoid creating multiple config files.
Currently these commands should be run in a single sbt session, since
the first one sets the elm compilation mode to "prod". Makes the js a
bit smaller removing debug infos.
- Using a script in `/usr/local/bin ` now to overwrit the default *ocrmypdf* version and thus replaced the approach using a bash function
- Also had to add volume mapping to docker call
**ATTENTION** the path /tmp/docspell-convert:/tmp/docspell-convert must be mapped when starting Docspell's docker image!
- if `/var/run/docker.sock` is found in the docker-container, this feature is activated - if not, nothing changes
- usage: mount bind `docker.sock` from host by using `-v` or `volumes:`
* new file: base.dockerfile
base docker file for compiling sources
deleted: build-images.sh
replaced by dev-build-images.sh
deleted: build-joex-base.sh
replaced by base / joex dockerfile
changed: consumedir.dockerfile
based on compiled tool-binaries from base image
added health check (basic, check for REST server connection)
new file: dev-build-images.sh
added one build script for all purposes
derives tag from version-file (all snapshot become latest)
new file: dev-push-images.sh
added one push script for all purposes (similar to build)
changed: docker-compose.yml
* changed regarding entrypoints and commands now being in the images
* added health checks for 3rd party images (postgres and solr)
* some minor renaming of the areas
renamed: entrypoint-joex.sh -> joex-entrypoint.sh
* for better order
renamed: joex-base.dockerfile -> joex.dockerfile
* also reworked to base on main base image
* plus renamed for better order
deleted: push-images.sh
* replaced by dev-push-images.sh
deleted: push-joex-base.sh
* not necessary anymore
changed: restserver.dockerfile
* reworked to be base on main base image
* smaller
* added health check
* updated docker-compose to new images
* update docker-compose.yml
remove unnecessary network entries
* update docker-compose.yml
added missing volume for postgres
* reverted image naming scheme and added log to docker build
1. go back to local code instead of cloning git
2. added build log to docker image build script, incl. log build times.
Logs can be found in `docker/dev-log` folder
3. added docker docs and new docker build logs folder to .gitignore
4. added
* build docker images from local files instead of cloned remote repo, plus time recording of builds
- switched way docker images are built from remote git repo to local files (which should be the git repo, but may have local changes)
- the docker build logs will show the time needed for the single image builds
* reverted deletion of joex base dockerfile
* joex base dockerfile plus smaller improvements
- separate joex base file
- added docker hook to improve Docker Hub experience
* updated docker hub build hook
corrected wrong path (base is build context)
* Fix of docker hub build hook again
base path seems to be the dockerfile's folder
* fixed typo in .dockerignore
* added ability to spool log to console instead of file, especially for automated docker builds
* improved logging of build script
* minor tweaks from review (.dockerignore, docker hub hook and an error when using other repos)
* added push of non-base images to automated docker hook
* fixes for docker hub build hook
* fixed/improved docker build hook
* replaced tag-version of untagged versions with SNAPSHOT (was LATEST, which should be used or stable tags only)
plus, made the version tag mandatory for the dockerfiles
* adapted docker build and push scripts for tagged images (using docker automated builds)
* fixed docker build hook
stupid copy & paste mistake...
* minor mistake in build hook
* added validation of matching version numbers for docker automated builds (for non-snapshot builds)
* fixed missing fi in new validity check
* fixed docker build hook validity check
* mixed up version comparison fixed
* relative path error in hook validation
* mixed up version comparison fixed
* test
* fixed error in version matching for docker hook
* test
* improved versioning, so that docker images are v0.00.00
* revert version.sbt
got overwritten by accidence
* reverted version.sbt
* improved environment parameters, especially enabled setting DB params by them
- additionally added .env file to have the same env variables for all containers
* cleaned up docker-compose.yml to fit public origin repo again
* optimized way db params are set
figured out, we do not need the DB-String to be built at startup, docspell.conf reads also multiple variables.
I still kept the restserver entrypoint, although we do not need it now - it might be helpful in the future or for debugging pruposes
* added restart option to restart docspell, e.g. after a system reboot - but only if it was running before
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.