Commit Graph

30 Commits

Author SHA1 Message Date
Eike Kettner
26dff18ae0 Add spanish as an example
Adding a new language without nlp requires now only to fill out the
pieces:

- define a list of month names to support date recognition
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions
- update the elm file so it shows up on the client
2021-01-18 17:41:40 +01:00
Eike Kettner
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
Bo Jeanes
36c29812c7 Allow scaling joex with docker-compose up --scale
Container name can't be hard coded and each joex instance needs a unique
name. Since Docker always sets the `HOSTNAME` variable and these are
unique, we can just interpolate the hostname into the joex app
identifier, to avoid creating multiple config files.
2021-01-09 10:33:11 +11:00
Eike Kettner
f7ffa10b07 Fix docker build
Currently these commands should be run in a single sbt session, since
the first one sets the elm compilation mode to "prod". Makes the js a
bit smaller removing debug infos.
2021-01-07 00:39:36 +01:00
Eike Kettner
2a172ce720 Remove fulltext recreate-key config value
It's now in the admin routes, protected by the
`admin-endpoint.secret`.
2021-01-04 15:18:02 +01:00
Eike Kettner
df1fc845e9 Docker: add missing language for tesseract
Closes: #525
2021-01-02 14:15:09 +01:00
totti4ever
5dbd35060a
Fix ocrmypdf containers not being removed after a run
before they would kind of lying around and pile up after a couple of processed items
2020-10-31 19:35:35 +01:00
Malte
69465807c5 fixed wrong timezone because of missing tzdata package
- Now the timezone can be set as expected using TZ env variable
2020-10-27 23:23:38 +01:00
Malte
3d074c5fc9 Bugfixes
- Using a script in `/usr/local/bin ` now to overwrit the default *ocrmypdf* version and thus replaced the approach using a bash function
- Also had to add volume mapping to docker call

**ATTENTION** the path /tmp/docspell-convert:/tmp/docspell-convert must be mapped when starting Docspell's docker image!
2020-10-27 12:37:37 +01:00
Malte
cde7519f24 set default version of OCRmyPDF's docker image to _v11.2.1_, which seems to be the latest stable before _11.3.0_ 2020-10-27 06:57:27 +01:00
Malte
e9db579af6 added environment variable to set preferred OCRmyPDF version when using docker image
- e.g. `- OCRMYPDF_VERSION=v11.2.1`
 - default ist `latest`
2020-10-27 06:30:50 +01:00
Malte
c56f692ff5 (DOCKER) allows to use jbarlow83's official docker image of OCRmyPDF, i.e. use a newer version
- if `/var/run/docker.sock` is found in the docker-container, this feature is activated - if not, nothing changes
 - usage: mount bind `docker.sock` from host by using `-v` or `volumes:`
2020-10-27 06:00:55 +01:00
totti4ever
bcb42920e1
Proper docker files (build from code) - 2.1 (#311)
* new file: base.dockerfile
base docker file for compiling sources

deleted: build-images.sh
replaced by dev-build-images.sh

deleted: build-joex-base.sh
replaced by base / joex dockerfile

changed: consumedir.dockerfile
based on compiled tool-binaries from base image
added health check (basic, check for REST server connection)

new file: dev-build-images.sh
added one build script for all purposes
derives tag from version-file (all snapshot become latest)

new file: dev-push-images.sh
added one push script for all purposes (similar to build)

changed: docker-compose.yml
	* changed regarding entrypoints and commands now being in the images
	* added health checks for 3rd party images (postgres and solr)
	* some minor renaming of the areas

renamed: entrypoint-joex.sh -> joex-entrypoint.sh
	* for better order

renamed: joex-base.dockerfile -> joex.dockerfile
	* also reworked to base on main base image
	* plus renamed for better order

deleted: push-images.sh
	* replaced by dev-push-images.sh

deleted: push-joex-base.sh
	* not necessary anymore

changed: restserver.dockerfile
	* reworked to be base on main base image
	* smaller
	* added health check

* updated docker-compose to new images

* update docker-compose.yml

remove unnecessary network entries

* update docker-compose.yml

added missing volume for postgres

* reverted image naming scheme and added log to docker build

1. go back to local code instead of cloning git
2. added build log to docker image build script, incl. log build times.
Logs can be found in `docker/dev-log` folder
3. added docker docs and new docker build logs folder to .gitignore
4. added

* build docker images from local files instead of cloned remote repo, plus time recording of builds
 - switched way docker images are built from remote git repo to local files (which should be the git repo, but may have local changes)
 - the docker build logs will show the time needed for the single image builds

* reverted deletion of joex base dockerfile

* joex base dockerfile plus smaller improvements
  - separate joex base file
  - added docker hook to improve Docker Hub experience

* updated docker hub build hook

corrected wrong path (base is build context)

* Fix of docker hub build hook again

base path seems to be the dockerfile's folder

* fixed typo in .dockerignore

* added ability to spool log to console instead of file, especially for automated docker builds

* improved logging of build script

* minor tweaks from review (.dockerignore, docker hub hook and an error when using other repos)

* added push of non-base images to automated docker hook

* fixes for docker hub build hook

* fixed/improved docker build hook

* replaced tag-version of untagged versions with SNAPSHOT (was LATEST, which should be used or stable tags only)
plus, made the version tag mandatory for the dockerfiles

* adapted docker build and push scripts for tagged images (using docker automated builds)

* fixed docker build hook
stupid copy & paste mistake...

* minor mistake in build hook

* added validation of matching version numbers for docker automated builds (for non-snapshot builds)

* fixed missing fi in new validity check

* fixed docker build hook validity check

* mixed up version comparison fixed

* relative path error in hook validation

* mixed up version comparison fixed

* test

* fixed error in version matching for docker hook

* test

* improved versioning, so that docker images are v0.00.00

* revert version.sbt

got overwritten by accidence

* reverted version.sbt

* improved environment parameters, especially enabled setting DB params by them

	- additionally added .env file to have the same env variables for all containers

* cleaned up docker-compose.yml to fit public origin repo again

* optimized way db params are set

figured out, we do not need the DB-String to be built at startup, docspell.conf reads also multiple variables.
I still kept the restserver entrypoint, although we do not need it now - it might be helpful in the future or for debugging pruposes

* added restart option  to restart docspell, e.g. after a system reboot - but only if it was running before
2020-10-19 13:56:44 +02:00
Eike Kettner
13daa99933 Update docker and nix setup 2020-09-28 01:10:44 +02:00
Eike Kettner
28a70f56ec Fix joex docker image 2020-09-27 01:20:00 +02:00
Eike Kettner
4451ba0ef3 Configure joex with 1.5g heap in docker compose
Issue: #287
2020-09-26 13:18:54 +02:00
Eike Kettner
29ddcccbba Use a base image for joex containing all the tools 2020-09-09 22:59:34 +02:00
Eike Kettner
dc88fcb960 Update nix and docker setup 2020-09-09 22:31:35 +02:00
Eike Kettner
8e5e198098 Update nix and docker setups 2020-09-08 00:32:17 +02:00
Eike Kettner
d68d076c84 Update nix and docker setups 2020-08-15 00:34:33 +02:00
Eike Kettner
66793080d8 Update docker setup 2020-08-01 19:01:49 +02:00
Eike Kettner
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
Eike Kettner
ec7b34ee6f Update nix/nixos and docker setups 2020-06-29 21:01:07 +02:00
Eike Kettner
f883648839 Add missing entrypoint script for docker 2020-06-28 13:50:14 +02:00
Eike Kettner
d3b3c6289b Prepare docker setup for fulltext search 2020-06-28 13:37:39 +02:00
Eike Kettner
41964027d1 Update docker files 2020-06-17 22:28:04 +02:00
Eike Kettner
3d902c3273 Add a docker image for watching a directory 2020-05-25 19:43:06 +02:00
Eike Kettner
0b7cc0ec6b Update nix and docker setups 2020-05-25 17:57:41 +02:00
Eike Kettner
8f46f6b57b Update docker setup 2020-04-30 22:38:53 +02:00
Eike Kettner
5b21a876aa Try provide docker setup 2020-03-31 00:45:43 +02:00