Use ocrmypdf tool to create pdf/a during conversion

- Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.
2025-08-05 02:24:52 +00:00 · 2020-07-18 12:48:41 +02:00
parent 99210365ce
commit 3d49ceaab5
16 changed files with 316 additions and 21 deletions
--- a/docker/joex.dockerfile
+++ b/docker/joex.dockerfile
@ -19,6 +19,17 @@ RUN apk add --no-cache openjdk11-jre \
    ttf-dejavu \
    ttf-freefont \
    ttf-liberation \
+    libxml2-dev \
+    libxslt-dev \
+    pngquant \
+    zlib-dev \
+    g++ \
+    qpdf \
+    python3-dev \
+    libffi-dev\
+    qpdf-dev \
+  && pip3 install --upgrade pip \
+  && pip3 install ocrmypdf \
  && curl -Ls $UNO_URL -o /usr/local/bin/unoconv \
  && chmod +x /usr/local/bin/unoconv \
  && ln -s /usr/bin/python3 /usr/bin/python \
@ -27,7 +38,7 @@ RUN apk add --no-cache openjdk11-jre \
  && curl -L -o docspell.zip https://github.com/eikek/docspell/releases/download/v0.8.0/docspell-joex-0.8.0.zip \
  && unzip docspell.zip \
  && rm docspell.zip \
-  && apk del curl unzip
+  && apk del curl unzip libxml2-dev libxslt-dev zlib-dev g++ python3-dev libffi-dev qpdf-dev

 COPY entrypoint-joex.sh /opt/entrypoint.sh