mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-20 17:39:54 +00:00
Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.
This commit is contained in:
@ -19,6 +19,17 @@ RUN apk add --no-cache openjdk11-jre \
|
||||
ttf-dejavu \
|
||||
ttf-freefont \
|
||||
ttf-liberation \
|
||||
libxml2-dev \
|
||||
libxslt-dev \
|
||||
pngquant \
|
||||
zlib-dev \
|
||||
g++ \
|
||||
qpdf \
|
||||
python3-dev \
|
||||
libffi-dev\
|
||||
qpdf-dev \
|
||||
&& pip3 install --upgrade pip \
|
||||
&& pip3 install ocrmypdf \
|
||||
&& curl -Ls $UNO_URL -o /usr/local/bin/unoconv \
|
||||
&& chmod +x /usr/local/bin/unoconv \
|
||||
&& ln -s /usr/bin/python3 /usr/bin/python \
|
||||
@ -27,7 +38,7 @@ RUN apk add --no-cache openjdk11-jre \
|
||||
&& curl -L -o docspell.zip https://github.com/eikek/docspell/releases/download/v0.8.0/docspell-joex-0.8.0.zip \
|
||||
&& unzip docspell.zip \
|
||||
&& rm docspell.zip \
|
||||
&& apk del curl unzip
|
||||
&& apk del curl unzip libxml2-dev libxslt-dev zlib-dev g++ python3-dev libffi-dev qpdf-dev
|
||||
|
||||
COPY entrypoint-joex.sh /opt/entrypoint.sh
|
||||
|
||||
|
Reference in New Issue
Block a user