From 919381be1e80f1cf48c1b976a0d96d7dfc0cf968 Mon Sep 17 00:00:00 2001 From: Eike Kettner <eike.kettner@posteo.de> Date: Sat, 15 Feb 2020 13:57:21 +0100 Subject: [PATCH] More research on how to create pdfs from other files --- .../docs/dev/adr/0006_more-file-types.md | 9 + .../docs/dev/adr/0010_convert_image_files.md | 178 +++++++++++++++++- 2 files changed, 186 insertions(+), 1 deletion(-) diff --git a/modules/microsite/docs/dev/adr/0006_more-file-types.md b/modules/microsite/docs/dev/adr/0006_more-file-types.md index 251c3e93..6c433051 100644 --- a/modules/microsite/docs/dev/adr/0006_more-file-types.md +++ b/modules/microsite/docs/dev/adr/0006_more-file-types.md @@ -119,7 +119,16 @@ office documents, common images, plain text (i.e. markdown) and html should be supported. In terms of file extensions: `doc`, `docx`, `xls`, `xlsx`, `odt`, `md`, `html`, `txt`, `jpg`, `png`, `tif`. +There is always the preference to use jvm internal libraries in order +to be more platform independent and to reduce external dependencies. +But this is not always possible (like doing OCR). +- Office documents (`doc`, `docx`, `xls`, `xlsx`, `odt`, `ods`): + unoconv (see [ADR 9](0009_convert_office_docs)) +- HTML (`html`): wkhtmltopdf (see [ADR 7](0007_convert_html_files)) +- Text/Markdown (`txt`, `md`): Java-Lib flexmark + wkhtmltopdf +- Images (`jpg`, `png`, `tif`): Tesseract (see [ADR + 10](0010_convert_image_files)) ## Links diff --git a/modules/microsite/docs/dev/adr/0010_convert_image_files.md b/modules/microsite/docs/dev/adr/0010_convert_image_files.md index 458fa828..bf8e16d2 100644 --- a/modules/microsite/docs/dev/adr/0010_convert_image_files.md +++ b/modules/microsite/docs/dev/adr/0010_convert_image_files.md @@ -9,8 +9,184 @@ title: Convert Image Files How to convert image files properly to pdf? +Since there are thousands of different image formats, there will never +be support for all. The most common containers should be supported, +though: + +- jpeg (jfif, exif) +- png +- tiff (baseline, single page) + +The focus is on document images, maybe from digital cameras or +scanners. ## Considered Options * [pdfbox]() library -* [pandoc](https://pandoc.org/) external command +* [imagemagick](https://www.imagemagick.org/) external command +* [img2pdf](https://github.com/josch/img2pdf) external command +* [tesseract](https://github.com/tesseract-ocr/tesseract) external command + +There are no screenshots here, because it doesn't make sense since +they all look the same on the screen. Instead we look at the files +properties. + +**Input File** + +The input files are: + +``` +$ identify input/* +input/jfif.jpg JPEG 2480x3514 2480x3514+0+0 8-bit sRGB 240229B 0.000u 0:00.000 +input/letter-en.jpg JPEG 1695x2378 1695x2378+0+0 8-bit Gray 256c 467341B 0.000u 0:00.000 +input/letter-en.png PNG 1695x2378 1695x2378+0+0 8-bit Gray 256c 191571B 0.000u 0:00.000 +input/letter-en.tiff TIFF 1695x2378 1695x2378+0+0 8-bit Grayscale Gray 4030880B 0.000u 0:00.000 +``` + +Size: +- jfif.jpg 240k +- letter-en.jpg 467k +- letter-en.png 191k +- letter-en.tiff 4.0M + +### pdfbox + +Using a java library is preferred, if the quality is good enough. +There is an +[example](https://github.com/apache/pdfbox/blob/2cea31cc63623fd6ece149c60d5f0cc05a696ea7/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ImageToPDF.java) +for this exact use case. + +This is the sample code: + +``` scala +def imgtopdf(file: String): ExitCode = { + val jpg = Paths.get(file).toAbsolutePath + if (!Files.exists(jpg)) { + sys.error(s"file doesn't exist: $jpg") + } + val pd = new PDDocument() + val page = new PDPage(PDRectangle.A4) + pd.addPage(page) + val bimg = ImageIO.read(jpg.toFile) + + val img = LosslessFactory.createFromImage(pd, bimg) + + val stream = new PDPageContentStream(pd, page) + stream.drawImage(img, 0, 0, PDRectangle.A4.getWidth, PDRectangle.A4.getHeight) + stream.close() + + pd.save("test.pdf") + pd.close() + + ExitCode.Success +} +``` + +Using pdfbox 2.0.18 and twelvemonkeys 3.5. Running time: `1384ms` + +``` +$ identify *.pdf +jfif.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 129660B 0.000u 0:00.000 +letter-en.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49118B 0.000u 0:00.000 +letter-en.png.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49118B 0.000u 0:00.000 +letter-en.tiff.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49118B 0.000u 0:00.000 +``` + +Size: +- jfif.jpg 1.1M +- letter-en.jpg 142k +- letter-en.png 142k +- letter-en.tiff 142k + +### img2pdf + +This is a python tool that adds the image into the pdf without +reencoding. + +Using version 0.3.1. Running time: `323ms`. + +``` +$ identify *.pdf +jfif.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 129708B 0.000u 0:00.000 +letter-en.jpg.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49864B 0.000u 0:00.000 +letter-en.png.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49864B 0.000u 0:00.000 +letter-en.tiff.pdf PDF 595x842 595x842+0+0 16-bit sRGB 49864B 0.000u 0:00.000 +``` + +Size: +- jfif.jpg 241k +- letter-en.jpg 468k +- letter-en.png 191k +- letter-en.tiff 192k + +### ImageMagick + +The well known imagemagick tool can convert images to pdfs, too. + +Using version 6.9.10-71. Running time: `881ms`. + +``` +$ identify *.pdf +jfif.jpg.pdf PDF 595x843 595x843+0+0 16-bit sRGB 134873B 0.000u 0:00.000 +letter-en.jpg.pdf PDF 1695x2378 1695x2378+0+0 16-bit sRGB 360100B 0.000u 0:00.000 +letter-en.png.pdf PDF 1695x2378 1695x2378+0+0 16-bit sRGB 322418B 0.000u 0:00.000 +letter-en.tiff.pdf PDF 1695x2378 1695x2378+0+0 16-bit sRGB 322418B 0.000u 0:00.000 +``` + +Size: +- jfif.jpg 300k +- letter-en.jpg 390k +- letter-en.png 180k +- letter-en.tiff 5.1M + + +### Tesseract + +Docspell already relies on tesseract for doing OCR. And in contrast to +all other candidates, it can create PDFs that are searchable. Of +course, this yields in much longer running time, that cannot be +compared to the times of the other options. + +``` +tesseract doc3.jpg out -l deu pdf +``` + +It can also create both outputs in one go: + +``` +tesseract doc3.jpg out -l deu pdf txt +``` + +Using tesseract 4. Running time: `6661ms` + +``` +$ identify *.pdf +tesseract/jfif.jpg.pdf PDF 595x843 595x843+0+0 16-bit sRGB 130535B 0.000u 0:00.000 +tesseract/letter-en.jpg.pdf PDF 1743x2446 1743x2446+0+0 16-bit sRGB 328716B 0.000u 0:00.000 +tesseract/letter-en.png.pdf PDF 1743x2446 1743x2446+0+0 16-bit sRGB 328716B 0.000u 0:00.000 +tesseract/letter-en.tiff.pdf PDF 1743x2446 1743x2446+0+0 16-bit sRGB 328716B 0.000u 0:00.000 +``` + +Size: +- jfif.jpg 246k +- letter-en.jpg 473k +- letter-en.png 183k +- letter-en.tiff 183k + + +## Decision + +Tesseract. + +To not use more external tools, imagemagick and img2pdf are not +chosen, even though img2pdf shows the best results and is fastest. + +Pdfbox library would be the favorite, because results are good and +with the [twelvemonkeys](https://github.com/haraldk/TwelveMonkeys) +library there is support for many images. The priority is to avoid +more external commands if possible. + +But since there already is a dependency to tesseract and it can create +searchable pdfs, the decision is to use tesseract for this. Then PDFs +with images can be converted to searchable PDFs with images. And text +extraction is required anyways.