Some research on pdf conversion

This commit is contained in:
Eike Kettner 2020-02-11 22:41:44 +01:00
parent ce22b727b1
commit 3026f199f7
30 changed files with 649 additions and 5 deletions

View File

@ -5,8 +5,9 @@ title: ADRs
# ADR
- [0001 Components](adr/0001_components.html)
- [0002 Component Interaction](adr/0002_component_interaction.html)
- [0003 Encryption](adr/0003_encryption.html)
- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch.html)
- [0005 Job Executor](adr/0005_job-executor.html)
- [0001 Components](adr/0001_components)
- [0002 Component Interaction](adr/0002_component_interaction)
- [0003 Encryption](adr/0003_encryption)
- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch)
- [0005 Job Executor](adr/0005_job-executor)
- [0006 More File Types](adr/0006_more-file-types)

View File

@ -1,3 +1,8 @@
---
layout: docs
title: Use Markdown Architectural Decision Records
---
# Use Markdown Architectural Decision Records
## Context and Problem Statement

View File

@ -0,0 +1,129 @@
---
layout: docs
title: More File Types
---
# More File Types
## Context and Problem Statement
Docspell currently only supports PDF files. This has simplified early
development and design a lot and so helped with starting the project.
Handling pdf files is usually easy (to view, to extract text, print
etc).
The pdf format has been chosen, because PDFs files are very common and
can be viewed with many tools on many systems (i.e. non-proprietary
tools). Docspell also is a document archive and from this perspective,
it is important that documents can be viewed in 10 years and more. The
hope is, that the PDF format is best suited for this. Therefore all
documents in Docspell must be accessible as PDF. The trivial solution
to this requirement is to only allow PDF files.
Support for more document types, must then take care of the following:
- extracting text
- converting into pdf
- access original file
Text should be extracted from the source file, in case conversion is
not lossless. Since Docspell can already extract text from PDF files
using OCR, text can also be extracted from the converted file as a
fallback.
The original file must always be accessible. The main reason is that
all uploaded data should be accessible without any modification. And
since the conversion may not always create best results, the original
file should be kept.
## Decision Drivers
People expect that software like Docspell support the most common
document types, like all the “office documents” (`docx`, `rtf`, `odt`,
`xlsx`, …) and images. For many people it is more common to create
those files instead of PDF. Some (older) scanners may not be able to
scan into PDF files but only to image files.
## Considered Options
This ADR does not evaluate different options. It rather documents why
this feature is realized and the thoughts that lead to how it is
implemented.
## Realization
### Data Model
The `attachment` table holds one file. There will be another table
`attachment_source` that holds the original file. It looks like this:
``` sql
CREATE TABLE "attachment_source" (
"id" varchar(254) not null primary key,
"file_id" varchar(254) not null,
"filename" varchar(254),
"created" timestamp not null,
foreign key ("file_id") references "filemeta"("id"),
foreign key ("id") references "attachment"("attachid")
);
```
The `id` is the primary key and is the same as the associated
`attachment`, creating a `1-1` relationship (well, more correct is
`0..1-1`) between `attachment` and `attachment_source`.
There will always be a `attachment_source` record for every
`attachment` record. If the original file is a PDF already, then both
table's `file_id` columns point to the same file. But now the user can
change the filename of an `attachment` while the original filename is
preserved in `attachment_source`. It must not be possible for the user
to change anything in `attachment_source`.
The `attachment` table is not touched in order to keep current code
mostly unchanged and to have a simpler data migration. The downside
is, that the data model allows to have an `attachment` record without
an `attachment_source` record. OTOH, a foreign key inside `attachment`
pointing to an `attachment_source` is also not correct, because it
allows the same `attachment_source` record to be associated with many
`attachment` records. This would do even more harm, in my opinion.
### Migration
Creating a new table and not altering existing ones, should simplify
data migration.
Since only PDF files where allowed and the user could not change
anything in the `attachment` table, the existing data can simply be
inserted into the new table. This presents the trivial case where the
attachment and source are the same.
### Processing
The first step in processing is now converting the file into a pdf. If
it already is a pdf, nothing is done. This step is before text
extraction, so text can first be tried to extract from the source file
and only if that fails (or is not supported), text can be extracted
from the converted pdf file. All remaining steps are untouched.
If conversion is not supported for the input file, it is skipped. If
conversion fails, the error is propagated to let the retry mechanism
take care.
### What types?
Which file types should be supported? At a first step, all major
office documents, common images, plain text (i.e. markdown) and html
should be supported. In terms of file extensions: `doc`, `docx`,
`xls`, `xlsx`, `odt`, `md`, `html`, `txt`, `jpg`, `png`, `tif`.
## Links
* [Convert HTML Files](0007_convert_html_files)
* [Convert Plain Text](0008_convert_plain_text)
* [Convert Office Documents](0009_convert_office_docs)
* [Convert Image Files](0010_convert_image_files)

View File

@ -0,0 +1,71 @@
---
layout: docs
title: Convert HTML Files
---
# {{ page.title }}
## Context and Problem Statement
How can HTML documents be converted into a PDF file that looks as much
as possible like the original?
It would be nice to have a java-only solution. But if an external tool
has a better outcome, then an external tool is fine, too.
Since Docspell is free software, the tools must also be free.
## Considered Options
* [pandoc](https://pandoc.org/) external command
* [wkhtmltopdf](https://wkhtmltopdf.org/) external command
* [Unoconv](https://github.com/unoconv/unoconv) external command
Native (firefox) view:
<div class="thumbnail">
<img src="./img/example-html-native.jpg" title="Native view of an HTML example file">
</div>
Note: the example html is from
[here](https://www.sparksuite.com/open-source/invoice.html).
I downloaded the HTML file to disk together with its resources (using
*Save as...* in the browser).
### Pandoc
<div class="thumbnail">
<img src="./img/example-html-pandoc-latex.jpg" title="Pandoc (Latex) HTML->PDF">
</div>
<div class="thumbnail">
<img src="./img/example-html-pandoc-html.jpg" title="Pandoc (html) HTML->PDF">
</div>
Not showing the version using `context` pdf-engine, since it looked
very similiar to the latex variant.
### wkhtmltopdf
<div class="thumbnail">
<img src="./img/example-html-wkhtmltopdf.jpg" title="wkhtmltopdf HTML->PDF">
</div>
### Unoconv
<div class="thumbnail">
<img src="./img/example-html-unoconv.jpg" title="Unoconv HTML->PDF">
</div>
## Decision Outcome
wkhtmltopdf.
It shows the best results.

View File

@ -0,0 +1,191 @@
---
layout: docs
title: Convert Text Files
---
# {{ page.title }}
## Context and Problem Statement
How can plain text and markdown documents be converted into a PDF
files?
Rendering images is not important here, since the files must be self
contained when uploaded to Docspell.
The test file is the current documentation page of Docspell, found in
`microsite/docs/doc.md`.
```
---
layout: docs
position: 4
title: Documentation
---
# {page .title}
Docspell assists in organizing large amounts of PDF files that are
...
## How it works
Documents have two ...
1. You maintain a kind of address book. It should list all possible
correspondents and the concerning people/things. This grows
incrementally with each new unknown document.
2. When docspell analyzes a document, it tries to find matches within
your address ...
3. You can inspect ...
The set of meta data that docspell uses to draw suggestions from, must
be maintained ...
## Terms
In order to better understand these pages, some terms should be
explained first.
### Item
An **Item** is roughly your (pdf) document, only that an item may span
multiple files, which are called **attachments**. And an item has
**meta data** associated:
- a **correspondent**: the other side of the communication. It can be
an organization or a person.
- a **concerning person** or **equipment**: a person or thing that
this item is about. Maybe it is an insurance contract about your
car.
- ...
### Collective
The users of the application are part of a **collective**. A
**collective** is a group of users that share access to the same
items. The account name is therefore comprised of a *collective name*
and a *user name*.
All users of a collective are equal; they have same permissions to
access all...
```
Then a plain text file is tried, too (without any markup).
```
Maecenas mauris lectus, lobortis et purus mattis
Duis vehicula mi vel mi pretium
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu.
Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut.
Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros
efficitur tincidunt. Cras justo mi, porttitor quis mattis vel,
ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod:
- Fusce vitae vestibulum velit,
- Pellentesque vulputate lectus quis pellentesque commodo
the end.
```
## Considered Options
* [flexmark](https://github.com/vsch/flexmark-java) for markdown to
HTML, then use existing machinery described in [adr
7](./0007_convert_html_files)
* [pandoc](https://pandoc.org/) external command
### flexmark markdown library for java
Process files with [flexmark](https://github.com/vsch/flexmark-java)
and then create a PDF from the resulting html.
Using the following snippet:
``` scala
def renderMarkdown(): ExitCode = {
val opts = new MutableDataSet()
opts.set(Parser.EXTENSIONS.asInstanceOf[DataKey[util.Collection[_]]],
util.Arrays.asList(TablesExtension.create(),
StrikethroughExtension.create()));
val parser = Parser.builder(opts).build()
val renderer = HtmlRenderer.builder(opts).build()
val reader = Files.newBufferedReader(Paths.get("in.txt|md"))
val doc = parser.parseReader(reader)
val html = renderer.render(doc)
val body = "<html><head></head><body style=\"padding: 0 5em;\">" + html + "</body></html>"
Files.write(
Paths.get("test.html"),
body.getBytes(StandardCharsets.UTF_8))
ExitCode.Success
}
```
Then run the result through `wkhtmltopdf`.
Markdown file:
<div class="thumbnail">
<img src="./img/example-md-java.jpg" title="Flexmark/wkhtmltopdf MD->PDF">
</div>
TXT file:
<div class="thumbnail">
<img src="./img/example-txt-java.jpg" title="Flexmark/wkhtmltopdf TXT->PDF">
</div>
### pandoc
Command:
```
pandoc -f markdown -t html -o test.pdf microsite/docs/doc.md
```
Markdown/Latex:
<div class="thumbnail">
<img src="./img/example-md-pandoc-latex.jpg" title="Pandoc (Latex) MD->PDF">
</div>
Markdown/Html:
<div class="thumbnail">
<img src="./img/example-md-pandoc-html.jpg" title="Pandoc (html) MD->PDF">
</div>
Text/Latex:
<div class="thumbnail">
<img src="./img/example-txt-pandoc-latex.jpg" title="Pandoc (Latex) TXT->PDF">
</div>
Text/Html:
<div class="thumbnail">
<img src="./img/example-txt-pandoc-html.jpg" title="Pandoc (html) TXT->PDF">
</div>
## Decision Outcome
Java library "flexmark".
I think all results are great. It depends on the type of document and
what one expects to see. I guess that most people expect something
like pandoc-html produces for the kind of files docspell is for (it is
not for newspaper articles, where pandoc-latex would be best fit).
But choosing pandoc means yet another external command to depend on.
And the results from flexmark are really good, too. One can fiddle
with options and css to make it look better.
To not introduce another external command, decision is to use flexmark
and then the already existing html->pdf conversion.

View File

@ -0,0 +1,231 @@
---
layout: docs
title: Convert Office Documents
---
# {{ page.title }}
## Context and Problem Statement
How can office documents, like `docx` or `odt` be converted into a PDF
file that looks as much as possible like the original?
It would be nice to have a java-only solution. But if an external tool
has a better outcome, then an external tool is fine, too.
Since Docspell is free software, the tools must also be free.
## Considered Options
* [Apache POI](https://poi.apache.org) together with
[this](https://search.maven.org/artifact/fr.opensagres.xdocreport/org.apache.poi.xwpf.converter.pdf/1.0.6/jar)
library
* [pandoc](https://pandoc.org/) external command
* [abiword]() external command
* [Unoconv](https://github.com/unoconv/unoconv) external command
To choose an option, some documents are converted to pdf and compared.
Only the formats `docx` and `odt` are considered here. These are the
most used formats. They have to look well, if a `xlsx` or `pptx`
doesn't look so great, that is ok.
Here is the native view to compare with:
ODT:
<div class="thumbnail">
<img src="./img/example-odt-native.jpg" title="Native view of an ODT example file">
</div>
### `XWPFConverter`
I couldn't get any example to work. There were exceptions:
```
java.lang.IllegalArgumentException: Value for parameter 'id' was out of bounds
at org.apache.poi.util.IdentifierManager.reserve(IdentifierManager.java:80)
at org.apache.poi.xwpf.usermodel.XWPFRun.<init>(XWPFRun.java:101)
at org.apache.poi.xwpf.usermodel.XWPFRun.<init>(XWPFRun.java:146)
at org.apache.poi.xwpf.usermodel.XWPFParagraph.buildRunsInOrderFromXml(XWPFParagraph.java:135)
at org.apache.poi.xwpf.usermodel.XWPFParagraph.<init>(XWPFParagraph.java:88)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:147)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
at docspell.convert.Testing$.withPoi(Testing.scala:17)
at docspell.convert.Testing$.$anonfun$run$1(Testing.scala:12)
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
at cats.effect.internals.PoolUtils$$anon$2$$anon$3.run(PoolUtils.scala:51)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
The project (not Apache Poi, the other) seems unmaintained. I could
not find any website and the artifact in maven central is from 2016.
### Pandoc
I know pandoc as a very great tool when converting between markup
documents. So this tries it with office documents. It supports `docx`
and `odt` from there `--list-input-formats`.
From the pandoc manual:
> By default, pandoc will use LaTeX to create the PDF, which requires
> that a LaTeX engine be installed (see --pdf-engine below).
> Alternatively, pandoc can use ConTeXt, roff ms, or HTML as an
> intermediate format. To do this, specify an output file with a .pdf
> extension, as before, but add the --pdf-engine option or -t context,
> -t html, or -t ms to the command line. The tool used to generate the
> PDF from the intermediate format may be specified using --pdf-engine.
Trying with latex engine:
```
pandoc -f odt -o test.pdf example.odt
```
Results ODT:
<div class="thumbnail">
<img src="./img/example-odt-pandoc-latex.jpg" title="Pandoc (Latex) ODT->PDF">
</div>
```
pandoc -f odt -o test.pdf example.docx
```
Results DOCX:
<div class="thumbnail">
<img src="./img/example-docx-pandoc-latex.jpg" title="Pandoc (Latex) DOCX->PDF">
</div>
----
Trying with context engine:
```
pandoc -f odt -t context -o test.pdf example.odt
```
Results ODT:
<div class="thumbnail">
<img src="./img/example-odt-pandoc-context.jpg" title="Pandoc (Context) ODT->PDF">
</div>
Results DOCX:
<div class="thumbnail">
<img src="./img/example-docx-pandoc-context.jpg" title="Pandoc (Context) DOCX->PDF">
</div>
----
Trying with ms engine:
```
pandoc -f odt -t ms -o test.pdf example.odt
```
Results ODT:
<div class="thumbnail">
<img src="./img/example-odt-pandoc-ms.jpg" title="Pandoc (MS) ODT->PDF">
</div>
Results DOCX:
<div class="thumbnail">
<img src="./img/example-docx-pandoc-ms.jpg" title="Pandoc (MS) DOCX->PDF">
</div>
---
Trying with html engine (this requires `wkhtmltopdf` to be present):
```
$ pandoc --extract-media . -f odt -t html -o test.pdf example.odt
```
Results ODT:
<div class="thumbnail">
<img src="./img/example-odt-pandoc-html.jpg" title="Pandoc (html) ODT->PDF">
</div>
Results DOCX:
<div class="thumbnail">
<img src="./img/example-docx-pandoc-html.jpg" title="Pandoc (html) DOCX->PDF">
</div>
### Abiword
Trying with:
```
abiword --to=pdf example.odt
```
Results:
<div class="thumbnail">
<img src="./img/example-odt-abiword.jpg" title="Abiword ODT->PDF">
</div>
Trying with a `docx` file failed. It worked with a `doc` file.
### Unoconv
Unoconv relies on libreoffice/openoffice, so installing it will result
in installing parts of libreoffice, which is a very large dependency.
Trying with:
```
unoconv -f pdf example.odt
```
Results ODT:
<div class="thumbnail">
<img src="./img/example-odt-unoconv.jpg" title="Unoconv ODT->PDF">
</div>
Results DOCX:
<div class="thumbnail">
<img src="./img/example-docx-unoconv.jpg" title="Unoconv ODT->PDF">
</div>
## Decision Outcome
Unoconv.
The results from `unoconv` are really good.
Abiword also is not that bad, it didn't convert the chart, but all
font markup is there. It would be great to not depend on something as
big as libreoffice, but the results are so much better.
Also pandoc deals very well with DOCX files (using the `context`
engine). The only thing that was not rendered was the embedded chart
(like abiword). But all images and font styling was present.
It will be a configurable external command anyways, so users can
exchange it at any time with a different one.

View File

@ -0,0 +1,16 @@
---
layout: docs
title: Convert Image Files
---
# {{ page.title }}
## Context and Problem Statement
How to convert image files properly to pdf?
## Considered Options
* [pdfbox]() library
* [pandoc](https://pandoc.org/) external command

Binary file not shown.

After

Width:  |  Height:  |  Size: 385 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 443 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 291 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 353 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 292 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 167 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 586 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 479 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 280 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 270 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 363 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 418 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 500 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 349 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 350 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 296 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 155 KiB