Some research on pdf conversion
@ -5,8 +5,9 @@ title: ADRs
|
|||||||
|
|
||||||
# ADR
|
# ADR
|
||||||
|
|
||||||
- [0001 Components](adr/0001_components.html)
|
- [0001 Components](adr/0001_components)
|
||||||
- [0002 Component Interaction](adr/0002_component_interaction.html)
|
- [0002 Component Interaction](adr/0002_component_interaction)
|
||||||
- [0003 Encryption](adr/0003_encryption.html)
|
- [0003 Encryption](adr/0003_encryption)
|
||||||
- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch.html)
|
- [0004 ISO8601 vs Unix](adr/0004_iso8601vsEpoch)
|
||||||
- [0005 Job Executor](adr/0005_job-executor.html)
|
- [0005 Job Executor](adr/0005_job-executor)
|
||||||
|
- [0006 More File Types](adr/0006_more-file-types)
|
||||||
|
@ -1,3 +1,8 @@
|
|||||||
|
---
|
||||||
|
layout: docs
|
||||||
|
title: Use Markdown Architectural Decision Records
|
||||||
|
---
|
||||||
|
|
||||||
# Use Markdown Architectural Decision Records
|
# Use Markdown Architectural Decision Records
|
||||||
|
|
||||||
## Context and Problem Statement
|
## Context and Problem Statement
|
||||||
|
129
modules/microsite/docs/dev/adr/0006_more-file-types.md
Normal file
@ -0,0 +1,129 @@
|
|||||||
|
---
|
||||||
|
layout: docs
|
||||||
|
title: More File Types
|
||||||
|
---
|
||||||
|
|
||||||
|
# More File Types
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
Docspell currently only supports PDF files. This has simplified early
|
||||||
|
development and design a lot and so helped with starting the project.
|
||||||
|
Handling pdf files is usually easy (to view, to extract text, print
|
||||||
|
etc).
|
||||||
|
|
||||||
|
The pdf format has been chosen, because PDFs files are very common and
|
||||||
|
can be viewed with many tools on many systems (i.e. non-proprietary
|
||||||
|
tools). Docspell also is a document archive and from this perspective,
|
||||||
|
it is important that documents can be viewed in 10 years and more. The
|
||||||
|
hope is, that the PDF format is best suited for this. Therefore all
|
||||||
|
documents in Docspell must be accessible as PDF. The trivial solution
|
||||||
|
to this requirement is to only allow PDF files.
|
||||||
|
|
||||||
|
Support for more document types, must then take care of the following:
|
||||||
|
|
||||||
|
- extracting text
|
||||||
|
- converting into pdf
|
||||||
|
- access original file
|
||||||
|
|
||||||
|
Text should be extracted from the source file, in case conversion is
|
||||||
|
not lossless. Since Docspell can already extract text from PDF files
|
||||||
|
using OCR, text can also be extracted from the converted file as a
|
||||||
|
fallback.
|
||||||
|
|
||||||
|
The original file must always be accessible. The main reason is that
|
||||||
|
all uploaded data should be accessible without any modification. And
|
||||||
|
since the conversion may not always create best results, the original
|
||||||
|
file should be kept.
|
||||||
|
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
People expect that software like Docspell support the most common
|
||||||
|
document types, like all the “office documents” (`docx`, `rtf`, `odt`,
|
||||||
|
`xlsx`, …) and images. For many people it is more common to create
|
||||||
|
those files instead of PDF. Some (older) scanners may not be able to
|
||||||
|
scan into PDF files but only to image files.
|
||||||
|
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
This ADR does not evaluate different options. It rather documents why
|
||||||
|
this feature is realized and the thoughts that lead to how it is
|
||||||
|
implemented.
|
||||||
|
|
||||||
|
## Realization
|
||||||
|
|
||||||
|
### Data Model
|
||||||
|
|
||||||
|
The `attachment` table holds one file. There will be another table
|
||||||
|
`attachment_source` that holds the original file. It looks like this:
|
||||||
|
|
||||||
|
``` sql
|
||||||
|
CREATE TABLE "attachment_source" (
|
||||||
|
"id" varchar(254) not null primary key,
|
||||||
|
"file_id" varchar(254) not null,
|
||||||
|
"filename" varchar(254),
|
||||||
|
"created" timestamp not null,
|
||||||
|
foreign key ("file_id") references "filemeta"("id"),
|
||||||
|
foreign key ("id") references "attachment"("attachid")
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
The `id` is the primary key and is the same as the associated
|
||||||
|
`attachment`, creating a `1-1` relationship (well, more correct is
|
||||||
|
`0..1-1`) between `attachment` and `attachment_source`.
|
||||||
|
|
||||||
|
There will always be a `attachment_source` record for every
|
||||||
|
`attachment` record. If the original file is a PDF already, then both
|
||||||
|
table's `file_id` columns point to the same file. But now the user can
|
||||||
|
change the filename of an `attachment` while the original filename is
|
||||||
|
preserved in `attachment_source`. It must not be possible for the user
|
||||||
|
to change anything in `attachment_source`.
|
||||||
|
|
||||||
|
The `attachment` table is not touched in order to keep current code
|
||||||
|
mostly unchanged and to have a simpler data migration. The downside
|
||||||
|
is, that the data model allows to have an `attachment` record without
|
||||||
|
an `attachment_source` record. OTOH, a foreign key inside `attachment`
|
||||||
|
pointing to an `attachment_source` is also not correct, because it
|
||||||
|
allows the same `attachment_source` record to be associated with many
|
||||||
|
`attachment` records. This would do even more harm, in my opinion.
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
|
||||||
|
Creating a new table and not altering existing ones, should simplify
|
||||||
|
data migration.
|
||||||
|
|
||||||
|
Since only PDF files where allowed and the user could not change
|
||||||
|
anything in the `attachment` table, the existing data can simply be
|
||||||
|
inserted into the new table. This presents the trivial case where the
|
||||||
|
attachment and source are the same.
|
||||||
|
|
||||||
|
|
||||||
|
### Processing
|
||||||
|
|
||||||
|
The first step in processing is now converting the file into a pdf. If
|
||||||
|
it already is a pdf, nothing is done. This step is before text
|
||||||
|
extraction, so text can first be tried to extract from the source file
|
||||||
|
and only if that fails (or is not supported), text can be extracted
|
||||||
|
from the converted pdf file. All remaining steps are untouched.
|
||||||
|
|
||||||
|
If conversion is not supported for the input file, it is skipped. If
|
||||||
|
conversion fails, the error is propagated to let the retry mechanism
|
||||||
|
take care.
|
||||||
|
|
||||||
|
### What types?
|
||||||
|
|
||||||
|
Which file types should be supported? At a first step, all major
|
||||||
|
office documents, common images, plain text (i.e. markdown) and html
|
||||||
|
should be supported. In terms of file extensions: `doc`, `docx`,
|
||||||
|
`xls`, `xlsx`, `odt`, `md`, `html`, `txt`, `jpg`, `png`, `tif`.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
* [Convert HTML Files](0007_convert_html_files)
|
||||||
|
* [Convert Plain Text](0008_convert_plain_text)
|
||||||
|
* [Convert Office Documents](0009_convert_office_docs)
|
||||||
|
* [Convert Image Files](0010_convert_image_files)
|
71
modules/microsite/docs/dev/adr/0007_convert_html_files.md
Normal file
@ -0,0 +1,71 @@
|
|||||||
|
---
|
||||||
|
layout: docs
|
||||||
|
title: Convert HTML Files
|
||||||
|
---
|
||||||
|
|
||||||
|
# {{ page.title }}
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
How can HTML documents be converted into a PDF file that looks as much
|
||||||
|
as possible like the original?
|
||||||
|
|
||||||
|
It would be nice to have a java-only solution. But if an external tool
|
||||||
|
has a better outcome, then an external tool is fine, too.
|
||||||
|
|
||||||
|
Since Docspell is free software, the tools must also be free.
|
||||||
|
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
* [pandoc](https://pandoc.org/) external command
|
||||||
|
* [wkhtmltopdf](https://wkhtmltopdf.org/) external command
|
||||||
|
* [Unoconv](https://github.com/unoconv/unoconv) external command
|
||||||
|
|
||||||
|
Native (firefox) view:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-html-native.jpg" title="Native view of an HTML example file">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Note: the example html is from
|
||||||
|
[here](https://www.sparksuite.com/open-source/invoice.html).
|
||||||
|
|
||||||
|
I downloaded the HTML file to disk together with its resources (using
|
||||||
|
*Save as...* in the browser).
|
||||||
|
|
||||||
|
|
||||||
|
### Pandoc
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-html-pandoc-latex.jpg" title="Pandoc (Latex) HTML->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-html-pandoc-html.jpg" title="Pandoc (html) HTML->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Not showing the version using `context` pdf-engine, since it looked
|
||||||
|
very similiar to the latex variant.
|
||||||
|
|
||||||
|
|
||||||
|
### wkhtmltopdf
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-html-wkhtmltopdf.jpg" title="wkhtmltopdf HTML->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
### Unoconv
|
||||||
|
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-html-unoconv.jpg" title="Unoconv HTML->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
wkhtmltopdf.
|
||||||
|
|
||||||
|
It shows the best results.
|
191
modules/microsite/docs/dev/adr/0008_convert_plain_text.md
Normal file
@ -0,0 +1,191 @@
|
|||||||
|
---
|
||||||
|
layout: docs
|
||||||
|
title: Convert Text Files
|
||||||
|
---
|
||||||
|
|
||||||
|
# {{ page.title }}
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
How can plain text and markdown documents be converted into a PDF
|
||||||
|
files?
|
||||||
|
|
||||||
|
Rendering images is not important here, since the files must be self
|
||||||
|
contained when uploaded to Docspell.
|
||||||
|
|
||||||
|
The test file is the current documentation page of Docspell, found in
|
||||||
|
`microsite/docs/doc.md`.
|
||||||
|
|
||||||
|
```
|
||||||
|
---
|
||||||
|
layout: docs
|
||||||
|
position: 4
|
||||||
|
title: Documentation
|
||||||
|
---
|
||||||
|
|
||||||
|
# {page .title}
|
||||||
|
|
||||||
|
|
||||||
|
Docspell assists in organizing large amounts of PDF files that are
|
||||||
|
...
|
||||||
|
|
||||||
|
## How it works
|
||||||
|
|
||||||
|
Documents have two ...
|
||||||
|
|
||||||
|
1. You maintain a kind of address book. It should list all possible
|
||||||
|
correspondents and the concerning people/things. This grows
|
||||||
|
incrementally with each new unknown document.
|
||||||
|
2. When docspell analyzes a document, it tries to find matches within
|
||||||
|
your address ...
|
||||||
|
3. You can inspect ...
|
||||||
|
|
||||||
|
The set of meta data that docspell uses to draw suggestions from, must
|
||||||
|
be maintained ...
|
||||||
|
|
||||||
|
|
||||||
|
## Terms
|
||||||
|
|
||||||
|
In order to better understand these pages, some terms should be
|
||||||
|
explained first.
|
||||||
|
|
||||||
|
### Item
|
||||||
|
|
||||||
|
An **Item** is roughly your (pdf) document, only that an item may span
|
||||||
|
multiple files, which are called **attachments**. And an item has
|
||||||
|
**meta data** associated:
|
||||||
|
|
||||||
|
- a **correspondent**: the other side of the communication. It can be
|
||||||
|
an organization or a person.
|
||||||
|
- a **concerning person** or **equipment**: a person or thing that
|
||||||
|
this item is about. Maybe it is an insurance contract about your
|
||||||
|
car.
|
||||||
|
- ...
|
||||||
|
|
||||||
|
### Collective
|
||||||
|
|
||||||
|
The users of the application are part of a **collective**. A
|
||||||
|
**collective** is a group of users that share access to the same
|
||||||
|
items. The account name is therefore comprised of a *collective name*
|
||||||
|
and a *user name*.
|
||||||
|
|
||||||
|
All users of a collective are equal; they have same permissions to
|
||||||
|
access all...
|
||||||
|
```
|
||||||
|
|
||||||
|
Then a plain text file is tried, too (without any markup).
|
||||||
|
|
||||||
|
```
|
||||||
|
Maecenas mauris lectus, lobortis et purus mattis
|
||||||
|
|
||||||
|
Duis vehicula mi vel mi pretium
|
||||||
|
|
||||||
|
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu.
|
||||||
|
|
||||||
|
Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut.
|
||||||
|
Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros
|
||||||
|
efficitur tincidunt. Cras justo mi, porttitor quis mattis vel,
|
||||||
|
ultricies ut purus. Ut facilisis et lacus eu cursus.
|
||||||
|
|
||||||
|
In eleifend velit vitae libero sollicitudin euismod:
|
||||||
|
|
||||||
|
- Fusce vitae vestibulum velit,
|
||||||
|
- Pellentesque vulputate lectus quis pellentesque commodo
|
||||||
|
|
||||||
|
the end.
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
* [flexmark](https://github.com/vsch/flexmark-java) for markdown to
|
||||||
|
HTML, then use existing machinery described in [adr
|
||||||
|
7](./0007_convert_html_files)
|
||||||
|
* [pandoc](https://pandoc.org/) external command
|
||||||
|
|
||||||
|
|
||||||
|
### flexmark markdown library for java
|
||||||
|
|
||||||
|
Process files with [flexmark](https://github.com/vsch/flexmark-java)
|
||||||
|
and then create a PDF from the resulting html.
|
||||||
|
|
||||||
|
Using the following snippet:
|
||||||
|
|
||||||
|
``` scala
|
||||||
|
def renderMarkdown(): ExitCode = {
|
||||||
|
val opts = new MutableDataSet()
|
||||||
|
opts.set(Parser.EXTENSIONS.asInstanceOf[DataKey[util.Collection[_]]],
|
||||||
|
util.Arrays.asList(TablesExtension.create(),
|
||||||
|
StrikethroughExtension.create()));
|
||||||
|
|
||||||
|
val parser = Parser.builder(opts).build()
|
||||||
|
val renderer = HtmlRenderer.builder(opts).build()
|
||||||
|
val reader = Files.newBufferedReader(Paths.get("in.txt|md"))
|
||||||
|
val doc = parser.parseReader(reader)
|
||||||
|
val html = renderer.render(doc)
|
||||||
|
val body = "<html><head></head><body style=\"padding: 0 5em;\">" + html + "</body></html>"
|
||||||
|
Files.write(
|
||||||
|
Paths.get("test.html"),
|
||||||
|
body.getBytes(StandardCharsets.UTF_8))
|
||||||
|
|
||||||
|
ExitCode.Success
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then run the result through `wkhtmltopdf`.
|
||||||
|
|
||||||
|
Markdown file:
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-md-java.jpg" title="Flexmark/wkhtmltopdf MD->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
TXT file:
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-txt-java.jpg" title="Flexmark/wkhtmltopdf TXT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
### pandoc
|
||||||
|
|
||||||
|
Command:
|
||||||
|
|
||||||
|
```
|
||||||
|
pandoc -f markdown -t html -o test.pdf microsite/docs/doc.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Markdown/Latex:
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-md-pandoc-latex.jpg" title="Pandoc (Latex) MD->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Markdown/Html:
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-md-pandoc-html.jpg" title="Pandoc (html) MD->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Text/Latex:
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-txt-pandoc-latex.jpg" title="Pandoc (Latex) TXT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Text/Html:
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-txt-pandoc-html.jpg" title="Pandoc (html) TXT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Java library "flexmark".
|
||||||
|
|
||||||
|
I think all results are great. It depends on the type of document and
|
||||||
|
what one expects to see. I guess that most people expect something
|
||||||
|
like pandoc-html produces for the kind of files docspell is for (it is
|
||||||
|
not for newspaper articles, where pandoc-latex would be best fit).
|
||||||
|
|
||||||
|
But choosing pandoc means yet another external command to depend on.
|
||||||
|
And the results from flexmark are really good, too. One can fiddle
|
||||||
|
with options and css to make it look better.
|
||||||
|
|
||||||
|
To not introduce another external command, decision is to use flexmark
|
||||||
|
and then the already existing html->pdf conversion.
|
231
modules/microsite/docs/dev/adr/0009_convert_office_docs.md
Normal file
@ -0,0 +1,231 @@
|
|||||||
|
---
|
||||||
|
layout: docs
|
||||||
|
title: Convert Office Documents
|
||||||
|
---
|
||||||
|
|
||||||
|
# {{ page.title }}
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
How can office documents, like `docx` or `odt` be converted into a PDF
|
||||||
|
file that looks as much as possible like the original?
|
||||||
|
|
||||||
|
It would be nice to have a java-only solution. But if an external tool
|
||||||
|
has a better outcome, then an external tool is fine, too.
|
||||||
|
|
||||||
|
Since Docspell is free software, the tools must also be free.
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
* [Apache POI](https://poi.apache.org) together with
|
||||||
|
[this](https://search.maven.org/artifact/fr.opensagres.xdocreport/org.apache.poi.xwpf.converter.pdf/1.0.6/jar)
|
||||||
|
library
|
||||||
|
* [pandoc](https://pandoc.org/) external command
|
||||||
|
* [abiword]() external command
|
||||||
|
* [Unoconv](https://github.com/unoconv/unoconv) external command
|
||||||
|
|
||||||
|
To choose an option, some documents are converted to pdf and compared.
|
||||||
|
Only the formats `docx` and `odt` are considered here. These are the
|
||||||
|
most used formats. They have to look well, if a `xlsx` or `pptx`
|
||||||
|
doesn't look so great, that is ok.
|
||||||
|
|
||||||
|
Here is the native view to compare with:
|
||||||
|
|
||||||
|
ODT:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-odt-native.jpg" title="Native view of an ODT example file">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
### `XWPFConverter`
|
||||||
|
|
||||||
|
I couldn't get any example to work. There were exceptions:
|
||||||
|
|
||||||
|
```
|
||||||
|
java.lang.IllegalArgumentException: Value for parameter 'id' was out of bounds
|
||||||
|
at org.apache.poi.util.IdentifierManager.reserve(IdentifierManager.java:80)
|
||||||
|
at org.apache.poi.xwpf.usermodel.XWPFRun.<init>(XWPFRun.java:101)
|
||||||
|
at org.apache.poi.xwpf.usermodel.XWPFRun.<init>(XWPFRun.java:146)
|
||||||
|
at org.apache.poi.xwpf.usermodel.XWPFParagraph.buildRunsInOrderFromXml(XWPFParagraph.java:135)
|
||||||
|
at org.apache.poi.xwpf.usermodel.XWPFParagraph.<init>(XWPFParagraph.java:88)
|
||||||
|
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:147)
|
||||||
|
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
|
||||||
|
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
|
||||||
|
at docspell.convert.Testing$.withPoi(Testing.scala:17)
|
||||||
|
at docspell.convert.Testing$.$anonfun$run$1(Testing.scala:12)
|
||||||
|
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
|
||||||
|
at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
|
||||||
|
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
|
||||||
|
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
|
||||||
|
at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
|
||||||
|
at cats.effect.internals.PoolUtils$$anon$2$$anon$3.run(PoolUtils.scala:51)
|
||||||
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
|
||||||
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
|
||||||
|
at java.lang.Thread.run(Thread.java:748)
|
||||||
|
```
|
||||||
|
|
||||||
|
The project (not Apache Poi, the other) seems unmaintained. I could
|
||||||
|
not find any website and the artifact in maven central is from 2016.
|
||||||
|
|
||||||
|
|
||||||
|
### Pandoc
|
||||||
|
|
||||||
|
I know pandoc as a very great tool when converting between markup
|
||||||
|
documents. So this tries it with office documents. It supports `docx`
|
||||||
|
and `odt` from there `--list-input-formats`.
|
||||||
|
|
||||||
|
From the pandoc manual:
|
||||||
|
|
||||||
|
> By default, pandoc will use LaTeX to create the PDF, which requires
|
||||||
|
> that a LaTeX engine be installed (see --pdf-engine below).
|
||||||
|
> Alternatively, pandoc can use ConTeXt, roff ms, or HTML as an
|
||||||
|
> intermediate format. To do this, specify an output file with a .pdf
|
||||||
|
> extension, as before, but add the --pdf-engine option or -t context,
|
||||||
|
> -t html, or -t ms to the command line. The tool used to generate the
|
||||||
|
> PDF from the intermediate format may be specified using --pdf-engine.
|
||||||
|
|
||||||
|
Trying with latex engine:
|
||||||
|
|
||||||
|
```
|
||||||
|
pandoc -f odt -o test.pdf example.odt
|
||||||
|
```
|
||||||
|
|
||||||
|
Results ODT:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-odt-pandoc-latex.jpg" title="Pandoc (Latex) ODT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
pandoc -f odt -o test.pdf example.docx
|
||||||
|
```
|
||||||
|
|
||||||
|
Results DOCX:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-docx-pandoc-latex.jpg" title="Pandoc (Latex) DOCX->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Trying with context engine:
|
||||||
|
|
||||||
|
```
|
||||||
|
pandoc -f odt -t context -o test.pdf example.odt
|
||||||
|
```
|
||||||
|
|
||||||
|
Results ODT:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-odt-pandoc-context.jpg" title="Pandoc (Context) ODT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
Results DOCX:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-docx-pandoc-context.jpg" title="Pandoc (Context) DOCX->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Trying with ms engine:
|
||||||
|
|
||||||
|
```
|
||||||
|
pandoc -f odt -t ms -o test.pdf example.odt
|
||||||
|
```
|
||||||
|
|
||||||
|
Results ODT:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-odt-pandoc-ms.jpg" title="Pandoc (MS) ODT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Results DOCX:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-docx-pandoc-ms.jpg" title="Pandoc (MS) DOCX->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Trying with html engine (this requires `wkhtmltopdf` to be present):
|
||||||
|
|
||||||
|
```
|
||||||
|
$ pandoc --extract-media . -f odt -t html -o test.pdf example.odt
|
||||||
|
```
|
||||||
|
|
||||||
|
Results ODT:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-odt-pandoc-html.jpg" title="Pandoc (html) ODT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Results DOCX:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-docx-pandoc-html.jpg" title="Pandoc (html) DOCX->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
### Abiword
|
||||||
|
|
||||||
|
Trying with:
|
||||||
|
|
||||||
|
```
|
||||||
|
abiword --to=pdf example.odt
|
||||||
|
```
|
||||||
|
|
||||||
|
Results:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-odt-abiword.jpg" title="Abiword ODT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
Trying with a `docx` file failed. It worked with a `doc` file.
|
||||||
|
|
||||||
|
|
||||||
|
### Unoconv
|
||||||
|
|
||||||
|
Unoconv relies on libreoffice/openoffice, so installing it will result
|
||||||
|
in installing parts of libreoffice, which is a very large dependency.
|
||||||
|
|
||||||
|
Trying with:
|
||||||
|
|
||||||
|
```
|
||||||
|
unoconv -f pdf example.odt
|
||||||
|
```
|
||||||
|
|
||||||
|
Results ODT:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-odt-unoconv.jpg" title="Unoconv ODT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
Results DOCX:
|
||||||
|
|
||||||
|
<div class="thumbnail">
|
||||||
|
<img src="./img/example-docx-unoconv.jpg" title="Unoconv ODT->PDF">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
## Decision Outcome
|
||||||
|
|
||||||
|
Unoconv.
|
||||||
|
|
||||||
|
The results from `unoconv` are really good.
|
||||||
|
|
||||||
|
Abiword also is not that bad, it didn't convert the chart, but all
|
||||||
|
font markup is there. It would be great to not depend on something as
|
||||||
|
big as libreoffice, but the results are so much better.
|
||||||
|
|
||||||
|
Also pandoc deals very well with DOCX files (using the `context`
|
||||||
|
engine). The only thing that was not rendered was the embedded chart
|
||||||
|
(like abiword). But all images and font styling was present.
|
||||||
|
|
||||||
|
It will be a configurable external command anyways, so users can
|
||||||
|
exchange it at any time with a different one.
|
16
modules/microsite/docs/dev/adr/0010_convert_image_files.md
Normal file
@ -0,0 +1,16 @@
|
|||||||
|
---
|
||||||
|
layout: docs
|
||||||
|
title: Convert Image Files
|
||||||
|
---
|
||||||
|
|
||||||
|
# {{ page.title }}
|
||||||
|
|
||||||
|
## Context and Problem Statement
|
||||||
|
|
||||||
|
How to convert image files properly to pdf?
|
||||||
|
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
* [pdfbox]() library
|
||||||
|
* [pandoc](https://pandoc.org/) external command
|
After Width: | Height: | Size: 385 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-docx-pandoc-html.jpg
Normal file
After Width: | Height: | Size: 443 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-docx-pandoc-latex.jpg
Normal file
After Width: | Height: | Size: 291 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-docx-pandoc-ms.jpg
Normal file
After Width: | Height: | Size: 353 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-docx-unoconv.jpg
Normal file
After Width: | Height: | Size: 292 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-html-native.jpg
Normal file
After Width: | Height: | Size: 145 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-html-pandoc-html.jpg
Normal file
After Width: | Height: | Size: 167 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-html-pandoc-latex.jpg
Normal file
After Width: | Height: | Size: 135 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-html-unoconv.jpg
Normal file
After Width: | Height: | Size: 148 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-html-wkhtmltopdf.jpg
Normal file
After Width: | Height: | Size: 142 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-md-java.jpg
Normal file
After Width: | Height: | Size: 586 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-md-pandoc-html.jpg
Normal file
After Width: | Height: | Size: 479 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-md-pandoc-latex.jpg
Normal file
After Width: | Height: | Size: 280 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-odt-abiword.jpg
Normal file
After Width: | Height: | Size: 270 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-odt-native.jpg
Normal file
After Width: | Height: | Size: 363 KiB |
After Width: | Height: | Size: 418 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-odt-pandoc-html.jpg
Normal file
After Width: | Height: | Size: 500 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-odt-pandoc-latex.jpg
Normal file
After Width: | Height: | Size: 349 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-odt-pandoc-ms.jpg
Normal file
After Width: | Height: | Size: 350 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-odt-unoconv.jpg
Normal file
After Width: | Height: | Size: 296 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-txt-java.jpg
Normal file
After Width: | Height: | Size: 176 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-txt-pandoc-html.jpg
Normal file
After Width: | Height: | Size: 174 KiB |
BIN
modules/microsite/docs/dev/adr/img/example-txt-pandoc-latex.jpg
Normal file
After Width: | Height: | Size: 155 KiB |