mirror of
https://github.com/TheAnachronism/docspell.git
synced 2024-11-13 02:31:10 +00:00
192 lines
5.2 KiB
Markdown
192 lines
5.2 KiB
Markdown
|
---
|
||
|
layout: docs
|
||
|
title: Convert Text Files
|
||
|
---
|
||
|
|
||
|
# {{ page.title }}
|
||
|
|
||
|
## Context and Problem Statement
|
||
|
|
||
|
How can plain text and markdown documents be converted into a PDF
|
||
|
files?
|
||
|
|
||
|
Rendering images is not important here, since the files must be self
|
||
|
contained when uploaded to Docspell.
|
||
|
|
||
|
The test file is the current documentation page of Docspell, found in
|
||
|
`microsite/docs/doc.md`.
|
||
|
|
||
|
```
|
||
|
---
|
||
|
layout: docs
|
||
|
position: 4
|
||
|
title: Documentation
|
||
|
---
|
||
|
|
||
|
# {page .title}
|
||
|
|
||
|
|
||
|
Docspell assists in organizing large amounts of PDF files that are
|
||
|
...
|
||
|
|
||
|
## How it works
|
||
|
|
||
|
Documents have two ...
|
||
|
|
||
|
1. You maintain a kind of address book. It should list all possible
|
||
|
correspondents and the concerning people/things. This grows
|
||
|
incrementally with each new unknown document.
|
||
|
2. When docspell analyzes a document, it tries to find matches within
|
||
|
your address ...
|
||
|
3. You can inspect ...
|
||
|
|
||
|
The set of meta data that docspell uses to draw suggestions from, must
|
||
|
be maintained ...
|
||
|
|
||
|
|
||
|
## Terms
|
||
|
|
||
|
In order to better understand these pages, some terms should be
|
||
|
explained first.
|
||
|
|
||
|
### Item
|
||
|
|
||
|
An **Item** is roughly your (pdf) document, only that an item may span
|
||
|
multiple files, which are called **attachments**. And an item has
|
||
|
**meta data** associated:
|
||
|
|
||
|
- a **correspondent**: the other side of the communication. It can be
|
||
|
an organization or a person.
|
||
|
- a **concerning person** or **equipment**: a person or thing that
|
||
|
this item is about. Maybe it is an insurance contract about your
|
||
|
car.
|
||
|
- ...
|
||
|
|
||
|
### Collective
|
||
|
|
||
|
The users of the application are part of a **collective**. A
|
||
|
**collective** is a group of users that share access to the same
|
||
|
items. The account name is therefore comprised of a *collective name*
|
||
|
and a *user name*.
|
||
|
|
||
|
All users of a collective are equal; they have same permissions to
|
||
|
access all...
|
||
|
```
|
||
|
|
||
|
Then a plain text file is tried, too (without any markup).
|
||
|
|
||
|
```
|
||
|
Maecenas mauris lectus, lobortis et purus mattis
|
||
|
|
||
|
Duis vehicula mi vel mi pretium
|
||
|
|
||
|
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu.
|
||
|
|
||
|
Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut.
|
||
|
Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros
|
||
|
efficitur tincidunt. Cras justo mi, porttitor quis mattis vel,
|
||
|
ultricies ut purus. Ut facilisis et lacus eu cursus.
|
||
|
|
||
|
In eleifend velit vitae libero sollicitudin euismod:
|
||
|
|
||
|
- Fusce vitae vestibulum velit,
|
||
|
- Pellentesque vulputate lectus quis pellentesque commodo
|
||
|
|
||
|
the end.
|
||
|
```
|
||
|
|
||
|
|
||
|
## Considered Options
|
||
|
|
||
|
* [flexmark](https://github.com/vsch/flexmark-java) for markdown to
|
||
|
HTML, then use existing machinery described in [adr
|
||
|
7](./0007_convert_html_files)
|
||
|
* [pandoc](https://pandoc.org/) external command
|
||
|
|
||
|
|
||
|
### flexmark markdown library for java
|
||
|
|
||
|
Process files with [flexmark](https://github.com/vsch/flexmark-java)
|
||
|
and then create a PDF from the resulting html.
|
||
|
|
||
|
Using the following snippet:
|
||
|
|
||
|
``` scala
|
||
|
def renderMarkdown(): ExitCode = {
|
||
|
val opts = new MutableDataSet()
|
||
|
opts.set(Parser.EXTENSIONS.asInstanceOf[DataKey[util.Collection[_]]],
|
||
|
util.Arrays.asList(TablesExtension.create(),
|
||
|
StrikethroughExtension.create()));
|
||
|
|
||
|
val parser = Parser.builder(opts).build()
|
||
|
val renderer = HtmlRenderer.builder(opts).build()
|
||
|
val reader = Files.newBufferedReader(Paths.get("in.txt|md"))
|
||
|
val doc = parser.parseReader(reader)
|
||
|
val html = renderer.render(doc)
|
||
|
val body = "<html><head></head><body style=\"padding: 0 5em;\">" + html + "</body></html>"
|
||
|
Files.write(
|
||
|
Paths.get("test.html"),
|
||
|
body.getBytes(StandardCharsets.UTF_8))
|
||
|
|
||
|
ExitCode.Success
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Then run the result through `wkhtmltopdf`.
|
||
|
|
||
|
Markdown file:
|
||
|
<div class="thumbnail">
|
||
|
<img src="./img/example-md-java.jpg" title="Flexmark/wkhtmltopdf MD->PDF">
|
||
|
</div>
|
||
|
|
||
|
TXT file:
|
||
|
<div class="thumbnail">
|
||
|
<img src="./img/example-txt-java.jpg" title="Flexmark/wkhtmltopdf TXT->PDF">
|
||
|
</div>
|
||
|
|
||
|
|
||
|
### pandoc
|
||
|
|
||
|
Command:
|
||
|
|
||
|
```
|
||
|
pandoc -f markdown -t html -o test.pdf microsite/docs/doc.md
|
||
|
```
|
||
|
|
||
|
Markdown/Latex:
|
||
|
<div class="thumbnail">
|
||
|
<img src="./img/example-md-pandoc-latex.jpg" title="Pandoc (Latex) MD->PDF">
|
||
|
</div>
|
||
|
|
||
|
Markdown/Html:
|
||
|
<div class="thumbnail">
|
||
|
<img src="./img/example-md-pandoc-html.jpg" title="Pandoc (html) MD->PDF">
|
||
|
</div>
|
||
|
|
||
|
Text/Latex:
|
||
|
<div class="thumbnail">
|
||
|
<img src="./img/example-txt-pandoc-latex.jpg" title="Pandoc (Latex) TXT->PDF">
|
||
|
</div>
|
||
|
|
||
|
Text/Html:
|
||
|
<div class="thumbnail">
|
||
|
<img src="./img/example-txt-pandoc-html.jpg" title="Pandoc (html) TXT->PDF">
|
||
|
</div>
|
||
|
|
||
|
|
||
|
## Decision Outcome
|
||
|
|
||
|
Java library "flexmark".
|
||
|
|
||
|
I think all results are great. It depends on the type of document and
|
||
|
what one expects to see. I guess that most people expect something
|
||
|
like pandoc-html produces for the kind of files docspell is for (it is
|
||
|
not for newspaper articles, where pandoc-latex would be best fit).
|
||
|
|
||
|
But choosing pandoc means yet another external command to depend on.
|
||
|
And the results from flexmark are really good, too. One can fiddle
|
||
|
with options and css to make it look better.
|
||
|
|
||
|
To not introduce another external command, decision is to use flexmark
|
||
|
and then the already existing html->pdf conversion.
|