From 7fe8843893a2bc7bb1c7c08eb6c540308fbeccee Mon Sep 17 00:00:00 2001
From: Eike Kettner <eike.kettner@posteo.de>
Date: Thu, 20 Feb 2020 21:43:37 +0100
Subject: [PATCH] Update documentation sites

---
 modules/microsite/docs/doc/install.md | 32 ++++++++++++++++++---------
 modules/microsite/docs/features.md    | 10 ++++++++-
 modules/microsite/docs/getit.md       | 19 +++++++++++-----
 3 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/modules/microsite/docs/doc/install.md b/modules/microsite/docs/doc/install.md
index af4ba903..6f085d53 100644
--- a/modules/microsite/docs/doc/install.md
+++ b/modules/microsite/docs/doc/install.md
@@ -68,19 +68,28 @@ component.
   extraction](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
   (at the expense of a longer runtime).
 - [Tesseract](https://github.com/tesseract-ocr/tesseract) is the tool
-  doing the OCR (converts images into text). It is a widely used open
-  source OCR engine. Tesseract 3 and 4 should work with docspell; you
-  can adopt the command line in the configuration file, if necessary.
+  doing the OCR (converts images into text). It can also convert
+  images into pdf files. It is a widely used open source OCR engine.
+  Tesseract 3 and 4 should work with docspell; you can adopt the
+  command line in the configuration file, if necessary.
+- [Unoconv](https://github.com/unoconv/unoconv) is used to convert
+  office documents into PDF files. It uses libreoffice/openoffice.
+- [wkhtmltopdf](https://wkhtmltopdf.org/) is used to convert HTML into
+  PDF files.
 
+The performance of `unoconv` can be improved by starting `unoconv -l`
+in a separate process. This runs a libreoffice/openoffice listener
+therefore avoids starting one each time `unoconv` is called.
 
 ### Example Debian
 
 On Debian this should install all joex requirements:
 
 ``` bash
-sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper
+sudo apt-get install ghostscript tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper unoconv wkhtmltopdf
 ```
 
+
 ## Database
 
 Both components must have access to a SQL database. Docspell has
@@ -203,12 +212,15 @@ work is done by the joex components.
 ### Joex
 
 Running the joex component on the Raspberry Pi is possible, but will
-result in long processing times. Tested on a RPi model 3 (4 cores, 1G
-RAM) processing a PDF (scanned with 300dpi) with two pages took
-9:52. You can speed it up considerably by uninstalling the `unpaper`
-command, because this step takes quite long. This, of course, reduces
-the quality of OCR. But without `unpaper` the same sample pdf was then
-processed in 1:24, a speedup of 8 minutes.
+result in long processing times for OCR. Files that don't require OCR
+are no problem.
+
+Tested on a RPi model 3 (4 cores, 1G RAM) processing a PDF (scanned
+with 300dpi) with two pages took 9:52. You can speed it up
+considerably by uninstalling the `unpaper` command, because this step
+takes quite long. This, of course, reduces the quality of OCR. But
+without `unpaper` the same sample pdf was then processed in 1:24, a
+speedup of 8 minutes.
 
 You should limit the joex pool size to 1 and, depending on your model
 and the amount of RAM, set a heap size of at least 500M
diff --git a/modules/microsite/docs/features.md b/modules/microsite/docs/features.md
index 5e9ec6ba..c5d60643 100644
--- a/modules/microsite/docs/features.md
+++ b/modules/microsite/docs/features.md
@@ -9,6 +9,7 @@ title: Features and Limitations
 - Multiple users per account
 - Handle multiple documents as one unit
 - OCR using [tesseract](https://github.com/tesseract-ocr/tesseract)
+- Conversion to PDF: all files are converted into a PDF file
 - Text is analysed to find and attach meta data automatically
 - Manage document processing (cancel jobs, set priorities)
 - Everything available via a documented [REST Api](api)
@@ -18,6 +19,14 @@ title: Features and Limitations
 - REST server and document processing are separate applications which
   can be scaled-out independently
 - Everything stored in a SQL database: PostgreSQL, MariaDB or H2
+- Files supported:
+  - PDF
+  - common MS Office (doc, docx, xls, xlsx)
+  - OpenDocument (odt, ods)
+  - RichText (rtf)
+  - Images (jpg, png, tiff)
+  - HTML
+  - text/* (treated as Markdown)
 - Tools:
   - Watch a folder: watch folders for changes and send files to docspell
   - Firefox plugin: right click on a link and send the file to docspell
@@ -31,7 +40,6 @@ These are current known limitations that may be of interest for
 considering docspell at the moment. Hopefully they will be resolved
 eventually….
 
-- Only PDF files possible for now.
 - No fulltext search implemented. This currently has very low
   priority, because I myself never needed it. Open an issue if you
   find it important.
diff --git a/modules/microsite/docs/getit.md b/modules/microsite/docs/getit.md
index 0d533269..19c2721f 100644
--- a/modules/microsite/docs/getit.md
+++ b/modules/microsite/docs/getit.md
@@ -18,11 +18,20 @@ You need to download the two files:
 ## Prerequisite
 
 Install Java (use your package manager or look
-[here](https://adoptopenjdk.net/)),
-[tesseract](https://github.com/tesseract-ocr/tesseract),
-[ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
-[unpaper](https://github.com/Flameeyes/unpaper). The last is not
-really required, but improves OCR.
+[here](https://adoptopenjdk.net/)).
+
+OCR functionality requires the following tools:
+
+- [tesseract](https://github.com/tesseract-ocr/tesseract),
+- [ghostscript](http://pages.cs.wisc.edu/~ghost/) and possibly
+- [unpaper](https://github.com/Flameeyes/unpaper).
+
+The last is not really required, but improves OCR.
+
+PDF conversion requires the following tools:
+
+- [unoconv](https://github.com/unoconv/unoconv)
+- [wkhtmltopdf](https://wkhtmltopdf.org/)
 
 
 ## Running