Adding extraction primitives

2025-09-28 15:48:22 +00:00 · 2020-02-16 21:37:26 +01:00
parent 851ee7ef0f
commit 8143a4edcc
46 changed files with 2731 additions and 83 deletions
--- a/modules/microsite/docs/dev/adr/0011_extract_text.md
+++ b/modules/microsite/docs/dev/adr/0011_extract_text.md
@@ -0,0 +1,77 @@
+---
+layout: docs
+title: Extract Text from Files
+---
+
+# Extract Text from Files
+
+## Context and Problem Statement
+
+With support for more file types there must be a way to extract text
+from all of them. It is better to extract text from the source files,
+in contrast to extracting the text from the converted pdf file.
+
+There are multiple options and multiple file types. Again, most
+priority is to use a java/scala library to reduce external
+dependencies.
+
+## Considered Options
+
+### MS Office Documents
+
+There is only one library I know: [Apache
+POI](https://poi.apache.org/). It supports `doc(x)` and `xls(x)`.
+However, it doesn't support open-document format (odt and ods).
+
+### OpenDocument Format
+
+There are two libraries:
+
+- [Apache Tika Parser](https://tika.apache.org/)
+- [ODFToolkit](https://github.com/tdf/odftoolkit)
+
+*Tika:* The tika-parsers package contains an opendocument parser for
+extracting text. But it has a huge dependency tree, since it is a
+super-package containing a parser for almost every common file type.
+
+*ODF Toolkit:* This depends on [Apache Jena](https://jena.apache.org)
+and also pulls in quite some dependencies (while not as much as
+tika-parser). It is not too bad, since it is a library for
+manipulating opendocument files. But all I need is to only extract
+text. I created tests that extracted text from my odt/ods files. It
+worked at first sight, but running the tests in a loop resulted in
+strange nullpointer exceptions (it only worked the first run).
+
+### Richtext
+
+Richtext is supported by the jdk (using `RichtextEditorKit` from
+swing).
+
+### PDF
+
+For "image" pdf files, tesseract is used. For "text" PDF files, the
+library [Apache PDFBox](https://pdfbox.apache.org) can be used.
+
+There also is [iText](https://github.com/itext/itext7) with a AGPL
+license.
+
+### Images
+
+For images and "image" PDF files, there is already tesseract in place.
+
+### HTML
+
+HTML must be converted into a PDF file before text can be extracted.
+
+### Text/Markdown
+
+These files can be used as-is, obviously.
+
+
+## Decision Outcome
+
+- MS Office files: POI library
+- Open Document files: Tika, but integrating the few source files that
+  make up the open document parser. Due to its huge dependency tree,
+  the library is not added.
+- PDF: Apache PDFBox. I know this library better than itext.