mirror of https://github.com/TheAnachronism/docspell.git synced 2024-11-13 02:31:10 +00:00

Eike Kettner 3bc24af85d Remove thumbnail faq entry

2020-11-15 20:10:50 +01:00

5.8 KiB

Raw Blame History

+++ title = "FAQ" weight = 100 description = "Frequently asked questions." insert_anchor_links = "right" [extra] mktoc = true +++

FAQ

Where are my files stored?

Everything, including all files, are stored in the database.

Now that seems to put off some people coming to Docspell, so here are some thoughts on why this is and why is may be not such a big deal. It was a conscious decision and the option to hold all files in the file system was considered, but not chosen in the end.

First, it was clear that a database is required in order to support the planned features. It is required to efficiently support a multi-user application: the account data, passwords and many other things (tags, metadata etc) must be stored and queried reliably. Very often a relational model emerges and a database is the best fit, otherwise one would just "reinvent the wheel". So the options are to have a database and files in the filesystem or everything in one database. There are, of course, pros and cons for both ways, these were the reasons for the current decision:

Backups: With two things, you have to take care to backup both. All supported databases have good support for backups so having just one thing to backup is (usually) better than having to backup two things. YMMV if you already have some backups system in place.
Simpler, easier to maintain application: there is just one storage system used by the application and not two which reduces complexity in the code.
Consistency: Both "databases" (filesystem + relational db) can easily get out of sync and this will break the application. It's very strong plus to be able to rely on the strong ACID guarantees of database systems.
Distributed/Scaling: One goal is to run Docspell in a distributed way. If files were on the filesystem, the problem is that they have to be transferred to all the nodes eventually. This is trivially solved to use the database as a central storage and synchronization point.
Support for binary files in today's databases is not that bad. Docspell has no intention to store very large files. It will be quite efficient. I've used it several times and never had problems related to this. This postgres page shows some pros and cons.
The advantage of having files in the filesystem is weakened, if files have to be stored using some hash of filenames which might be necessary in order to overcome certain file-system limitations.
For low-volume/traffic installations where you just don't want to run a real database server, you can use the H2 database. This is an in-process database (comparable to sqlite) and doesn't require another server running.

You can find more in these issues: 270, 289.

What's the Exit Strategy then?

Of course, there is no guarantee that this project will be alive in the future. It is important to know how to use your data then.

A very important thing is: it is FREE software (as in freedom and in beer). That is, you can be sure to use the current version for as long as you want. So it is a good idea to backup the releases (or docker images) you are using alongside with your data. This ensures that you will be able to use your data "forever". This also means that you can inspect the data model and use the api and/or standard SQL tools to get all the data. While this may be difficult/inconvenient, the point here is only that it is possible. It's not hidden or obscured, nothing is lost. You can even backup the sources to keep this documentation, too.

In order to move to a different tool, it is necessary to get the data out of Docspell in a machine readable/automatic way. Currently, there is no easy way for this. However, it is possible to get to all data with some scripting effort. Everything can be queried using a HTTP/REST api and so you can write a script/program that, for example, queries all items and downloads the files (something like this might be provided soon, for now there are starting points in the /tools folder). It is planned to provide a more convenient way to export the data into the file system. But there is no ETA for this.

My recommendation is to run periodic database backups and also store the binaries/docker images. This lets you re-create the current state any time which allows to postpone the problem of getting the data in a specific format out of Docspell.

Note that you don't need to backup the SOLR instance (if you're using fulltext search), since it can be recreated by Docspell.

What if my documents already contain OCR-ed text?

Documents are not ocr-ed twice normally. Doscpell first extracts the text from a pdf. If this is below some configurable minimum length, it will still run OCR just to see if that gives more. Then the longer of the texts is taken. By default it will hand all pdfs to ocrmypdf, but this will skip already ocred files. The whole ocrmypdf process can be switched off in the config file. So if you only have these pdfs, this would be an option, I guess. Alternatively, it is possible to change the ocrmypdf options in docspell's config file to fit your requirements.

Is there support for migrating from other tools?

Currently there exists a bash script to import files and metadata from Paperless. Please see this issue.

Wh…?

If you have any questions, don't hesitate to ask. You can open an issue or leave a message in the gitter room. If you don't want to sign-up there, drop a mail to info at docspell.org.

5.8 KiB Raw Blame History

FAQ