Tuesday, January 23, 2018

Your Documents Under the Magnifying Glass

A few years ago I moved my household administrivia to a paperless system. Instead of stacking file folders deep with bills and statements, everything would be scanned & shredded. This greatly helped with storage space - but in a couple of years I ended up with a network drive filled with over 3,000 PDFs, images and documents. Bear in mind the majority of these are scanned documents - so the contents are images instead of machine-readable text. Everything was dumped into a single directory and files were named based on the timestamp of when they were scanned, taking hours to organize documents into folders and sub-folders.

Instead of burning hours sorting documents I started burning hours building a simple set of applications that would read document metadata, attempt to convert the images to text, group documents by common letterhead and then provide a simple search interface over all of it. Since optical character recognition is hit-and-miss, any full-text search should permit proximate indexing and searching to allow for fuzzy matches.

In the end I created two apps: DocMag and DocIndex. DocMag serves as the search front-end and allows users to perform full-text searches on scanned documents, label them with tags and automagically group other documents with the same letterhead or logo. The interface is pretty spartan and uses Spring Boot to build a straightforward integration into Elasticsearch. DocIndex is the batch process that crawls a filesystem and parses the documents using OCR, generates thumbnails, tags similar documents using computer vision-based template matching, and stores document metadata within Elasticsearch.

DocMag was created in Groovy using Spring Boot (Spring Web, Spring Data, etc). I did this mainly to understand how Spring Boot's conventions translated over to the Groovy world... it had been quite a while since I had worked with Grails. It turns out that Groovy, Spring Boot and Thymeleaf complemented each other quite well and make for fairly simple web development.

DocIndex was created with Spring Boot and Java 9 initially. I griped in an earlier post about my problems with Java 9's dependency management, so instead I fell back to the lambda expressions and work queue management within Java 8. This permits multithreaded parsing of discovered files, which then allows for vertically scaling document indexing by adding cores. Horizontal scaling should be possible by replacing the in-memory work queue with a proper shared message broker. There is a "reminder" issue I've already filed to migrate to a proper broker so this can be done sometime in the future.

Both DocMag and DocIndex are deployed as containers within DockerHub. This was especially necessary with DocIndex, as it relied heavily on native libraries for Tesseract OCR and OpenCV. OpenCV was the most contentious - each Linux distribution has a different version of OpenCV, and the version changes quite rapidly. Building containers for distribution allowed me to ensure users got the correct version of native libraries that worked well with their Java bindings.

Another nice feature of the containerized deployment model was composition - I was able to pair the correct revision of Elasticsearch, conditionally include Kibana, and provide a simple web application firewall by placing DocMag behind modsecurity and Apache. Network connections could be maintained between Elasticsearch, modsecurity, and DocMag without any of these interconnects leaking to the "outside" world, allowing me to do things such as only expose modsecurity to outside traffic and only permitting DocMag to receive requests through modsecurity. Elasticsearch could be hidden as well, only available on the internal network managed by Docker Compose.

Deployment can be relatively straightforward; since everything is deployed to Docker Hub as a container, one should just need to download the docker-compose.yml file and issue export DOCUMENT_HOST_DIR=/mnt/documents && docker-compose up -d. This should provision a single-node Elasticsearch instance, start DocMag behind modsecurity, and begin indexing with DocIndex.

If you are stuck digging through mountains of scanned documents, give DocMag a try. Ease of installation is one of its primary goals - so let me know if you find any issues getting it running!