Tuesday, January 23, 2018

Your Documents Under the Magnifying Glass

A few years ago I moved my household administrivia to a paperless system. Instead of stacking file folders deep with bills and statements, everything would be scanned & shredded. This greatly helped with storage space - but in a couple of years I ended up with a network drive filled with over 3,000 PDFs, images and documents. Bear in mind the majority of these are scanned documents - so the contents are images instead of machine-readable text. Everything was dumped into a single directory and files were named based on the timestamp of when they were scanned, taking hours to organize documents into folders and sub-folders.

Instead of burning hours sorting documents I started burning hours building a simple set of applications that would read document metadata, attempt to convert the images to text, group documents by common letterhead and then provide a simple search interface over all of it. Since optical character recognition is hit-and-miss, any full-text search should permit proximate indexing and searching to allow for fuzzy matches.

In the end I created two apps: DocMag and DocIndex. DocMag serves as the search front-end and allows users to perform full-text searches on scanned documents, label them with tags and automagically group other documents with the same letterhead or logo. The interface is pretty spartan and uses Spring Boot to build a straightforward integration into Elasticsearch. DocIndex is the batch process that crawls a filesystem and parses the documents using OCR, generates thumbnails, tags similar documents using computer vision-based template matching, and stores document metadata within Elasticsearch.

DocMag was created in Groovy using Spring Boot (Spring Web, Spring Data, etc). I did this mainly to understand how Spring Boot's conventions translated over to the Groovy world... it had been quite a while since I had worked with Grails. It turns out that Groovy, Spring Boot and Thymeleaf complemented each other quite well and make for fairly simple web development.

DocIndex was created with Spring Boot and Java 9 initially. I griped in an earlier post about my problems with Java 9's dependency management, so instead I fell back to the lambda expressions and work queue management within Java 8. This permits multithreaded parsing of discovered files, which then allows for vertically scaling document indexing by adding cores. Horizontal scaling should be possible by replacing the in-memory work queue with a proper shared message broker. There is a "reminder" issue I've already filed to migrate to a proper broker so this can be done sometime in the future.

Both DocMag and DocIndex are deployed as containers within DockerHub. This was especially necessary with DocIndex, as it relied heavily on native libraries for Tesseract OCR and OpenCV. OpenCV was the most contentious - each Linux distribution has a different version of OpenCV, and the version changes quite rapidly. Building containers for distribution allowed me to ensure users got the correct version of native libraries that worked well with their Java bindings.

Another nice feature of the containerized deployment model was composition - I was able to pair the correct revision of Elasticsearch, conditionally include Kibana, and provide a simple web application firewall by placing DocMag behind modsecurity and Apache. Network connections could be maintained between Elasticsearch, modsecurity, and DocMag without any of these interconnects leaking to the "outside" world, allowing me to do things such as only expose modsecurity to outside traffic and only permitting DocMag to receive requests through modsecurity. Elasticsearch could be hidden as well, only available on the internal network managed by Docker Compose.

Deployment can be relatively straightforward; since everything is deployed to Docker Hub as a container, one should just need to download the docker-compose.yml file and issue export DOCUMENT_HOST_DIR=/mnt/documents && docker-compose up -d. This should provision a single-node Elasticsearch instance, start DocMag behind modsecurity, and begin indexing with DocIndex.

If you are stuck digging through mountains of scanned documents, give DocMag a try. Ease of installation is one of its primary goals - so let me know if you find any issues getting it running!

Wednesday, December 20, 2017

Java Jigsaw Puzzles DevOps

Oh man that's a catchy blog title.

For the past couple o' weeks, my after-hours project has been trying out building webapps and batch jobs using the combo of Java 9, Spring Boot 2 milestone releases, Elasticsearch 6.1 and Docker Edge with Docker Compose. Just because I was in a WAF frame of mind I added modsecurity as a web application firewall in front of the app so I could learn a bit more about building WAF rules with Apache 2.

It was a fun lil' exercise, but in the end I found that all the cutting edge releases simply wouldn't play nicely with each other.

One painful exercise was trying to get Java 9 distributions to work within a Docker container just as it would within my desktop environment. Project Jigsaw is an oft-cited future feature of Java that build engineers have been asking for to end the myriad of JavaEE / Java ME / Java Desktop / Java Server distributions. It should help containerization by allowing svelte JRE installations to bootstrap within a minimal OS. However... this new way of distributing JREs with modular components creates yet another dependency management headache for builds. Once you begin writing manifest elements for Jigsaw + Java 9, every library and its mother now needs to be managed by your manifest as well. Its enough to drive you nuts.

Let's say you don't want to jump into building modular JARs yet and just build traditional JARs that don't use Jigsaw dependency management. Well... Ubuntu's OpenJRE 9 distribution doesn't automatically inject some Java 9 foundation libraries (such as javax.image), while Oracle's JDK does. If you use an Oracle JDK locally to develop things may appear just fine, but then you need to perform some command-line overrides for things to work on an OpenJRE 9 build. To make things more hairy, it seems that OpenJDK and Oracle have built implementations that might be runtime compatible but are NOT compatible from a build & deployment standpoint. Command-line arguments are vastly different, even though manifest formats are the same. That makes building standard build & deployment scripts a pain, as well as local testing. Distributing Oracle's JRE within a container is just to fraught for me to attempt - so I stick to distribution with OpenJDK instead.

I ended up burning too much time trying to get a consistent build between my streamlined Ubuntu-powered Docker container and my local MacOS development environment, so I punted back to Java 8. While Java 9 had some nice memory management features and some syntactic sugar, what I really needed was Lambda and Stream support. Java 8 was sufficient for this in both Oracle and OpenJDK-land.

The combo of Spring Boot 2 (milestone 7) and Elasticsearch 6.1.0 was another mix that simply didn't pan out. The Java libraries for Elasticsearch 6 had a few signature changes across the API which were entirely incompatible with Spring Data Elasticsearch, and the protocol between ES 5 and 6 did not appear to be compatible. I'm sure this will get patched up in short order within the Spring project, however until then I had to fall back to Elasticsearch 5.6.4. I wanted to stick with Spring Boot conventions as closely as possible, so I did not go native just for ES 6 support.

In the end... I do have a fully containerized solution using Spring Boot 2, Java 8, Elasticsearch 5.6.4, and modsecurity. Getting WAF protection, a single-node ES cluster, a web front-end and a indexing batch process running in the background all happens with:

export DOCUMENT_HOST_DIR=/mnt/documents && docker-compose up -d

...and that's it! Containers are also available at Docker Hub and require thankfully LITTLE dependency management.

Monday, May 15, 2017

Climate Change By The Dollar

One of my lil' neurosis is ensuring that I reduce my energy usage year over year. To make sure I'm following a downward trend, I've been trending the dollar cost for energy and water bills. Assuming that cost per unit does not go down year over year (which so far has been true), this should be a reflection of overall energy use.

Note that the large hills on the graph spurred on by heating bills (both water and central air) are shrinking each year. Air conditioning during the summer months is showing small increases. Over the past three years I have also replaced all light fixtures with LED lighting - which does help drop the constant spend month over month.

It is interesting that while both winters and summers are getting warmer, heating the house expends much more energy than cooling the house, providing an overall downward trend. Water use is also beginning to spike due to the lawn irrigation system, which is why I created the Sprinkler Switch project to only water when no rain has occurred recently or is forecast to occur that day.

This is an indirect measure of how our climate is changing, and only represents a four year sample size. The trends are still quite visible - and demonstrate how evolutions in home heating could significantly reduce energy consumption.

Saturday, February 04, 2017

Alarm Clock Hacking by Blocks

A little over two years ago I built an alarm clock intended for hacking by kids, using a web-based Python IDE. When I tested the lessons, I found that kids didn't like messing with Python and only learned enough to get things barely working. Yet, when it came to Scratch Jr or the desktop version of Scratch, they would spend hours at a time. I needed to find a more approachable way to code.

Recently I discovered Blockly, a product from Google for Education. With that framework you can code by blocks and use its transcoder to output JavaScript, Python, Lua, Dart or (ugh) PHP. The transcoder runs entirely client-side, and the output is human-readable - well indented and even commented.

Writing custom blocks turned out to be an easy thing, so I created blocks to modify the LED display, send audio out to a speaker, or react to button presses. Now you can use blocks to program the clock, while retaining all the functionality present in the older Python interface.

If I was going to redo the Hack Clock, this time I wanted to have a presentable site with full hardware and software lessons, for both Python and Blockly. I revamped the Hack Clock website, completed the Python lessons that I left incomplete last time, wrote new Blockly lessons for the new IDE, and completely re-did the hardware how-tos. Lesson writing took up the lion's share of time, since they all needed new images and better testing.

Another bit o' feedback I had received was that installing the Hack Clock software was too much of a pain. I tried to make this a bit easier this time by offering releases within a Debian pkg, although you still needed to use apt to install dependencies. Still, this cuts down installation from over an hour to about ten minutes... and most of those ten minutes is spent twiddling your thumbs while you want for packages to download and install.

The hardware needed tweaking as well. It turns out the Raspberry Pi headphone jack is just a PWM pin hack and it seemed that GStreamer sometimes just couldn't grok it. The headphone jack was never a complete solution either - it required a discrete amplifier to power speakers, and soldering wires onto a 1/8" jack is a GIGANTIC pain. To make the audio hardware easier to cope with, I moved away from the headphone jack to Adafruit's I2S decoder and amplifier. It provided better audio and cleaner installation without increasing my part count or price. It has proven out to be easier for everyone so far.

The old Hack Clock had another embarrassing flaw: it could only handle one button input and couldn't manage output at all. That drove me nuts and was probably the second biggest thing I wanted to fix. With the latest release the Hack Clock can handle as many buttons as you have GPIO pins, and you can also drive output pins as "switches" in code. The code-by-blocks IDE could deal with buttons and switches as simple function blocks - which meant reacting to user input became much easier to code.

Once things were ready, I installed the Hack Clock software in a mission-critical environment: kids' rooms. So far things have gone well; audio has been more reliable than with the headphone jack, and they have been able to tweak the software more easily than with Python. One bit I noticed this round however: kids don't like looking down to read something, then looking back to code it. The next generation Hack Clock should have an interactive demo to guide through the lessons so they never have to glance away from the IDE.

I'd love to hear what other people experience when they try to get the Hack Clock running as well. A hardware list is posted on Hackaday, and all the instructions are at http://hackclock.deckerego.net/. Let me know what you think!

Thursday, December 15, 2016

Arcade Addiction

Ah, who can forget playing Pac-Man at the Pizza Hut. Or Joust waiting for a pizza at Noble Roman's. Or DigDug at Pizza King. Come to think of it... I ate a lot of pizza as a kid.

Fast forward to Christmas of 2014 - I purchased a arcade cocktail cabinet from Rec Room Masters. After it was assembled in Ikea-like fashion I mounted an old monitor, discarded 2.1 speakers and an Raspberry Pi 3 inside of the chassis. Nifty.

One oddity was that I didn't want to shell out all the cash for every single button in a panel... so I needed a cap for each remaining hole. Luckily I had access to a 3D printer, so was able to remix a hole cap on Thingiverse and print black caps to fill the gaps.

I wanted the Raspberry Pi to sit a bit out of the way, so I screwed it into the VESA mount that the monitor rested on. After sawing an Adafruit perma-protoboard in half I was able to craft some custom headers that allow ribbon cables to connect from the Raspberry Pi and join with header posts for the joystick pins and buttons. This allowed for much better cable management and room for the subwoofer & speakers underneath.

I wasn't interested in installing a coin door on one side - so I kept it wide open and instead had the cabinet door facing the center of the room. Little had I expected that cats would LOVE climbing in the open gap of the cabinet... and ripping cables off my Pi. I shoved the open side against the wall - allowing the extension cord to conveniently poke through - and now the only access is through the swinging door on the opposite side.

On the software side, joysticks and buttons are mapped through mk_arcade_joystick_rpi, an archaically named but amazingly useful module that allows GPIO pins to become joystick inputs recognized by Linux. It took some work in order to have libretro recognize these buttons; many of them had to be remapped. However, libretro quickly became my go-to MAME emulator and now supports controls on both sides of the cocktail cabinet.

I had to perform some slight modifications to the RetroPie display setup to rotate the screen 90˚, but luckily so many cocktail cabinet titles were programmed for this 4:3 aspect ratio. Titles are working flawlessly now, and I can host two-player action by flipping a few emulated dip switches.

One thing I found interesting was how MAME distributions were entirely dependent on the exact name of your zip file. In addition, each ZIP was a true manifestation of the on-board arcade ROMs - in that sometimes a US distribution or second edition game actually piggybacked on top of a previous ROM. In this same way, two ZIPs were sometimes required to run a single title: one for the older ROM, one for the later version. I ended up combining the two ZIP archives into a single one - in this way older ROM images were still injected as a dependency, while the ZIP name was that of the older title and was still executed correctly.

Pizza night at the household now takes on an entirely new meeting. A few slices and a frosty beverage helps me appreciate Ms. Pac-Man in a whole new light.

Wednesday, October 07, 2015

Raspberry Pi Finally Conquers Userland

Raspberry Pi developers have had quite a coup on their hands this past few weeks. The "official" Raspberry Pi Linux distribution Raspian was just upgraded to Debian 8, or "Jessie." This provides a huge number of wins - the 4.1 release of the Linux kernel, latest glibc and build chain updates, more native packages (like Node.JS and wiringPi), and device trees. Oh, sweet device trees.

While the current Raspian distribution still relies on wiringPi 2.24, the most recent 2.29 version has a much nicer way of addressing GPIO in userspace by exposing the GPIO ports in /dev/gpiomem. All too often Raspberry Pi developers run GPIO apps as root to access the array of general purpose I/O pins, however this leads to all the lovely security holes and vulnerabilities that privileged access brings. You never want Apache or Python or any user-created apps running as root - so instead you must find a way to export these ports and allow unprivileged users to access them. Traditionally this has been done using wiringPi's export utility, however the latest gpiomem exposure seems to be much cleaner.

With Jessie I've been able to significantly cut the complexity of installing Garage Security and Sprinkler Switch. I don't need to manually install wiringPi, Node.JS, Video4Linux and a number of other packages. Things seem to largely "just work" as one might expect of a modern distro. One example is that Motion has been updated and appears to be pre-packaged on Raspian, and the necessary Video4Linux bcm2835-v4l2 kernel module properly creates a /dev/video0 device. CPU utilization appears to be much lower with the current stack, and it appears that I can just tweak Motion's configs to save videos in an HTML 5-friendly way rather than transcoding them with a script.

Garage Security and Sprinker Switch are being updated now for Jessie and testing is underway... the new Jessie builds are looking very promising so far.

Friday, September 25, 2015

XMPP (Jabber) as a Message Broker

For a long while I’ve relied on Jabber/XMPP support within Google Talk to communicate with back-end systems like my Garage Security monitor. Garage Security could push notifications to me when motion was detected, and I could reply back to ask for camera snapshots or current temperatures. It’s almost as if I was using XMPP and Google’s Talk servers as a message bus; everything was a request/response pair that I could receive as notifications in a nice lil’ mobile interface. This was a superior approach to having a peer-to-peer communication channel over the Internet at large - I could keep my firewall completely closed and instead publish events to a trusted broker over at Google HQ. I essentially treated Google Talk like a hosted RabbitMQ instance.

This "XMPP broker” approach continued to work after Google moved from Google Talk to Hangouts and dropped full XMPP support (notably for federation), however things appear to have become a bit more difficult when two different systems (like Garage Security and my new Sprinkler Switch) want to share the same Hangouts user ID. Previously both systems would receive an inbound message, so I would filter by a token in the message body. If I asked for “garage status,” Garage Security would catch the “garage” keyword and respond while Sprinkler Switch would just ignore it. As Hangouts has turned the XMPP support decidedly more text-message-ish, it seems now the last system to authenticate will starve out the previous system, and only one system will actually receive the messages.

This is not outside of the XMPP spec it seems, and the protocol itself specifies two ways for the systems themselves to deal with the issue:

  1. When connecting to the XMPP server set the priority for your connection. A higher priority is more likely to get inbound messages.
  2. Specify a resource name within your XMPP user identifier. This allows a system to be uniquely addressable with the same username.

The first option doesn’t necessarily help my situation - I want both systems to receive inbound messages. The second option is possible using XMPP’s definition of user IDs… where a user identifier is actually the composite of:

  • The username that is used for authentication
  • The domain that the user resides within
  • The resource that uniquely identifies who is signing in
Using this schema, I could provide SleekXMPP a JabberID (its native user identifier) of chuckleface@gmail.com/garage and have it uniquely identify Garage Security, while chuckleface@gmail.com/sprinkler uniquely identifies Sprinkler Switch. It’s not entirely unlike the routing key in AMQP or a topic name in JMS… chuckleface could be considered your message type, gmail.com could be considered your exchange, and garage could be considered your ID. Or something like that. It makes sense in my head at least.

Hangouts, however, just cares about chat messages. It could give two craps about my resource name. There’s no way to specify that in a contact either… with Hangouts you specify an e-mail address which in turn becomes a username and a domain. That’s fine for chat… but when I want to address an individual system I’m kinda outta luck. Hangouts will just reply back to the last resource that sent it a message - no way to specify a specific resource.

I've posted a demo using Python on GitHub which lets you build a quick XMPP client. An example might be:

>>> from jabtest import Jabber
>>> jab1 = Jabber('test@gmail.com', 'apikeyh4x0r5', 'testone')
Opened XMPP Connection
>>> jab2 = Jabber('test@gmail.com', 'apikeyh4x0r5', 'testone')
Opened XMPP Connection
>>> jab1.send_msg('deckerego@gmail.com', 'Testing One')
Sending message: Testing One

I don't have a fantastic solution for now... so in the interim I've disabled Jabber support for Sprinkler Switch.