Sunday, June 22, 2014

A Drift Into Failure

I'm still working towards catching up on my Christmas ready. I already wrote my missives on Thinking in Systems and A Pattern Language; next up is the DevOps favorite Drift into Failure.

The basic premise of Drift is that failures, even massive ones, don't (usually) happen because of a vast conspiracy or from the deeds of evil people. Massive failures occur from behavior that is considered completely normal, even accepted, as part of a daily routine. These routines give our perspectives tunnel vision and often don't allow us to see the underlying issue. Production goals, scarce resources and pressure on performance causes drift in these routines that slowly erode safe practices.

Fatal aircraft crashes and space shuttle disasters are often quoted in the book, however every operations or software engineer in IT has seen this play out a gazillion times before. The site goes down on a regular basis... and no one knows quite why. After digging and pushing new code and re-pushing bug fixes for many sleepless nights, one often finds out that the outage was due to a routine maintenance task gone awry. Maybe a query optimization cache was manually flushed within the production RDBMS, causing the entire cluster to freak out and create a bad query plan. It seemed perfectly sane at the time and even if every single person knew this was going to happen the day before, it likely wouldn't have been caught.

Drift points out how remediation and "root cause" reporting is often fruitless. The concept of high-reliability organizations was pushed in the 1980's as an entire school of thought, focused on errors and incidents as the basic units of failure. Dekker demonstrates that "the past is no good basis for sustaining a belief in future safety," and such a focus on root-cause analysis often does not prevent future incidents. The traditional "Swiss Cheese Model" for determining cause has attempted to see where all of the holes within established safety procedures line up, so as to create a long gap through which problems can drive themselves through. This type of reductionist thinking where atomic failures create linear consequences has turned out not to be predictive after all - instead we need to look at things through the lens of probability.

One of the best practices that anyone, including those supporting enterprise software, can encourage to avoid failure is to be skeptical during the quiet times and always invite in a wide range of viewpoints and opinions. Overconfidence can be your downfall, and dissent is always a healthy way to get new perspective. Dekker quotes Managing the Unexpected to point out that "success... breeds overconfidence in the adequacy of current practices and reduces the acceptance of opposing points of view." Those that were not technically qualified to make decisions often were the ones that made them, or outside pressures (event subtle ones) caused trade-offs in accepted practices. Redundancies that were supposed to make things highly available often make systems more complicated and, in turn, actually make them more likely to fail.

The best way to avoid a drift into failure is to invite outside opinion, even bring in disparate practice groups. Take minority opinions seriously. Don't distill everything to a slideshow. Be wary of adding redundancies and failsafes - often the most simple solution will be the most reliable. The recent re-invigoration in microservices is a great example of this - by simplifying the pieces of a complex system, we can allow each component to work in isolation and ignore the remainder of the system. This allows the system to grow, adapt and evolve without support systems usually provided for monolithic software stacks.

Drifting into failure occurs when an organization can't adapt to an increasingly complex environment. Never settle, always embrace diversity and keep exploring new ways to evolve. A great quote from Dekker is "if you stop exploring, however, and just keep exploiting [by only taking advantage of what you already know], you might miss out on something much better, and your system may become brittle, or less adaptive."

Sunday, March 23, 2014

An Expensive Failure of Judgement

So remember when I precariously perched a moderately encased rangefinder above my sump pump well? It was kinda wedged in between the cover and the well wall, and I thought there wasn't enough play in the line leading to the rangefinder as to let it drop in. Well... all my hackery finally caught up to me and a very expensive sensor ended up taking a swim. Current remained running through it the entire time so for several hours it swam in well water, slowly accreting minerals. No amount of drying out would save it.

I wasn't going to replace it with another expensive sensor... so I went the completely opposite direction and built an unbelievably primitive water detector. Here two plates of aluminum foil were hot-glued to construction paper and the bare end of my infamous telephone wire, then isolated in electrical tape. If water bridged the two aluminum plates, a connection would be made - at least enough of a connection to be considered a "high" signal.

The other end of the two wires were sent to the NPN transistor that was originally intended to work as a UART logic inverter. Now it was a simple logic gate; once the water closed the circuit the NPN shut off the current headed to a GPIO pin. If the pin was live, no water was detected. If the pin was dead, you had a problem.

The web front-end that I created for this whole rigamarole was updated to reflect this hack, and now just reports the binary status of the water detector. I'm not thrilled with the setup, but I also wasn't too keen on the idea of plopping any more money down on a solution.

So... lesson learned. Don't dangle water sensitive components over a well of water.

I do have a need for another security camera, so this whole setup may just be ditched in favor of another Motion rig. I really dig the I2C temperature and humidity breakout board however, and I'd like to keep using it. Maybe I'll save up my allowance and get a CC3000 WiFi board and pair it up with the temp/humidity board... that would be a pretty nifty & tiny package.

Tuesday, March 11, 2014

It's a Basement, not a Swimming Pool

Second up on my paranoia list is my basement slowly filling with water. My paranoia is founded in a rich history of failed sump pumps, broken water mains and power outages. I can mitigate some of my worries by installing a backup, non-electric Venturi aspirator and a die-cast primary sump pump - however anything mechanical can break. I believe in nothing anymore.

A Raspberry Pi can help satiate most of my neurosis, including this one. Using a Honeywell HumidIcon Digital Humidity/Temperature Sensor and a Maxbotix Ultrasonic Range Finder I can monitor basement humidity, temperature and sump well levels.

My first component to integrate was the range finder. The Maxbotix LV-EZ4 can operate in one of two modes - either providing an ASCII representation of the range using RS232 serial communication or using an analog voltage. I dorked around with two possible ways of using this - feeding an analog signal through an Adafruit Trinket and have the values translated into an I2C signal. However - I had a 5v Trinket - and even with constructing voltage dividers I couldn't quite coordinate the right voltages to negotiate with the Pi. I punted and used the serial port from the LV-EZ4, however the Pi uses UART and so I had to create a logic inverter using a recycled NPN transistor. Once I inverted the signals from the range finder, the Pi was able to read the inbound ASCII representation of the range.

After I had the range finder working, I used Sparkfun's Honeywell breakout board via I2C to communicate temperature and humidity to the Pi. Both the range finder and the breakout board fit nicely on a mini breadboard, sharing voltage and ground while splitting out I2C data, clock and RS232 data feeds. Once permissions were correctly set and kernel modules loaded, things appeared to be working nicely.

I wanted to save the range finder from water splashes, or at least slow its eventual decay. I re-used the case from the SD card I purchased for the Raspberry Pi, cutting out holes for the extrusions in the range finder board. Corners were then covered in electrical tape, and the seams were covered in hot glue. No, it's not pretty. No, it may not add to the LV-EZ4's lifespan. It was at least worth a shot however, and I've added a bit of crush/drop protection.

Everything is hooked into a Raspberry Pi Model A, just to save a few bucks. For an enclosure I ripped apart an old Netgear wireless access point, which easily housed the mini breadboard and the Pi. I decided to try things out but stumbled upon an unsettling fact... there are no power outlets near the sump pump well. Undeterred, I went looking for any long length of wire and found twenty feet of RJ11 telephone cable. It had four total wires - which would be more than enough to carry voltage, ground and signal wires. I sloppily spliced the wire, soldered it onto three jumpers, attached one side to the breadboard and another to the range finder. To my surprise - it actually worked. I was able to string the range finder all the way across the room, which also made ambient humidity readings more accurate.

In much the same way as I created the Bottle application for the garage door security monitor, I created a Bottle app to host REST APIs and display the well depth, temperature and humidity as well as allow Jabber (e.g. Google Talk) clients to request the status of the well and the climate. It all is working well so far, however I still need to tweak the Honeywell I2C code to make sure the component re-samples conditions at every request. Right now it is just fetching the currently stored values.

Right now the range finder is resting atop the sump pump well and is just waiting for the upcoming rains. My eventual goal is to create a home dashboard that aggregates all sensor data from around the house: sump pump well depth, basement temperature and humidity from the Basement Monitor APIs, ground-level temperature and humidity from a Nest thermostat, garage door state and camera feeds from the Garage Security APIs and maybe even power data from an attached APC UPS. The Bottle apps would then work to expose sensor data as REST APIs, and a more powerful Play application would serve the user interface, archive historical data, provide alerts and indicate trends.

Saturday, March 01, 2014

A Systems Language

A Pattern Language is an interesting book to pick up, and that's not just a joke about the size of the volume. Its web site betrays how old the book actually is; it was published in 1977 based on research that had been ongoing for several years. It's scope is pretty large and covers everything from the layout of an office building to the composition of an entire town. Much attention is focused on how to build communities within these spaces, and a lot of research provides evidence on optimal ways of building and tearing down boundaries.

Of particular interest to me were chapters concerning self-governing workshops and offices. The book stresses that no one enjoys their work if they are a cog in a machine. Instead, "work is a form of living, with its own intrinsic rewards; any way of organizing work which is at odds with this idea, which treats work instrumentally, as a means only to other ends, is inhuman." This is a fairly strongly worded assertion that means that employees must feel empowered in order to construct meaningful product.

Just as Thinking in Patterns postulated that groups should autonomously self-organize in order to realize their greatest efficiency, A Pattern Language encourages the formation of self-governing workshops and offices of 5 to 20 workers. A chapter is dedicated to the federation of these workgroups to produce complex artifacts - such as several independent workshops working in concert to build much larger things.

A Pattern Language also encourages keeping service departments small (less than 12 members) and ensuring that they can work without having to fight red tape. This applies to many shared services departments in both government as well as public sector organizations; departments and public services don't work if they are too large as the human qualities vanish. One must fight the urge to make an "idiot-proof system," since this can cause the system to devolve to the point that only idiots will run it.

The book is largely about physical space of course, so it has many recommendations on how offices should be connected. The authors specifically studied what isolated groups within a company, and even what we might consider small physical distances amounted to big interruptions in communication. If two parts of an office are too far apart, people will not move between them as often as they need to. If they are a floor apart, they sometimes will not speak at all.

Ultimately A Pattern Language has a lot of common sense to offer up about how to build a work community, backed by a fair amount of research that bucked many trends in the 70's. It had points that should not be glossed over even now, including:
  • You spend 8 hours at work - there is no reason it should be any less of a community
  • Workplaces must not be too scattered, nor too agglomerated, but clustered in groups of 15
  • Workplaces should be decentralized, not reliant on a central hub
  • Mix manual jobs, desk jobs, craft jobs, selling, etc. within a community
  • There should always be a common piece of land (or a courtyard) within the work community which ties offices together
  • The work community is interlaced with the larger community they operate within

    Workspace efficiency and community engagement is definitely not a new practice, however we always tend to think it is. If we can remember the lessons learned thirty-seven years ago, we may be in a better place to make a better workplace today.
  • Wednesday, February 12, 2014

    Thinking in Patterns

    Cognitive Hazard by Arenamontanus
    I've finally started to look at some recommended reading that has been on my wish list for going on two years now. Two of the books, Thinking in Systems and A Pattern Language, have particularly resonated with me since they spoke directly to the practice of software engineering without mentioning it once.

    Donella Meadows has left behind quite a legacy, and has great observations on how people work within overarching systems. Systems are everywhere and are often composed of yet other systems - just as it is with how people manage their workload every day. In particular, Donella notes the traps that systems can cause which cause things to go completely awry. Let's see if we can identify any of these traps within the context of enterprise software development:
    • Policy Resistance (think of "The War on Drugs," where two sides are trying to leverage the same system)
    • Tragedy of the Commons (exhausting a shared resource)
    • Drift to Low Performance (goals are eroded because negative feedback has more resonance than positive feedback)
    • Escalation (one side is attempting to out-produce the other, without a balance in between the two sides)
    • Competitive Exclusion (success to the successful)
    • Shifting the Burden to the Intervenor (an addiction has removed a system's ability to shoulder its own burdens)
    • Rule Beating (finding loopholes)
    • Seeking the Wrong Goal

    Any of those sound familiar in your current software engineering practice? No matter if this is exhibited between the business and the engineers, or PM's and engineers, or between engineers - these are universal pratfalls.

    There are ways to influence systems and avoid the traps we often fall into. These leverage points within the system can allow you to alter behavior and encourage positive results. A tricky point remains that some of the leverage points that are easiest to alter have the smallest impact, and some of the largest impact leverage points are very difficult to manipulate. If we look at an Agile software scrum, you might identify least impactful to most these leverage points as:
    1. Numbers, Constants and Parameters. It often feels like you're changing things because you have the most control over these knobs and dials... but all too often reactions are delayed and are cushioned by buffers within the system. Sure, you can change your sprint velocity or begin estimating bugs, but those are just different views on the same result.
    2. Buffers, or the sizes of stabilizing stocks that act as reservoirs of results. A buffer may delay or even out the consequence of a change within the system. Changing buffers would be like changing from a two to a four week development sprint in Agile - you may give yourself more time to recover, but more than likely you're just delaying an inevitable fail.
    3. Failing that, you might try to alter the real, physical parts of the system and how they interact. This can happen, but they are often difficult to change and result in a game of whack-a-mole. This is more fondly called "re-arranging the chairs on the Titanic," and often is exhibited by swapping out team members but keeping the system the same.
    4. The next leverage point might be to try and change how quickly you respond to changes by reducing delays, which in turn alters how quickly the system changes. However, Donella does demonstrate that shorter reaction time can very easily result in greater volatility, and things can become so volatile that they crash. This is what Agile is meant to guard against by locking down a sprint and ensuring priorities aren't changed on a day-by-day basis.
    5. In order to get a grasp on things one may also overlook the balancing feedback loops - or safety measures - that safeguard the system in times of emergency. The excuse is generally that "the worst is unlikely to happen," however this drastically reduces one's survival range. Adaptability is important, and if you take away the ability to adapt you can crash even harder.
    6. Monitoring for reinforcing feedback loops is something that becomes crucially important. This tasks requires one to watch for runaway chain reactions, which can cause a meltdown if not kept in check. Here bad decisions and bad reactions begat even more bad decisions and bad reactions, causing a runaway system. Look for balance instead of infinite feedback loops; if you can keep pushing your tasks to the next sprint, you're only encouraging a runaway backlog of tasks.
    7. Information flows can save a system. If information is in your face and always available, it influences even small decisions. Look at the Nest thermostat or smoke detector - here are devices whose primary purpose is to give you a nonstop flow of info wherever you are. The more info you have (such as how many hours heat was pumped into your house), the more you make small alterations to find balance. This is another part of the Agile process in the form of burndowns/burnups/velocity graphs. This info is meant to be viewed and reflected upon often.
    8. Rules (incentives, punishments, constraints) often have to take place to enforce all the above points. In order to kill feedback loops, ensure emergency systems are maintained and information is shared some rules of the game have to be put into place.
    9. Self-organization, which is an odd juxtaposition of the above rule about rules, is something that Donella prizes most about not only the human condition but systems in general. Usually if you let the component pieces of a system find their role, they will find a way to work with other components in harmony. This is the proof against micro-management; the more you manage, the more you can threaten a system's success. Let developers go free within the confines of the sprint, and don't hover over them (aside from a daily standup).
    10. Find the right goals to change a system. If you focus on GDP, you will focus on gross domestic product at the exclusion of other things. Picking the right goal is tougher than it sounds - you need to know what you want first. However if you can clearly identify and communicate a measurable goal, you can have a huge amount of control over the system. Define what the business actually wants to see - and involve them in the decision making process.
    11. Change your mindset. This is effectively what EVERY project management methodology attempts to do - make you think about the same problem in a different way. If it gives you a renewed perspective, this can be helpful. However...
    12. ...ultimately you should transcend paradigms and realize no paradigms are true. This is what supposed "anti-patterns" are meant to exhibit, and it can be helpful to realize that Agile, just like Waterfall, will ultimately come and go. Just ship early, and ship often.

    Just as we have "Gang of Four" or "Enterprise Integration" patterns, the above are system patterns that can help us decompose and deal with a system. Look for the common traps that always happen - and then evaluate your leverage points to counteract them.

    Monday, December 02, 2013

    Massively Parallel Compute as a Service

    Back in the Spring of 2012, I asked several panelists at VMWorld to weigh in on vector processing with GPUs as a big data/big compute solution. The response was a resounding "not yet," as the infrastructure has not yet reached commodity level and GPU processing was greatly constrained by memory paging. It now seems like both obstacles are being removed.

    Amazon Web Services is now offering EC2 instances that offer up virtualized instances of NVIDIA's Kepler GPUs as "G2" instances. This supports H.264 encoding, OpenCL, CUDA and OpenGL toolsets which allows for more mature toolsets to build apps targeted to these vector processing instances. This kind of support allows for commodity toolchains and commodity infrastructure to allow for massively parallel processing on demand.

    Memory paging should soon be addressed by NVIDIA via CUDA 6, and should also be addressed by AMD with its upcoming Kaveri architecture. Once memory addressing is unified, the swapping of memory regions should become unnecessary and allow for memory to be addressed locally without pagination. This simplifies application development, virtualization and hardware architectures considerably.

    I believe that very soon we will see vector processing at scale garner as much attention as map/reduce clusters currently do. Massive data parsing has been commoditized, and now we have an opportunity to commoditize massive algorithmic crunching.

    Sunday, November 03, 2013

    Retrospective: The Raspberry Pi Garage Door Remote + Security System

    My rinky-dinky garage security system is now online and in operational use. I still have more tweaks to do - for example, I got rid of the metal backplane within the My Book casing that now serves as my board enclosure because it shielded my WiFi signal, killing the network connection. I'm sure I will continue to tweak the Motion configs to increase framerates and decrease sensitivity. Now that e-mail notifications are working, hopefully I can limit the spurious notifications and just notify on the bigger changes of motion over two seconds in length.

    Another measure of success is cost; if I could have purchased a ready-made setup for a marginal increase in cost, it may be better to go with a commercial platform. If the build is overkill and I could have built it with cheaper components, I should scrap this and re-build. Looking at commercial options I couldn't find anything that had both the garage door functionality and the security camera... just one or the other. Chamberlain does sell the MyQ Garage, a pretty nifty home automation product that contains a universal garage door opener and a tilt sensor that is WiFi-enabled and can be paired with a smartphone app. They also sell the MyQ Internet Connectivity Kit, which is more of an Internet-enabled garage door master controller. Neither have a security camera paired with it, but you could easily install a wireless camera separately for around $40. The MyQ solutions are $140 and $120 respectively, giving you a total build cost of $160-$180. Not bad, really.

    If you bought every part new, the build list for my lil' setup is:
    Raspberry Pi B $40
    USB Micro-B cable $2
    USB AC Adapter $5
    8GB Class10 SD Card $8
    802.11n USB dongle $9
    Parts for MOSFET switch $5
    Universal garage door opener $25
    HP HD-3100 webcam $14
    Enclosure made of random stuff $0
    Total $108

    I had most of these parts on-hand, so my actual cost was closer to $70. That means a savings of $90 over a commercial solution. I don't know of a cheaper solution than the Raspberry Pi that could handle a 1280x720 webcam feed and perform motion detection, and a $14 webcam is cheaper than Raspberry Pi's own camera expansion card.

    Of course, your time isn't free. The hours spent in construction count - so I tried to estimate how long each step took me:
    Tearing down & wiring up garage remote 1 hour
    Setting up webcam and Motion 2 hours
    Configuring OS & system administration 4 hours
    Building web interface 3 hours
    Building enclosure 2 hours

    All told maybe 12 hours of work, a quarter of which was me figuring out how to render an MJPEG stream on an HTML5 canvas. The web interface can be re-used, as are the system administration steps, so I could probably do another in four hours or so. Four hours and $70 isn't too bad for peace of mind.

    Speaking of ease of mind, I'll leave this thread with an ad for Chamberlain's MyQ Garage. I thought I was bad... but these actors have turned garage door anxiety into an existential crisis.