I’ve been catching up on technology news this week, after a few months of completely ignoring it. One of my favorites so far has been Netflix’s brief post on 5 Lessons We’ve Learned Using AWS. Within, they discuss the importance of testing your systems ability to survive failure in a infrastructure that is inherently chaotic and prone to failure (a fundamental characteristic of massively scalable clouds).
One way Netflix has tested their ability to survive failure is to constantly force failure within their system. They built a system component named ChaosMonkey that randomly kills part of their architecture to ensure that the rest of it can survive and continuing offering service to customers. I love this approach. Not only does it show a lot of maturity within the Netflix engineering team about what systems failure really means, but also shows that they understand the importance of constantly ensuring that failures are expected and recoverable; I still see too many enterprises today that have yet to learn that important lesson.
ChaosMonkey reminds me greatly of what I used to do, albeit manually, when I was in charge of systems architecture at a company in the early a part of my career. We had a wonderfully horizontally scalable system architecture, back in an era when most enterprises would only use mainframes for systems of such size and importance. It was a critical part of our ability to provide high levels of service and performance at a relatively low cost. We did a good job of giving our operations team plenty of opportunity to test system failure and ensure that we could survive catastrophic failure, but I found it was never quite enough as the system grew more complex.
About the time we began implementing RAID drive subsystems for our database nodes, I began a systematic approach to destroy parts of our architecture on a regular basis. At first, it was a way to make sure the RAID subsystem would act appropriately in a degraded state – and they did, but we had plenty of software modifications to make to allow our systems to perform better in such a state. My experiments quickly became a little more aggressive… a few times a week, I would randomly pull drives out of their carriers, or kill processes on production nodes, or even cause an entire system crash.
Remarkably, it was rare when any of this activity would even be noticed. Our architecture was so good at surviving single points of failure that the operations team would often not even notice that anything at all had happened. When they did, it was usually a sign of a weak-point in our architecture that needed to be fixed. Occasionally, hardware actually failed from my little experiments, and I always felt bad about that, but the reality is that same hardware would have failed at a much more inconvenient time.
Alas, once our company was purchased by a publicly-traded company, development was no longer allowed access to the data center. For the most part, this was a very positive thing, but it also prevented people like myself from twiddling with the system to make sure it would work well. Our operations staff was good however, and many of the lessons development learned over the years were picked up by them. Today’s trend of dev-ops is a welcome return to some of that way of thinking, as it really does show that you need developer mindsets when running your operations.
I can distill my experiences doing dev-ops work in the past into some key points:
- Build your systems with the assumption that failure happens, and happens regularly.
- You must test your ability to fail and recover, and do it constantly.
- Hardware & software solutions for vertical scaling are inherently inflexible and likely much more expensive in the long run. This was true in the 1980s, and even more true today now that more shops understand the nature of horizontal scalability.
- Beware of the mindset that wants high uptime for individual systems. It’s cool for your server at home, but in an operations scenario that system represents one that might not startup properly if it does fail. Inability for a system to startup properly was, and remains, a hugely common problem. Reboot all of your systems regularly.
In January of 2011 I took a deliberate break from my high-information diet. No news, television or on-line, no blogs, no Facebook, no Twitter for a whole month. It was awesome, and highly effective at completely breaking my high-information diet habit.
I’ve been a voracious consumer of information since I was a kid. As part of the first home computing generation, I grew up with the assumption that I had access to information on-line without much effort. In the first few years it was just the traditional dial-up bulletin board model, but quickly that gave way to the predecessor of the Internet available through the local universities and proto-ISPs. The amount of information and news we had available then was simply amazing, and it has only continued to grow since then.
It wasn’t difficult to imagine that we’d have access to this much information back then – science fiction authors and nerds have been thinking along those lines for much longer than I’ve been around, but to actually witness coming to fruition was wonderful – and so easy to take for granted.
The bottom-line is that many of us, and especially me, are addicted to information. News is no longer fed to us in an hour long program on television very night; there are plenty of young adults now who never lived when there were no 24/7 news channels. We all have the expectation that news and information is freely available anytime, and all the time, wherever we are.
As technology continued to advance, so did our ease of accessing this wealth of information. We could get all of our news via the web, and could skip the cable networks if we wanted. Eventually, we could watch the cable networks live over the Internet at the office, or wherever we might be. A plethora of applications let us aggregate and present all of this information in one place for our consumption, in real-time. The upshot being that there is ample opportunity to be distracted by information at any time – there’s no waiting any more.
We certainly don’t need 24/7 news and the Internet to procrastinate, but they both sure make it easy for anyone who works with a computer. As mobile computing devices become more common place, I believe we’ll start to see procrastination in the hands of everyone. Just wait until the clerk at Taco Bell is too busy checking Facebook instead of taking your order! Oh wait, you mean that’s already happened?
Adobe’s Flash has a lot of uses, but one of the most impressive to me has been the the creation of interactive graphs on a web page. One just has to visit Google Finance to see a great example of this in action; it’s fast, effective and fits seemlessly within the rest of the page.
Many times, however, Flash isn’t an appropriate technology to use. If you’re an open-source product like Zenoss, Flash presents a licensing issue. If you’re targeting mobile platforms like the iPhone, Flash isn’t available. And, sometimes, you may just not like Flash; it does have it’s own security problems and overhead, for example. What’s a web developer to do? Enter HTML5 to the rescue…