Whether we’re streaming a movie on our phone, analyzing satellite imagery with machine learning models, or just finding a dinner date, we simply expect the technologies we lean on to work perfectly—every time. Information technology (IT) increasingly sits at the center of our social lives, work, schools, healthcare systems, and much more. And as we all have experienced, when disruptions occur, the impacts can be painful.
For the companies building the digital products and services we use, meeting our growing expectations means ensuring that these services are always available, with no downtime, ever—even in the event of spikes in demand, natural disasters, cyberattacks, or human error. That consistency is what’s known as 99.999% uptime, or "five nines" in engineering parlance, and Amazon Web Services (AWS) tools are enabling more and more customers to achieve five-nines of availability.
We have multiple development environments, test environments, staging environments, and production environments. This allows us to have a process that moves the system from development to testing, staging, and production in real time and never miss a moment of service.
Roustem Karimov
Founder of 1PasswordAs with the security of applications that run in the cloud, the responsibility for the availability and resiliency of those applications is a shared one. The AWS part of the bargain is to build a cloud infrastructure that can withstand disruptions of almost any type or scale. AWS does that in large part first by building multiple Availability Zones (AZ) in AWS Regions around the world. Each Availability Zone consists of one or more data centers with their own power, cooling, and physical security. AZs within an AWS Region are connected via redundant, ultra-low-latency networks. At latest count, AWS has 96 Availability Zones within 30 geographic regions around the world.
Availability Zones are engineered to be both geographically far enough apart to lessen the risk that any one event might impact another data center in the AWS Region. But they are not so far-flung that continuity of business is an issue for customers if they have workloads in more than one Availability Zone (many do), or have to switch to another AZ for any reason, whether due to a huge uptick in demand or an earthquake. AWS infrastructure has backups for the backups, and then some.
The AWS customer side of the responsibility equation is to make sure the services they are running on the AWS infrastructure are designed with the same continuous availability and resilience in mind. Justin Waite, Vice President of Engineering at Cisco, says that while this kind of complete resilience may not always be achievable, the trick is optimizing the processes that kick-in when there are disruptions.
“It's very hard to nail perfection in a technology space,” Waite said. “Ebbs and flows, and ticks and tocks, are happening so quickly that you could work to perfect something and then somebody could come out with a tool that just makes it completely obsolete. Everything's going to fail at some point. The question is, "How do we fail gracefully, and how to ensure the right customer experience when failure does occur?"
Give engineers room to safely experiment
Waite said the cloud has changed the equation for evolving an idea into a product that can be quickly and reliably offered to customers.
“Now, decisions can be made, and lightbulbs can go on in a matter of minutes. It doesn't take a year of planning and wondering how we are going to spin all this up in a certain region, data center, or colocation, with certain hardware, networking, and so forth."
According to Waite, a crucial element of building products that work “like magic,” with essentially no downtime, is giving engineering teams their own creation space—a place where they have the freedom to safely tinker with cloud tools. These tools function like open canvases, allowing developers to move products forward and start gaming out “what-if” scenarios, without having to worry about breaking anything for real and causing issues in the core business.
"That is much different than management handing out a product roadmap diagram with hardware specs. You start to see developers solve problems—not necessarily in smarter ways, but with more experimentation,” Waite said. “Sure, there will be some failure, but it’s about encouraging a 'try and buy' mentality. Want to see what happens when someone pulls that plug or a user presses this button? Great, there you go. Try an Internet of Things (IoT) device or service? Cool, there it is. Because of the cloud, developers have a much bigger toolbox."
For Roustem Karimov, founder of password manager 1Password, the “bigger toolbox” offered by the AWS Cloud is what enabled him and his co-founders to create his company and provide his customers with constant service.
When Karimov first prototyped 1Password alongside co-founder Dave Teare more than a decade ago, as a side project for their other work, the spectrum of cloud tools he and his team now have at their disposal were not available. As a consequence, 1Password simply wasn't possible yet, he said. But in 2016, with AWS technology at his fingertips, he said the business model—managing passwords, securely, at scale, and in real time—finally made sense.
Today, more than 100,000 businesses around the globe rely on 1Password to manage their passwords, and they expect availability at all times (1Password also offers services for individuals and families). For Karimov and his team, this means keeping systems running even when traffic is abnormally high, a component of their code fails, or hackers attempt to attack their system. Otherwise, employees, managers, and executives—1Password’s customers—could be suddenly locked out of their critical applications.
So, how have they designed against doomsday scenarios?
"First of all, AWS allowed us to make sure there is no single point of failure. Every component of the infrastructure has a failover option. In addition to that, we can actually build an entire 1Password service—the entire environment—including all the components, databases, caches, and application servers by just running a single script," Karimov said.
The automation of the infrastructure allows 1Password to stand-up new customers and serve existing ones quickly, predictably, and reliably.
"We have multiple development environments, test environments, staging environments, and production environments," said Karimov. "This allows us to have a process that moves the system from development to testing, staging, and production in real time and never miss a moment of service."
1Password runs all of its services through the cloud, relying on AWS infrastructure and cloud tools that make this kind of constant availability achievable.
"If we were trying to do it manually, without the cloud, our uptime would not be possible,” said Karimov.
Prepare for when things head south: Three tips to fend off disruptions
Amazon Distinguished Engineer and Senior Vice President James Hamilton offers three ways to give every company the best shot at the highest level of availability.
Automate as much as possible. According to the Uptime Institute, the majority of outages are caused by human error, commonly introduced in tasks such as testing, backups, and code reviews. Automate as much as you can, so human errors don’t happen in the first place.
Test for knowns and unknowns, and break things before they actually break. Testing can take the form of purposefully breaking things and seeing what happens. The point is to subject your system to real-world scenarios that you can anticipate as well as those that may seem outside the realm of possibility. By testing your system’s limits under controlled circumstances, you can be ready to fix problems and avoid downtime when issues occur. If you practice the actions and steps you would take during a disaster, before a real event occurs—and it will—you will be ready.
Continuously collect and analyze data from your applications, and just as importantly—unify it. Having a single source of truth will make it easier for your development team to spot problems and fix them. You can vastly reduce the time spent troubleshooting and fixing bugs if everyone is reading from the same set of data and using the same analytics tools.