As discussed in the previous post, I am now in a team that is developing a Fault Injection Framework in order to help find weaknesses in a system so that developers/maintainers can fix them. I mentioned Netflix's Simian Army and how we are taking a different approach: permutating through each failure case instead of doing cases randomly.

First off, some definitions:

  • Node - a single service or host (e.g. a web service)
  • Relationship - a node-to-node connection in a system (e.g. a web service that connects to a database)
  • Cluster - a collection of relationships (e.g. a web service that connects to a database and a load balancer)
  • System - a collection of clusters (e.g. the entire system network graph)
  • Plugin - the type of fault we are injecting (e.g. increasing latency, or forcing a shutdown). It is categorized by single- or multi-noded plugins. Single being a fault that involves recycling a database. Multi being a fault that involves introducing latency in a relationship.
  • Injection Case - a combination of plugins used on a relationship/cluster to simulate failure
  • Injection Suite - a series of injection cases to run on a system

We want to be able to run an injection suite on a system so that we can find the weaknesses of the system and allow the system's developers/maintainers to be better prepared for when one of the generated cases happen in a production environment.

In order to accomplish this, we need to find every different plugin combinations that can apply to each relationship. For example, if we have a 2-node relationship and each node has the same 1 plugin (plugin1), then we have 4 different cases:

  1. Node1{none}:Node2{none}*
  2. Node1{none}:Node2{plugin1}
  3. Node1{plugin1}:Node2{none}
  4. Node1{plugin1}:Node2{plugin1}

NOTE: "none" being a case where no plugin is applied to the node.

Now, adding a new plugin (plugin2) to both nodes increases the 4 different cases to 9 different cases. Adding another (plugin3) increases combinations to 16. Another increases to 25, and so on exponentially... This is only part of the problem. It grows even further when you start adding clusters and plugins that work with multiple nodes.

However, there are cases that do not need to be tested. For example, the case (Node1{none}:Node2{none}) does not need to be tested because nothing is happening. Other generated cases might require to shutdown Node A and increase latency to Node A--an illegal case because Node A would be shutdown, thus not being able to increase its latency. And herein lies the problem: how do we systematically generate legal injection cases, without generating all injection cases under the sun.

This is a problem I will go further into detail about in the next post.