Chaos Engineering: What happens when your banking transactions are in a black hole?
After 6 years of working on keeping internet banking, mortgage broking, and foreign exchange trading up and running for over 1 million users I’ve seen many interesting, unusual and unexpected failure stories. Missing transactions, incorrect balance, duplicate transactions, lost transactions, lost mortgages and much more. Here’s a repo for fintech failure stories if you’d like to learn more: https://github.com/tammybutow/Fintech-Failure-Stories/
What happens when your banking transactions are in a black hole?
For the demo, we’ll be using the Bank of Anthos repo. It’s an open-source project from Google Cloud Platform.
This is the architecture of our Bank of Anthos demo:
Chaos Engineering Experiment: Balance Reader
We will practice Chaos Engineering in a controlled way, as any good scientific experiment is done. The screenshot below is what our Bank of Anthos application looks like when it’s up and running and we’ve logged in as the supplied test user. We’ll be creating a black hole to perform our experiment.
What is a black hole?
A region of a distributed system where gravity is so strong that nothing — no requests or transactions — can escape from it. All IP packets in this region are trapped.
How can we create a black hole?
Gremlin has a built-in black hole attack which you can use via the CLI or UI:
gremlin attack blackhole
This will capture IP packets at the transport layer, targeted by supplied port and host arguments. We will use traffic policing features in the Linux Kernel to drop targeted IP packets.
Does blackholing a critical path service like the Balance Reader result in a graceful degradation of the customer experience?
Bank of Anthos runs on Kubernetes so we’ll use Gremlin’s Kubernetes feature to set up and run our black hole attack:
https://app.gremlin.com/attacks/new/kubernetes
Results of the Chaos Engineering Experiment —Balance Reader
As a result of the experiment, your balance appears as $ — — — . This could make the user think they have no money in their account. While the attack is running we should also check other functionality for the internet banking application.
Can we make deposits?
The user is still able to make a deposit of $1000 while the Balance Reader service is in a black hole.
Can we send payments?
The user is unable to send payments. They will see an error that the payment failed due to Balance Reader.
Would you like to recreate this demo?
This is a completely free demo environment to learn about black holes:
- Use this link to install with minikube on google cloud shell: https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_repo=https://github.com/GoogleCloudPlatform/bank-of-anthos&cloudshell_workspace=.&cloudshell_tutorial=extras/cloudshell/tutorial.md
- Click minikube → start
- In Cloud Shell terminal, run kubectl apply -f extras/jwt/jwt-secret.yaml
- Click <> Cloud Code → Run on Kubernetes, change to port 4503
- Sign up for a Gremlin account if you don’t have one already — https://gremlin.com/buttons
- To create black holes, create a namespace for gremlin and install gremlin as helm chart https://github.com/gremlin/helm
Prefer a video demo?
Chaos Engineering Experiment: Transaction History
Now let’s see how a black hole impacts our Transaction History service.
Does blackholing transaction history result in a graceful degradation of the customer experience?
We’ll select the transactionhistory deployment with Gremlin:
Results of the Chaos Engineering Experiment — Transaction History
As a result of the experiment, your transactions will no longer appear and instead, you’ll see an error — Error: Could Not Load Transactions. This could make the user think their deposits and withdrawals are not accurate, which could result in an increase in phone calls and visits to the bank.
What can we do to mitigate against a black hole?
Depending on the service, using the Kubernetes feature to scale replicas may work well. Use the following command in Google Cloud Shell:
kubectl scale deployment transactionhistory — replicas=2
Next run the following command to view your additional pod:
kubectl get pods
Now let’s send 50% of transaction history pods into a blackhole:
Our new architecture for Bank of Anthos — we now have two pods for our Transaction History:
Results of the Chaos Engineering Experiment — Transaction History
There will be a very short outage (less than 1s) and then the other pod will take over. We are still able to see transaction history and no longer receive error messages:
What different sizes of black holes can we experience or create?
We can experience and create black holes of all sizes. When creating black holes, start micro and gradually expand the blast radius. For example, start micro with one pod and expand over time to black hole an entire region — this is a very safe way to test region failure in a short time (60 seconds) with no teardown or bringing up infrastructure and applications.
Additional Questions To Consider
- How do black holes impact observability?
- What can we learn from blackhole-related failures?
- How can we use black holes to learn how to make systems more reliable?
Resources:
- https://github.com/GoogleCloudPlatform/bank-of-anthos
- https://gremlin.com/buttons
- https://github.com/gremlin/helm
Thank you
Get a free copy of my O’Reilly ebook Reducing MTTD for High-Severity Incidents:
Find me on Twitter: @tambryantbutow