Chaos Engineering: What happens when your banking transactions are in a black hole?

Tammy Butow
5 min readJun 4, 2021

--

After 6 years of working on keeping internet banking, mortgage broking, and foreign exchange trading up and running for over 1 million users I’ve seen many interesting, unusual and unexpected failure stories. Missing transactions, incorrect balance, duplicate transactions, lost transactions, lost mortgages and much more. Here’s a repo for fintech failure stories if you’d like to learn more: https://github.com/tammybutow/Fintech-Failure-Stories/

What happens when your banking transactions are in a black hole?

For the demo, we’ll be using the Bank of Anthos repo. It’s an open-source project from Google Cloud Platform.

This is the architecture of our Bank of Anthos demo:

https://github.com/GoogleCloudPlatform/bank-of-anthos/

Chaos Engineering Experiment: Balance Reader

We will practice Chaos Engineering in a controlled way, as any good scientific experiment is done. The screenshot below is what our Bank of Anthos application looks like when it’s up and running and we’ve logged in as the supplied test user. We’ll be creating a black hole to perform our experiment.

What is a black hole?

A region of a distributed system where gravity is so strong that nothing — no requests or transactions — can escape from it. All IP packets in this region are trapped.

How can we create a black hole?

Gremlin has a built-in black hole attack which you can use via the CLI or UI:

gremlin attack blackhole

This will capture IP packets at the transport layer, targeted by supplied port and host arguments. We will use traffic policing features in the Linux Kernel to drop targeted IP packets.

Does blackholing a critical path service like the Balance Reader result in a graceful degradation of the customer experience?

Bank of Anthos runs on Kubernetes so we’ll use Gremlin’s Kubernetes feature to set up and run our black hole attack:

https://app.gremlin.com/attacks/new/kubernetes

Running a black hole attack with Gremlin on the Bank of Anthos

Results of the Chaos Engineering Experiment —Balance Reader

As a result of the experiment, your balance appears as $ — — — . This could make the user think they have no money in their account. While the attack is running we should also check other functionality for the internet banking application.

When the black hole is created by Gremlin your bank balance will appear as $ — -

Can we make deposits?

The user is still able to make a deposit of $1000 while the Balance Reader service is in a black hole.

The user is able to make deposits despite the Balance Reader black hole

Can we send payments?

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader.

The user is not able to send payments due to the Balance Reader black hole
Connection Refused error due to the Balance Reader black hole

Would you like to recreate this demo?

This is a completely free demo environment to learn about black holes:

  1. Use this link to install with minikube on google cloud shell: https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_repo=https://github.com/GoogleCloudPlatform/bank-of-anthos&cloudshell_workspace=.&cloudshell_tutorial=extras/cloudshell/tutorial.md
  2. Click minikubestart
  3. In Cloud Shell terminal, run kubectl apply -f extras/jwt/jwt-secret.yaml
  4. Click <> Cloud CodeRun on Kubernetes, change to port 4503
  5. Sign up for a Gremlin account if you don’t have one already — https://gremlin.com/buttons
  6. To create black holes, create a namespace for gremlin and install gremlin as helm chart https://github.com/gremlin/helm

Prefer a video demo?

In this video I will share how to create this demo

Chaos Engineering Experiment: Transaction History

Now let’s see how a black hole impacts our Transaction History service.

https://github.com/GoogleCloudPlatform/bank-of-anthos

Does blackholing transaction history result in a graceful degradation of the customer experience?

Transaction History before the black hole is created

We’ll select the transactionhistory deployment with Gremlin:

Results of the Chaos Engineering Experiment — Transaction History

As a result of the experiment, your transactions will no longer appear and instead, you’ll see an error — Error: Could Not Load Transactions. This could make the user think their deposits and withdrawals are not accurate, which could result in an increase in phone calls and visits to the bank.

Error: Could Not Load Transactions — Transaction History while the black hole exists

What can we do to mitigate against a black hole?

Depending on the service, using the Kubernetes feature to scale replicas may work well. Use the following command in Google Cloud Shell:

kubectl scale deployment transactionhistory — replicas=2

Next run the following command to view your additional pod:

kubectl get pods

Now we have 2 transaction history pods

Now let’s send 50% of transaction history pods into a blackhole:

https://app.gremlin.com/attacks/new/kubernetes

Our new architecture for Bank of Anthos — we now have two pods for our Transaction History:

Results of the Chaos Engineering Experiment — Transaction History

There will be a very short outage (less than 1s) and then the other pod will take over. We are still able to see transaction history and no longer receive error messages:

No error messages during the Black Hole due to additional replicas

What different sizes of black holes can we experience or create?

We can experience and create black holes of all sizes. When creating black holes, start micro and gradually expand the blast radius. For example, start micro with one pod and expand over time to black hole an entire region — this is a very safe way to test region failure in a short time (60 seconds) with no teardown or bringing up infrastructure and applications.

Additional Questions To Consider

  • How do black holes impact observability?
  • What can we learn from blackhole-related failures?
  • How can we use black holes to learn how to make systems more reliable?

Resources:

Thank you

Get a free copy of my O’Reilly ebook Reducing MTTD for High-Severity Incidents:

gremlin.com/talk/blackholes

Reducing MTTD for High-Severity Incidents

Find me on Twitter: @tambryantbutow

--

--