Nun-db Chaos Testing with Docker and Toxiproxy

I was writing my annual post about how to become a better developer, and I wanted to add a topic about dealing with latency and how to test latency problems in local and dev environments. When I found a bug in Nun-db where, under high latency, two servers could be in the same cluster and consider themselves the primary.

Keep calm; at first, it was only happening when servers were under heavy load and latency was over 1 second, and with our latest changes to allow local first data, we no longer rely on the primary for most operations.

I decided to fix it but also use this post to show how I designed the experiment to replicate the problem and ensure I could/would fix it.

Experimental Design

I optimized for:

Running the experiment locally and on the CI
Quick run and replicable tests/script
Battle-tested easy-to-use tools

Tools

Docker and Docker Compose for running locally and being able to replicate it in CI Docker Compose.
Toxiproxy to artificially introduce latency between nodes in Nun-db.

What I Would Like to Simulate

In the next diagram, I have drawn what I would like to use in my experiment. I have introduced 500 ms of latency between nodes and would like to see how the elections would behave. The 500 ms is a really large latency, and usually in the cloud, you have much shorter latency in the two-digit or low 100 ms range. But to make the problem much more evident, I decided to go crazy and run three nodes in a single local environment.

The idea is that externally everything will pass through Toxiproxy; therefore, all ports exported are in Toxiproxy, 3017, 3018, 3019. Here.

version: "3.9"  # optional since v1.27.0
services:
  toxiproxy:
    image: "shopify/toxiproxy"
    ports:
       - "3017:3017"
       - "3018:3018"
       - "3019:3019"

Creates proxies to access the downstream service in the same port, this is made sending command to toxi proxy

version: "3.9"  # optional since v1.27.0
services:
# ...
  toxiproxy-config:
    image: "shopify/toxiproxy"
    entrypoint: > 
      sh -c "
              sleep 1
              # Create proxies
             /go/bin/toxiproxy-cli -h toxiproxy:8474 create primary --listen 0.0.0.0:3017 --upstream nun-db-primary:3017
             /go/bin/toxiproxy-cli -h toxiproxy:8474 create secondary-1 --listen 0.0.0.0:3018 --upstream nun-db-secondary-1:3018
             /go/bin/toxiproxy-cli -h toxiproxy:8474 create secondary-2 --listen 0.0.0.0:3019 --upstream nun-db-secondary-2:3019
             # Add latency up and down stream to primary
             /go/bin/toxiproxy-cli -h toxiproxy:8474  toxic add -t latency -a latency=2000 primary 
             /go/bin/toxiproxy-cli -h toxiproxy:8474  toxic add --upstream --type=latency --toxicName=latency_upstream --a latency=2000 --toxicity=1 primary

             # Add latency up and down stream to secondary-1
             /go/bin/toxiproxy-cli -h toxiproxy:8474  toxic add -t latency -a latency=500 secondary-1
             /go/bin/toxiproxy-cli -h toxiproxy:8474 toxic add --upstream --type=latency --toxicName=latency_upstream --a latency=2000 --toxicity=1 secondary-1

             # Add latency up and down stream to secondary-2
             /go/bin/toxiproxy-cli -h toxiproxy:8474  toxic add -t latency -a latency=500 secondary-2
             /go/bin/toxiproxy-cli -h toxiproxy:8474  toxic add --upstream --type=latency --toxicName=latency_upstream --a latency=2000 --toxicity=1 secondary-2
             "

Finally reference each node from nun-db to access the other nodes via proxy.

version: "3.9"  # optional since v1.27.0
services:
# ...
  nun-db-primary:
    image: mateusfreira/nun-db
    environment:
      - NUN_DBS_DIR=/nun_data
      - NUN_WS_ADDR=0.0.0.0:3058
      - NUN_HTTP_ADDR=0.0.0.0:9092
      - NUN_TCP_ADDR=nun-db-primary:3017
      - NUN_TCP_JOIN_ADDR=toxiproxy:3017 # This defines the external address for this node
      - NUN_LOG_LEVEL=debug
      - NUN_USER=mateus
      - NUN_REPLICASET=toxiproxy:3017,toxiproxy:3018,toxiproxy:3019 # Access the others via toxiproxy
      - NUN_PWD=mateus
      - NUN_ELECTION_TIMEOUT=3000

Now all you have to do is to send the command for the external ports. For that I created a few make commands to quick watch the election state in all nodes for that we use cluster-state; watch the election state

print-election-primary:
	RUST_BACKTRACE=1 ./../../target/debug/nun-db -p $(password) -u $(user) --host "http://$(primaryHttpAddress)" exec "cluster-state;"

print-election-2:
	RUST_BACKTRACE=1 ./../../target/debug/nun-db -p $(password) -u $(user) --host "http://$(secoundary1HttpAddress)" exec "cluster-state;"

print-election-3:
	RUST_BACKTRACE=1 ./../../target/debug/nun-db -p $(password) -u $(user) --host "http://$(secoundary2HttpAddress)" exec "cluster-state;"

Find the full docker files in GitHub here

The bug I found

First time I run the spring here is what I found :(

    make print-election-primary print-election-2  print-election-3 | grep Primary
    # Response "valid auth\n;cluster-state  toxiproxy:3017:Primary, toxiproxy:3018:Secoundary, toxiproxy:3019:Secoundary,\n"
    # Response "valid auth\n;cluster-state  toxiproxy:3017:Primary, toxiproxy:3018:Secoundary, toxiproxy:3019:Secoundary,\n"
    # Response "valid auth\n;cluster-state  toxiproxy:3017:Secoundary, toxiproxy:3018:Primary, toxiproxy:3019:Secoundary,\n"

PR with the fix.

What was the problem?

Prior to the fix, the election relied on a timeout to claim victory Here. And we fixed it to use we changed the algorithms to wait for the ack from all nodes and only use time out if all messages are not replied. Now the fix was to wait until the response comes from the nodes or the timeout

// ...
    while opp.is_none() && start_time < *NUN_ELECTION_TIMEOUT {
        log::debug!("Waiting for opp to be registered");
        thread::sleep(time::Duration::from_millis(2));
        start_time = start_time + 2;
        opp = dbs.get_pending_opp_copy(id);
    }

After the fist the same experiment works as expected

make print-election-primary print-election-2  print-election-3 | grep Primary
Response "valid auth\n;cluster-state  toxiproxy:3017(self):Primary , toxiproxy:3018(Connected):Secoundary, toxiproxy:3019(Connected):Secoundary,\n"
Response "valid auth\n;cluster-state  toxiproxy:3017(Connected):Primary, toxiproxy:3018(self):Secoundary , toxiproxy:3019(Connected):Secoundary,\n"
Response "valid auth\n;cluster-state  toxiproxy:3017(Connected):Primary, toxiproxy:3018(Connected):Secoundary, toxiproxy:3019(self):Secoundary ,\n"

PS Election message exchange

If you are not familiar with the Nun-db election process, the next diagram can help you understand when the pending operations are resolved and acknowledged from the secondaries. The protocol is based on rp and ack when received successfully.

Conclusion

Toxyproxy proxy made it really easy to replicate the problem locally, it is now part of my daily driver for working with nun-db, it all this bug could have caused me problems much latter down the line and the simple fact that I added it to mention it on a post and found a fix already shows is power.

Written on December 10, 2023