Nun-db Chaos Testing with Docker and Toxiproxy
I was writing my annual post about how to become a better developer, and I wanted to add a topic about dealing with latency and how to test latency problems in local and dev environments. When I found a bug in Nun-db where, under high latency, two servers could be in the same cluster and consider themselves the primary.
Keep calm; at first, it was only happening when servers were under heavy load and latency was over 1 second, and with our latest changes to allow local first data, we no longer rely on the primary for most operations.
I decided to fix it but also use this post to show how I designed the experiment to replicate the problem and ensure I could/would fix it.
Experimental Design
I optimized for:
- Running the experiment locally and on the CI
- Quick run and replicable tests/script
- Battle-tested easy-to-use tools
Tools
- Docker and Docker Compose for running locally and being able to replicate it in CI Docker Compose.
- Toxiproxy to artificially introduce latency between nodes in Nun-db.
What I Would Like to Simulate
In the next diagram, I have drawn what I would like to use in my experiment. I have introduced 500 ms of latency between nodes and would like to see how the elections would behave. The 500 ms is a really large latency, and usually in the cloud, you have much shorter latency in the two-digit or low 100 ms range. But to make the problem much more evident, I decided to go crazy and run three nodes in a single local environment.
- The idea is that externally everything will pass through Toxiproxy; therefore, all ports exported are in Toxiproxy,
3017, 3018, 3019
. Here.
version: "3.9" # optional since v1.27.0 services: toxiproxy: image: "shopify/toxiproxy" ports: - "3017:3017" - "3018:3018" - "3019:3019"
- Creates proxies to access the downstream service in the same port, this is made sending command to toxi proxy
version: "3.9" # optional since v1.27.0 services: # ... toxiproxy-config: image: "shopify/toxiproxy" entrypoint: > sh -c " sleep 1 # Create proxies /go/bin/toxiproxy-cli -h toxiproxy:8474 create primary --listen 0.0.0.0:3017 --upstream nun-db-primary:3017 /go/bin/toxiproxy-cli -h toxiproxy:8474 create secondary-1 --listen 0.0.0.0:3018 --upstream nun-db-secondary-1:3018 /go/bin/toxiproxy-cli -h toxiproxy:8474 create secondary-2 --listen 0.0.0.0:3019 --upstream nun-db-secondary-2:3019 # Add latency up and down stream to primary /go/bin/toxiproxy-cli -h toxiproxy:8474 toxic add -t latency -a latency=2000 primary /go/bin/toxiproxy-cli -h toxiproxy:8474 toxic add --upstream --type=latency --toxicName=latency_upstream --a latency=2000 --toxicity=1 primary # Add latency up and down stream to secondary-1 /go/bin/toxiproxy-cli -h toxiproxy:8474 toxic add -t latency -a latency=500 secondary-1 /go/bin/toxiproxy-cli -h toxiproxy:8474 toxic add --upstream --type=latency --toxicName=latency_upstream --a latency=2000 --toxicity=1 secondary-1 # Add latency up and down stream to secondary-2 /go/bin/toxiproxy-cli -h toxiproxy:8474 toxic add -t latency -a latency=500 secondary-2 /go/bin/toxiproxy-cli -h toxiproxy:8474 toxic add --upstream --type=latency --toxicName=latency_upstream --a latency=2000 --toxicity=1 secondary-2 "
- Finally reference each node from nun-db to access the other nodes via proxy.
version: "3.9" # optional since v1.27.0 services: # ... nun-db-primary: image: mateusfreira/nun-db environment: - NUN_DBS_DIR=/nun_data - NUN_WS_ADDR=0.0.0.0:3058 - NUN_HTTP_ADDR=0.0.0.0:9092 - NUN_TCP_ADDR=nun-db-primary:3017 - NUN_TCP_JOIN_ADDR=toxiproxy:3017 # This defines the external address for this node - NUN_LOG_LEVEL=debug - NUN_USER=mateus - NUN_REPLICASET=toxiproxy:3017,toxiproxy:3018,toxiproxy:3019 # Access the others via toxiproxy - NUN_PWD=mateus - NUN_ELECTION_TIMEOUT=3000
- Now all you have to do is to send the command for the external ports. For that I created a few make commands to quick watch the election state in all nodes for that we use
cluster-state;
watch the election state
print-election-primary: RUST_BACKTRACE=1 ./../../target/debug/nun-db -p $(password) -u $(user) --host "http://$(primaryHttpAddress)" exec "cluster-state;" print-election-2: RUST_BACKTRACE=1 ./../../target/debug/nun-db -p $(password) -u $(user) --host "http://$(secoundary1HttpAddress)" exec "cluster-state;" print-election-3: RUST_BACKTRACE=1 ./../../target/debug/nun-db -p $(password) -u $(user) --host "http://$(secoundary2HttpAddress)" exec "cluster-state;"
Find the full docker files in GitHub here
The bug I found
- First time I run the spring here is what I found :(
make print-election-primary print-election-2 print-election-3 | grep Primary # Response "valid auth\n;cluster-state toxiproxy:3017:Primary, toxiproxy:3018:Secoundary, toxiproxy:3019:Secoundary,\n" # Response "valid auth\n;cluster-state toxiproxy:3017:Primary, toxiproxy:3018:Secoundary, toxiproxy:3019:Secoundary,\n" # Response "valid auth\n;cluster-state toxiproxy:3017:Secoundary, toxiproxy:3018:Primary, toxiproxy:3019:Secoundary,\n"
- PR with the fix.
What was the problem?
- Prior to the fix, the election relied on a timeout to claim victory Here. And we fixed it to use we changed the algorithms to wait for the ack from all nodes and only use time out if all messages are not replied. Now the fix was to wait until the response comes from the nodes or the timeout
// ... while opp.is_none() && start_time < *NUN_ELECTION_TIMEOUT { log::debug!("Waiting for opp to be registered"); thread::sleep(time::Duration::from_millis(2)); start_time = start_time + 2; opp = dbs.get_pending_opp_copy(id); }
- After the fist the same experiment works as expected
make print-election-primary print-election-2 print-election-3 | grep Primary Response "valid auth\n;cluster-state toxiproxy:3017(self):Primary , toxiproxy:3018(Connected):Secoundary, toxiproxy:3019(Connected):Secoundary,\n" Response "valid auth\n;cluster-state toxiproxy:3017(Connected):Primary, toxiproxy:3018(self):Secoundary , toxiproxy:3019(Connected):Secoundary,\n" Response "valid auth\n;cluster-state toxiproxy:3017(Connected):Primary, toxiproxy:3018(Connected):Secoundary, toxiproxy:3019(self):Secoundary ,\n"
PS Election message exchange
If you are not familiar with the Nun-db election process, the next diagram can help you understand when the pending operations are resolved and acknowledged from the secondaries. The protocol is based on rp
and ack
when received successfully.
Conclusion
Toxyproxy proxy made it really easy to replicate the problem locally, it is now part of my daily driver for working with nun-db, it all this bug could have caused me problems much latter down the line and the simple fact that I added it to mention it on a post and found a fix already shows is power.
- What Makes a Great Software Engineering Team?
- Lessons for Software Developers from Ramon Dino's Result and the Arnold Ohio 2024 Event
- Consulting to Fix AI-Created Software May Be Our Next Big Opportunity
- The making of a software engineer (My personal take 16 years into it)
- Secrets for becoming a better developer in 2024
- Nun-db Chaos Testing with Docker and Toxiproxy
- My Thoughts about On-Call Compensation and Scheduling for Small Engineering Teams and Companies
- Turning Failures into Success: How Overcoming Challenges Fuels Software Engineer Growth
- Secrets for becoming a better developer in 2022
- Argo workflow as performance test tool
- How not to burnout when working on hard problems as a developer
- Are you working remotely? You should be ready to hit the road at any time in 2022
- Secrets to becoming a better remote developer 2021 edition
- Secrets I use to becoming a better remote developer
- Are you working remotely? You should be ready to hit the road at any time
- Productivity Trackers I use (as a developer working remote)