Stop procrastinating and just fix your flaky tests, it may be catching nasty bugs
I believe every developer knows this in their heart, yet sometimes we still ignore it. I procrastinated with one flaky test for a long time in my project Nun-db.
It only happens on the GitHub actions, and I could never replicate it locally, so I was used to running the build and moving on since I only open or merge one or 2 PR a week on Nun-db it ran like that for longer than I would like.
This Saturday morning, while in a coffee shop, there was a broken build in the master branch from a PR I had merged 2 hours before, it kills me to see a broken build in my projects in master, so I decided to fix it and do not accept this flakiness anymore.
To my surprise, it was not a test flaky; it was a real problem. The code was flaky, not the test.
Allow me to give some context. Next, the unit test I am talking about starts three nun-db processes as replica sets and sends a command to make sure the value is replicated correctly for all processes. You can see it here Here.
After reading the test, I was convinced there was no wrong code on the test, and I decided to look at the logs from the CI, as I am presenting next. I omitted most of the log for simplicity but paying particular attention to the #
will help you understand what is happening.
2022-03-19T13:09:30.4500089Z test tests::should_replicate_as_expected_set_safe ... [2022-03-19T13:09:30.449713727Z nun_db] nundb starting! # Primary starts fine 2022-03-19T13:09:35.6648602Z [2022-03-19T13:09:35.664777173Z nun_db::lib::process_request] [139909448636160] Server processed message 'join 127.0.0.1:3018 2022-03-19T13:09:35.6671592Z [2022-03-19T13:09:35.667074838Z nun_db::lib::process_request] Setting 127.0.0.1:3017 as primary! # Secoundary 1 stats fine # Secoundary 2 starts 2022-03-19T13:09:40.8938860Z [2022-03-19T13:09:40.893781615Z nun_db::lib::process_request] [139909444421376] Server processed message 'join 127.0.0.1:3016 # Primary fails to connect to the Secoundary 2 2022-03-19T13:09:40.8949291Z [2022-03-19T13:09:40.894793987Z nun_db::lib::replication_ops] Could not stablish the connection to replicate to 127.0.0.1:3016 error : Connection refused (os error 111) # Test fails here
What does that mean?
The primary process fails to connect to the secondary as it was off, tho in the Nun-db flow, the new secondary part of a cluster tells the others it became alive, which means this is absurd and should never happen. Well, it did, so let’s see in code what caused it.
The bug is in the next block of code; let’s check it together.
fn start_db( user: &str, pwd: &str, ws_address: &str, http_address: &str, tcp_address: &str, replicate_address: &str, ) -> Result<(), String> { //... omited for simplicity let db_replication_start = dbs.clone(); let tcp_address_to_relication = Arc::new(tcp_address.to_string()); //... start replication thereads let join_thread = thread::spawn(move || { // Here notify the other nodes the process has started lib::replication_ops::ask_to_join_all_replicas( &replicate_address_to_thread, &tcp_address_to_election.to_string(), &dbs_self_election.user.to_string(), &dbs_self_election.pwd.to_string(), ); lib::election_ops::start_inital_election(dbs_self_election) }); // Other steps here let _snapshot_thread = thread::spawn(|| lib::disk_ops::start_snap_shot_timer(timer, db_snap)); // Start ws connections // Start http connection // Only now will start tcp connection (nodes communication use TCP) (UPS) let tcp_thread = thread::spawn(move || lib::network::tcp_ops::start_tcp_client(dbs_tcp, &tcp_address)); //... omited }
So locking now seems evident that I should never send a notification to another process before preparing the current one to respond to it. In this case, start the TCP connection listener.
Well, the fix you already know at this point. Move the TCP start thread up. And that is what I did in the PR, and I am showing next
Conclusion
Dear future me, flaky tests may not just be flaky tests; they may be hiding a real issue in your system. So the bug may be flaky, not the test. That will be true for race conditions like this. When dealing with distributed systems, this is even more critical. DO NOT IGNORE FLAKY TESTS, thanks.
- Code Coverage for Rust Projects with GitHub Actions
- NunDb is now referenced in the Database of Databases
- Real-time Medical Image Collaboration POC Made Easy with OHIF and Nun-db
- How to create users with different permission levels in Nun-db
- Match vs Hashmap! Which one is faster in rust?
- Towards a More Secure Nun-db: Our Latest Security Enhancements
- Building a Trello-like React/Redux App with NunDB with offline and conflict resolution features
- Introduction to managing conflicts in NunDB
- Keepin up with Nun-db 2023
- The new storage engine of Nun-db
- Stop procrastinating and just fix your flaky tests, it may be catching nasty bugs
- An approach to hunt and fix non-reproducible bugs - Case study - Fixing a race conditions in Nun-db replication algorithm in rust
- NunDB the debug command
- Keeping up with Nun-db 2021
- Writing a prometheus exporter in rust from idea to grafana chart
- Integration tests in rust a multi-process test example
- Leader election in rust the journey towards implementing nun-db leader election
- How to make redux TodoMVC example a real-time multiuser app with nun-db in 10 steps
- A fast-to-sync/search and space-optimized replication algorithm written in rust, The Nun-db data replication model
- NunDb How to backup one or all databases
- How to create your simple version of google analytics real-time using Nun-db
- Migrating a chat bot feature from Firebase to Nun-db
- Keepin' up with NunDB
- Going live with NunDB