The new storage engine of Nun-db
When I started Nun-db, I intended it to be an in-memory database for several reasons. For obvious reasons, I eventually needed to snapshot it to the database to disk. At that time, I decided not to spend too much time thinking ho how the data would be on disk. Therefore, I decided to use a serialization lib and move on with my coding with the more important stuff at the time. I used the lib bincode
that did a good job at the time and took less than 2 hours to implement. I knew the time to write my own serialization structure and disk storage would come eventually. Well, that time is now.
Why a new storage engine?
As I am working to make Nun-db a leaderless database, I needed to introduce a values version to my keys to easily find conflicts between nodes.
To add that new field to the disk, I would need to change the disk structure, so I used the opportunity to build a more definitive version after all.
I read the paper Faster: A Concurrent Key-Value Store with In-Place Updates when I was starting Nun-db and had been thinking about it ever since. I decided to onboard some of its ideas to Nun-db, to follow this post, you don’t need to read the paper but I will let the link here in case you are curious.
Goal for the new storage engine
Faster updates
Using bincode
made my storage engine save all the values, despite being updated or not, and that made it “slow” I would like the new one to update only the keys that have changes so it would be faster.
Append only fast storage
According to the paper and after doing some experiments, I observed that append-only files would write much faster, so I was looking for a structure that could run as append-only.
In place update
I would need to support in-place updates to have the faster updates.
Hows?
In place update implementation
The in-place update is one of the biggest reasons for this new implementation.
To store only the updated keys, I needed to change the model in memory, adding a status. In the next block of code, I am showing the changes in the Value struct
. The most important point, for now, is the ValueStatus
. That status is used to hold how the state of the key in memory.
pub struct Value { pub value: String, pub version: i32, pub state: ValueStatus,// New pub value_disk_addr: u64,// New pub key_disk_addr: u64,// New } #[derive(Clone, PartialEq, Copy, Debug)] pub enum ValueStatus { /// Value is ok memory == disk values Ok = 0, /// Value is updated on memory not yet stored in disk Updated = 2, /// Value value needs to be deleted in the disk Deleted = 1, /// Value is new it is not present on disk yet New = 3, }
That model made it possible to us to easily control the flow of information and what needs to be updated or not.
The first major improvement is the ability to filter keys that needs filer that needs to be updated in the disk.
let mut keys_to_update = vec![]; { let data = db.map.read().expect("Error getting the db.map.read"); data.iter() .filter(|&(_k, v)| v.state != ValueStatus::Ok) .for_each(|(k, v)| keys_to_update.push((k.clone(), v.clone()))) }; // Release the locker
As the next block of code shows, we can use append-only in Value files for all cases, in-place update in the keys file in case of Delete or Update, and append if Insert(New).
for (key, value) in keys_to_update { match value.state { ValueStatus::New => { // Append value file let record_size = write_value(&mut values_file, &value, status); log::debug!( "Write key: {}, addr: {} value_addr: {} ", key, next_key_addr, value_addr ); // Append key file let key_size = write_key(&mut keys_file, &key, &value, value_addr); value_addr = value_addr + record_size; db.set_value_as_ok(&key, &value, value_addr, next_key_addr); next_key_addr = next_key_addr + key_size; } ValueStatus::Updated => { // Append value file let record_size = write_value(&mut values_file, &value, status); // In place upate in key file update_key( &mut keys_file_write, &key, value.version, value_addr, value.key_disk_addr, ); value_addr = value_addr + record_size; db.set_value_as_ok(&key, &value, value_addr, value.key_disk_addr); } ValueStatus::Deleted => { // In place upate in key file update_key( &mut keys_file_write, &key, VERSION_DELETED, // Version -1 means deleted 0, value.key_disk_addr, ); } ValueStatus::Ok => panic!("Values Ok should never get here"), } }
This strategy is good for speed, but it can create unused spaces in the value and key files over time. I implemented the reclaiming space alternative for a slower operation but to save space on disk.
The in disk structure
# keys file
+-------------++---------------------++---------------------+
| key || version || value_addr |
| length:rest || 4 bytes || 8 bytes |
+-------------++---------------------++---------------------+
## If key address is -1 means the key is deleted
# Values file
+-------------++---------------------+
| value || status |
| length:rest || 4 bytes |
+-------------++---------------------+
The migration
As this is a new implementation of the storage engine, one of the most important points is how we will migrate from the old engine to the new engine.
For that, we made sure to implement one test to ensure we will restore the old one when the new version is up, and as soon as the next storage happens, we will store it as the new implementation.
In the next block of code, I am showing how we implemented that, basically the files changed from $name-nun.data
to 2 files $name-nun.data.keys
and $name-nun.data.values
.
Reclaiming space
It is obvious that the current storage engine is much faster than the previous. Still, it uses and loses lots of space to be able to do append-only in the values files. I need to implement the last part of this algorithm to reclaim space. Of course, this will be slower than the normal one. I have procrastinated for at least one week to implement this last part, the gain is very little, and this feature won’t probably be needed for a while. I was also not finding an easy way to test this feature. Now, as I am writing this text, I am actually had an idea of how to test, and I am retaking it. The idea is to add metrics for the storage method to return the number of stored keys.
This change actually made my test even better because now I have a fine-granted way to control how many changes the storage method made, and I can assert I am not updating what I did not mean to update.
That decision was easy now. The point is, where do I add that to the external API? … My initial idea is to add to the snapshot command an optional parameter called reclaim
or diff
, reclaiming the command to reclaim space.
I think the best way to implement this is to add an additional parameter to the function storage_data_disk
and will use it in the external usage APIs.
Performance
The before and after performance of this implementation are not actually comparable. I did do lots of comparing while I was coding this feature, but I realize it is not fair to compare. This version has lots of optimizations on what to store, and the previous one did not have any. Therefore it would store all the data all the time.
Instead of comparing with the last version, I decided to code some performance tests into my subsets of tests.
- Make sure 10k changed keys can be stored in less than 100ms.
- Make sure 10k changed keys can be read from disk in less than 100ms.
#[test] fn restore_should_be_fast() { let (dbs, db_name, db) = create_db_with_10k_keys(); clean_all_db_files(&db_name); let start = Instant::now(); let changed_keys = storage_data_disk(&db, &db_name, false); log::info!("TIme {:?}", start.elapsed()); assert!(start.elapsed().as_millis() < 100); assert_eq!(changed_keys, 10000); let start_load = Instant::now(); let db_file_name = file_name_from_db_name(&db_name); let (loaded_db, _) = create_db_from_file_name(&db_file_name, &dbs); let time_in_ms = start_load.elapsed().as_millis(); log::info!("TIme to update {:?}ms", time_in_ms); assert!(start_load.elapsed().as_millis() < 100); ....
I did the same to ensure single key updates have to fast.
- One key update in a 10 k dataset has to be > 2ms.
//... let value = loaded_db.get_value(String::from("key_1")).unwrap(); let value_100 = loaded_db.get_value(String::from("key_100")).unwrap(); let value_1000 = loaded_db.get_value(String::from("key_1000")).unwrap(); assert_eq!(value.state, ValueStatus::Updated); assert_eq!(value_100.state, ValueStatus::Ok); assert_eq!(value_100.value, String::from("key_100")); assert_eq!(value_1000.value, String::from("key_1000")); let start_secount_storage = Instant::now(); let changed_keys = storage_data_disk(&loaded_db, &db_name, false); assert_eq!(changed_keys, 1); let time_in_ms = start_secount_storage.elapsed().as_millis(); log::debug!("TIme to update {:?}ms", time_in_ms); assert!(time_in_ms < 1); //...
Smell of good design
Only needed to change bo
(Business objects) and disk_ops to implement this feature. That is a smell. The boundaries of my code are still good up to this point. Good design makes it easy to change.
Thinking about a bigger-than-memory dataset as fast as in-memory
Since I started NunDB, the Goal has been to be a Memory database that is fast. Today, analyzing the applications running on it that were a very good decision and still be, I think NunDB will still be a memory database for the foreseeable future. But on the other hand, while I design these disk features and how to test the performance more and more, I think it would be possible to address datasets bigger than memory and deliver great performance, I don’t want to test that for now, but I will keep this in mind as I work on the next features we will implement.
Conclusion
Writing this new storage engine to nun-DB was the most fun I had in the project for a while. Yet it was exhausting at some points. Having a small test to fix was what kept me going in the hard times. It took me more than one month to shape this up and merge.
I had only one major issue after merging it here. It was a missing edge case of data deleted before any database could have been a snapshot to the disk. I call that success.
The issue was dramatic and cost me at least 1 hour of manual work to fix my prod environment to recover corrupted data. However, it was a minor problem overall, and it worked like a charm once I deployed it. Deployment was boring, as it should always be.
Now it is time to start working not the leaderless features of NunDB, and it will be ready to scale to infinity. I needed this last change to make it safe to go to that part of the challenge. And I am looking forward to starting, see you at the next one.
- Code Coverage for Rust Projects with GitHub Actions
- NunDb is now referenced in the Database of Databases
- Real-time Medical Image Collaboration POC Made Easy with OHIF and Nun-db
- How to create users with different permission levels in Nun-db
- Match vs Hashmap! Which one is faster in rust?
- Towards a More Secure Nun-db: Our Latest Security Enhancements
- Building a Trello-like React/Redux App with NunDB with offline and conflict resolution features
- Introduction to managing conflicts in NunDB
- Keepin up with Nun-db 2023
- The new storage engine of Nun-db
- Stop procrastinating and just fix your flaky tests, it may be catching nasty bugs
- An approach to hunt and fix non-reproducible bugs - Case study - Fixing a race conditions in Nun-db replication algorithm in rust
- NunDB the debug command
- Keeping up with Nun-db 2021
- Writing a prometheus exporter in rust from idea to grafana chart
- Integration tests in rust a multi-process test example
- Leader election in rust the journey towards implementing nun-db leader election
- How to make redux TodoMVC example a real-time multiuser app with nun-db in 10 steps
- A fast-to-sync/search and space-optimized replication algorithm written in rust, The Nun-db data replication model
- NunDb How to backup one or all databases
- How to create your simple version of google analytics real-time using Nun-db
- Migrating a chat bot feature from Firebase to Nun-db
- Keepin' up with NunDB
- Going live with NunDB