The new storage engine of Nun-db

When I started Nun-db, I intended it to be an in-memory database for several reasons. For obvious reasons, I eventually needed to snapshot it to the database to disk. At that time, I decided not to spend too much time thinking ho how the data would be on disk. Therefore, I decided to use a serialization lib and move on with my coding with the more important stuff at the time. I used the lib bincode that did a good job at the time and took less than 2 hours to implement. I knew the time to write my own serialization structure and disk storage would come eventually. Well, that time is now.

Why a new storage engine?

As I am working to make Nun-db a leaderless database, I needed to introduce a values version to my keys to easily find conflicts between nodes.

To add that new field to the disk, I would need to change the disk structure, so I used the opportunity to build a more definitive version after all.

I read the paper Faster: A Concurrent Key-Value Store with In-Place Updates when I was starting Nun-db and had been thinking about it ever since. I decided to onboard some of its ideas to Nun-db, to follow this post, you don’t need to read the paper but I will let the link here in case you are curious.

Goal for the new storage engine

Faster updates

Using bincode made my storage engine save all the values, despite being updated or not, and that made it “slow” I would like the new one to update only the keys that have changes so it would be faster.

Append only fast storage

According to the paper and after doing some experiments, I observed that append-only files would write much faster, so I was looking for a structure that could run as append-only.

In place update

I would need to support in-place updates to have the faster updates.

Hows?

In place update implementation

The in-place update is one of the biggest reasons for this new implementation.

To store only the updated keys, I needed to change the model in memory, adding a status. In the next block of code, I am showing the changes in the Value struct. The most important point, for now, is the ValueStatus. That status is used to hold how the state of the key in memory.

pub struct Value {
    pub value: String,
    pub version: i32,
    pub state: ValueStatus,// New 
    pub value_disk_addr: u64,// New 
    pub key_disk_addr: u64,// New
}


#[derive(Clone, PartialEq, Copy, Debug)]
pub enum ValueStatus {
    /// Value is ok memory == disk values
    Ok = 0,
    /// Value is updated on memory not yet stored in disk 
    Updated = 2,
    /// Value value needs to be deleted in the disk
    Deleted = 1,
    /// Value is new it is not present on disk yet
    New = 3,
}

That model made it possible to us to easily control the flow of information and what needs to be updated or not.

The first major improvement is the ability to filter keys that needs filer that needs to be updated in the disk.

    let mut keys_to_update = vec![];
    {
        let data = db.map.read().expect("Error getting the db.map.read");
        data.iter()
            .filter(|&(_k, v)| v.state != ValueStatus::Ok)
            .for_each(|(k, v)| keys_to_update.push((k.clone(), v.clone())))
    }; // Release the locker

As the next block of code shows, we can use append-only in Value files for all cases, in-place update in the keys file in case of Delete or Update, and append if Insert(New).

    for (key, value) in keys_to_update {
        match value.state {
            ValueStatus::New => {
                // Append value file
                let record_size = write_value(&mut values_file, &value, status);
                log::debug!(
                    "Write key: {}, addr: {} value_addr: {} ",
                    key,
                    next_key_addr,
                    value_addr
                );

                // Append key file
                let key_size = write_key(&mut keys_file, &key, &value, value_addr);
                value_addr = value_addr + record_size;
                db.set_value_as_ok(&key, &value, value_addr, next_key_addr);
                next_key_addr = next_key_addr + key_size;
            }

            ValueStatus::Updated => {
                // Append value file
                let record_size = write_value(&mut values_file, &value, status);

                // In place upate in key file
                update_key(
                    &mut keys_file_write,
                    &key,
                    value.version,
                    value_addr,
                    value.key_disk_addr,
                );
                value_addr = value_addr + record_size;
                db.set_value_as_ok(&key, &value, value_addr, value.key_disk_addr);
            }

            ValueStatus::Deleted => {
                // In place upate in key file
                update_key(
                    &mut keys_file_write,
                    &key,
                    VERSION_DELETED, // Version -1 means deleted
                    0,
                    value.key_disk_addr,
                );
            }

            ValueStatus::Ok => panic!("Values Ok should never get here"),
        }
    }

This strategy is good for speed, but it can create unused spaces in the value and key files over time. I implemented the reclaiming space alternative for a slower operation but to save space on disk.

The in disk structure

# keys file
 +-------------++---------------------++---------------------+
 |  key        ||         version     ||      value_addr     |
 | length:rest || 4 bytes             ||      8 bytes        |
 +-------------++---------------------++---------------------+
 ## If key address is -1 means the key is deleted

# Values file

 +-------------++---------------------+
 |  value      ||         status      |
 | length:rest || 4 bytes             |
 +-------------++---------------------+

stateDiagram direction LR [*] --> Key state Key { key_lenth --> key_str note right of key_lenth 8 bytes end note note right of key_str n bytes (Defined in key_length) end note key_str --> version note right of version 4 bytes end note version --> value_addr note right of value_addr 8 bytes end note }

The migration

As this is a new implementation of the storage engine, one of the most important points is how we will migrate from the old engine to the new engine.

For that, we made sure to implement one test to ensure we will restore the old one when the new version is up, and as soon as the next storage happens, we will store it as the new implementation.

In the next block of code, I am showing how we implemented that, basically the files changed from $name-nun.data to 2 files $name-nun.data.keys and $name-nun.data.values.

Reclaiming space

It is obvious that the current storage engine is much faster than the previous. Still, it uses and loses lots of space to be able to do append-only in the values files. I need to implement the last part of this algorithm to reclaim space. Of course, this will be slower than the normal one. I have procrastinated for at least one week to implement this last part, the gain is very little, and this feature won’t probably be needed for a while. I was also not finding an easy way to test this feature. Now, as I am writing this text, I am actually had an idea of how to test, and I am retaking it. The idea is to add metrics for the storage method to return the number of stored keys.

This change actually made my test even better because now I have a fine-granted way to control how many changes the storage method made, and I can assert I am not updating what I did not mean to update.

That decision was easy now. The point is, where do I add that to the external API? … My initial idea is to add to the snapshot command an optional parameter called reclaim or diff, reclaiming the command to reclaim space.

I think the best way to implement this is to add an additional parameter to the function storage_data_disk and will use it in the external usage APIs.

Performance

The before and after performance of this implementation are not actually comparable. I did do lots of comparing while I was coding this feature, but I realize it is not fair to compare. This version has lots of optimizations on what to store, and the previous one did not have any. Therefore it would store all the data all the time.

Instead of comparing with the last version, I decided to code some performance tests into my subsets of tests.

Make sure 10k changed keys can be stored in less than 100ms.
Make sure 10k changed keys can be read from disk in less than 100ms.

    #[test]
    fn restore_should_be_fast() {
        let (dbs, db_name, db) = create_db_with_10k_keys();
        clean_all_db_files(&db_name);
        let start = Instant::now();
        let changed_keys = storage_data_disk(&db, &db_name, false);
        log::info!("TIme {:?}", start.elapsed());
        assert!(start.elapsed().as_millis() < 100);
        assert_eq!(changed_keys, 10000);

        let start_load = Instant::now();
        let db_file_name = file_name_from_db_name(&db_name);
        let (loaded_db, _) = create_db_from_file_name(&db_file_name, &dbs);
        let time_in_ms = start_load.elapsed().as_millis();
        log::info!("TIme to update {:?}ms", time_in_ms);
        assert!(start_load.elapsed().as_millis() < 100);
....

I did the same to ensure single key updates have to fast.

One key update in a 10 k dataset has to be > 2ms.

//...
        let value = loaded_db.get_value(String::from("key_1")).unwrap();
        let value_100 = loaded_db.get_value(String::from("key_100")).unwrap();
        let value_1000 = loaded_db.get_value(String::from("key_1000")).unwrap();
        assert_eq!(value.state, ValueStatus::Updated);
        assert_eq!(value_100.state, ValueStatus::Ok);
        assert_eq!(value_100.value, String::from("key_100"));
        assert_eq!(value_1000.value, String::from("key_1000"));
        let start_secount_storage = Instant::now();
        let changed_keys = storage_data_disk(&loaded_db, &db_name, false);
        assert_eq!(changed_keys, 1);
        let time_in_ms = start_secount_storage.elapsed().as_millis();
        log::debug!("TIme to update {:?}ms", time_in_ms);
        assert!(time_in_ms < 1);
//...

Smell of good design

Only needed to change bo (Business objects) and disk_ops to implement this feature. That is a smell. The boundaries of my code are still good up to this point. Good design makes it easy to change.

Thinking about a bigger-than-memory dataset as fast as in-memory

Since I started NunDB, the Goal has been to be a Memory database that is fast. Today, analyzing the applications running on it that were a very good decision and still be, I think NunDB will still be a memory database for the foreseeable future. But on the other hand, while I design these disk features and how to test the performance more and more, I think it would be possible to address datasets bigger than memory and deliver great performance, I don’t want to test that for now, but I will keep this in mind as I work on the next features we will implement.

Conclusion

Writing this new storage engine to nun-DB was the most fun I had in the project for a while. Yet it was exhausting at some points. Having a small test to fix was what kept me going in the hard times. It took me more than one month to shape this up and merge.

I had only one major issue after merging it here. It was a missing edge case of data deleted before any database could have been a snapshot to the disk. I call that success.

The issue was dramatic and cost me at least 1 hour of manual work to fix my prod environment to recover corrupted data. However, it was a minor problem overall, and it worked like a charm once I deployed it. Deployment was boring, as it should always be.

Now it is time to start working not the leaderless features of NunDB, and it will be ready to scale to infinity. I needed this last change to make it safe to go to that part of the challenge. And I am looking forward to starting, see you at the next one.

Written on May 16, 2022