Fastest Vec Update on My Computer

“Premature optimization is the root of all evil.” At least that is what ‘big java’ wants you to think. But if I wasn’t worried about optimizations I’d be using linked lists in Haskell. I use rust because I want speed. If you are here for speed, then keep reading. In this post you’ll find a detailed account of how I created a problem, solved the problem, and then optimized the solution way to much.

The Problem

The original problem was wanting to find a good problem (yes I see the irony) to build a better intuition for Data Oriented Programming. Instead of looking up good problems, or following a tutorial I thought that updating a list really fast sounded fun. It is doubtful that anyone will ever use this exact interface to solve this exact problem, but hopefully it is still a fun little exercise for you.

The Problem, Actually

In general we want a function that takes in 2 lists, &[u8] and &[Update], and applies each update to the list in order. The “updated” list can either be a new list or the existing list.

Updates come in 2 forms Remove(usize) and Insert(usize, u8). We don’t need a change variant because Remove(i) combined with Insert(i, x) is the same as Change(i, x).

enum Update {
    Remove(usize),
    Insert(usize, u8),
}

Constraints and Requirements from Up Above

Any good problem has constraints. These are the constraints given to us from on high, or at least the ones I chose to give myself.

the updates list must be sorted
- Insert(i, x) < Remove(i) < Insert(i + 1, y)
Inserts must be applied in order
- so [Insert(3, 4), Insert(3, 5)] is essentially equivilant to Insert(3, [4, 5])
Removes are idempotent
You cannot remove or insert outside the list
Inserts are applied before their index
All updates are applied to the indices of the initial list
It must be as fast as possible
It is expected to operate on large lists of data.

Sudo Example

We want it to work like this

let initial_values = [1, 2, 2, 3, 7];
let updates = [
  Insert(0, 0_u8),
  Remove(1),
  Insert(4, 4_u8),
  Insert(4, 5_u8),
  Remove(4)
];
// output becomes:
// [0, 1, 2, 3, 4, 5]

Source Code

For those interested, here’s the source code. I used nix to pin the exact version of the nightly compiler I am using.

All of the benchmarks from the Criterion library.

I performed all benchmarks on a m1 macbook pro (running asahi linux).

Naive Solution

Rust has a builtin vector type with the methods: insert and remove. Exactly what we need.

fn fast_update(input: &mut Vec<u8>, updates: &[Update]) {
    for update in updates {
        match *updates {
            Update::Remove(idx) => {
                input.remove(index);
            }
            Update::Insert(index, value) => {
                input.insert(index, value)
            }
        }
    }
}

Super simple, except it’s wrong. This will only work for the very first update. We need to keep track of 2 more things: how much the indexes have changed, and if we’ve already removed that index.

More interesting than it being wrong is that it is extremely slow. O(n * m) to be exact. This is because when you remove and insert, you have to shift everything else after it over. Even if we got the “correct” answer I would still consider this solution to be incorrect because it is way to slow. If you’d like to spend your time waiting on a “correct” solution you can implement it yourself, but I’d rather spend my time optimizing a problem no one asked me to.

Big Oh Yeah

Lets make it linear. Start by making a second list, this will be returned instead of mutating the input. This is an extra nice property to have, that we will keep up until the end. Then we will take elements from either input or updates depending on what the current update is.

pub fn fast_update(input: &[u8], updates: &[Update]) -> Vec<u8> {
    // calculate the change in the vector size
    let mut v = Vec::new(); // you can also use `with_capacity()`

    let mut updates = updates.iter();
    let mut inputs = input.iter().enumerate();

    let mut update = updates.next();
    let mut next_input = inputs.next();

    // loop until no updates apply to current idx
    loop {
        match (update, next_input) {
            // skip `next_input` if removing at that index
            (Some(Update::Remove(index)), Some((idx, _))) if *index == idx => {
                update = updates.next();
                next_input = inputs.next();
            }
            // insert the value in `index`
            (Some(Update::Insert(index, insert_val)), Some((idx, _))) if *index == idx => {
                v.push(*insert_val);
                update = updates.next();
            }
            // insert the value 
            (_, Some((_, val))) => {
                v.push(*val);
                next_input = inputs.next();
            }
            // when both lists are empty, break
            (None, None) => break,
            // we ignore some edge cases for simplicity
        }
    }

    v
}

You could also do this with the Peekable iterator, but I chose this approach because it seemed more spicy.

This works and it is fast. For an input size of 100,000 and an updates size of 5,000, we get the following results:

method	mean speed
slow	5.2620 ms
faster	206.07 µs

A very large speed up to say the least. I would argue that while this level of optimization is not premature for the problem as stated.

We aren’t Done Yet

If this was a basic programming tutorial created by “Big OOP”, then I’d say we are done. Likely I’d make an excuse about how we aren’t here for premature optimization. But that isn’t why I’m here! Lets rebel against useless, over-quoted, misunderstood mantras about optimization. Fight Big OOP. The point here is to go supa fast, not just fast. We are using Rust, so lets use use rust. Don’t stop here, but instead push onward to higher and faster plains of code.

Make it Faster?

What about an iterator? To start, we need a struct to store all of the state. I’m opting to store a slice and an index for each list but you could also store an iterator. I’m also using a borrowing iterator so our function signature can still take a reference to the slice of updates. We can then collect that struct in our function body.

struct UpdateIter<'a> {
    updates: &'a [Update],
    update_idx: usize,
    input: &'a [u8],
    input_idx: usize,
}
impl Iterator for UpdateIter<'_> {
    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        todo!("")
    }
}
fn fast_update(input: &[u8], updates: &[Udpate]) -> Vec<u8> {
    UpdateIter {
        updates,
        update_idx: 0,
        input,
        input_idx: 0,
    }
    .collect()
}

For the body of next we want to: return the current value of the input, return the next input, or skip an input. Lets start with those first 2, they seem much simpler

let next_update = self.updates.get(self.update_idx);

match next_update {
    Some(Update::Remove(index)) if *index == self.input_idx => todo!(),
    // insert the value in `index`
    Some(Update::Insert(index, insert_val)) if *index == self.input_idx => {
        update_idx += 1;
        Some(insert_val)
    }
    // insert the value 
    _ => {
        input_idx += 1;
        self.input.get(self.input_idx - 1)
    }
    // we ignore some edge cases for simplicity
}

The question remains, how do we skip on removes? The trick there is to skip all removes before getting to that match statement.

loop {
    match self.updates.get(self.update_idx) {
        Some(Update::Remove(idx)) if self.input_idx == *idx => {
            self.update_idx += 1;
            self.input_idx += 1;
        }
        Some(Update::Remove(idx)) if self.input_idx > *idx => {
            // idempotent removal
            self.update_idx += 1;
        }
        _ => break,
    }
}

But is it fast? No. It was slower. For the same sample size as before, it had a mean time of 248.40 µs.

Splitting it up

What if we try splitting up the updates into 2 parts, performing all of the inserts and then all of the removes.

First we will need to split the updates list up:

let (removes, inserts): (Vec<_>, Vec<_>) = updates
    .into_iter()
    // we have to make sure that removes happen at the correct offset
    .scan(0, |num_inserts, update| {
        Some(match update {
            Update::Remove(idx) => Update::Remove(idx + *num_inserts),
            insert @ Update::Insert(_, _) => {
                *num_inserts += 1;
                insert
            }
        })
    })
    .partition(|updates| matches!(updates, Update::Remove(_)))

To apply both the inserts and removes, we can loop over the inputs, and either remove or insert when the index is of the update matches.

for inserts it looks like this:

let mut output = Vec::with_capacity(input.len() + inserts.len());
let mut inserts = inserts.into_iter();
let mut next_insert = inserts.next();
for (idx, &val) in input.iter().enumerate() {
    while let Some(Update::Insert(insert_idx, insert_val)) = next_insert {
        if insert_idx == idx {
            next_insert = inserts.next();
            output.push(insert_val);
        } else {
            break;
        }
    }
    // we always push the current val
    output.push(val);
}
// we will also need to make sure there aren't any inserts left

And it turns out this is even slower than the iterator approach. But that makes sense when you think about it. We have increased the number of allocations and conditionals.

Splitting, but Better

While the idea split is good, the execution is terrible. Lets take a step back and consider what splitting is trying to acheive. When we apply the updates in the original loop, we face the problem that we aren’t being predictable to the CPU. Most of what we are doing is moving data from the original input to the output, but the CPU cannot predict that with so many different branches. The hope with splitting up Updates is to make the operations we are doing more predictable, but currently we aren’t doing that.

The first problem is that we still need to check that we have the correct update variant. The second is that in every iteration of the loop we have to check if the update should be applied.

We can solve the first problem by changing the types. Instead of using the Update enum after splitting, we can instead use usize for removes and (usize, u8) for inserts.

let mut removes = Vec::<usize>::with_capacity(updates.len());
let mut inserts = Vec::<(usize, u8)>::with_capacity(updates.len());

for update in updates {
    match *update {
        Update::Remove(idx) => removes.push(idx + inserts.len()),
        Update::Insert(idx, val) => inserts.push((idx, val)),
    }
}

For the second one, we should iterate over the updates in the outer loop and over the input in the inner loop. Or perhaps we can just copy the part of the input we want to come before the update.

let mut inserted = Vec::with_capacity(input.len() + inserts.len());
let mut prev_idx = 0;
for (insert_idx, val) in inserts {
    inserted.extend_from_slice(&input[prev_idx..insert_idx]);
    inserted.push(val);
    prev_idx = insert_idx;
}
inserted.extend_from_slice(&input[prev_idx..]);

And finally, this approach is faster. In fact it has a mean time of 60.698 µs for the same inputs as before.

We can get even more speedups if we remove all of the repeated Remove updates while splitting the updates. This eliminates an extra check in the remove loop and gets our speed down to 59.209 µs. (the improvement is bigger when we increase our input sizes).

method	mean speed
slow	5.2620 ms
faster	206.07 µs
split	60.698 µs
removed removes	59.209 µs

Unsafe?

We have quite a few allocations currently, and a lot the the standard library code we are using is doing checks we don’t need them to. My hope was that those checks could be avoided by the compiler but when I looked that the generated assembly, it didn’t seem that was the case. To get around both of these problems we’ll need to use some unsafe rust.

To remove the allocations, we’ll start by changing the function signature to also pass in a buffer that we can use in whatever way we want but in the end will contain the output.

fn update_split_new_types_1_2<'a, 'b>(
    input: &'a [u8],
    updates: &'a [Update],
    buffer: &'b mut [MaybeUninit<u8>],
) -> &'b [u8];
 

To make working with the buffer safer we’ll create a struct for holding slices of the unitialized buffer. Here len tells us how much of the slice is initialized.

struct PartialInitSlice<'a, T> {
    raw_data: &'a mut [MaybeUninit<T>],
    len: usize,
}

We could use a raw pointer as well, but the slice already contains the capacity and this way the lifetimes are a bit easier.

To create instances of the slice, we need to get slices of the buffer with the proper length.

let (this_buffer, buffer) = buffer.split_at_mut(cap * std::mem::size_of::<T>());

Now we need to convert this_buffer from [MaybeUnit<u8>] to [MaybeUnit<T>]. Luckily the standard library has a nice function for doing that: core::slice::from_raw_parts_mut.

PartialInitSlice {
    raw_data: unsafe {
        core::slice::from_raw_parts_mut(
            this_buffer.as_mut_ptr() as *mut MaybeUninit<T>,
            cap,
        )
    },
    len: 0,
}

The question now becomes: is it safe? In order to use this function safely the docs tell us that we must make sure that the slice is all part of the same allocation, it is properly aligned, it is non_null, be properly initialized, and must not be accessed through any other pointer.

The last one is actually the easiest to verify. As long as we don’t use this_buffer outside of this struct, and buffer lives as long as PartialInitSlice, we are safe. This can be easily acheived by wrapping the construction in a function.

We can also convince ourselves that it is all one allocation because the slice has cap * size_of::<T> bytes and so does this_buffer. If this_buffer is sound, we’re good. We can use similar reasoning for the alignment. From what I can tell, it is safe to promote the alignment up from u8 as long as there is enough memory to hold it. The last thing to verify is that all of the memory is properly initialized. This is the exact reason for using MaybeUninit. All memory is properly initialized for MaybeUninit.

Initializing the data structure doesn’t get us anything unless we can get it to work like a vec. In order to do that we provide a couple of methods to the struct.

impl<'a, T> PartialInitSlice<'a, T> {
    fn get_slice(self) -> &'a [T];
    fn last(&self) -> Option<&T>;
    unsafe fn push_unchecked(&mut self, el: T);
    unsafe fn append_slice_unchecked(&mut self, slice: &[T]);
}

The first 2 provide methods ways to get elements from the slice, according to len. These methods can be safe because one of the gaurentees of the struct is that len is only updated when all values in the array at indexes less then len are initialized. (Sadly this invariant isn’t enforced by the compiler, and it is up to the programmer to create an interface that upholds the invariant. Ideally I would be able to mark len as unsafe, so that even within the module level code, I still must verify that invariant, and both the functions would be completely sound).

The other 2 methods are for updating the slice. We mark them as unsafe so the caller must make sure we don’t go outside the buffer. This allows us to take some of the checks out of the hotpath that it seemed the compiler didn’t choose to take out on it’s own.

push_unchecked it quite straightforward, we merely write the value to self.raw_data[self.len], and update len. append_slice_unchecked is a bit trickier. What we want to do is pretty much this:

self.raw_data[self.len..self.len + slice.len()] = slice;
self.len += slice.len();

but that approach has a couple of problems. The bigger problem is that raw_data and slice have different types (MaybeUninit<T> and T), meaning the above code won’t even type check. A secondary problem is that the above approach requires a runtime check that we can make the caller verify instead. We can solve the second problem by instead using core::ptr::copy_nonoverlapping. The first problem can be solved by casting slice.ptr() to MaybeUninit<T>. We can’t cast raw_data because we know that the slice we are writing to is unitialized, which is immediate undefined behavior.

let dst = unsafe { self.raw_data.as_mut_ptr().add(self.len) };
let src = slice.as_ptr() as *const MaybeUninit<T>;

unsafe { core::ptr::copy_nonoverlapping(src, dst, slice.len()) };

self.len += slice.len();

This method is only slower if your data is large enough. For the data sizes from before we actually slowed down, but for 50,000 updates and 1,000,000 elements We do see a slight improvement

Smaller data set:

method	mean speed
slow	5.2620 ms
faster	206.07 µs
split	60.698 µs
removed removes	59.209 µs
unsafe	66.435 µs

Larger data set: (Only the 3 fastest methods for brevity)

method	mean speed
split	602.39 µs
removed removes	589.36 µs
unsafe	565.38 µs

I’m not sure why it is slower for smaller amounts of data and faster with more data. This approach is also much more consistent. Aside from a few outliers, This approach has a much stronger clustering of data points, my guess is that it comes largely from removing the allocations which increase the amount of non-determinism in the system from run to run. The branch misses could also play into it, but I don’t think it is as big of a deal.

Conclusion and Other Ideas

This project likely doesn’t have many “useful” applications in the real world, especially with the exact constraints I put on it. In other tests I ran I found it to be much faster to not provide inserts and removes in the same list. It would probably be even more advantagous to have them be completely seperate functions and add a function for updating. Despite that I learned quite a bit and I got a huge rush everytime I made a speed improvement. I think the findings are quite interesting. I especially found it interesting that splitting up the seperation and updates into 2 separate steps made such a big speed up. Hopefully this will give me a better intuition of what sort of things will yeild actual speedups. At the very least it has reaffirmed the mantra to benchmark everything.

Blog Archive

Archive of all previous blog posts

Blog Archive

Archive of all previous blog posts