Bigints

If you want to look at the code: it’s here Bigints

Let’s talk about writing Bigints. A Bigint is a data structure that allows one to do math on integers that are larger than the computer’s register size. On a 64 bit computer, registers are 64 bits, so if you want to add two numbers that are larger than can fit in 64 bits, you’d need to put them in two registers or more per number, and then apply each mathematical operation to all the registers.

Let’s start off with a u128 type. That would look something like this:

#[derive(Debug, Clone, Copy)]
pub struct Uint128 {
    pub h: u64, 
    pub l: u64, 
}

For addition, subtraction, and multiplication thankfully nightly rust has overflowing_*, widening_* and wrapping_* methods on integers that allow us a simple implementation.

impl std::ops::Add for Uint128 {
    type Output = Self;

    #[inline(never)]
    fn add(self, rhs: Self) -> Self::Output {
        let (l, carry) = self.l.overflowing_add(rhs.l);
        let h = self.h.wrapping_add(rhs.h).wrapping_add(carry as u64);

        Self { l, h }
    }
}
impl std::ops::Sub for Uint128 {
    type Output = Self;

    #[inline(never)]
    fn sub(self, rhs: Self) -> Self::Output {
        let (l, borrow) = self.l.overflowing_sub(rhs.l);
        let h = self.h.wrapping_sub(rhs.h).wrapping_sub(borrow as u64);

        Self { l, h }
    }
}
impl std::ops::Mul for Uint128 {
    fn mul(self, rhs: Self) -> Self::Output {
        let (p0_lo, p0_hi) = self.l.widening_mul(rhs.l);

        let t1_lo = self.l.wrapping_mul(rhs.h);
        let t2_lo = self.h.wrapping_mul(rhs.l);
        let h = p0_hi.wrapping_add(t1_lo).wrapping_add(t2_lo);
        Self { h, l: p0_lo }
    }
}

For division we have to handwrite and lean on the u128 type, but we get a pretty nice result in the end:

impl std::ops::Div for Uint128 {
    fn div(self, rhs: Self) -> Self::Output {
        let n = (self.h as u128) << 64 | self.l as u128;
        let d = (rhs.h as u128) << 64 | rhs.l as u128;
        let q = n / d;
        Self {
            l: q as u64,
            h: (q >> 64) as u64,
        }
    }
}

The assembly delegates to a compiler builtin called __udivti3, so our code is pretty much loads that up and all the hairy work is done for us.

<bigints::u128::Uint128 as core::ops::arith::Div>::div:
    push rax
    mov rax, rdx
    or rax, rcx
    je .LBB_2
    call qword ptr [rip + __udivti3@GOTPCREL]
    pop rcx
    ret
.LBB_2:
    lea rdi, [rip + .Lanon.12]
    call qword ptr [rip + core::panicking::panic_const::panic_const_div_by_zero@GOTPCREL]

And that’s it, right?

Sadly I wrote a bug in the beginning for my platform (x86_64). The bigints struct has to flip its members since this is a little-endian platform.

#[derive(Debug, Clone, Copy)]
pub struct Uint128 {
    pub l: u64, 
    pub h: u64, 
}

Otherwise the multiplication incurs two extra movs at the start of the function to load the members into the right place.

mov rax, rdx
mov rdx, rcx
mulx r8, rdx, rsi
imul rax, rsi
imul rdi, rcx
add rax, rdi
add rax, r8

If you write the struct the other way, i.e:

#[derive(Debug, Clone, Copy)]
#[cfg(target_endian = "little")]
pub struct Uint128 {
    pub l: u64, // bits 0-63 (lower address)
    pub h: u64, // bits 64-127 (higher address)
}

#[derive(Debug, Clone, Copy)]
#[cfg(target_endian = "big")]
pub struct Uint128 {
    pub h: u64, // bits 64-127 (lower address)
    pub l: u64, // bits 0-63 (higher address)
}

Then the two movs disappear from the x86_64 generated codegen.

mulx r8, rdx, rsi
imul rax, rsi
imul rdi, rcx
add rax, rdi
add rax, r8

Now I needed a way to make sure all my operations were closer to native, so I wrote up a test harness that used cargo-show-asm to check the generated assembly on a variety of platforms (including a big-endian one, s390x) to make sure that there were no unnecessary moves being produced.

I used this and compared my implementations to native to see if they were being properly optimized and found out that for aarch64, using overflowing_sub and wrapping_sub nets this:

<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:                                                                                                                                                                                                                                                                                                                                
    subs x0, x0, x2                                                                                                                                                                                                                                                                                                                                                                    
    sub x8, x1, x3                                                                                                                                                                                                                                                                                                                                                                     
    cset w9, lo                                                                                                                                                                                                                                                                                                                                                                        
    sub x1, x8, x9                                                                                                                                                                                                                                                                                                                                                                     
    ret                                                                                                                                                                                                                                                                                                                                                                                

Whereas the native version looks like so:

<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:                                                                                                                                                                                                                                                                                                                                
  subs x0, x0, x2
  sbc  x1, x1, x3
  ret

While for x86_64, it’s properly optimized:

<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:
    mov rax, rdi
    sub rax, rdx
    sbb rsi, rcx
    mov rdx, rsi
    ret

As well, for multiplication, there’s an extra instruction due to scheduling:

<bigints::u128::Uint128 as core::ops::arith::Mul>::mul:
    mul x9, x2, x1
    mul x8, x2, x0
    umulh x10, x2, x0
    madd x9, x3, x0, x9
    mov x0, x8
    add x1, x9, x10
    ret
<bigints::u128::Uint128 as core::ops::arith::Mul>::mul:
    umulh x10, x0, x2    
    mul x9, x1, x2       
    madd x9, x3, x0, x9  
    mul x0, x0, x2       
    add x1, x9, x10      
    ret

I assume this is an LLVM backend issue somewhere, but it’s hard to fault them, I doubt anyone would write this code anyway since the u128 types exist.

Digging into the sub problem

The sub problem for aarch64 is interesting on two accounts: the x86_64 version works fine and add is being properly optimized. Given that I figured I’d investigate a bit more on this.

The first thing to do was to look at the llvm IR that’s generated and see if llc would be able to take that IR and optimize it away:

We can get the LLVM IR from the frontend (rust) for our implementation and native like so:

$ cargo asm --llvm --lib "<bigints::u128::Uint128 as core::ops::arith::Sub>::sub" \
--release --target aarch64-unknown-linux-gnu > sub128.ll
$ cargo asm --llvm --lib "bigints::native_sub" --release --target x86_64-unknown-linux-gnu

And after removing rust’s native symbol mangling we get the following:

define { i64, i64 } @sub128(i64 %a_lo, i64 %a_hi, i64 %b_lo, i64 %b_hi) {
  %lo = sub i64 %a_lo, %b_lo
  %borrow = icmp ult i64 %a_lo, %b_lo
  %hi_tmp = sub i64 %a_hi, %b_hi
  %borrow_ext = sext i1 %borrow to i64
  %hi = add i64 %hi_tmp, %borrow_ext
  %res0 = insertvalue { i64, i64 } poison, i64 %lo, 0
  %res1 = insertvalue { i64, i64 } %res0, i64 %hi, 1
  ret { i64, i64 } %res1
}
define i128 @sub128_native(i128 %a, i128 %b) {
  %r = sub i128 %a, %b
  ret i128 %r
}

Although the non-native one looks worse, llc could still optimize it:

But after running llc on it, we see it doesn’t:

Our version:

    subs    x0, x0, x2
    sub     x8, x1, x3
    cset    w9, lo
    sub     x1, x8, x9
    ret

Native:

    subs    x0, x0, x2
    sbc     x1, x1, x3
    ret

So the problem does persist and we have a way to reproduce it; run rust as an LLVM IR generator and run llc on its output. We could either make a change in the rust frontend itself, and then generate IR and pass it to the same llc version, or only make a change in llc. Either way, it’s time to look through both:

I first looked through the rust codebase to see if there was a reason why the LLVM IR looks so much worse for a handwritten u128 sub than native:

// In the logic for checked_*, 
        if !signed {
            match oop {
                OverflowOp::Sub => {
                    // Emit sub and icmp instead of llvm.usub.with.overflow. LLVM considers these
                    // to be the canonical form. It will attempt to reform llvm.usub.with.overflow
                    // in the backend if profitable.
                    let sub = self.sub(lhs, rhs);
                    let cmp = self.icmp(IntPredicate::IntULT, lhs, rhs);
                    return (sub, cmp);
                }
                OverflowOp::Add => {
                    // Like with sub above, using icmp is the preferred form. See
                    // <https://rust-lang.zulipchat.com/#narrow/channel/187780-t-compiler.2Fllvm/topic/.60uadd.2Ewith.2Eoverflow.60.20.28again.29/near/533041085>
                    let add = self.add(lhs, rhs);
                    let cmp = self.icmp(IntPredicate::IntULT, add, lhs);
                    return (add, cmp);
                }
                OverflowOp::Mul => {}
            }
        }

Checking the zulip thread the discussion is about u128 ops, and changing them from u(add|sub|mul).with.overflow and s(add|sub|mul).with.overflow to icmp + op. However, the note is from the LLVM side that icmp + the op itself is preferred, but the previous form, is ok as well. Given that info the rust side should be ok, and it’s time to dig into LLVM.

So I combed through instruction selection code for aarch64 to see if there were any signs of any missed optimizations. I pulled down and built llc to find out that actually llvm 23 (rust is on llvm 21.8 for me) compiles the previous llvm IR just fine.

I tracked down the fix to this PR, which basically special cases this for aarch64:

Combine subtract with borrow to SBC

Which means that the sub fix will be coming in a future version of rust soon.