Bigints
If you want to look at the code: it’s here Bigints
Let’s talk about writing Bigints. A Bigint is a data structure that allows one to do math on integers that are larger than the computer’s register size. On a 64 bit computer, registers are 64 bits, so if you want to add two numbers that are larger than can fit in 64 bits, you’d need to put them in two registers or more per number, and then apply each mathematical operation to all the registers.
Let’s start off with a u128 type. That would look
something like this:
#[derive(Debug, Clone, Copy)]
pub struct Uint128 {
pub h: u64,
pub l: u64,
}For addition, subtraction, and multiplication thankfully nightly rust
has overflowing_*, widening_* and
wrapping_* methods on integers that allow us a simple
implementation.
impl std::ops::Add for Uint128 {
type Output = Self;
#[inline(never)]
fn add(self, rhs: Self) -> Self::Output {
let (l, carry) = self.l.overflowing_add(rhs.l);
let h = self.h.wrapping_add(rhs.h).wrapping_add(carry as u64);
Self { l, h }
}
}impl std::ops::Sub for Uint128 {
type Output = Self;
#[inline(never)]
fn sub(self, rhs: Self) -> Self::Output {
let (l, borrow) = self.l.overflowing_sub(rhs.l);
let h = self.h.wrapping_sub(rhs.h).wrapping_sub(borrow as u64);
Self { l, h }
}
}impl std::ops::Mul for Uint128 {
fn mul(self, rhs: Self) -> Self::Output {
let (p0_lo, p0_hi) = self.l.widening_mul(rhs.l);
let t1_lo = self.l.wrapping_mul(rhs.h);
let t2_lo = self.h.wrapping_mul(rhs.l);
let h = p0_hi.wrapping_add(t1_lo).wrapping_add(t2_lo);
Self { h, l: p0_lo }
}
}For division we have to handwrite and lean on the u128
type, but we get a pretty nice result in the end:
impl std::ops::Div for Uint128 {
fn div(self, rhs: Self) -> Self::Output {
let n = (self.h as u128) << 64 | self.l as u128;
let d = (rhs.h as u128) << 64 | rhs.l as u128;
let q = n / d;
Self {
l: q as u64,
h: (q >> 64) as u64,
}
}
}The assembly delegates to a compiler builtin called
__udivti3, so our code is pretty much loads that up and all
the hairy work is done for us.
<bigints::u128::Uint128 as core::ops::arith::Div>::div:
push rax
mov rax, rdx
or rax, rcx
je .LBB_2
call qword ptr [rip + __udivti3@GOTPCREL]
pop rcx
ret
.LBB_2:
lea rdi, [rip + .Lanon.12]
call qword ptr [rip + core::panicking::panic_const::panic_const_div_by_zero@GOTPCREL]And that’s it, right?
Sadly I wrote a bug in the beginning for my platform (x86_64). The bigints struct has to flip its members since this is a little-endian platform.
#[derive(Debug, Clone, Copy)]
pub struct Uint128 {
pub l: u64,
pub h: u64,
}Otherwise the multiplication incurs two extra movs at
the start of the function to load the members into the right place.
mov rax, rdx
mov rdx, rcx
mulx r8, rdx, rsi
imul rax, rsi
imul rdi, rcx
add rax, rdi
add rax, r8If you write the struct the other way, i.e:
#[derive(Debug, Clone, Copy)]
#[cfg(target_endian = "little")]
pub struct Uint128 {
pub l: u64, // bits 0-63 (lower address)
pub h: u64, // bits 64-127 (higher address)
}
#[derive(Debug, Clone, Copy)]
#[cfg(target_endian = "big")]
pub struct Uint128 {
pub h: u64, // bits 64-127 (lower address)
pub l: u64, // bits 0-63 (higher address)
}Then the two movs disappear from the x86_64 generated
codegen.
mulx r8, rdx, rsi
imul rax, rsi
imul rdi, rcx
add rax, rdi
add rax, r8Now I needed a way to make sure all my operations were closer to
native, so I wrote up a test harness that used
cargo-show-asm to check the generated assembly on a variety
of platforms (including a big-endian one, s390x) to make sure that there
were no unnecessary moves being produced.
I used this and compared my implementations to native to see if they
were being properly optimized and found out that for aarch64, using
overflowing_sub and wrapping_sub nets
this:
<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:
subs x0, x0, x2
sub x8, x1, x3
cset w9, lo
sub x1, x8, x9
ret Whereas the native version looks like so:
<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:
subs x0, x0, x2
sbc x1, x1, x3
retWhile for x86_64, it’s properly optimized:
<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:
mov rax, rdi
sub rax, rdx
sbb rsi, rcx
mov rdx, rsi
retAs well, for multiplication, there’s an extra instruction due to scheduling:
<bigints::u128::Uint128 as core::ops::arith::Mul>::mul:
mul x9, x2, x1
mul x8, x2, x0
umulh x10, x2, x0
madd x9, x3, x0, x9
mov x0, x8
add x1, x9, x10
ret<bigints::u128::Uint128 as core::ops::arith::Mul>::mul:
umulh x10, x0, x2
mul x9, x1, x2
madd x9, x3, x0, x9
mul x0, x0, x2
add x1, x9, x10
retI assume this is an LLVM backend issue somewhere, but it’s hard to fault them, I doubt anyone would write this code anyway since the u128 types exist.
Digging into the sub problem
The sub problem for aarch64 is interesting on two accounts: the x86_64 version works fine and add is being properly optimized. Given that I figured I’d investigate a bit more on this.
The first thing to do was to look at the llvm IR that’s generated and
see if llc would be able to take that IR and optimize it
away:
We can get the LLVM IR from the frontend (rust) for our implementation and native like so:
$ cargo asm --llvm --lib "<bigints::u128::Uint128 as core::ops::arith::Sub>::sub" \
--release --target aarch64-unknown-linux-gnu > sub128.ll$ cargo asm --llvm --lib "bigints::native_sub" --release --target x86_64-unknown-linux-gnuAnd after removing rust’s native symbol mangling we get the following:
define { i64, i64 } @sub128(i64 %a_lo, i64 %a_hi, i64 %b_lo, i64 %b_hi) {
%lo = sub i64 %a_lo, %b_lo
%borrow = icmp ult i64 %a_lo, %b_lo
%hi_tmp = sub i64 %a_hi, %b_hi
%borrow_ext = sext i1 %borrow to i64
%hi = add i64 %hi_tmp, %borrow_ext
%res0 = insertvalue { i64, i64 } poison, i64 %lo, 0
%res1 = insertvalue { i64, i64 } %res0, i64 %hi, 1
ret { i64, i64 } %res1
}define i128 @sub128_native(i128 %a, i128 %b) {
%r = sub i128 %a, %b
ret i128 %r
}Although the non-native one looks worse, llc could still
optimize it:
But after running llc on it, we see it doesn’t:
Our version:
subs x0, x0, x2
sub x8, x1, x3
cset w9, lo
sub x1, x8, x9
retNative:
subs x0, x0, x2
sbc x1, x1, x3
retSo the problem does persist and we have a way to reproduce it; run
rust as an LLVM IR generator and run llc on its output. We
could either make a change in the rust frontend itself, and then
generate IR and pass it to the same llc version, or only
make a change in llc. Either way, it’s time to look through
both:
I first looked through the rust codebase to see if there was a reason why the LLVM IR looks so much worse for a handwritten u128 sub than native:
// In the logic for checked_*,
if !signed {
match oop {
OverflowOp::Sub => {
// Emit sub and icmp instead of llvm.usub.with.overflow. LLVM considers these
// to be the canonical form. It will attempt to reform llvm.usub.with.overflow
// in the backend if profitable.
let sub = self.sub(lhs, rhs);
let cmp = self.icmp(IntPredicate::IntULT, lhs, rhs);
return (sub, cmp);
}
OverflowOp::Add => {
// Like with sub above, using icmp is the preferred form. See
// <https://rust-lang.zulipchat.com/#narrow/channel/187780-t-compiler.2Fllvm/topic/.60uadd.2Ewith.2Eoverflow.60.20.28again.29/near/533041085>
let add = self.add(lhs, rhs);
let cmp = self.icmp(IntPredicate::IntULT, add, lhs);
return (add, cmp);
}
OverflowOp::Mul => {}
}
}Checking the zulip thread the discussion is about u128 ops, and
changing them from u(add|sub|mul).with.overflow and
s(add|sub|mul).with.overflow to icmp + op. However, the
note is from the LLVM side that icmp + the op itself is preferred, but
the previous form, is ok as well. Given that info the rust side should
be ok, and it’s time to dig into LLVM.
So I combed through instruction selection code for aarch64 to see if
there were any signs of any missed optimizations. I pulled down and
built llc to find out that actually llvm 23 (rust is on
llvm 21.8 for me) compiles the previous llvm IR just fine.
I tracked down the fix to this PR, which basically special cases this for aarch64:
Combine subtract with borrow to SBC
Which means that the sub fix will be coming in a future version of rust soon.