Bigints
Let’s talk about writing Bigints. A Bigint is a data structure that allows one to do math on integers that are larger than the computer’s register size. On a 64 bit computer, registers are 64 bits, so if you want to add two numbers that are larger than can fit in 64 bits, you’d need to put them in two registers or more per number, and then apply each mathematical operation to all the registers.
Let’s start off with a u128 type. That would look
something like this:
#[derive(Debug, Clone, Copy)]
pub struct Uint128 {
pub h: u64,
pub l: u64,
}For addition, subtraction, and multiplication thankfully nightly rust
has overflowing_*, widening_* and
wrapping_* methods on integers that allow us a simple
implementation.
impl std::ops::Add for Uint128 {
type Output = Self;
#[inline(never)]
fn add(self, rhs: Self) -> Self::Output {
let (l, carry) = self.l.overflowing_add(rhs.l);
let h = self.h.wrapping_add(rhs.h).wrapping_add(carry as u64);
Self { l, h }
}
}impl std::ops::Sub for Uint128 {
type Output = Self;
#[inline(never)]
fn sub(self, rhs: Self) -> Self::Output {
let (l, borrow) = self.l.overflowing_sub(rhs.l);
let h = self.h.wrapping_sub(rhs.h).wrapping_sub(borrow as u64);
Self { l, h }
}
}impl std::ops::Mul for Uint128 {
fn mul(self, rhs: Self) -> Self::Output {
let (p0_lo, p0_hi) = self.l.widening_mul(rhs.l);
let t1_lo = self.l.wrapping_mul(rhs.h);
let t2_lo = self.h.wrapping_mul(rhs.l);
let h = p0_hi.wrapping_add(t1_lo).wrapping_add(t2_lo);
Self { h, l: p0_lo }
}
}For division we have to handwrite and lean on the u128
type, but we get a pretty nice result in the end:
impl std::ops::Div for Uint128 {
fn div(self, rhs: Self) -> Self::Output {
let n = (self.h as u128) << 64 | self.l as u128;
let d = (rhs.h as u128) << 64 | rhs.l as u128;
let q = n / d;
Self {
l: q as u64,
h: (q >> 64) as u64,
}
}
}The assembly delegates to a compiler builtin called
__udivti3, so our code is pretty much loads that up and all
the hairy work is done for us.
<bigints::u128::Uint128 as core::ops::arith::Div>::div:
push rax
mov rax, rdx
or rax, rcx
je .LBB_2
call qword ptr [rip + __udivti3@GOTPCREL]
pop rcx
ret
.LBB_2:
lea rdi, [rip + .Lanon.12]
call qword ptr [rip + core::panicking::panic_const::panic_const_div_by_zero@GOTPCREL]And that’s it, right?
Sadly I wrote a bug in the beginning for my platform (x86_64). The bigints struct has to flip its members since this is a little-endian platform.
#[derive(Debug, Clone, Copy)]
pub struct Uint128 {
pub l: u64,
pub h: u64,
}Otherwise the multiplication incurs two extra movs at
the start of the function to load the members into the right place.
mov rax, rdx
mov rdx, rcx
mulx r8, rdx, rsi
imul rax, rsi
imul rdi, rcx
add rax, rdi
add rax, r8If you write the struct the other way, i.e:
#[derive(Debug, Clone, Copy)]
#[cfg(target_endian = "little")]
pub struct Uint128 {
pub l: u64, // bits 0-63 (lower address)
pub h: u64, // bits 64-127 (higher address)
}
#[derive(Debug, Clone, Copy)]
#[cfg(target_endian = "big")]
pub struct Uint128 {
pub h: u64, // bits 64-127 (lower address)
pub l: u64, // bits 0-63 (higher address)
}Then the two movs disappear from the x86_64 generated
codegen.
mulx r8, rdx, rsi
imul rax, rsi
imul rdi, rcx
add rax, rdi
add rax, r8Now I needed a way to make sure all my operations were closer to
native, so I wrote up a test harness that used
cargo-show-asm to check the generated assembly on a variety
of platforms (including a big-endian one, s390x) to make sure that there
were no unnecessary moves being produced.
I used this and compared my implementations to native to see if they
were being properly optimized and found out that for aarch64, using
overflowing_sub and wrapping_sub nets
this:
<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:
subs x0, x0, x2
sub x8, x1, x3
cset w9, lo
sub x1, x8, x9
ret Whereas the native version looks like so:
<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:
subs x0, x0, x2
sbc x1, x1, x3
retWhile for x86_64, it’s properly optimized:
<bigints::u128::Uint128 as core::ops::arith::Sub>::sub:
mov rax, rdi
sub rax, rdx
sbb rsi, rcx
mov rdx, rsi
retAs well, for multiplication, there’s an extra instruction due to scheduling:
<bigints::u128::Uint128 as core::ops::arith::Mul>::mul:
mul x9, x2, x1
mul x8, x2, x0
umulh x10, x2, x0
madd x9, x3, x0, x9
mov x0, x8
add x1, x9, x10
ret<bigints::u128::Uint128 as core::ops::arith::Mul>::mul:
umulh x10, x0, x2
mul x9, x1, x2
madd x9, x3, x0, x9
mul x0, x0, x2
add x1, x9, x10
retI assume this is an LLVM backend issue somewhere, but it’s hard to fault them, I doubt anyone would write this code anyway since the u128 types exist.
The resulting code is here: Bigints