How fast are computers these days

I’ve been doing a lot of reading about distributed systems, but I’ve never known how fast hardware goes these days. Some common advice like “I/O is slow” or “system calls are slow” might not matter that much, depending on the performance requirements of the system that’s using them.

General performance numbers are useful for building distributed systems – let’s say you have a system with 1PB of data and you want to backup all that data to one node, where you can write to disk at 1GB/s. How long does it take to back up the whole node? That would take about 10 days to complete (1PB / 1GB) is 1M, and 1M seconds is about 10 days. If we were using HDDs from 20 years ago instead of SSDs, where writes might be closer to 40MB/s, it would take closer to 250 days, which is really crazy.

Armed with this information, we might say a total outage cannot happen, or we carefully design the system to partition the data so backups can happen in parallel, where if we have 1000 nodes, then our 10 day backup might take 10 / 1000 days, or closer to 90 seconds.

For a twist, I decided to benchmark my phone (a Samsung S10) vs my personal laptop (a Framework laptop with a 12th gen intel i5-1240p processor, 32GB of DDR4-3200 RAM, and a 2TB Western Digital SN750). The S10 has a value of about $200 used these days, and the laptop about $1000. So, we can also see how much money it would cost to build a system using these parts (although I assume building a system with NUCs or the like would be much more cost efficient), we can get a ballpark estimation of how much certain distributed systems would cost to build. As well, I’ll put down my guesses for how fast I thought each benchmark would be for my phone and my laptop, and see how far off I was. In the end, we’ll discuss some in the wild systems and see how feasible it would be to build them.

HTTP Servers

The communication layer of a service might use HTTP. Thus, I wanted to benchmark a “hello world” HTTP service to see how fast a networked service could run, given it does no work.

I assumed my laptop would run about 100k req/s, with around 10 microsecond latency, and my phone would do around 10k req/s, with 100 microsecond latency.

Try to guess how fast each device was:

The computer had these results:

Running 5s test @ http://localhost:3000
  256 goroutine(s) running concurrently
975712 requests in 4.929485946s, 105.15MB read
Requests/sec:       197933.82
Transfer/sec:       21.33MB
Avg Req Time:       1.293361ms
Fastest Request:    21.095µs
Slowest Request:    67.116949ms
Number of Errors:   0

And the phone:

Running 5s test @ http://localhost:3000
  32 goroutine(s) running concurrently
107649 requests in 4.909044953s, 11.29MB read
Requests/sec:       21928.71
Transfer/sec:       2.30MB
Avg Req Time:       1.459274ms
Fastest Request:    85.677µs
Slowest Request:    48.922239ms
Number of Errors:   0

The laptop had about 200k req/s and the phone had about 20k req/s. Both were about twice as fast as I expected.

So, if we had to create a “Hello world” server and reach 1M req/s, it would take 50 phones or 5 laptops. The 50 phones would cost $10000, and the 5 laptops would cost about $5000. Computers are cheaper.

Redis

What if we put a cache in front of our servers? How fast could it respond?

Since redis is single-threaded and in-memory, it should have similar numbers to the HTTP server.

I would assume that it would perform the same as the HTTP servers, since they’d be doing about the same amount of work: ~200k req/s for the laptop and ~20k req/s for the phone.

The laptop ran at ~200k for read, write, ping.

The phone ran at ~40k req/s for read, write, ping.

Pretty good – if we could hold our dataset in memory, we could respond with 200k req/s.

Sqlite

We’ll need a database to store our data. Let’s say we use sqlite, and benchmark it, using row sizes of 1000B.

I assume that we can write about 50k/second rows on the laptop and 10k rows/second on the phone.

The laptop:

Batching 1000 writes at a time:

$ ./sqlite-bench -batch-count 1000 -batch-size 1000 -row-size 1000 -journal-mode wal -synchronous normal ./bench.db

Inserts:   1000000 rows
Elapsed:   7.824s
Rate:      127817.265 insert/sec
File size: 1026584576 bytes

Writing 1 row at a time:

$ ./sqlite-bench -batch-count 1000000 -batch-size 1 -row-size 1000 -journal-mode wal -synchronous normal ./bench.db

Inserts:   1000000 rows
Elapsed:   43.839s
Rate:      22810.910 insert/sec
File size: 1026584576 bytes

The phone:

Batching 1000 writes at a time:

$ ./sqlite-bench -batch-count 1000 -batch-size 1000 -row-size 1000 -journal-mode wal -synchronous normal ./bench.db

Inserts: 1000000 rows
Elapsed: 66.006s
Rate: 15150.53 insert/sec
File size: 1026584576 bytes

Writing 1 row at a time:

$ ./sqlite-bench -batch-count 1000000 -batch-size 1 -row-size 1000 -journal-mode wal -synchronous normal ./bench.db

Inserts:   1000000 rows
Elapsed:   200.473s
Rate:      4884.369 insert/sec
File size: 1026584576 bytes

Here we see the phone’s weakness – its disk. Inserting row by row, the phone only can write 5k rows/s, whereas the computer can do about 20k rows/s, probably due to the phone having a flash disk and the computer having an SSD.

Disk

Since the sqlite bench uncovered the weakness of the phone’s disk, lets test out some numbers for the file system:

I decided to benchmark sequential reads + writes for both and with fsync for the writes. Since I used fio as my tool, and it supports io_uring, I gave that a shot on my laptop. Android doesn’t support io_uring for security reasons, so I only ran io_uring on the computer.

I’d expect about 4GB/s read and 3GB/s write on the computer (since that’s what the SSD is rated at) and maybe 1GB/s read and 200MB/s write for the phone?

The numbers looked like this, with a blocksize of 1MB and running 8 jobs in parallel, which seemed to be the best, throughput wise:

Computer:

job name p50 p90 p99 p99.99 throughput
sync sequential read 113μs 5014μs 33424μs 37487μs 3387MB/s
sync sequential write - fsync 169μs 529μs 1319μs 333448μs 2830MB/s
sync sequential write + fsync 775μs 1090μs 1369μs 2147μs 1714MB/s
sync readwrite - fsync 204μs 416μs 865μs 14222μs Read: 2556MB/s Write: 2665MB/s
sync readwrite + fsync 416μs 832μs 2769μs 9372μs Read: 1052MB/s Write: 1093MB/s

Phone:

With a blocksize of 256k and running 8 jobs in parallel:

job name p50 p90 p99 p99.99 throughput
sync sequential read 1549μs 4490μs 4752μs 22152μs 866MB/s
sync sequential write - fsync 18482μs 39584μs 89654μs 109577μs 100MB/s
sync sequential write + fsync 510μs 865μs 1729μs 1860μs 86.3MB/s
sync readwrite - fsync 314μs 1926μs 41681μs 41681μs Read: 77.4MB/s Write: 75.5MB/s
sync readwrite + fsync 379μs 717μs 10421μs 10421μs Read: 38.5MB/s Write: 41.4MB/s

The phone was substantially slower than I expected.

Hashing

I was expecting about 400MB/s hashing on the laptop and about 100MB/s hashing on the phone for a non-cryptographic hash, and ~40MB/s for a cryptographic hash on a laptop, and 10MB/s on a phone.

Laptop:

Phone:

I could not have been more wrong.

The phone hashes very quickly compared to the laptop, maybe because of some hardware instructions?

Networks

The Computer can support 2.5Gbit/s ethernet, and the phone can support 2000Mb/s DL and 316Mb/s UL, which is pretty fast, even faster than disk, so network probably won’t be the bottleneck.

Talking about Systems

With some numbers to work with, let’s talk about some famous products and their requirements.

Amazon S3

According to Source Amazon S3 holds 280 trillion objects, and handles about 100 million req/s. S3 sends 125 billion events per day (1.25M req/s), and S3 handles 4 billion checksum computations per second.

Assuming that’s 80M reads and 20M writes, each of 1MB, that would involve 80TB/s of reads, and 20TB/s of writes. For the disk usage alone, you’d need ~7,000 laptops to handle the writes, and ~27,000 laptops to handle the reads every second. But assuming we use 2.5Gb internet, at ~300MB/s, we’d need 10 times that amount, or 70,000 laptops to handle the writes and 270,000 laptops for the reads.

For the hashing, assuming each hash is over 1MB of data, would require hashing 4PB/s of data. Using xxhash at 2GB/s would require 2,000,000 laptops to just hash the data. Servers can hash xxhash data at over 100GB/s, so I assume amazon uses those to reduce the computer requirement to a modest less than 400,000.

Conclusion

While going through this I learned a lot about how fast my computers are – they can handle hosting lots of services. Also, computers are really cheap – an 8GB RAM, 80GB SSD instance on hetzner is ~$5/month, and the equivalent NUC would cost ~$100 to own, and these can handle thousands of requests per second at the very least, more than enough for the large majority of services.