The database layer

5 min read

I'm learning Rust these days. It was supposed to be my "Summer Vacation Project" - but that turned out to be Beer and Space Opera. So now, after my vacation, I'm trying to work some hour on nsblast every day - and hopefully some weekends with smaller Rust projects.

There is som tension between C++, which is my preferred programming language, and Rust. Enthusiasts for both languages express with confidence that their language is the faster one.

I don't know Rust well enough yet to have an opinion on that. However, I know a ting or two about why C++ programs sometimes perform less than optimal. It's usually not about the language itself - C++ can produce really fast programs. It's about what the programmers do in their programs, and how they do it. For example, in C++ there is a standard library, and in that standard library there is a string class. That string class make C++ code in general a hell of a lot safer than classic C code, where the programmer need to track the bounds and lifetime of every string herself - but it comes with a cost. Each string that contains more than a handful of letters require a memory allocation. The string class is mutable and not thread safe. That leads to a lot of deep copies and race conditions in actual code. Deep copies slows down the applications for multiple reasons, while race conditions makes the applications do weird and unexpected things or even crash!

I worked for a few years deep into the code of a fork of Arangodb. Arangodb is a very nice database server. It expose a HTTP/REST API for clients, which make it very easy to use from pretty much any programming language. Just like nsblast, it use RocksDB as it's database engine. It's data format is Json. Internally it converts Json to a binary format (vpack) that is supposedly faster to work with and more memory efficient than Json itself. However, for every request to the server, it had to convert between Json and vpack. When I ran load-tests (using my restc-cpp library to fully utilize the servers API capabilities) and peeked into the servers hot-spots with the Linux command perf top, this conversion, including its numerous memory allocations, consumed most of the CPU.

The DNS protocol is a binary protocol, designed much the same way as the IP protocol. It was how on-the-wire protocols were designed (and still are being designed) when you push a lot of data using limited computer resources. The servers of the early Internet had less CPU and memory than the cheapest Chinese Android phone you can find. Even a Raspberry Pi looks like a mainframe in comparison. DNS messages are blocks of bytes where individual bits has special meaning, and integers can be just a few bits wide. Strings are compressed by referencing a single 8 bit integer into a raw offset of the block. All multi-byte numbers are in Big Endian order. The thing can be processed very efficiently by weak computers and network hardware - but it's not how you usually represent data in 2023! Also, the data-block you send as a reply to a user cannot be entirely cached, because it need to provide the answer to potentially more than one question. And it need to set a random sequence number found in the request. And it need to copy the question(s) from the request.

My first idea was to store the Resource Records as protobuf messages. Protobuf is a space efficient and fast way to store information in C++. However, if I stored the data this way, I would need to serialize and wrap around the integers for every piece of data sent from the server. In stead I decided to store the information as binary blocks that is very similar to how they are defined by the DNS protocol. I still need to copy the data to the destination block I send as a reply reply to a request, and I still need to "compress" the strings by so that any hostname only occur once in a block (any secondary reference use a "pointer" to it's location). However, this can be done quite efficient in C++. Doing it this way also keep the size of the binary blobs I hand to RocksDB small (no overhead like Json entity names). It's also very fast to load and "parse" on of these binary representations of a Resource Record. A single FQDN (Fully Qualified Domain Name) can host a number of Resource Records. I store one binary record for each unique FQDN. In order to quickly decide if the Resource Record the user asks for is present (the user may ask for a IPv4 or IPv6 address, or a mail server or a DNS server) the most "popular" types are flagged by a bit in the header of the message. That way I can decide early if I even need to parse the record to find whatever information the user wants.

The overall idea is to store as little data as possible, because that will allow the CPU to keep more relevant information in it's caches. It also improves the bandwidth to the disk (records written per second). In addition, I try to keep the processing required to produce a reply as simple as possible, with few data transformations and memory allocations to avoid wasting CPU. Electric power is expensive, and a server should not waste it.

This is the binary representation of an Entry - all the information the server needs in order to serve requests for a fqdn (Fully Qualified Domain Name). An Entry is at it's core just a buffer of bytes. Note that this is the Entry at the time this article was written. It may be changed by the time you read this. There is a link to the latest specification below.

Fields:

version: Version of the data format
flags: Flags to quickly check if a popular type of RR's are present
labelsize: Size of the labels buffer (in the first RR)
zonelen: Offset to the start of labels that identifies the zone. This is used to quickly look up the SOA Entry for a fqdn. For example, www.example.com will have 11 in this field, which means that the last 11 bytes of the fqdn identifies the zone boundary, example.com. Using zonelen, I don't have to store the zone name in the Entry, something that would duplicate data and increase its size.
rrcount: Number of RR's in the RRSet
indexoffset: Offset to the start of the index.
Tenant id: UUID of the tenant that owns the fqdn. Used for access control when someone tries to change something via the REST API, and for usage statistics for a tenant. This is only stored for Entries on zone boundaries. Since all the fqdn's in a zone has the same owner, there is no need to replicate this data in all it's Entries.

The index contains a sorted list of Resource Records, their type, and their position in the Entry. When a C++ Entry class is wrapped around an Entry, it use the index to allow an iterator to iterate (or loop) over the Resource Records.

    0                   1
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | version       | flags         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | rrcount                       |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | labelsize     | zonelen       |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | IndexOffset                   |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | Tenant id (uuid) (optional)   |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                               |
    /        RRset entries          /
    |                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                               |
    /            index              /
    |                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

In real applications, its not just about the language you use. It's also about how you think. Java can run in circles around C++ if the Java developer re-use a pool of objects, while the C++ developer use memory allocations and deep copies of data. With Nsblast, the idea is to not waste any resources. I plan to deploy it and allow developers to use it for free. I don't want to pay any more for Cloud VM's than I need to. The less memory, CPU and network resources the application use, the less money I bleed on this project.

This is the design document for the internal data storage format.

The Last Viking LTD

The database layer

Tags