Major Protocol Upgrade - Diving Into CKB-VM V1

A technical overview of the new CKB-VM v1 features from the CKB2021 hard fork.

Introduction

In our previous article we gave a high-level overview of the CKB-VM v1 improvements and explained the basic process to activate and take advantage of this new virtual machine environment. Next, we will dig a little deeper into the CKB-VM v1 improvements and explain in a more technical way, how these improvements operate and enhance the developer experience.

RISC-V B Extension

The RISC-V B extension is our first update to the instruction set that powers CKB-VM. This instruction set extension is relatively small, but it proves concretely that this update path works both in theory and in practice.

The B extension covers the four major categories of bit manipulation: counting, extracting, inserting and swapping. A total of 43 new instructions are introduced, all of which require only 1 cycle to be consumed.

In many cryptographic algorithms, bit operations are used such as clz (count leading zeros), circular shifts, etc. The addition of the B extended instruction set allows us to increase the speed of such calculations by a factor of at least 10. Let’s take clz as an example. Traditionally, we use the following algorithm:

uint64_t clz(uint64_t n)
{
if (n == 0) return 64;
int c = 0;
if (n <= 0x00000000FFFFFFFF) { c += 32; n <<= 32; };
if (n <= 0x0000FFFFFFFFFFFF) { c += 16; n <<= 16; };
if (n <= 0x00FFFFFFFFFFFFFF) { c += 8; n <<= 8; };
if (n <= 0x0FFFFFFFFFFFFFFF) { c += 4; n <<= 4; };
if (n <= 0x3FFFFFFFFFFFFFFF) { c += 2; n <<= 2; };
if (n <= 0x7FFFFFFFFFFFFFFF) { c += 1; };
return c;
}

The above code uses about 30 instructions. But with the B extension, we only need to spend one instruction to do it!

uint64_t clz(uint64_t n) {
uint64_t rd;
__asm__ (“clz %0, %1” : “=r”(rd) : “r”(n));
return rd;
}

The B extension has been integrated into the latest version of gcc. To compile code with the b extension, you need to add the parameter -march=rv64imc_zba_zbb_zbc, like this:

riscv64-unknown-elf-gcc -o /tmp/main -march=rv64imc_zba_zbb_zbc main.c

Let’s look into a real world example showing that the B extension improves the performance of the Blake2b algorithm. It simply replaces a function with a single B extension instruction. The original version is as follows:

static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
return (w >> c) | (w << (64 – c));
}

Update it with B extension instructions:
static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
uint64_t rd = 0;
__asm__ (“ror %0, %1, %2” : “=r”(rd) : “r”(w), “r”(c));
return rd;
}

With this small change, there is a 15% performance increase. Nervos uses the Blake2b algorithm often, meaning that this is a very significant improvement to our ecosystem.

Mainstream compilers like GCC, are currently not smart enough to convert source code into RISC-V B extension instructions automatically. That means developers still need to use assembly code manually for performance tuning, such as was done above. In the future compilers will have better support for the B extension, which means developers can see major performance increases simply by upgrading their compiler to a newer version.

An early benchmark provided by the community showed an immediate improvement of approximately 10% on a set of computationally intensive cryptographic algorithms.

We hope that by adding more extensions in the future, crypto primitives running in CKB-VM will gain execution efficiency that is comparable to native execution. These types of improvements will keep CKB-VM at the forefront of the industry, and give cryptography developers and dApp developers more options and freedom in this highly specialized application where every bit of efficiency matters.

Macro-Operation Fusion (MOP)

Macro-Operation Fusion is a hardware optimization technique found in many modern computer micro architectures whereby a series of adjacent macro-operations are merged into a single macro-operation before or during decoding. Those instructions are later decoded into a single fused instruction. If you are unfamiliar with MOP and would like to read more, you can find some helpful links here and here.

Since CKB-VM is an abstract layer of CPU/Memory hardware, MOP technology can be used just like it would in the equivalent hardware. Here we will illustrate how CKB-VM can use MOP to fuse multiple instructions into a single instruction. Let’s first take an example of multiplication. Assume we have to calculate the following 128-bit unsigned int value (result) in C:

uint64_t a = 0xFFFFFFFFFFFFFFFF;
uint64_t b = 0xFFFFFFFFFFFFFFFF;
uint128_t result = a * b;

The compiler can’t handle 128-bits integers in one single instruction. It then compiles them into several instructions which must contain the following RISC-V instructions:

mulhu
mul

The mulhu instruction can handle the higher 64-bits of result while the mul instruction handles the lower 64-bits. The idea behind MOP is to combine the RISC-V instructions mulhu/mul into one fused instruction. CKB-VM can implement this fused instruction in assembly. For example, on x86-64, it is implemented by following two x86-64 instructions:

movq
imulq

The main benefit of this is that cycles can be significantly reduced using MOP. Previously it consumed 10 (5 + 5) cycles for the mulhu/mul instructions. After MOP is implemented, it only consumes 5 cycles total for the single fused instruction. This is a 100% performance gain.

Here is a list of all the MOP instructions included in CKB-VM v1. The multiplication operations noted are critical to cryptography algorithms. MOP can provide a massive performance increase to all cryptography algorithms running on CKB-VM. We hand-optimized bls12-381 and reduced verification from 76.6M cycles to 51.8M cycles. Here is one implementation of the work that was done. You will see a lot of instances of the instructions mulhu/mul appearing adjacently. This optimization allows MOP to be utilized.

Unfortunately, GCC and other compilers aren’t optimized for the RISC-V MOP instructions yet. This means that developers still need to manually arrange assembly code to optimize for MOP today. However, we know that this will improve in the future as compilers begin to add support.

Exec Syscall

The exec syscall is inspired by the Linux exec syscall. The exec syscall runs executable code that is contained within the data area of a specific cell. This executable is run in the same context as the currently running virtual machine, and it replaces the previously running executable that made the call. The amount of cycles used does not change, but the code, registers, and memory of the VM are replaced by those of the new program.

The exec syscall is very helpful to reuse code in scripts. Let’s use a lock script as an example. Almost every lock script shares a common process of validating one or more cryptographic signatures. We can build a single common script that only does signature validation, including SECP256K1, RSA, Schnorr, etc. This script will rarely require changes.

Developers that are creating lock scripts can simplify their process by using the exec syscall to execute the aforementioned script which contains signature validation. This allows them to remove the signature validation code from their script, removing the redundancy and allowing them to focus more exclusively on the logic specific to their dapp.

This pull request demonstrates how this can be done in practice. We encourage developers to continue using this pattern to make scripts more composable and shareable.

ARM64 Assembly Implementation

CKB-VM already has a highly efficient implementation of RISC-V specifications on x86-64, and with the CKB2021 hard fork, we are introducing an ARM64 assembly implementation. This feature gives CKB-VM a huge performance boost on ARM64 platforms. ARM64 is a CPU architecture that is used on billions of devices, such as Apple and Android smartphones and tablets, Raspberry Pi devices, Google Chrome laptops, and Apple’s newest line of M1 and M2 desktop and laptop computers. Microsoft is also testing a version of Windows 11 that runs on ARM64. We believe that the ARM64 architecture will continue to be more and more popular, and we are making sure that CKB-VM will keep up with the trend.

Other Improvements on CKB-VM

Several other improvements have also been made behind the scenes to CKB-VM. The first is lazy initialization during script execution in CKB-VM. Previously, when CKB-VM executes a script the first step is to initialize all memory to zero. If some of the memory is never used during execution, the unused memory remains reserved, which is not optimal. Now when scripts execute the initialization of memory is deferred. The program memory is divided into multiple frames, and only when a frame is accessed for the first time does it reserve the block of memory and initialize it to all zeros. The result of this is that small scripts that do not need large amounts of memory will begin execution faster and will require less memory during the duration of their execution.

Another improvement is the addition of “chaos mode”. In production use, all memory is always initialized to zero when first being used. However, during testing and debugging, we can use chaos mode to initialize memory to random values instead of zero. This technique creates randomized synthetic scenarios which can aid in finding certain bugs in scripts, such as the accessing of uninitialized memory. This is a serious issue that can be very difficult to find under normal circumstances, but chaos mode can help identify the problem much faster.

Closing

There are no shortcuts to success when trailblazing a path and building new technology. It takes a lot of work and painstaking effort to ensure that our technology remains at the leading edge. By continually investing in improvements to our ecosystem’s foundation we are positioning the platform for the best possible trajectory and ensuring our success in the future.

Some of the updates that we have described may be deeper than many developers are interested in, but all developers and users will benefit. It doesn’t matter if you’re a low-level developer on L1 or an EVM developer on L2. It doesn’t matter if you’re a user of a native dApp on Nervos or a user on the far edges of one of our multi-chain ecosystem partners. Even users that don’t realize the dApp they are using is powered by Nervos will benefit from the solid foundation and meticulous optimization that makes Nervos an outlier in the industry.