KPU
===

Some notes about the K210 KPU, which is definitely the weirdest, possibly
most interesting peripheral on this SoC. Documentation doesn't seem to be
available, so the information here has been reconstructed from various vendor
source code.

This kind of custom hardware is pretty much impossible to understand without
knowledge of the domain, in this case Convolutional Neural Networks on images.
My understanding of this is rudimentary (my last brush with it was in uni) so
I may be missing some obvious clues here and there.

From the datasheet
==================

The Kendryte datasheet has the following information on the KPU:

> KPU is a general-purpose neural network processor with built-in convolution,
> batch normalization, activation, and pooling operations. It can detect faces or
> objects in real time. The specific characteristics are as follows:
> 
> - Supports the fixed-point model that the mainstream training framework trains
> according to specific restriction rules
> - There is no direct limit on the number of network layers, and each layer of
> convolutional neural network parameters can be configured separately, including
> the number of input and output channels, and the input and output line width
> and column height
> - Support for 1x1 and 3x3 convolution kernels

1×1 and 3×3 is not a very wide range of supported convolutions, but maybe the most
common ones in this specific application area…

> - Support for any form of activation function

This is definitely true, Normalization functions seem to be represented as an array of 16
segments (`kpu_activate_table_t`).

> - The maximum supported neural network parameter size for real-time work is 5MiB
> to 5.9MiB
> - The maximum supported network parameter size when working in non-real time is
> (flash size - software size)

The flash size specs are somewhat of a red herring as they relate to software
instead of hardare: the KPU does not have logic for loading parameters from
flash.

Some other source mentions:

> 64 KPU which are 576bit width, supports convolution kernel. Offers
> 0.25TOPS@0.3W,400MHz, and when you overclock to 800MHz, it offers 0.5TOPS,
> meaning you can do object recognition 60fps@VGA.

Clock speed
===========

The KPU is clocked from PLL1, with a divisor between 1 and 16.
The usual clock speed in the Sipeed examples is 300, sometimes 400 MHz.
According to some mentions in the data sheet it's possible to clock it to 800 MHz.

Overall execution flow
======================

The overall execution flow is that the KPU runs a neural network layer by
layer. This happens in a sequential fashion. Each layer can be considered a
separate set of instructions for the KPU.

A layer can receive its input in the "AI" memory area (2MB of the memory is reserved for this,
from 0x40600000 to 0x407fffff) as well as write its output there. The input and
output can consist of multiple channels (R/G/B for example).

It is possible to set an interrupt to notify the host CPU when a specific layer has
finished executing.

Looking at `lib/drivers/kpu.c` in the SDK, function `ai_step`, many types of CNN layers are
implemented in software instead of executed by the KPU. I suppose they accelerated the
most common multiplication-intensive layers in hardware, which is `KL_K210_CONV`.

Peripehral layout
=================

The register layout of the peripheral is as folllows. Source: `lib/drivers/include/kpu.h`.
All registers are 64-bit.

| Ofs   | Name              | Description                                                   |
| ----- | ----------------- | ------------------------------------------------------------- |
| 0x00  | `layer_argument_fifo` | Layer arguments (instructions) are submitted here         |
| 0x08  | `interrupt_status` | Status of pending interrupts                                 |
| 0x10  | `interrupt_raw`   |                                                               |
| 0x18  | `interrupt_mask`  | Specifies which global interrupts are enabled                 |
| 0x20  | `interrupt_clear` | Clear pending interrupts                                      |
| 0x28  | `fifo_threshold`  | FIFO interrupt thresholds                                     |
| 0x30  | `fifo_data_out`   | Data output FIFO read register                                |
| 0x38  | `fifo_ctrl`       | Flush FIFOs                                                   |
| 0x40  | `eight_bit_mode`  | Enable 8-bit instead of 16-bit precision                      |

Layer format
============

KPU neural network layers are represented by a series of 12 64-bit values,
submitted to the layer argument FIFO one by one. The overall structure of the bit fields is
available in `lib/drivers/include/kpu.h`.

It looks like the generation of models is supposed to be done offline by a tool called [nnscase](https://github.com/kendryte/nncase),
which compiles TensorFlow models to a specific internal representation.
The k210-specific code parts are [k210_ops.cpp](https://github.com/kendryte/nncase/tree/master/src/codegen/ops/k210/k210_ops.cpp)
and [k210_sim_types.h](https://github.com/kendryte/nncase/blob/master/src/common/include/runtime/k210/k210_sim_types.h)
and [k210_ops_body.h](https://github.com/kendryte/nncase/blob/master/src/common/include/runtime/k210/k210_ops_body.h)
(serialization and deserialization).
src/common/include/kernels/k210/k210_kernels.h (emulation)

0 `interrupt_enabe`
-------------------

    bit    name
    ------ ----------------------
    0      `int_en`               Generate interuupt after layer computation finished
    1      `ram_flag`             ?
    2      `full_add`             Set in `kpu_conv2d_output_full_add`
    3      `depth_wise_layer`     Is a "depth-wise" layer (1 if enabled)
    4..63  reserved

"depth-wise" affects meny of the computations: it likely means that the layer
computation mixes multiple channels so that they cannot be processed one by
one.

1 `image_addr`
--------------

    bit    name
    ------ ----------------------
    0..14  `image_src_addr`       Image source address
    15     reserved
    16..30 `image_dst_addr`       Image destination address
    31..63 reserved

`image_src_addr` and `image_dst_addr` are specified in 64-byte units relative to the base of "AI" memory.

2 `image_channel_num`
---------------------

    bit    name
    ------ ----------------------
    0..9   `i_ch_num`             Number of input channels (minus one)
    10..31 reserved
    32..41 `o_ch_num`             Number of output channels (minus one)
    42..47 reserved
    48..57 `o_ch_num_coef`        Number of output channel coefficients (minus one)
    58..63 reserved

3 `image_size`
--------------

    bit    name
    ------ ----------------------
    0..9   `i_row_wid`            Input row width (minus one)
    10..18 `i_col_high`           Input column height (minus one)
    19..31 reserved
    32..41 `o_row_wid`            Output row width (minus one)
    42..50 `o_col_high`           Output column height (minus one)
    51..63 reserved

4 `kernel_pool_type_cfg`
------------------------

    bit    name
    ------ ----------------------
    0..2   `kernel_type`      `filter_type_t` (see below)
    3      `pad_type`         Always 1
    4..7   `pool_type`        `pool_type_t` (see below)
    8      `first_stride`     ?
    9      `bypass_conv`      ?
    10     `load_para`        Load parameters (1 if enabled)
    11..15 reserved
    16..23 `dma_burst_size`   Always 15
    24..31 `pad_value`        Padding value
    32..63 `bwsx_base_addr`   Batch normalization array base address (8-aligned, `kpu_batchnorm_argument_t`)

`kpu_filter_type`:

    value   enum
    ------  -----------
    0       1x1
    1       3x3

`kpu_pool_type`:

    value   enum           description
    ------  -------------- ------------------
    0       bypass         bypass pooling (filter size 1×1, stride 1)
    1       max_2_s2       max pooling (filter size 2×2, stride 2)
    2       mean_2_s2      mean pooling (filter size 2×2, stride 2)
    3       max_4_s4       max pooling (filter size 4×4, stride 4)
    4       mean_4_s4      mean pooling (filter size 4×4, stride 4)
    5       left_top_2_s2  pick left top (filter size 2×2, stride 2)
    6       right_top_2_s2 pick right top (filter size 2×2, stride 2)
    7       left_top_4_s4  pick left top (filter size 4×4, stride 4)
    8       mean_2_s1      mean pooling (filter size 2×2, stride 1)
    9       max_2_s1       max pooling (filter size 2×2, stride 1)

See `kpu_pool2d` in `src/common/include/kernels/k210/k210_kernels.h`,
as well as `src/common/include/runtime/k210/k210_runtime_op_utility.h` in nncase.

5 `kernel_load_cfg`
-------------------

    bit    name
    ------ ----------------------
    0      `load_coor`       Always 1
    1..6   `load_time`       Parameter load frequency (0=once, 1=per channel?)
    7..14  reserved
    15..31 `para_size`       Parameter (weights) size
    32..63 `para_start_addr` Parameter (weights) start address (128-aligned, one byte per weight)

6 `kernel_offset`
-----------------

    bit    name
    ------ ----------------------
    0..3   `coef_column_offset`  ?
    4..15  `coef_row_offset`     ?
    16..63 reserved

7 `kernel_calc_type_cfg`
------------------------

    bit    name
    ------ ----------------------
    0..14  `channel_switch_addr`  In layout channel length
    15     reserved
    16..19 `row_switch_addr`      In layout row length
    20..27 `coef_size`            ?
    28..30 `coef_group`           ?
    31     `load_act`             Load activation function (1 is enabled)
    32..63 `active_addr`          Activation function address (256-aligned `kpu_activate_table_t`)

8 `write_back_cfg`
------------------

    bit    name
    ------ ----------------------
    0..14  `wb_channel_switch_addr`  Out layout channel length
    15     reserved
    16..19 `wb_row_switch_addr`      Out layout row length
    20..22 `wb_group`                Out layout number of groups
    23..63 reserved

9 `conv_value`
--------------

    bit    name
    ------ ----------------------
    0..3   `shr_w`                   Convolution value shift right w
    4..7   `shr_x`                   Convolution value shift right x
    8..31  `arg_w`                   Convolution value w multiplier
    32..55 `arg_x`                   Convolution value x multiplier
    56..63 reserved

10 `conv_value2`
----------------

    bit    name
    ------ ----------------------
    0..39  `arg_add`                 Convolution value addition/bias
    40..63 reserved

11 `dma_parameter`
------------------

    bit    name
    ------ ----------------------
    0      `send_data_out`           Send data out to DMA (main memory)
    1..15  reserved
    16..31 `channel_byte_num`        Number of bytes per out channel (minus one)
    32..63 `dma_total_byte`          Number of bytes total out (minus one)