doc: Add some random info about KPU

2024-11-22 01:16:20 +04:00 · 2019-08-30 17:59:27 +00:00 · 2019-08-30 17:59:27 +00:00 · adfa176f77
commit adfa176f77
parent a6feb00fc8
1 changed files with 275 additions and 0 deletions
--- a/doc/kpu.md
+++ b/doc/kpu.md
@ -0,0 +1,275 @@
+KPU
+===
+
+Some notes about the K210 KPU, which is definitely the weirdest, possibly
+most interesting peripheral on this SoC. Documentation doesn't seem to be
+available, so the information here has been reconstructed from various vendor
+source code.
+
+This kind of custom hardware is pretty much impossible to understand without
+knowledge of the domain, in this case Convolutional Neural Networks on images.
+My understanding of this is rudimentary (my last brush with it was in uni) so
+I may be missing some obvious clues here and there.
+
+From the datasheet
+==================
+
+The Kendryte datasheet has the following information on the KPU:
+
+> KPU is a general-purpose neural network processor with built-in convolution,
+> batch normalization, activation, and pooling operations. It can detect faces or
+> objects in real time. The specific characteristics are as follows:
+> 
+> - Supports the fixed-point model that the mainstream training framework trains
+> according to specific restriction rules
+> - There is no direct limit on the number of network layers, and each layer of
+> convolutional neural network parameters can be configured separately, including
+> the number of input and output channels, and the input and output line width
+> and column height
+> - Support for 1x1 and 3x3 convolution kernels
+
+1×1 and 3×3 is not a very wide range of supported convolutions, but maybe the most
+common ones in this specific application area…
+
+> - Support for any form of activation function
+
+This is definitely true, Normalization functions seem to be represented as an array of 16
+segments (`kpu_activate_table_t`).
+
+> - The maximum supported neural network parameter size for real-time work is 5MiB
+> to 5.9MiB
+> - The maximum supported network parameter size when working in non-real time is
+> (flash size - software size)
+
+The flash size specs are somewhat of a red herring as they relate to software
+instead of hardare: the KPU does not have logic for loading parameters from
+flash.
+
+Some other source mentions:
+
+> 64 KPU which are 576bit width, supports convolution kernel. Offers
+> 0.25TOPS@0.3W,400MHz, and when you overclock to 800MHz, it offers 0.5TOPS,
+> meaning you can do object recognition 60fps@VGA.
+
+Clock speed
+===========
+
+The KPU is clocked from PLL1, with a divisor between 1 and 16.
+The usual clock speed in the Sipeed examples is 300, sometimes 400 MHz.
+According to some mentions in the data sheet it's possible to clock it to 800 MHz.
+
+Overall execution flow
+======================
+
+The overall execution flow is that the KPU runs a neural network layer by
+layer. This happens in a sequential fashion. Each layer can be considered a
+separate set of instructions for the KPU.
+
+A layer can receive its input in the "AI" memory area (2MB of the memory is reserved for this,
+from 0x40600000 to 0x407fffff) as well as write its output there. The input and
+output can consist of multiple channels (R/G/B for example).
+
+It is possible to set an interrupt to notify the host CPU when a specific layer has
+finished executing.
+
+Looking at `lib/drivers/kpu.c` in the SDK, function `ai_step`, many types of CNN layers are
+implemented in software instead of executed by the KPU. I suppose they accelerated the
+most common multiplication-intensive layers in hardware, which is `KL_K210_CONV`.
+
+Peripehral layout
+=================
+
+The register layout of the peripheral is as folllows. Source: `lib/drivers/include/kpu.h`.
+All registers are 64-bit.
+
+| Ofs   | Name              | Description                                                   |
+| ----- | ----------------- | ------------------------------------------------------------- |
+| 0x00  | `layer_argument_fifo` | Layer arguments (instructions) are submitted here         |
+| 0x08  | `interrupt_status` | Status of pending interrupts                                 |
+| 0x10  | `interrupt_raw`   |                                                               |
+| 0x18  | `interrupt_mask`  | Specifies which global interrupts are enabled                 |
+| 0x20  | `interrupt_clear` | Clear pending interrupts                                      |
+| 0x28  | `fifo_threshold`  | FIFO interrupt thresholds                                     |
+| 0x30  | `fifo_data_out`   | Data output FIFO read register                                |
+| 0x38  | `fifo_ctrl`       | Flush FIFOs                                                   |
+| 0x40  | `eight_bit_mode`  | Enable 8-bit instead of 16-bit precision                      |
+
+Layer format
+============
+
+KPU neural network layers are represented by a series of 12 64-bit values,
+submitted to the layer argument FIFO one by one. The overall structure of the bit fields is
+available in `lib/drivers/include/kpu.h`.
+
+It looks like the generation of models is supposed to be done offline by a tool called [nnscase](https://github.com/kendryte/nncase),
+which compiles TensorFlow models to a specific internal representation.
+The k210-specific code parts are [k210_ops.cpp](https://github.com/kendryte/nncase/tree/master/src/codegen/ops/k210/k210_ops.cpp)
+and [k210_sim_types.h](https://github.com/kendryte/nncase/blob/master/src/common/include/runtime/k210/k210_sim_types.h)
+and [k210_ops_body.h](https://github.com/kendryte/nncase/blob/master/src/common/include/runtime/k210/k210_ops_body.h)
+(serialization and deserialization).
+src/common/include/kernels/k210/k210_kernels.h (emulation)
+
+0 `interrupt_enabe`
+-------------------
+
+    bit    name
+    ------ ----------------------
+    0      `int_en`               Generate interuupt after layer computation finished
+    1      `ram_flag`             ?
+    2      `full_add`             Set in `kpu_conv2d_output_full_add`
+    3      `depth_wise_layer`     Is a "depth-wise" layer (1 if enabled)
+    4..63  reserved
+
+"depth-wise" affects meny of the computations: it likely means that the layer
+computation mixes multiple channels so that they cannot be processed one by
+one.
+
+1 `image_addr`
+--------------
+
+    bit    name
+    ------ ----------------------
+    0..14  `image_src_addr`       Image source address
+    15     reserved
+    16..30 `image_dst_addr`       Image destination address
+    31..63 reserved
+
+`image_src_addr` and `image_dst_addr` are specified in 64-byte units relative to the base of "AI" memory.
+
+2 `image_channel_num`
+---------------------
+
+    bit    name
+    ------ ----------------------
+    0..9   `i_ch_num`             Number of input channels (minus one)
+    10..31 reserved
+    32..41 `o_ch_num`             Number of output channels (minus one)
+    42..47 reserved
+    48..57 `o_ch_num_coef`        Number of output channel coefficients (minus one)
+    58..63 reserved
+
+3 `image_size`
+--------------
+
+    bit    name
+    ------ ----------------------
+    0..9   `i_row_wid`            Input row width (minus one)
+    10..18 `i_col_high`           Input column height (minus one)
+    19..31 reserved
+    32..41 `o_row_wid`            Output row width (minus one)
+    42..50 `o_col_high`           Output column height (minus one)
+    51..63 reserved
+
+4 `kernel_pool_type_cfg`
+------------------------
+
+    bit    name
+    ------ ----------------------
+    0..2   `kernel_type`      `filter_type_t` (see below)
+    3      `pad_type`         Always 1
+    4..7   `pool_type`        `pool_type_t` (see below)
+    8      `first_stride`     ?
+    9      `bypass_conv`      ?
+    10     `load_para`        Load parameters (1 if enabled)
+    11..15 reserved
+    16..23 `dma_burst_size`   Always 15
+    24..31 `pad_value`        Padding value
+    32..63 `bwsx_base_addr`   Batch normalization array base address (8-aligned, `kpu_batchnorm_argument_t`)
+
+`kpu_filter_type`:
+
+    value   enum
+    ------  -----------
+    0       1x1
+    1       3x3
+
+`kpu_pool_type`:
+
+    value   enum           description
+    ------  -------------- ------------------
+    0       bypass         bypass pooling (filter size 1×1, stride 1)
+    1       max_2_s2       max pooling (filter size 2×2, stride 2)
+    2       mean_2_s2      mean pooling (filter size 2×2, stride 2)
+    3       max_4_s4       max pooling (filter size 4×4, stride 4)
+    4       mean_4_s4      mean pooling (filter size 4×4, stride 4)
+    5       left_top_2_s2  pick left top (filter size 2×2, stride 2)
+    6       right_top_2_s2 pick right top (filter size 2×2, stride 2)
+    7       left_top_4_s4  pick left top (filter size 4×4, stride 4)
+    8       mean_2_s1      mean pooling (filter size 2×2, stride 1)
+    9       max_2_s1       max pooling (filter size 2×2, stride 1)
+
+See `kpu_pool2d` in `src/common/include/kernels/k210/k210_kernels.h`,
+as well as `src/common/include/runtime/k210/k210_runtime_op_utility.h` in nncase.
+
+5 `kernel_load_cfg`
+-------------------
+
+    bit    name
+    ------ ----------------------
+    0      `load_coor`       Always 1
+    1..6   `load_time`       Parameter load frequency (0=once, 1=per channel?)
+    7..14  reserved
+    15..31 `para_size`       Parameter (weights) size
+    32..63 `para_start_addr` Parameter (weights) start address (128-aligned, one byte per weight)
+
+6 `kernel_offset`
+-----------------
+
+    bit    name
+    ------ ----------------------
+    0..3   `coef_column_offset`  ?
+    4..15  `coef_row_offset`     ?
+    16..63 reserved
+
+7 `kernel_calc_type_cfg`
+------------------------
+
+    bit    name
+    ------ ----------------------
+    0..14  `channel_switch_addr`  In layout channel length
+    15     reserved
+    16..19 `row_switch_addr`      In layout row length
+    20..27 `coef_size`            ?
+    28..30 `coef_group`           ?
+    31     `load_act`             Load activation function (1 is enabled)
+    32..63 `active_addr`          Activation function address (256-aligned `kpu_activate_table_t`)
+
+8 `write_back_cfg`
+------------------
+
+    bit    name
+    ------ ----------------------
+    0..14  `wb_channel_switch_addr`  Out layout channel length
+    15     reserved
+    16..19 `wb_row_switch_addr`      Out layout row length
+    20..22 `wb_group`                Out layout number of groups
+    23..63 reserved
+
+9 `conv_value`
+--------------
+
+    bit    name
+    ------ ----------------------
+    0..3   `shr_w`                   Convolution value shift right w
+    4..7   `shr_x`                   Convolution value shift right x
+    8..31  `arg_w`                   Convolution value w multiplier
+    32..55 `arg_x`                   Convolution value x multiplier
+    56..63 reserved
+
+10 `conv_value2`
+----------------
+
+    bit    name
+    ------ ----------------------
+    0..39  `arg_add`                 Convolution value addition/bias
+    40..63 reserved
+
+11 `dma_parameter`
+------------------
+
+    bit    name
+    ------ ----------------------
+    0      `send_data_out`           Send data out to DMA (main memory)
+    1..15  reserved
+    16..31 `channel_byte_num`        Number of bytes per out channel (minus one)
+    32..63 `dma_total_byte`          Number of bytes total out (minus one)