mirror of
https://github.com/laanwj/k210-sdk-stuff.git
synced 2024-11-25 02:46:18 +04:00
276 lines
11 KiB
Markdown
276 lines
11 KiB
Markdown
|
KPU
|
|||
|
===
|
|||
|
|
|||
|
Some notes about the K210 KPU, which is definitely the weirdest, possibly
|
|||
|
most interesting peripheral on this SoC. Documentation doesn't seem to be
|
|||
|
available, so the information here has been reconstructed from various vendor
|
|||
|
source code.
|
|||
|
|
|||
|
This kind of custom hardware is pretty much impossible to understand without
|
|||
|
knowledge of the domain, in this case Convolutional Neural Networks on images.
|
|||
|
My understanding of this is rudimentary (my last brush with it was in uni) so
|
|||
|
I may be missing some obvious clues here and there.
|
|||
|
|
|||
|
From the datasheet
|
|||
|
==================
|
|||
|
|
|||
|
The Kendryte datasheet has the following information on the KPU:
|
|||
|
|
|||
|
> KPU is a general-purpose neural network processor with built-in convolution,
|
|||
|
> batch normalization, activation, and pooling operations. It can detect faces or
|
|||
|
> objects in real time. The specific characteristics are as follows:
|
|||
|
>
|
|||
|
> - Supports the fixed-point model that the mainstream training framework trains
|
|||
|
> according to specific restriction rules
|
|||
|
> - There is no direct limit on the number of network layers, and each layer of
|
|||
|
> convolutional neural network parameters can be configured separately, including
|
|||
|
> the number of input and output channels, and the input and output line width
|
|||
|
> and column height
|
|||
|
> - Support for 1x1 and 3x3 convolution kernels
|
|||
|
|
|||
|
1×1 and 3×3 is not a very wide range of supported convolutions, but maybe the most
|
|||
|
common ones in this specific application area…
|
|||
|
|
|||
|
> - Support for any form of activation function
|
|||
|
|
|||
|
This is definitely true, Normalization functions seem to be represented as an array of 16
|
|||
|
segments (`kpu_activate_table_t`).
|
|||
|
|
|||
|
> - The maximum supported neural network parameter size for real-time work is 5MiB
|
|||
|
> to 5.9MiB
|
|||
|
> - The maximum supported network parameter size when working in non-real time is
|
|||
|
> (flash size - software size)
|
|||
|
|
|||
|
The flash size specs are somewhat of a red herring as they relate to software
|
|||
|
instead of hardare: the KPU does not have logic for loading parameters from
|
|||
|
flash.
|
|||
|
|
|||
|
Some other source mentions:
|
|||
|
|
|||
|
> 64 KPU which are 576bit width, supports convolution kernel. Offers
|
|||
|
> 0.25TOPS@0.3W,400MHz, and when you overclock to 800MHz, it offers 0.5TOPS,
|
|||
|
> meaning you can do object recognition 60fps@VGA.
|
|||
|
|
|||
|
Clock speed
|
|||
|
===========
|
|||
|
|
|||
|
The KPU is clocked from PLL1, with a divisor between 1 and 16.
|
|||
|
The usual clock speed in the Sipeed examples is 300, sometimes 400 MHz.
|
|||
|
According to some mentions in the data sheet it's possible to clock it to 800 MHz.
|
|||
|
|
|||
|
Overall execution flow
|
|||
|
======================
|
|||
|
|
|||
|
The overall execution flow is that the KPU runs a neural network layer by
|
|||
|
layer. This happens in a sequential fashion. Each layer can be considered a
|
|||
|
separate set of instructions for the KPU.
|
|||
|
|
|||
|
A layer can receive its input in the "AI" memory area (2MB of the memory is reserved for this,
|
|||
|
from 0x40600000 to 0x407fffff) as well as write its output there. The input and
|
|||
|
output can consist of multiple channels (R/G/B for example).
|
|||
|
|
|||
|
It is possible to set an interrupt to notify the host CPU when a specific layer has
|
|||
|
finished executing.
|
|||
|
|
|||
|
Looking at `lib/drivers/kpu.c` in the SDK, function `ai_step`, many types of CNN layers are
|
|||
|
implemented in software instead of executed by the KPU. I suppose they accelerated the
|
|||
|
most common multiplication-intensive layers in hardware, which is `KL_K210_CONV`.
|
|||
|
|
|||
|
Peripehral layout
|
|||
|
=================
|
|||
|
|
|||
|
The register layout of the peripheral is as folllows. Source: `lib/drivers/include/kpu.h`.
|
|||
|
All registers are 64-bit.
|
|||
|
|
|||
|
| Ofs | Name | Description |
|
|||
|
| ----- | ----------------- | ------------------------------------------------------------- |
|
|||
|
| 0x00 | `layer_argument_fifo` | Layer arguments (instructions) are submitted here |
|
|||
|
| 0x08 | `interrupt_status` | Status of pending interrupts |
|
|||
|
| 0x10 | `interrupt_raw` | |
|
|||
|
| 0x18 | `interrupt_mask` | Specifies which global interrupts are enabled |
|
|||
|
| 0x20 | `interrupt_clear` | Clear pending interrupts |
|
|||
|
| 0x28 | `fifo_threshold` | FIFO interrupt thresholds |
|
|||
|
| 0x30 | `fifo_data_out` | Data output FIFO read register |
|
|||
|
| 0x38 | `fifo_ctrl` | Flush FIFOs |
|
|||
|
| 0x40 | `eight_bit_mode` | Enable 8-bit instead of 16-bit precision |
|
|||
|
|
|||
|
Layer format
|
|||
|
============
|
|||
|
|
|||
|
KPU neural network layers are represented by a series of 12 64-bit values,
|
|||
|
submitted to the layer argument FIFO one by one. The overall structure of the bit fields is
|
|||
|
available in `lib/drivers/include/kpu.h`.
|
|||
|
|
|||
|
It looks like the generation of models is supposed to be done offline by a tool called [nnscase](https://github.com/kendryte/nncase),
|
|||
|
which compiles TensorFlow models to a specific internal representation.
|
|||
|
The k210-specific code parts are [k210_ops.cpp](https://github.com/kendryte/nncase/tree/master/src/codegen/ops/k210/k210_ops.cpp)
|
|||
|
and [k210_sim_types.h](https://github.com/kendryte/nncase/blob/master/src/common/include/runtime/k210/k210_sim_types.h)
|
|||
|
and [k210_ops_body.h](https://github.com/kendryte/nncase/blob/master/src/common/include/runtime/k210/k210_ops_body.h)
|
|||
|
(serialization and deserialization).
|
|||
|
src/common/include/kernels/k210/k210_kernels.h (emulation)
|
|||
|
|
|||
|
0 `interrupt_enabe`
|
|||
|
-------------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0 `int_en` Generate interuupt after layer computation finished
|
|||
|
1 `ram_flag` ?
|
|||
|
2 `full_add` Set in `kpu_conv2d_output_full_add`
|
|||
|
3 `depth_wise_layer` Is a "depth-wise" layer (1 if enabled)
|
|||
|
4..63 reserved
|
|||
|
|
|||
|
"depth-wise" affects meny of the computations: it likely means that the layer
|
|||
|
computation mixes multiple channels so that they cannot be processed one by
|
|||
|
one.
|
|||
|
|
|||
|
1 `image_addr`
|
|||
|
--------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..14 `image_src_addr` Image source address
|
|||
|
15 reserved
|
|||
|
16..30 `image_dst_addr` Image destination address
|
|||
|
31..63 reserved
|
|||
|
|
|||
|
`image_src_addr` and `image_dst_addr` are specified in 64-byte units relative to the base of "AI" memory.
|
|||
|
|
|||
|
2 `image_channel_num`
|
|||
|
---------------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..9 `i_ch_num` Number of input channels (minus one)
|
|||
|
10..31 reserved
|
|||
|
32..41 `o_ch_num` Number of output channels (minus one)
|
|||
|
42..47 reserved
|
|||
|
48..57 `o_ch_num_coef` Number of output channel coefficients (minus one)
|
|||
|
58..63 reserved
|
|||
|
|
|||
|
3 `image_size`
|
|||
|
--------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..9 `i_row_wid` Input row width (minus one)
|
|||
|
10..18 `i_col_high` Input column height (minus one)
|
|||
|
19..31 reserved
|
|||
|
32..41 `o_row_wid` Output row width (minus one)
|
|||
|
42..50 `o_col_high` Output column height (minus one)
|
|||
|
51..63 reserved
|
|||
|
|
|||
|
4 `kernel_pool_type_cfg`
|
|||
|
------------------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..2 `kernel_type` `filter_type_t` (see below)
|
|||
|
3 `pad_type` Always 1
|
|||
|
4..7 `pool_type` `pool_type_t` (see below)
|
|||
|
8 `first_stride` ?
|
|||
|
9 `bypass_conv` ?
|
|||
|
10 `load_para` Load parameters (1 if enabled)
|
|||
|
11..15 reserved
|
|||
|
16..23 `dma_burst_size` Always 15
|
|||
|
24..31 `pad_value` Padding value
|
|||
|
32..63 `bwsx_base_addr` Batch normalization array base address (8-aligned, `kpu_batchnorm_argument_t`)
|
|||
|
|
|||
|
`kpu_filter_type`:
|
|||
|
|
|||
|
value enum
|
|||
|
------ -----------
|
|||
|
0 1x1
|
|||
|
1 3x3
|
|||
|
|
|||
|
`kpu_pool_type`:
|
|||
|
|
|||
|
value enum description
|
|||
|
------ -------------- ------------------
|
|||
|
0 bypass bypass pooling (filter size 1×1, stride 1)
|
|||
|
1 max_2_s2 max pooling (filter size 2×2, stride 2)
|
|||
|
2 mean_2_s2 mean pooling (filter size 2×2, stride 2)
|
|||
|
3 max_4_s4 max pooling (filter size 4×4, stride 4)
|
|||
|
4 mean_4_s4 mean pooling (filter size 4×4, stride 4)
|
|||
|
5 left_top_2_s2 pick left top (filter size 2×2, stride 2)
|
|||
|
6 right_top_2_s2 pick right top (filter size 2×2, stride 2)
|
|||
|
7 left_top_4_s4 pick left top (filter size 4×4, stride 4)
|
|||
|
8 mean_2_s1 mean pooling (filter size 2×2, stride 1)
|
|||
|
9 max_2_s1 max pooling (filter size 2×2, stride 1)
|
|||
|
|
|||
|
See `kpu_pool2d` in `src/common/include/kernels/k210/k210_kernels.h`,
|
|||
|
as well as `src/common/include/runtime/k210/k210_runtime_op_utility.h` in nncase.
|
|||
|
|
|||
|
5 `kernel_load_cfg`
|
|||
|
-------------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0 `load_coor` Always 1
|
|||
|
1..6 `load_time` Parameter load frequency (0=once, 1=per channel?)
|
|||
|
7..14 reserved
|
|||
|
15..31 `para_size` Parameter (weights) size
|
|||
|
32..63 `para_start_addr` Parameter (weights) start address (128-aligned, one byte per weight)
|
|||
|
|
|||
|
6 `kernel_offset`
|
|||
|
-----------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..3 `coef_column_offset` ?
|
|||
|
4..15 `coef_row_offset` ?
|
|||
|
16..63 reserved
|
|||
|
|
|||
|
7 `kernel_calc_type_cfg`
|
|||
|
------------------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..14 `channel_switch_addr` In layout channel length
|
|||
|
15 reserved
|
|||
|
16..19 `row_switch_addr` In layout row length
|
|||
|
20..27 `coef_size` ?
|
|||
|
28..30 `coef_group` ?
|
|||
|
31 `load_act` Load activation function (1 is enabled)
|
|||
|
32..63 `active_addr` Activation function address (256-aligned `kpu_activate_table_t`)
|
|||
|
|
|||
|
8 `write_back_cfg`
|
|||
|
------------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..14 `wb_channel_switch_addr` Out layout channel length
|
|||
|
15 reserved
|
|||
|
16..19 `wb_row_switch_addr` Out layout row length
|
|||
|
20..22 `wb_group` Out layout number of groups
|
|||
|
23..63 reserved
|
|||
|
|
|||
|
9 `conv_value`
|
|||
|
--------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..3 `shr_w` Convolution value shift right w
|
|||
|
4..7 `shr_x` Convolution value shift right x
|
|||
|
8..31 `arg_w` Convolution value w multiplier
|
|||
|
32..55 `arg_x` Convolution value x multiplier
|
|||
|
56..63 reserved
|
|||
|
|
|||
|
10 `conv_value2`
|
|||
|
----------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0..39 `arg_add` Convolution value addition/bias
|
|||
|
40..63 reserved
|
|||
|
|
|||
|
11 `dma_parameter`
|
|||
|
------------------
|
|||
|
|
|||
|
bit name
|
|||
|
------ ----------------------
|
|||
|
0 `send_data_out` Send data out to DMA (main memory)
|
|||
|
1..15 reserved
|
|||
|
16..31 `channel_byte_num` Number of bytes per out channel (minus one)
|
|||
|
32..63 `dma_total_byte` Number of bytes total out (minus one)
|