The user calls the Xilinx AI SDK through the interface to send the image(cv::Mat) to the neural network.

select image to read in to OpenCV class using Opencv imread method.
call the create method provided by the corresponding library for desired network to get class instance.
call getInputWidth() and getInputHeight() to get the input image dimensions required for the given network.
resize image to given dimensions.
run the image through the network.

DNNDK

Deep Neural Network Development Kit
Workflow to deployment of the neural network:

Compress the neural network model.
Compile the neural network model.
Program with DNNDK APIs.
Compile the hybrid DPU application.
Run the hybrid DPU executable.

DPU = deep learning processing unit

DPU Kernel
DPU Task
DPU Node
DPU Tensor

After being compiled by Deep Neural Network Compiler (DNNC) compiler, the neural network model is transformed into an equivalent DPU assembly file, which is then assembled into one ELF object file by Deep Neural Network Assembler (DNNAS). DPU ELF object file is regarded as DPU kernel.

The kernel is loaded into the DPU dedicated memory space and allocated hardware resources. After that, each DPU kernel can be instantiated into several DPU tasks by calling dpuCreateTask() to enable the multi-threaded programming. The task has a private memory space.

A Node is a basic element (building block?) of a network and has input, output and parameters. Nodes are identified by unique names.

DPU tensor is a collection of multi-dimensional data that is used to store information while running. Tensor properties (such as height, width, channel, and so on) can be obtained using APIs exported by DNNDK.

It is common to exchange data between a CPU and the DPU when programming for the DPU. For example, data preprocessed by a CPU can be sent to the DPU for acceleration; and the output produced by the DPU might need to be copied back to a CPU for further processing. To handle this type of operation, DNNDK provides a set of APIs to make it easy for data exchange.

The DPU kernel ELF is combined with the software application (GCC compile, for example) output into a hybrid single binary with the CPU ELF, with everything required to run on both CPU and DPU. There is a DPU loader to handle dynamic relocation.

To develop deep learning applications on the DPU, three types of work must be done:

Use DNNDK APIs to manage DPU kernels.
- DPU kernel creation and destruction.
- DPU task creation.
- Managing input and output tensors.
Implement kernels not supported by the DPU on the CPU.
Add pre-processing and post-processing routines to read in data or calculate results.