Introduction to larod for app developers

Overview

larod provides a simple unified C API for running machine learning and image preprocessing efficiently. No matter what a particular device can provide in terms of hardware for such operations – be it some specialized deep learning hardware accelerator or simply an ARM CPU – the larod API is all you need to access it.

Purpose and scope

Though the larod API is in some sense quite similar to libraries like Tensorflow Lite and ARMNN the purpose and scope of larod is actually rather different. The main purpose is to wrap other frameworks (like Tensorflow Lite) and thus let users access various different devices (exposing different APIs) through a single unified API. So an application could for example run both preprocessing on a GPU and neural network inference on an ARM CPU on the same device using only the larod API. The native APIs for these devices may be completely different, for example OpenCL and Tensorflow Lite respectively, but larod abstracts these into one simple and unified API.

Furthermore the larod stack is designed to solve a number of other problems as well, for instance related to security, power management and asynchronicity. Please see features and future features below for more on this.

Architecture

larod is made up of two components:

liblarod: This is the shared library with which applications link dynamically to use larod. As an app developer liblarod, and larod.h in particular, will be your interface.
larod service: This is a Linux system service which liblarod communicates with to perform operations on the hardware.

By using the functions exposed by liblarod, an application can load neural network models and define preprocessing operations (crop, scale, image color format convert). These models are then used together with input data to create job requests, which are sent to the larod service for execution. Model data as well as input and output data are represented by file descriptors in liblarod.

Below we see an example of two apps using larod on a device that have three different devices supported by larod.

         +-----------+      +-----------+
         |           |      |           |
         |   App 1   |      |   App 2   |
         |           |      |           |
         +---------^-+      +-^---------+
                   |          |
              Static/Dynamic linking
                   |          |
                +--v----------v--+
                |                |
                |    liblarod    |
                |                |
                +-------^--------+
                        |
               D-Bus and Unix socket
                        |
                +-------v--------+
                |                |
                |     larod      |
       +-------->    service     <--------+
       |        |                |        |
       |        +-------^--------+        |
       |                |                 |
       |    Device-native SW interfaces   |
       |                |                 |
+------v-----+    +-----v------+    +-----v------+
|            |    |            |    |            |
|  Device A  |    |  Device B  |    |  Device C  |
|            |    |            |    |            |
+------------+    +------------+    +------------+

Both apps in this example can access all three devices through the same API though the devices' native SW interfaces may be completely different. The larod service will arbitrate job scheduling between the apps to the supported devices.

Features

Easy to get started with

The larod API is very well documented and there are several basic example applications that exhibit how to use it in practice. A simplified workflow is described below together with the essential concepts in liblarod. The `larod-client` command line tool is a good start for interacting with the larod service.

Multiple apps per device

Many runtime libraries shipped with hardware accelerators (devices) for e.g. exercising neural networks will only let one process (application) access the hardware at the same time. This is of course a problem if there are multiple applications running on the device wanting to run inferences on the hardware at the same time. larod solves this problem by having the larod service acting as a proxy process, "owning" the hardware, through which other applications can access the accelerator.

Asynchronous

For any possibly time consuming calls to larod – such as loading a model or running a neural network inference – there are asynchronous versions in larod. This implies that one can for example queue up multiple inferences (possibly on different devices) at once without blocking the program execution of one’s application. Note that larod provides such an interface even though the device used has a native SW interface that does not support asynchronous calls.

Prioritized jobs

Sometimes certain jobs are more important to have finish more quickly than others. For example in the case of alternating an object detection model with a classifier model running inferences on the boxes obtained by the object detector. If the object detector must always run for every frame (say 30 fps) and a lot of boxes have been detected, all the classification inferences may block the object detector. To this end priority on inferences are supported in larod. So in this example the object detector inferences could use a higher priority making larod choose those over classification inferences whenever multiple jobs are in the queue.

Image preprocessing

larod has an interface for simply defining image preprocessing operations like crop, scale, image color format convert, and supports several devices that can perform such operations. Read more on this in preprocessing.md.

Minimal overhead

In designing and implementing larod, performance has been of utmost importance. Thus one should expect very little overhead (and indeed benchmarks confirm this) compared to using a native runtime library directly.

Capabilities probing

After loading a model into larod one can through calls to larod extract information from the model, e.g. about the models input or output tensors. For example if one has not constructed the model on one’s own, one may want to know what kind of input tensor(s) the model requires: What is the data type? The layout? The dimensions? larod has the answers!

Security

In order for an application to use an accelerator it must have the privileges to do so. Without larod that would mean that for some accelerators an application would get the rights to do things that they should not be allowed to do, like writing to memory that must not be touched. With larod however the applications require no such permissions as larod acts as a proxy process which have all those necessary permissions, through which the applications communicate with the hardware.

Zero-copy and map only once

The larod API provides a simple way to avoid any memory copies or mappings of input and output tensor data that are possible to avoid. The application simply declares how their buffers may be accessed and whether they will be used repeatedly, and then larod will take all measures possible to optimize memory accesses on these buffers. This implies both that larod jobs run faster and that larod uses the CPU less so as to leave more resources to other processes.

Concepts and workflow

The basic workflow of running a job with larod is as follows:

Create a larodConnection.
Retrieve a suitable larodDevice.
Construct a larodModel on the selected device.
Construct input/output larodTensors and set up their file descriptors pointing to the respective data buffers.
Create a job request out of the larodModel and larodTensor handles.
Execute the job request.

A brief description of each step is given below. For details about each function call, see larod.h.

larodConnection

Connecting to the larod service and creating a larodConnection is as simple as

larodError* error = NULL;
larodConnection* conn = NULL;
larodConnect(&conn, &error);

The larodConnection represents the session that an application has with the larod service and is used in all the calls that interact with the service.

larodDevice

The larodDevice struct acts as an abstraction and handle to a specific device and its corresponding native software interface. A larodDevice does not necessarily have to mean something external like a deep-learning accelerator; it can also refer to a software interface that runs on the CPU itself.

A device in larod has a name and an instance number. The name is represented as a string and is usually provided by the user. Refer to preprocessing.md and nn-inference to look up the names of the supported preprocessing- and inference devices. The instance number of a device (starting from zero) represents a separate entity of that device and serves to distinguish multiple identical devices with the same name.

Before loading a model to run jobs with, a device must be retrieved with e.g. the following API function:

const larodDevice* dev = larodGetDevice(conn, "cpu-tflite", 0, &error);

The example above will return a device handle to CPU-based TFLite backend. This handle can then be used when loading a model as shown in the following section.

larodModel

A model represents a computational task with a set of inputs and outputs. The model is loaded onto a specific backend, specified by a larodDevice. There are two resources that one can use to configure the model:

Binary data represented by a file descriptor.
A constructed larodMap that contains a set of parameters.

Both the type of data that is accepted as well as available parameters are backend specific. Typically, neural networks are loaded with binary data while certain preprocessing operations can be constructed purely from a larodMap.

For example, a TFLite model can be loaded with

FILE* fpModel = fopen("mobilenet.tflite", "rb");
int fd = fileno(fpModel);
larodModel* model = larodLoadModel(conn, fd,
                                   larodGetDevice(conn, "cpu-tflite", 0, NULL),
                                   LAROD_ACCESS_PRIVATE, "Mobilenet",
                                   NULL /*larodMap*/, &error);

Similarly, a preprocessing model can be constructed from a larodMap. The file descriptor argument is omitted by setting it to -1 since there is no model file.

larodMap* modelParams = larodCreateMap(&error);
larodMapSetStr(modelParams, "image.input.format", "nv12", NULL);
larodMapSetIntArr2(modelParams, "image.input.size", 1280, 720, NULL);
larodMapSetStr(modelParams, "image.output.format", "rgb-interleaved", NULL);
larodMapSetIntArr2(modelParams, "image.output.size", 48, 48, NULL);
larodModel *model = larodLoadModel(conn, -1,
                                   larodGetDevice(conn, "cpu-proc", 0, NULL),
                                   LAROD_ACCESS_PRIVATE, "nv12->rgb",
                                   modelParams, &error);

Every model can be made either public or private by setting the access parameter in larodLoadModel(). A private model is only usable by the same session that created it, even though it is visible for everyone, and the model is automatically deleted in the service when the session closes its connection. Making the model public gives everyone permission to use it and one must explicitly call larodDeleteModel() to delete the model from the service's pool of loaded models.

larodTensor

The input and output of a model are represented by larodTensor objects. Each tensor carries information about the data that is used when running a job, e.g. the file descriptor pointing to the data buffer and various meta data such as the dimensions and data type of the tensor.

There are two main ways of creating the larodTensors required to run a job on a specific model.

If you already have file descriptors representing the input and output buffers you want to use for your job the easiest method is to use larodCreateModelInputs() and larodCreateModelOutputs(), e.g.

size_t numInputs = 0;

larodTensor** inputTensors = larodCreateModelInputs(model, &numInputs, &error);

In this way, the tensors will be constructed with the right characteristics to match what the model expects. With this method, it is easy to probe properties of the tensor, i.e. the meta data, with e.g. larodGetTensorDims() and other similar functions. Now the only fields one has to set for every tensor is the file descriptor that points to the data with larodSetTensorFd() as well as the properties of the file descriptor using larodSetTensorFdProps().

For the properties of the file descriptor, you probably would want to use LAROD_FD_TYPE_DMA or LAROD_FD_TYPE_DISK that already specifies appropriate access flags for common file descriptor types. Setting invalid flags could result in undefined behavior. For example, if one do not set LAROD_FD_PROP_DMABUF for a dma-buf fd but only LAROD_FD_PROP_MAP, it could give inconsistent data results (since the CPU cache would probably not be synchronized correctly). Therefore, always set the correct access flags corresponding to the actual file descriptor type (c.f. LAROD_FD_TYPE_DMA and LAROD_FD_TYPE_DISK in larod.h)

If you for some further convenience would like larod to also allocate buffers for the tensors you can instead use larodAllocModelInputs() and larodAllocModelOutputs(), e.g.

size_t numInputs = 0;
larodTensor** inputTensors = larodAllocModelInputs(conn, model, 0, &numInputs,
                                                   NULL, &error);

These larodTensors are also populated with all tensor meta data, and in addition they contain valid file descriptors and properties of these file descriptors. To obtain the file descriptor from one of these tensors larodGetTensorFd() should be used.

Notably, for some backends the allocation calls provide the application with options of allocating certain types of buffers (e.g. dma-buf based) that may not otherwise be available to the application (for security reasons).

Any larodTensor, regardless of in which way it was created, must be destroyed explicitly with larodDestroyTensors() when it is no longer of interest. This will remove any cachings and allocations of the tensor inside the service as well as freeing up used memory for the tensor in the user's address space.

larodJobRequest

Running a job requires a larodJobRequest that is created from a larodModel handle together with the input and output tensors:

larodJobRequest* jobReq = larodCreateJobRequest(model,
                                                inputTensors, numInputs,
                                                outputTensors, numOutputs,
                                                NULL, &error);

The job request is executed simply with

larodRunJob(conn, jobReq, &error);

larod will write the output to the output tensors' file descriptors.

Asynchronous calls

The workflow for asynchronous job execution with larod is almost the same as the above, synchronous, one. Loading models and executing jobs can be performed asynchrounosly with the larodLoadModelAsync() and larodRunJobAsync() calls. The addition, in the asynchronous case, is that these require the user to define their own callback functions with the signatures

void (*larodLoadModelCallback)(larodModel* model, void* userData,
                               larodError* error);
void (*larodRunJobCallback)(void* userData, larodError* error);

For example, running larodRunJobAsync() will return as soon as the job request is scheduled for execution and the given callback function will be executed when the job is finished.

The user is free to pass on any data to the callback function with userData argument, but please note that the function should not carry out relatively extensive blocking tasks; since it would block liblarod's message bus.

Tensor tracking for performance

The larod service can track larodTensors in order to sometimes be able to improve memory access performance. For instance, a larodTensor with the file descriptor property LAROD_FD_PROP_MAP need only be mapped once in the service if it's recurring. This can improve performance a lot.

larodTensors allocated by larod - e.g. using larodAllocModelInputs() - are tracked by default. To enable tracking on larodTensors using buffers not allocated by larod larodTrackTensor() should be used.

Backward compatibility

liblarod is versioned using semantic versioning and as such the API is only broken in new major releases of liblarod. Further, ABI and API backward compatibility is maintained a period of time after each liblarod major release in order to ease the transition between API breaks for app developers.

ABI backward compatibility

ABI backward compatibility means that apps compiled against a liblarod with an older major version are still able to dynamically link perfectly well at runtime with a newer major version of liblarod, and thus does not need to be recompiled to run with that newer version. In other words, an app compiled against a liblarod of major version X < Y still works when linking dynamically with a liblarod of major version Y as long as liblarod Y is ABI backward compatible with liblarod X.

ABI backward compatibility is handled under the hood by symbol versioning in liblarod.

API backward compatibility

API backward compatibility means that apps written using the API of an older major version of liblarod are with a minor edit (see below) able to be compiled perfectly well against a newer major version of liblarod, and thus does not need to be rewritten to compile and run with the newer version. In other words, an app written using the liblarod X < Y major version API can with a small edit be made to compile against a liblarod of major version Y as long as liblarod Y is API backward compatible with liblarod X.

In liblarod API backward compatibility is handled by special preprocessing defines that can be defined before including larod.h to declare which liblarod API version is to be used.

In particular, by defining LAROD_API_VERSION_X before including larod.h below, it will declare version X of liblarod's functions and types:

#defined LAROD_API_VERSION_X

#include <larod.h>

In this case any function or type present in the liblarod X.*.* API will behave the same way it did when compiled with liblarod X.*.* even though a liblarod of a newer major version Y > X is used to build the app as long as liblarod Y is API backward compatible with liblarod X.

Future features

larod is under active development and the following features are not yet part of larod; however, there are plans to implement them.

Custom layers

At this stage larod does not support models using custom layers for all devices. However there are plans to implement such support in the future for any hardware that allows custom layers as long as it does not interfere too much with the larod architecture.

Power management

Many accelerators use a lot of power when running jobs on them. Devices typically have a tight power budget which dictates power usage levels that must not be exceeded. If for example a camera needs to run a power hungry task not related to an accelerator such as switching an IR filter, then jobs on the accelerator must be paused for a while. larod will automatically delay running jobs on the hardware in such a scenario by keeping track of the power usage on the camera, and an application using larod does not have to worry about it at all.