liblarod
3.1.28
|
larod provides a simple unified C API for running machine learning and image preprocessing efficiently. No matter what a particular device can provide in terms of hardware for such operations – be it some specialized deep learning hardware accelerator or simply an ARM CPU – the larod API is all you need to access it.
Though the larod API is in some sense quite similar to libraries like Tensorflow Lite and ARMNN the purpose and scope of larod is actually rather different. The main purpose is to wrap other frameworks (like Tensorflow Lite) and thus let users access various different devices (exposing different APIs) through a single unified API. So an application could for example run both preprocessing on a GPU and neural network inference on an ARM CPU on the same device using only the larod API. The native APIs for these devices may be completely different, for example OpenCL and Tensorflow Lite respectively, but larod abstracts these into one simple and unified API.
Furthermore the larod stack is designed to solve a number of other problems as well, for instance related to security, power management and asynchronicity. Please see features and future features below for more on this.
larod is made up of two components:
larod.h
in particular, will be your interface.By using the functions exposed by liblarod, an application can load neural network models and define preprocessing operations (crop, scale, image color format convert). These models are then used together with input data to create job requests, which are sent to the larod service for execution. Model data as well as input and output data are represented by file descriptors in liblarod.
Below we see an example of two apps using larod on a device that have three different devices supported by larod.
Both apps in this example can access all three devices through the same API though the devices' native SW interfaces may be completely different. The larod service will arbitrate job scheduling between the apps to the supported devices.
The larod API is very well documented and there are several basic example applications that exhibit how to use it in practice. A simplified workflow is described below together with the essential concepts in liblarod. The `larod-client` command line tool is a good start for interacting with the larod service.
Many runtime libraries shipped with hardware accelerators (devices) for e.g. exercising neural networks will only let one process (application) access the hardware at the same time. This is of course a problem if there are multiple applications running on the device wanting to run inferences on the hardware at the same time. larod solves this problem by having the larod service acting as a proxy process, "owning" the hardware, through which other applications can access the accelerator.
For any possibly time consuming calls to larod – such as loading a model or running a neural network inference – there are asynchronous versions in larod. This implies that one can for example queue up multiple inferences (possibly on different devices) at once without blocking the program execution of one’s application. Note that larod provides such an interface even though the device used has a native SW interface that does not support asynchronous calls.
Sometimes certain jobs are more important to have finish more quickly than others. For example in the case of alternating an object detection model with a classifier model running inferences on the boxes obtained by the object detector. If the object detector must always run for every frame (say 30 fps) and a lot of boxes have been detected, all the classification inferences may block the object detector. To this end priority on inferences are supported in larod. So in this example the object detector inferences could use a higher priority making larod choose those over classification inferences whenever multiple jobs are in the queue.
larod has an interface for simply defining image preprocessing operations like crop, scale, image color format convert, and supports several devices that can perform such operations. Read more on this in preprocessing.md.
In designing and implementing larod, performance has been of utmost importance. Thus one should expect very little overhead (and indeed benchmarks confirm this) compared to using a native runtime library directly.
After loading a model into larod one can through calls to larod extract information from the model, e.g. about the models input or output tensors. For example if one has not constructed the model on one’s own, one may want to know what kind of input tensor(s) the model requires: What is the data type? The layout? The dimensions? larod has the answers!
In order for an application to use an accelerator it must have the privileges to do so. Without larod that would mean that for some accelerators an application would get the rights to do things that they should not be allowed to do, like writing to memory that must not be touched. With larod however the applications require no such permissions as larod acts as a proxy process which have all those necessary permissions, through which the applications communicate with the hardware.
The larod API provides a simple way to avoid any memory copies or mappings of input and output tensor data that are possible to avoid. The application simply declares how their buffers may be accessed and whether they will be used repeatedly, and then larod will take all measures possible to optimize memory accesses on these buffers. This implies both that larod jobs run faster and that larod uses the CPU less so as to leave more resources to other processes.
The basic workflow of running a job with larod is as follows:
larodConnection
.larodDevice
.larodModel
on the selected device.larodTensor
s and set up their file descriptors pointing to the respective data buffers.larodModel
and larodTensor
handles.A brief description of each step is given below. For details about each function call, see larod.h
.
Connecting to the larod service and creating a larodConnection
is as simple as
The larodConnection
represents the session that an application has with the larod service and is used in all the calls that interact with the service.
The larodDevice
struct acts as an abstraction and handle to a specific device and its corresponding native software interface. A larodDevice
does not necessarily have to mean something external like a deep-learning accelerator; it can also refer to a software interface that runs on the CPU itself.
A device in larod has a name and an instance number. The name is represented as a string and is usually provided by the user. Refer to preprocessing.md and nn-inference to look up the names of the supported preprocessing- and inference devices. The instance number of a device (starting from zero) represents a separate entity of that device and serves to distinguish multiple identical devices with the same name.
Before loading a model to run jobs with, a device must be retrieved with e.g. the following API function:
The example above will return a device handle to CPU-based TFLite backend. This handle can then be used when loading a model as shown in the following section.
A model represents a computational task with a set of inputs and outputs. The model is loaded onto a specific backend, specified by a larodDevice
. There are two resources that one can use to configure the model:
larodMap
that contains a set of parameters.Both the type of data that is accepted as well as available parameters are backend specific. Typically, neural networks are loaded with binary data while certain preprocessing operations can be constructed purely from a larodMap
.
For example, a TFLite model can be loaded with
Similarly, a preprocessing model can be constructed from a larodMap
. The file descriptor argument is omitted by setting it to -1 since there is no model file.
Every model can be made either public or private by setting the access parameter in larodLoadModel()
. A private model is only usable by the same session that created it, even though it is visible for everyone, and the model is automatically deleted in the service when the session closes its connection. Making the model public gives everyone permission to use it and one must explicitly call larodDeleteModel()
to delete the model from the service's pool of loaded models.
The input and output of a model are represented by larodTensor
objects. Each tensor carries information about the data that is used when running a job, e.g. the file descriptor pointing to the data buffer and various meta data such as the dimensions and data type of the tensor.
There are two main ways of creating the larodTensor
s required to run a job on a specific model.
If you already have file descriptors representing the input and output buffers you want to use for your job the easiest method is to use larodCreateModelInputs()
and larodCreateModelOutputs()
, e.g.
In this way, the tensors will be constructed with the right characteristics to match what the model expects. With this method, it is easy to probe properties of the tensor, i.e. the meta data, with e.g. larodGetTensorDims()
and other similar functions. Now the only fields one has to set for every tensor is the file descriptor that points to the data with larodSetTensorFd()
as well as the properties of the file descriptor using larodSetTensorFdProps()
.
For the properties of the file descriptor, you probably would want to use LAROD_FD_TYPE_DMA
or LAROD_FD_TYPE_DISK
that already specifies appropriate access flags for common file descriptor types. Setting invalid flags could result in undefined behavior. For example, if one do not set LAROD_FD_PROP_DMABUF
for a dma-buf fd but only LAROD_FD_PROP_MAP
, it could give inconsistent data results (since the CPU cache would probably not be synchronized correctly). Therefore, always set the correct access flags corresponding to the actual file descriptor type (c.f. LAROD_FD_TYPE_DMA
and LAROD_FD_TYPE_DISK
in larod.h
)
If you for some further convenience would like larod to also allocate buffers for the tensors you can instead use larodAllocModelInputs()
and larodAllocModelOutputs()
, e.g.
These larodTensor
s are also populated with all tensor meta data, and in addition they contain valid file descriptors and properties of these file descriptors. To obtain the file descriptor from one of these tensors larodGetTensorFd()
should be used.
Notably, for some backends the allocation calls provide the application with options of allocating certain types of buffers (e.g. dma-buf based) that may not otherwise be available to the application (for security reasons).
Any larodTensor
, regardless of in which way it was created, must be destroyed explicitly with larodDestroyTensors()
when it is no longer of interest. This will remove any cachings and allocations of the tensor inside the service as well as freeing up used memory for the tensor in the user's address space.
Running a job requires a larodJobRequest
that is created from a larodModel
handle together with the input and output tensors:
The job request is executed simply with
larod will write the output to the output tensors' file descriptors.
The workflow for asynchronous job execution with larod is almost the same as the above, synchronous, one. Loading models and executing jobs can be performed asynchrounosly with the larodLoadModelAsync()
and larodRunJobAsync()
calls. The addition, in the asynchronous case, is that these require the user to define their own callback functions with the signatures
For example, running larodRunJobAsync()
will return as soon as the job request is scheduled for execution and the given callback function will be executed when the job is finished.
The user is free to pass on any data to the callback function with userData
argument, but please note that the function should not carry out relatively extensive blocking tasks; since it would block liblarod's message bus.
The larod service can track larodTensor
s in order to sometimes be able to improve memory access performance. For instance, a larodTensor
with the file descriptor property LAROD_FD_PROP_MAP
need only be mapped once in the service if it's recurring. This can improve performance a lot.
larodTensor
s allocated by larod - e.g. using larodAllocModelInputs()
- are tracked by default. To enable tracking on larodTensor
s using buffers not allocated by larod larodTrackTensor()
should be used.
liblarod is versioned using semantic versioning and as such the API is only broken in new major releases of liblarod. Further, ABI and API backward compatibility is maintained a period of time after each liblarod major release in order to ease the transition between API breaks for app developers.
ABI backward compatibility means that apps compiled against a liblarod with an older major version are still able to dynamically link perfectly well at runtime with a newer major version of liblarod, and thus does not need to be recompiled to run with that newer version. In other words, an app compiled against a liblarod of major version X < Y still works when linking dynamically with a liblarod of major version Y as long as liblarod Y is ABI backward compatible with liblarod X.
ABI backward compatibility is handled under the hood by symbol versioning in liblarod.
API backward compatibility means that apps written using the API of an older major version of liblarod are with a minor edit (see below) able to be compiled perfectly well against a newer major version of liblarod, and thus does not need to be rewritten to compile and run with the newer version. In other words, an app written using the liblarod X < Y major version API can with a small edit be made to compile against a liblarod of major version Y as long as liblarod Y is API backward compatible with liblarod X.
In liblarod API backward compatibility is handled by special preprocessing defines that can be defined before including larod.h
to declare which liblarod API version is to be used.
In particular, by defining LAROD_API_VERSION_X
before including larod.h
below, it will declare version X of liblarod's functions and types:
In this case any function or type present in the liblarod X.*.* API will behave the same way it did when compiled with liblarod X.*.* even though a liblarod of a newer major version Y > X is used to build the app as long as liblarod Y is API backward compatible with liblarod X.
larod is under active development and the following features are not yet part of larod; however, there are plans to implement them.
At this stage larod does not support models using custom layers for all devices. However there are plans to implement such support in the future for any hardware that allows custom layers as long as it does not interfere too much with the larod architecture.
Many accelerators use a lot of power when running jobs on them. Devices typically have a tight power budget which dictates power usage levels that must not be exceeded. If for example a camera needs to run a power hungry task not related to an accelerator such as switching an IR filter, then jobs on the accelerator must be paused for a while. larod will automatically delay running jobs on the hardware in such a scenario by keeping track of the power usage on the camera, and an application using larod does not have to worry about it at all.