The Unity AR Foundation provides a convenient high level way of utilizing ARCore and ARKit in order to implement mixed and augmented reality applications. I used it to implement an iPad app that could access an rt-ai Edge Composable Processing Pipeline (CPP) via the new Conductor Stream Processing Element (SPE). This is the CPP used to test Conductor:
The Conductor SPE provides a Websocket API to mobile devices and is able to pass data from the mobile device to the pipeline and then return the results of the CPP’s processing back to the mobile device. In this case, I am using the CYOLO SPE to perform object detection on the video stream from the mobile device’s camera. The output of the CYOLO SPE goes to three destinations – back to the Conductor, to a MediaView for display locally (for debug) and also to a PutManifold SPE for long term storage and off-line processing.
The iPad Unity app used to test this arrangement uses AR Foundation and ARKit for spatial management and convenient access to camera data. The AR Foundation is especially nice as, if you only need the subset of ARKit functionality currently available, you can do everything in the C# domain without having to get involved with Swift and/or Objective C and all that. The captured camera data is formatted as an rt-ai Edge message and sent via the Websocket API to the Conductor. The Conductor returns detection metadata to the iPad which then uses this to display the labelled detection frames in the Unity space.
Right now, the app draws a labelled frame at a constant distance of 1 meter from the camera to align with the detected object. However, an enhancement would be to use depth information (if there is any) so that the frame could be positioned at the correct depth. Or if that wasn’t useful, the frame label could include depth information.
This setup demonstrates that it is feasible for an XR app to offload inference to an edge compute system and process results in real time. This greatly reduces the load on the mobile device, pointing the way to lightweight, low power, head mounted XR devices that could last for a full workday without recharge. Performing inference on-device (with CoreML for example) is certainly a viable alternative, especially where privacy dictates that raw data (such as video) cannot leave the device. However, processing such data using an edge compute system is hardly the same as sending data out to a remote cloud so, in many cases, privacy requirements can still be satisfied using edge offload.
This particular setup does not require Orchestrator as the iPad test app can go directly to the Conductor, which is part of a statically allocated CPP. The next step to complete the architecture is to add in the Orchestrator interaction so that CPPs can be dynamically instantiated.
Previous rt-ai Edge designs, such as the driveway monitor, are static in the sense that they just sit there, running 24/7. Another mode of operation is dynamic, where stream processing networks are created on demand and accessible via standard interfaces. This is appropriate for offloading inference from mobile devices in a sentient space for example. As users enter the space, apps on their mobile devices (XR headsets, tablets, phones etc) can access inference and other processing resources from the edge compute system supporting the space.
There are three main components in a dynamic rt-ai Edge system:
- Composable Processing Pipeline (CPP). This is the dynamic analog of the static Stream Processing Network (SPN). A CPP is a set of Stream Processing Elements (SPEs) that has been designed using rtaiDesigner. The main difference between a CPP and an SPN is that, in general, the CPP contains no data sources or sinks: these are provided by the user app.
- Conductor. The Conductor is responsible for managing an allocated resource session. User apps interact directly with the Conductor via a Websocket API while the Conductor maps data flowing on the Websocket API to and from the MQTT interfaces on the CPP(s) that have been allocated to that session.
- Orchestrator. The Orchestrator manages the dynamic system. User apps interact with the Orchestrator to request resource. The Orchestrator allocates necessary CPP resources and creates a Conductor instance to act as the source and sink for the CPP(s). The user apps are then redirected to the Websocket API on the new Conductor instance at which point data can flow to and from the user. The Orchestrator is responsible for managing all of the rt-ai Edge nodes that have been allocated to the edge compute system, allocating CPPs to nodes dynamically based on available resources and hardware (e.g. GPU or embedded inference hardware).
The diagram above shows the idle state. The heart of this design is the Orchestrator as it directs all operations. When a user (via an app or browser) wants to use some edge resource, it uses the RESTful API of the Orchestrator to identify itself and define the details of the resources that it requires. The requested resources are then mapped to one or more CPP types. In this example, the Orchestrator maintains a hot pool of CPPs to minimize start up latency. Hot pool CPPs are instantiated but idle as they have no data sources. As the Orchestrator allocates CPPs from the pool, the Orchestrator creates new CPP instances to replace them. This is useful because inference SPEs can have startup times of several seconds. The hot pool hides this delay from the user. Note that the hot pool could consist of multiple types of CPPs that perform different functions – the Orchestrator just selects the correct type to satisfy the resource request. Alternatively, there could be a fixed set of CPP instances and users are just allocated to those. Or, CPPs can be instantiated on demand if startup latency is not an issue.
Once the Orchestrator has identified one or more CPPs to satisfy the resource request, it creates a Conductor instance for the request. The Conductor presents a Websocket API to the user while connecting into rt-ai Edge’s MQTT infrastructure to communicate with the CPPs. If there is only a single CPP involved, the input pin of the CPP is connected to the output pin of the Conductor and the input pin of the Conductor is connected to the output pin(s) of the CPP. If there is more than one CPP required, the CPPs are connected together as required (this can be an arbitrary graph, not just a pipeline) and the input and output pin(s) at the edges connected to the Conductor. Once this is all set up, the Orchestrator redirects the user app to the new Conductor instance and the session can begin as shown below:
As an example, suppose an AR headset user wants to identify and annotate objects in the real world using an AR overlay. In this case, the user app might request a CPP that performs the appropriate object detection and returns the box coordinates of the object and an identified label. The user app would stream the video feed from the AR headset to the Conductor using the Websocket connection. The Conductor would then pass the video frames on to the CPP. The output of the CPP would contain the detected object metadata that is passed via the Conductor onto the Websocket connection back to the user app for rendering.
One of the issues with the GPU-based CYOLO (for example) is that it uses about 8GB of GPU memory meaning that, even on a GTX 1080 ti GPU card, it is only possible to have one instance of the CYOLO SPE on any one GPU card. A way around this is to run multiple streams through a single SPE instance. The architecture of rt-ai Edge always supported fan in (i.e. stream multiplexing) but not fan out (i.e. stream demultiplexing). The new FanOut module solves this problem. The screen capture above shows the new FanOut SPE running with the Intel NCS 2-based CSSD SPE. Video streams from three cameras are multiplexed on the CSSD SPE’s input pin. The multiplexed output is then passed to the FanOut SPE which demultiplexes the composite stream to up to eight individual streams. The screen capture also shows the FanOut configuration dialog – you just enter the source SPE name for the stream to be associated with each output pin.
Since my second NCS 2 has arrived I was able to run the triple NCS configuration shown above. The old NCS didn’t really contribute much in this case – the two NCS 2s were able to get an aggregate throughput of around 26 frames per second. This is shared between the three input streams of course.
The fan in/fan out multiplexing idea fits very well with the NCS 2 as you can just add more NCS 2s (or more likely, a special purpose multiple Myriad X board) to a node to increase aggregate throughput.
I wanted a small and portable rt-ai Edge node using the Neural Compute Stick for demos and decided to base it on a Gigabyte BRi7H-8550 compact PC as it is the lowest cost, smallest footprint, device that I could find with a decent i7 CPU. This is fitted with 16GB of DDR4 DRAM and a 256GB NVMe M2 disk. Previously I needed a mini ITX board along with a GPU which is much bigger and heavier as can be seen below.
The node is running Ubuntu 16.04 along with standard rt-ai node management software and performs very nicely. A second NCS can be fitted on the front USB port and a small USB hub could be used if more than two are required. For demo purposes, a Windows or Ubuntu laptop runs rtaiDesigner for GUI-based control and status with the node acting as a headless inference server.
While this is primarily intended as a demo device, it would actually be quite a nice embedded inference node.
As I had discovered, one Neural Compute Stick 2 (NCS 2) has pretty decent throughput. The question then is: what happens if you connect more than one of these to the same machine? I only have one NCS 2 and one of the older NCS devices to test this out but that combination worked ok with some tuning. OpenVINO manages allocation of requests to physical devices so there is no explicit way for this to be controlled via the API. However, it appears that multiple SPEs on the same node can be supported as then the NCSs are divided up between the SPEs. A reset error message is typically emitted but then everything seems to work fine.
To get the best performance, I ran in async mode using multiple ExecutableNetwork/InferRequest pairs, with the actual number being configurable from the rtaiDesigner GUI. In this case, 5 pairs gave the best results. The throughput is around 18 frames per second running ssd_mobilenet_v2_coco object detection.
Using one NCS at a time, the NCS 2 was able to process 12 frames per second (versus 9 frames per second in synchronous mode using the original SPE code) while the older NCS was able to process 6 frames per second, suggesting that both were being fully utilized.
Now I need to get a second NCS 2…
Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai Edge. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.
This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.
The Python-based YOLOv3 SPE has been working for a while now but the performance was a little disappointing at 2 or 3 fps using 1280 x 720 frames on an i7 5820K CPU/GTX 1080ti GPU machine. I was interested to see how much effect the Python code was having on overall performance. To do this I implemented the C++ rt-ai SPE API and added the C version of the YOLOv3 demo code. The result is shown above and this version now runs at just over 14 fps (17 fps at 640 x 480) which is very usable.
While Python is very convenient, it is clearly (and unsurprisingly) more efficient to use C/C++ so I will probably do that in the future where possible. The main side-effect is that rtaiDesigner has to deploy the correct compiled SPE for the target node (typically x64 or ARM) and that any shared libraries that are not part of the standard install are included too. A Dockerized version would of course solve the dependency problem and just require a container for each target architecture.