One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Edge Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.
The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.
Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.
The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.
As a thought experiment, I considered how rt-ai Edge could be used to implement a next generation gym. The thought was sparked by Orangetheory who make nice use of technology to enhance the gym experience. The question was: where next? My answer is here: rt-ai smart gym. It would be fun to implement some of these ideas!
Following on from the GPU version, I now have OpenPose running in an Intel NCS 2 Stream Processing Element, as shown in the screen capture above. This wasn’t too hard as it is based on an Intel sample and model. The metadata format is consistent with the GPU version (apart from the lack of support for face and hand pose estimation) but that’s fine for a lot of applications.
This is the familiar simple test design. The OpenPoseVINO SPE is running at about 3fps on 1280 x 720 video using an NCS 2 (the GPU version with a GTX 1080ti gets about 17fps in body pose only mode). The current SPE inherited a blocking inference OpenVINO call from the demo rather than an asynchronous inference call – this needs to be changed to be similar to the technique used by the SSD version so that the full capabilities of multiple NCS 2s can be utilized for body pose estimation.
One of the more interesting pieces of open source software for video processing is OpenPose. I used this code as the basis of a new OpenPoseGPU Stream Processing Element for rt-ai Edge and the results can be seen in the screen capture. The metadata produce can be seen partially on the right hand side – it is pretty extensive as it contains all of the detected key points, depending on whether face and hand processing is enabled.
This version is x86/NVIDIA GPU based. The next thing to do is to get the equivalent working with the Intel NCS 2, based on this example, and then compare performance to see if the NCS 2 is practical for applications needing specific frame rates. The goal is to generate metadata that can be used to train a deep neural net to recognize specific activities. This could be used to create a Stream Processing Network that generates high level metadata about what users in the view of the camera are doing. This is in turn could be used to generate feedback to users, generate alerts on anomalous behavior etc.
Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable. I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.
In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.
The Unity AR Foundation provides a convenient high level way of utilizing ARCore and ARKit in order to implement mixed and augmented reality applications. I used it to implement an iPad app that could access an rt-ai Edge Composable Processing Pipeline (CPP) via the new Conductor Stream Processing Element (SPE). This is the CPP used to test Conductor:
The Conductor SPE provides a Websocket API to mobile devices and is able to pass data from the mobile device to the pipeline and then return the results of the CPP’s processing back to the mobile device. In this case, I am using the CYOLO SPE to perform object detection on the video stream from the mobile device’s camera. The output of the CYOLO SPE goes to three destinations – back to the Conductor, to a MediaView for display locally (for debug) and also to a PutManifold SPE for long term storage and off-line processing.
The iPad Unity app used to test this arrangement uses AR Foundation and ARKit for spatial management and convenient access to camera data. The AR Foundation is especially nice as, if you only need the subset of ARKit functionality currently available, you can do everything in the C# domain without having to get involved with Swift and/or Objective C and all that. The captured camera data is formatted as an rt-ai Edge message and sent via the Websocket API to the Conductor. The Conductor returns detection metadata to the iPad which then uses this to display the labelled detection frames in the Unity space.
Right now, the app draws a labelled frame at a constant distance of 1 meter from the camera to align with the detected object. However, an enhancement would be to use depth information (if there is any) so that the frame could be positioned at the correct depth. Or if that wasn’t useful, the frame label could include depth information.
This setup demonstrates that it is feasible for an XR app to offload inference to an edge compute system and process results in real time. This greatly reduces the load on the mobile device, pointing the way to lightweight, low power, head mounted XR devices that could last for a full workday without recharge. Performing inference on-device (with CoreML for example) is certainly a viable alternative, especially where privacy dictates that raw data (such as video) cannot leave the device. However, processing such data using an edge compute system is hardly the same as sending data out to a remote cloud so, in many cases, privacy requirements can still be satisfied using edge offload.
This particular setup does not require Orchestrator as the iPad test app can go directly to the Conductor, which is part of a statically allocated CPP. The next step to complete the architecture is to add in the Orchestrator interaction so that CPPs can be dynamically instantiated.
Previous rt-ai Edge designs, such as the driveway monitor, are static in the sense that they just sit there, running 24/7. Another mode of operation is dynamic, where stream processing networks are created on demand and accessible via standard interfaces. This is appropriate for offloading inference from mobile devices in a sentient space for example. As users enter the space, apps on their mobile devices (XR headsets, tablets, phones etc) can access inference and other processing resources from the edge compute system supporting the space.
There are three main components in a dynamic rt-ai Edge system:
- Composable Processing Pipeline (CPP). This is the dynamic analog of the static Stream Processing Network (SPN). A CPP is a set of Stream Processing Elements (SPEs) that has been designed using rtaiDesigner. The main difference between a CPP and an SPN is that, in general, the CPP contains no data sources or sinks: these are provided by the user app.
- Conductor. The Conductor is responsible for managing an allocated resource session. User apps interact directly with the Conductor via a Websocket API while the Conductor maps data flowing on the Websocket API to and from the MQTT interfaces on the CPP(s) that have been allocated to that session.
- Orchestrator. The Orchestrator manages the dynamic system. User apps interact with the Orchestrator to request resource. The Orchestrator allocates necessary CPP resources and creates a Conductor instance to act as the source and sink for the CPP(s). The user apps are then redirected to the Websocket API on the new Conductor instance at which point data can flow to and from the user. The Orchestrator is responsible for managing all of the rt-ai Edge nodes that have been allocated to the edge compute system, allocating CPPs to nodes dynamically based on available resources and hardware (e.g. GPU or embedded inference hardware).
The diagram above shows the idle state. The heart of this design is the Orchestrator as it directs all operations. When a user (via an app or browser) wants to use some edge resource, it uses the RESTful API of the Orchestrator to identify itself and define the details of the resources that it requires. The requested resources are then mapped to one or more CPP types. In this example, the Orchestrator maintains a hot pool of CPPs to minimize start up latency. Hot pool CPPs are instantiated but idle as they have no data sources. As the Orchestrator allocates CPPs from the pool, the Orchestrator creates new CPP instances to replace them. This is useful because inference SPEs can have startup times of several seconds. The hot pool hides this delay from the user. Note that the hot pool could consist of multiple types of CPPs that perform different functions – the Orchestrator just selects the correct type to satisfy the resource request. Alternatively, there could be a fixed set of CPP instances and users are just allocated to those. Or, CPPs can be instantiated on demand if startup latency is not an issue.
Once the Orchestrator has identified one or more CPPs to satisfy the resource request, it creates a Conductor instance for the request. The Conductor presents a Websocket API to the user while connecting into rt-ai Edge’s MQTT infrastructure to communicate with the CPPs. If there is only a single CPP involved, the input pin of the CPP is connected to the output pin of the Conductor and the input pin of the Conductor is connected to the output pin(s) of the CPP. If there is more than one CPP required, the CPPs are connected together as required (this can be an arbitrary graph, not just a pipeline) and the input and output pin(s) at the edges connected to the Conductor. Once this is all set up, the Orchestrator redirects the user app to the new Conductor instance and the session can begin as shown below:
As an example, suppose an AR headset user wants to identify and annotate objects in the real world using an AR overlay. In this case, the user app might request a CPP that performs the appropriate object detection and returns the box coordinates of the object and an identified label. The user app would stream the video feed from the AR headset to the Conductor using the Websocket connection. The Conductor would then pass the video frames on to the CPP. The output of the CPP would contain the detected object metadata that is passed via the Conductor onto the Websocket connection back to the user app for rendering.