I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.
So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.
The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.
The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.
Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.
Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.
Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable. I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.
In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.
The Unity AR Foundation provides a convenient high level way of utilizing ARCore and ARKit in order to implement mixed and augmented reality applications. I used it to implement an iPad app that could access an rt-ai Edge Composable Processing Pipeline (CPP) via the new Conductor Stream Processing Element (SPE). This is the CPP used to test Conductor:
The Conductor SPE provides a Websocket API to mobile devices and is able to pass data from the mobile device to the pipeline and then return the results of the CPP’s processing back to the mobile device. In this case, I am using the CYOLO SPE to perform object detection on the video stream from the mobile device’s camera. The output of the CYOLO SPE goes to three destinations – back to the Conductor, to a MediaView for display locally (for debug) and also to a PutManifold SPE for long term storage and off-line processing.
The iPad Unity app used to test this arrangement uses AR Foundation and ARKit for spatial management and convenient access to camera data. The AR Foundation is especially nice as, if you only need the subset of ARKit functionality currently available, you can do everything in the C# domain without having to get involved with Swift and/or Objective C and all that. The captured camera data is formatted as an rt-ai Edge message and sent via the Websocket API to the Conductor. The Conductor returns detection metadata to the iPad which then uses this to display the labelled detection frames in the Unity space.
Right now, the app draws a labelled frame at a constant distance of 1 meter from the camera to align with the detected object. However, an enhancement would be to use depth information (if there is any) so that the frame could be positioned at the correct depth. Or if that wasn’t useful, the frame label could include depth information.
This setup demonstrates that it is feasible for an XR app to offload inference to an edge compute system and process results in real time. This greatly reduces the load on the mobile device, pointing the way to lightweight, low power, head mounted XR devices that could last for a full workday without recharge. Performing inference on-device (with CoreML for example) is certainly a viable alternative, especially where privacy dictates that raw data (such as video) cannot leave the device. However, processing such data using an edge compute system is hardly the same as sending data out to a remote cloud so, in many cases, privacy requirements can still be satisfied using edge offload.
This particular setup does not require Orchestrator as the iPad test app can go directly to the Conductor, which is part of a statically allocated CPP. The next step to complete the architecture is to add in the Orchestrator interaction so that CPPs can be dynamically instantiated.
Previous rt-ai Edge designs, such as the driveway monitor, are static in the sense that they just sit there, running 24/7. Another mode of operation is dynamic, where stream processing networks are created on demand and accessible via standard interfaces. This is appropriate for offloading inference from mobile devices in a sentient space for example. As users enter the space, apps on their mobile devices (XR headsets, tablets, phones etc) can access inference and other processing resources from the edge compute system supporting the space.
There are three main components in a dynamic rt-ai Edge system:
- Composable Processing Pipeline (CPP). This is the dynamic analog of the static Stream Processing Network (SPN). A CPP is a set of Stream Processing Elements (SPEs) that has been designed using rtaiDesigner. The main difference between a CPP and an SPN is that, in general, the CPP contains no data sources or sinks: these are provided by the user app.
- Conductor. The Conductor is responsible for managing an allocated resource session. User apps interact directly with the Conductor via a Websocket API while the Conductor maps data flowing on the Websocket API to and from the MQTT interfaces on the CPP(s) that have been allocated to that session.
- Orchestrator. The Orchestrator manages the dynamic system. User apps interact with the Orchestrator to request resource. The Orchestrator allocates necessary CPP resources and creates a Conductor instance to act as the source and sink for the CPP(s). The user apps are then redirected to the Websocket API on the new Conductor instance at which point data can flow to and from the user. The Orchestrator is responsible for managing all of the rt-ai Edge nodes that have been allocated to the edge compute system, allocating CPPs to nodes dynamically based on available resources and hardware (e.g. GPU or embedded inference hardware).
The diagram above shows the idle state. The heart of this design is the Orchestrator as it directs all operations. When a user (via an app or browser) wants to use some edge resource, it uses the RESTful API of the Orchestrator to identify itself and define the details of the resources that it requires. The requested resources are then mapped to one or more CPP types. In this example, the Orchestrator maintains a hot pool of CPPs to minimize start up latency. Hot pool CPPs are instantiated but idle as they have no data sources. As the Orchestrator allocates CPPs from the pool, the Orchestrator creates new CPP instances to replace them. This is useful because inference SPEs can have startup times of several seconds. The hot pool hides this delay from the user. Note that the hot pool could consist of multiple types of CPPs that perform different functions – the Orchestrator just selects the correct type to satisfy the resource request. Alternatively, there could be a fixed set of CPP instances and users are just allocated to those. Or, CPPs can be instantiated on demand if startup latency is not an issue.
Once the Orchestrator has identified one or more CPPs to satisfy the resource request, it creates a Conductor instance for the request. The Conductor presents a Websocket API to the user while connecting into rt-ai Edge’s MQTT infrastructure to communicate with the CPPs. If there is only a single CPP involved, the input pin of the CPP is connected to the output pin of the Conductor and the input pin of the Conductor is connected to the output pin(s) of the CPP. If there is more than one CPP required, the CPPs are connected together as required (this can be an arbitrary graph, not just a pipeline) and the input and output pin(s) at the edges connected to the Conductor. Once this is all set up, the Orchestrator redirects the user app to the new Conductor instance and the session can begin as shown below:
As an example, suppose an AR headset user wants to identify and annotate objects in the real world using an AR overlay. In this case, the user app might request a CPP that performs the appropriate object detection and returns the box coordinates of the object and an identified label. The user app would stream the video feed from the AR headset to the Conductor using the Websocket connection. The Conductor would then pass the video frames on to the CPP. The output of the CPP would contain the detected object metadata that is passed via the Conductor onto the Websocket connection back to the user app for rendering.
I wanted a small and portable rt-ai Edge node using the Neural Compute Stick for demos and decided to base it on a Gigabyte BRi7H-8550 compact PC as it is the lowest cost, smallest footprint, device that I could find with a decent i7 CPU. This is fitted with 16GB of DDR4 DRAM and a 256GB NVMe M2 disk. Previously I needed a mini ITX board along with a GPU which is much bigger and heavier as can be seen below.
The node is running Ubuntu 16.04 along with standard rt-ai node management software and performs very nicely. A second NCS can be fitted on the front USB port and a small USB hub could be used if more than two are required. For demo purposes, a Windows or Ubuntu laptop runs rtaiDesigner for GUI-based control and status with the node acting as a headless inference server.
While this is primarily intended as a demo device, it would actually be quite a nice embedded inference node.
Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai Edge. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.
This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.
Fresh from success with YOLOv3 on the desktop, a question came up of whether this could be made to work on the Movidius Neural Compute Stick and therefore run on the Raspberry Pi.
The NCS is a neat little device and because it connects via USB, it is easy to develop on a desktop and then transfer everything needed to the Pi.
The app zoo, on the ncsdk2 branch, has a tiny_yolo_v2 implementation that I used as the basis for this. It only took about an hour to get this working on the desktop – integration with rt-ai was very easy. The Raspberry Pi end was not – all kinds of version number issues and things like that. However, even though not all of the tools would compile, I just moved the compiled graph from the desktop to the Pi and that worked fine.
This is the design. The main difference here from the usual test designs is that the MYOLO SPE is assigned to node pi34 (the Raspberry Pi) rather than the desktop (Default). Just assigning the MYOLO SPE to the Pi saved me from having to connect a Picam or uvc camera to the Pi and also allowed me to get a better feel for the pure performance of the Pi with the NCS.
As can be seen from the first screen capture it worked fine although, because it supports only a subset (20 of 91) of the usual COCO labels, it did not pick up the mouse or the keyboard. Performance-wise, it was running at about 1fps and 30% CPU. Just for reference, I was getting about 8fps on the i7 desktop.