Combining TrueDepth, remote OpenPose inference and local depth map processing to generate spatial 3D pose coordinates


The problem with depth maps for video is that the depth data is very large and can’t be compressed easily. I had previously run OpenPose at 30 FPS using an iPad Pro and remote inference but that was just for the standard OpenPose (x, y) coordinate output. There’s no way that 30 FPS could be achieved by sending out TrueDepth depth maps with each frame. Instead, the depth processing has to be handled locally on the iPad – the depth map never leaves the device.

The screen capture above shows the system running at 30 FPS. I had to turn a lot of lights on in the office – the frame rate from the iPad camera will drop below 30 FPS if it is too dark which messes up the data!


This is the design. It is the triple scaled OpenPoseGPU design used previously. iOSOpenPose connects to the Conductor via a websocket connection that is used to send images to and receive processed images from the pipeline.

One issue is that each image frame has its own depth map and that’s the one that has to be used to convert the OpenPose (x, y) coordinates into spatial (x, y, z) distances. The solution, in a new app called iOSOpenPose, is to cache the depth maps locally and re-associate them with the processed images when they return. Each image and depth frame is marked with a unique incrementing index to assist with this. Incidentally, this is why I love using JSON for this kind of work – it is possible to add non-standard fields at any point and they will be carried transparently to their destination.

Empirically with my current setup, there is a six frame processing lag which is not too bad. It would probably be better with the dual scaled pipeline, two node design that more easily handles 30 FPS but I did not try that. Another issue is that the processing pipeline can validly lose image frames if it can’t keep up with the offered rate. The depth map cache management software has to take care of all of the nasty details like this and other real-world effects.

Real time OpenPose on an iPad…with the help of remote inference and rendering

I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.

So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.

The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.

The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.

Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML  for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.

Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.

The ZeroSensor – a sentient space point of presence

One application for rt-ai Edge is ubiquitous sensing leading to sentient spaces – spaces that can interact with people moving through and provide useful functionality, whether learned or programmed. A step on the road to that is the ZeroSensor, four prototypes of which are shown in the photo. Each ZeroSensor consists of a Raspberry Pi Zero W, a Pi camera module v2, an Adafruit BME 680 breakout and an Adafruit TSL2561 breakout. The combination gives a video stream and a sensor stream with light, temperature, pressure, humidity and air quality values. The video stream can be used to derive motion sensing and identification while the other sensors provide a general idea of conditions in the space. Notably missing is audio. Microphone support would be useful for general sensing and I might add that in real devices. A 3D printable case design is underway in order to allow wide-scale deployment.

Voice-based interaction is a powerful way for users to interact with sentient spaces. However, it is assumed that people who want to interact are using an AR headset of some sort which itself provides the audio I/O capabilities. Gesture input would be possible via the ZeroSensor’s camera. For privacy reasons video would not be viewed directly or stored but just used as a source of activity data and interaction.

This is the simple rt-ai design used to test the ZeroSensors. The ZeroSynth modules are rt-ai Edge synth modules that contain SPEs that interface with the ZeroSensor’s hardware and generate a video stream and a sensor data stream. An instance of a video viewer and sensor viewer are connected to each ZeroSynth module.

This is the result of running the ZeroSensor test design, showing a video and sensor window for each ZeroSensor. The cameras are staring at the ceiling because the four sensors were on a table. When the correct case is available, they will be deployed in the corners of rooms in the space.

Real time edge inference monitoring with rt-ai Edge

rt-ai Edge is progressing nicely and now supports multi-node operation (i.e. multiple networked servers participating in a processing network) along with real-time monitoring. The screen capture shows a simple processing network where the video feed from a camera is passed through a DeepLab-v3+ stream processing element (SPE) and then on to two separate media viewers. At the top of each SPE block in the Designer window is some text like Cam(Default). Here, Cam is the name given to the SPE while Default is the name of the node (server) on which the SPE is running. In this design there are two nodes, Default and rtai0.

The code underlying the common SPE API communicates with the Designer window and supplies the stats about bytes and messages in and out. Soon, this path will also allow SPE-specific real-time parameter tweaking from the Designer window.

To add a node to the system, it just needs to have all of the prerequisites installed and run a special NodeManager SPE. This also communicates with the Designer and supports SPE deployment and runtime control, activated when the user presses the Deploy design button. Moving an SPE between nodes is just a case of reassigning it, generating the design and then deploying the design again.

The green outlines around each SPE indicate the state of the SPE and the node on which it is running. When it is all green, as in the first screen capture, this indicates that both SPE and node are running. For the second screen capture, I manually terminated the View2 SPE on rtai0. The inner part of the outline has now gone red. This indicates that the node is up but the SPE is down. If the outline is all red, it means that the node is down and not communicating with the Designer.

It’s interesting to note that DeepLab-v3+ is processing around 5 frames per second using a GTX-1080 GPU. The input rate from the camera is 30 frames per second. The processor drops frames while it is still processing an earlier frame, ensuring that queues do not build up and latency is kept to a minimum.

rt-ai: real time stream processing and inference at the edge enables intelligent IoT

The “rt” part of rt-ai doesn’t just stand for “richardstech” for a change, it also stands for “real-time”. Real-time inference at the edge will allow decision making in the local loop with low latency and no dependence on the cloud. rt-ai includes a flexible and intuitive infrastructure for joining together stream processing pipelines in distributed, restricted processing power environments. It is very easy for anyone to add new pipeline elements that fully integrate with rt-ai pipelines. This leverages some of the concepts originally prototyped in rtndf while other parts of the rt-ai infrastructure have been in 24/7 use for several years, proving their intrinsic reliability.

Edge processing and control is essential if there is to be scalable use of intelligent IoT. I believe that dumb IoT, where everything has to be sent to a cloud service for processing, is a broken and unscalable model. The bandwidth requirements alone of sending all the data back to a central point will rapidly become unworkable. Latency guarantees are difficult to impossible in this model. Two advantages of rt-ai (keeping raw data at the edge where it belongs and only upstreaming salient information to the cloud along with minimizing required CPU cycles in power constrained environments) are the keys to scalable intelligent IoT.

rt-ai Edge

rt-ai Edge is a new concept in edge processing that makes it easy for anyone to build AI and ML enhanced stream processing pipelines in order to close the local loop and offload communications networks and the cloud. Semantic extraction of meaningful data from raw data feeds at the edge ensures that the core only has to deal with actionable information, not noise. rt-ai Edge leverages hardware acceleration within embedded devices to filter raw data into highly salient messages for higher level processing.

rt-ai Edge is in active development right now.