Platform-independent highly augmented spaces using UWB

The SHAPE project needs to be able to support consistent highly augmented spaces no matter what platform (headset + software) is chosen by any user situated within the physical space. Previously, SHAPE had just used ARKit to design spaces as an interim measure but this was not going to solve the problem in a platform-independent way. What SHAPE needed was a platform-independent way of linking an ARKit spatial map to the real physical environment. UWB technology provides just such a mechanism.

SHAPE breaks a large physical space into multiple subspaces – often mapped to physical rooms. A big problem is that augmentations can be seen through walls unless something prevents this. ARKit is relatively awful at wall detection so I gave up trying to get that to work. It’s not really ARKit’s fault. Using a single camera to reliably map a room’s walls is just not reliable. Another problem concerns windows and doors. Ideally, it should be possible to see augmentations outside of a physical room if they can be viewed through a window. That might be tough for any mapping software to handle correctly.

SHAPE is now able to solve these problems using UWB. The photo above shows part of the process used to link ARKit’s coordinate system to a physical space coordinate system defined by the UWB installation in the space. What this is showing is how the yaw (rotation about the y-axis in Unity terms) offset between ARKit’s coordinate system and UWB’s coordinate system is measured and subsequently corrected. Basically, a UWB tag (which has a known position in the space) is covered by the red ball in the center of the iPad screen and data recorded at that point. The data recorded consists of the virtual AR camera position and rotation along with the iPad’s position in the physical space. Because iPads currently do not support UWB, I attached one of the Decawave tags to the back of the iPad. Separately, a tag in Listener mode is used by a new SHAPE component, EdgeUWB, to provide a service that makes available the positions of all UWB tags in the subspace. EdgeSpace keeps track of these tag positions so when it receives a message to update the ARKit offset, it can combine the AR camera pose (in the message from the SHAPE app to EdgeSpace), iPad position (from the UWB tag via EdgeUWB) and the known location of the target UWB anchor (from a configuration file). With all of this information, EdgeSpace can calculate position and rotation offsets that are sent back to the SHAPE app so that the Unity augmentations can be correctly aligned in the physical space.

As the ARKit spatial map is also saved during this process and reloaded every time the SHAPE app starts up (or enters the room/subspace in a full implementation), the measured offsets remain valid.


Coming back to the issue of making sure that only augmentations in a physical subspace can be seen, except through a window or door, SHAPE now includes the concept of rooms that can have walls, a floor and a ceiling. The screenshot above shows an example of this. The yellow sticky note is outside of the room and so only the part in the “window” is visible. Since it is not at all obvious what is happening, the invisible but occluding walls used in normal mode can be replaced with visible walls so alignment can be visualized more easily.


This screenshot was taken in debug mode. The effect is subtle but there is a blue film representing the normal invisible occluding walls and the cutout for the window can be seen as it is clear. It can also be seen that the alignment isn’t totally perfect – for example, the cutout is a couple of inches higher than the actual transparent part of the physical window. In this case, the full sticky note is visible as the debug walls don’t occlude.

Incidentally, the room walls use the procedural punctured plane technology that was developed a while ago.


This is an example showing a wall with rectangular and elliptical cutouts. Cutouts can be defined to allow augmentations to be seen through windows and doors by configuring the appropriate cutout in terms of the UWB coordinate system.

While the current system only supports ARKit, in principle any equivalent system could be aligned to the UWB coordinate system using some sort of similar alignment process. Once this is done, a user in the space with any supported XR headset type will see a consistent set of augmentations, reliably positioned within the physical space (at least within the alignment accuracy limits of the underlying platform and the UWB location system). Note that, while UWB support in user devices is helpful, it is only required for setting up the space and initial map alignment. After that, user devices can achieve spatial lock (via ARKit for example) and then maintain tracking in the normal way.

An Edge TPU stream processing element for rt-ai Edge using the Coral USB Accelerator


A Coral USB Accelerator turned up yesterday so of course it had to be integrated with rt-ai Edge to see what it could do. Creating a Python-based SPE from the object detection demo in the API download didn’t take too long. I used the MobileNet SSD v2 COCO model as a starting point to generate this example output:

The very basic rt-ai Edge test design looks like this:

Using 1280 x 720 video frames from the webcam, I was getting around 2 frames per second from the CoralSSD SPE. This isn’t as good as the Intel NCS 2 SPE but that is a compiled C++ SPE whereas the Coral SPE is a Python 3 SPE. I haven’t found a C++ API spec for the Edge TPU as yet. Perhaps by investigating the SWIG-generated Python interface I could link the compiled libraries directly but that’s for another day…

Real time OpenPose on an iPad…with the help of remote inference and rendering

I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.

So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.

The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.

The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.

Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML  for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.

Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.

Adding depth to DNN object detection with ARKit and Unity AR Foundation


Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable.  I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.

In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.

Using edge inference to detect real world objects with Unity AR Foundation, ARKit and rt-ai Edge

The Unity AR Foundation provides a convenient high level way of utilizing ARCore and ARKit in order to implement mixed and augmented reality applications. I used it to implement an iPad app that could access an rt-ai Edge Composable Processing Pipeline (CPP) via the new Conductor Stream Processing Element (SPE). This is the CPP used to test Conductor:


The Conductor SPE provides a Websocket API to mobile devices and is able to pass data from the mobile device to the pipeline and then return the results of the CPP’s processing back to the mobile device. In this case, I am using the CYOLO SPE to perform object detection on the video stream from the mobile device’s camera. The output of the CYOLO SPE goes to three destinations – back to the Conductor, to a MediaView for display locally (for debug) and also to a PutManifold SPE for long term storage and off-line processing.

The iPad Unity app used to test this arrangement uses AR Foundation and ARKit for spatial management and convenient access to camera data. The AR Foundation is especially nice as, if you only need the subset of ARKit functionality currently available, you can do everything in the C# domain without having to get involved with Swift and/or Objective C and all that. The captured camera data is formatted as an rt-ai Edge message and sent via the Websocket API to the Conductor. The Conductor returns detection metadata to the iPad which then uses this to display the labelled detection frames in the Unity space.

Right now, the app draws a labelled frame at a constant distance of 1 meter from the camera to align with the detected object. However, an enhancement would be to use depth information (if there is any) so that the frame could be positioned at the correct depth. Or if that wasn’t useful, the frame label could include depth information.

This setup demonstrates that it is feasible for an XR app to offload inference to an edge compute system and process results in real time. This greatly reduces the load on the mobile device, pointing the way to lightweight, low power, head mounted XR devices that could last for a full workday without recharge. Performing inference on-device (with CoreML for example) is certainly a viable alternative, especially where privacy dictates that raw data (such as video) cannot leave the device. However, processing such data using an edge compute system is hardly the same as sending data out to a remote cloud so, in many cases, privacy requirements can still be satisfied using edge offload.

This particular setup does not require Orchestrator as the iPad test app can go directly to the Conductor, which is part of a statically allocated CPP. The next step to complete the architecture is to add in the Orchestrator interaction so that CPPs can be dynamically instantiated.

An rt-ai Edge architecture for scalable on-demand edge inference systems


Previous rt-ai Edge designs, such as the driveway monitor, are static in the sense that they just sit there, running 24/7. Another mode of operation is dynamic, where stream processing networks are created on demand and accessible via standard interfaces. This is appropriate for offloading inference from mobile devices in a sentient space for example. As users enter the space, apps on their mobile devices (XR headsets, tablets, phones etc) can access inference and other processing resources from the edge compute system supporting the space.

There are three main components in a dynamic rt-ai Edge system:

  • Composable Processing Pipeline (CPP). This is the dynamic analog of the static Stream Processing Network (SPN). A CPP is a set of Stream Processing Elements (SPEs) that has been designed using rtaiDesigner. The main difference between a CPP and an SPN is that, in general, the CPP contains no data sources or sinks: these are provided by the user app.
  • Conductor. The Conductor is responsible for managing an allocated resource session. User apps interact directly with the Conductor via a Websocket API while the Conductor maps data flowing on the Websocket API to and from the MQTT interfaces on the CPP(s) that have been allocated to that session.
  • Orchestrator. The Orchestrator manages the dynamic system. User apps interact with the Orchestrator to request resource. The Orchestrator allocates necessary CPP resources and creates a Conductor instance to act as the source and sink for the CPP(s). The user apps are then redirected to the Websocket API on the new Conductor instance at which point data can flow to and from the user. The Orchestrator is responsible for managing all of the rt-ai Edge nodes that have been allocated to the edge compute system, allocating CPPs to nodes dynamically based on available resources and hardware (e.g. GPU or embedded inference hardware).

The diagram above shows the idle state. The heart of this design is the Orchestrator as it directs all operations. When a user (via an app or browser) wants to use some edge resource, it uses the RESTful API of the Orchestrator to identify itself and define the details of the resources that it requires. The requested resources are then mapped to one or more CPP types. In this example, the Orchestrator maintains a hot pool of CPPs to minimize start up latency. Hot pool CPPs are instantiated but idle as they have no data sources. As the Orchestrator allocates CPPs from the pool, the Orchestrator creates new CPP instances to replace them. This is useful because inference SPEs can have startup times of several seconds. The hot pool hides this delay from the user. Note that the hot pool could consist of multiple types of CPPs that perform different functions – the Orchestrator just selects the correct type to satisfy the resource request. Alternatively, there could be a fixed set of CPP instances and users are just allocated to those. Or, CPPs can be instantiated on demand if startup latency is not an issue.

Once the Orchestrator has identified one or more CPPs to satisfy the resource request, it creates a Conductor instance for the request. The Conductor presents a Websocket API to the user while connecting into rt-ai Edge’s MQTT infrastructure to communicate with the CPPs. If there is only a single CPP involved, the input pin of the CPP is connected to the output pin of the Conductor and the input pin of the Conductor is connected to the output pin(s) of the CPP. If there is more than one CPP required, the CPPs are connected together as required (this can be an arbitrary graph, not just a pipeline) and the input and output pin(s) at the edges connected to the Conductor. Once this is all set up, the Orchestrator redirects the user app to the new Conductor instance and the session can begin as shown below:


As an example, suppose an AR headset user wants to identify and annotate objects in the real world using an AR overlay. In this case, the user app might request a CPP that performs the appropriate object detection and returns the box coordinates of the object and an identified label. The user app would stream the video feed from the AR headset to the Conductor using the Websocket connection. The Conductor would then pass the video frames on to the CPP. The output of the CPP would contain the detected object metadata that is passed via the Conductor onto the Websocket connection back to the user app for rendering.

Miniature rt-ai Edge node and inference engine


I wanted a small and portable rt-ai Edge node using the Neural Compute Stick for demos and decided to base it on a Gigabyte BRi7H-8550 compact PC as it is the lowest cost, smallest footprint, device that I could find with a decent i7 CPU. This is fitted with 16GB of DDR4 DRAM and a 256GB NVMe M2 disk. Previously I needed a mini ITX board along with a GPU which is much bigger and heavier as can be seen below.


The node is running Ubuntu 16.04 along with standard rt-ai node management software and performs very nicely. A second NCS can be fitted on the front USB port and a small USB hub could be used if more than two are required. For demo purposes, a Windows or Ubuntu laptop runs rtaiDesigner for GUI-based control and status with the node acting as a headless inference server.

While this is primarily intended as a demo device, it would actually be quite a nice embedded inference node.