Platform-independent highly augmented spaces using UWB

The SHAPE project needs to be able to support consistent highly augmented spaces no matter what platform (headset + software) is chosen by any user situated within the physical space. Previously, SHAPE had just used ARKit to design spaces as an interim measure but this was not going to solve the problem in a platform-independent way. What SHAPE needed was a platform-independent way of linking an ARKit spatial map to the real physical environment. UWB technology provides just such a mechanism.

SHAPE breaks a large physical space into multiple subspaces – often mapped to physical rooms. A big problem is that augmentations can be seen through walls unless something prevents this. ARKit is relatively awful at wall detection so I gave up trying to get that to work. It’s not really ARKit’s fault. Using a single camera to reliably map a room’s walls is just not reliable. Another problem concerns windows and doors. Ideally, it should be possible to see augmentations outside of a physical room if they can be viewed through a window. That might be tough for any mapping software to handle correctly.

SHAPE is now able to solve these problems using UWB. The photo above shows part of the process used to link ARKit’s coordinate system to a physical space coordinate system defined by the UWB installation in the space. What this is showing is how the yaw (rotation about the y-axis in Unity terms) offset between ARKit’s coordinate system and UWB’s coordinate system is measured and subsequently corrected. Basically, a UWB tag (which has a known position in the space) is covered by the red ball in the center of the iPad screen and data recorded at that point. The data recorded consists of the virtual AR camera position and rotation along with the iPad’s position in the physical space. Because iPads currently do not support UWB, I attached one of the Decawave tags to the back of the iPad. Separately, a tag in Listener mode is used by a new SHAPE component, EdgeUWB, to provide a service that makes available the positions of all UWB tags in the subspace. EdgeSpace keeps track of these tag positions so when it receives a message to update the ARKit offset, it can combine the AR camera pose (in the message from the SHAPE app to EdgeSpace), iPad position (from the UWB tag via EdgeUWB) and the known location of the target UWB anchor (from a configuration file). With all of this information, EdgeSpace can calculate position and rotation offsets that are sent back to the SHAPE app so that the Unity augmentations can be correctly aligned in the physical space.

As the ARKit spatial map is also saved during this process and reloaded every time the SHAPE app starts up (or enters the room/subspace in a full implementation), the measured offsets remain valid.


Coming back to the issue of making sure that only augmentations in a physical subspace can be seen, except through a window or door, SHAPE now includes the concept of rooms that can have walls, a floor and a ceiling. The screenshot above shows an example of this. The yellow sticky note is outside of the room and so only the part in the “window” is visible. Since it is not at all obvious what is happening, the invisible but occluding walls used in normal mode can be replaced with visible walls so alignment can be visualized more easily.


This screenshot was taken in debug mode. The effect is subtle but there is a blue film representing the normal invisible occluding walls and the cutout for the window can be seen as it is clear. It can also be seen that the alignment isn’t totally perfect – for example, the cutout is a couple of inches higher than the actual transparent part of the physical window. In this case, the full sticky note is visible as the debug walls don’t occlude.

Incidentally, the room walls use the procedural punctured plane technology that was developed a while ago.


This is an example showing a wall with rectangular and elliptical cutouts. Cutouts can be defined to allow augmentations to be seen through windows and doors by configuring the appropriate cutout in terms of the UWB coordinate system.

While the current system only supports ARKit, in principle any equivalent system could be aligned to the UWB coordinate system using some sort of similar alignment process. Once this is done, a user in the space with any supported XR headset type will see a consistent set of augmentations, reliably positioned within the physical space (at least within the alignment accuracy limits of the underlying platform and the UWB location system). Note that, while UWB support in user devices is helpful, it is only required for setting up the space and initial map alignment. After that, user devices can achieve spatial lock (via ARKit for example) and then maintain tracking in the normal way.

Connecting to SHAPE-based augmented spaces via QR codes or NFC


The SHAPE concept requires that a single standard SHAPE app works with any SHAPE installation without the user having to do anything particularly special. The first thing that the SHAPE app has to be able to do is to connect with an EdgeAccess instance (see the SHAPE architecture here), or two (primary and backup) if redundant operation is required. A simple way to do this is to use QR codes and this is now working in the macOS and iOS Unity SHAPE apps (with the help of the ZXing.Net QR code reader). The idea is that a customer entering a theme park or sports arena for example would be given a customized QR code that contains their assigned temporary user name and the URLs to primary and backup EdgeAccesses. The SHAPE app is then started and begins searching for a valid SHAPE QR code. When one is found, the SHAPE app connects to the specified EdgeAccess(es) and begins normal operation.

This mode of operation implies a system that creates the QR codes and is tied in to purchasing tickets which might not always be practical. An alternative to this is to have standard QR codes for the SHAPE installation that have URLs to a new SHAPE component called AccessManager. One or two AccessManagers (two for redundancy) serve the entire installation which means that one or more standard QR codes could be supplied to any customer. The first step for the app then is to connect to the AccessManager (using the URL from the QR code) which then redirects the SHAPE app to the assigned primary and backup EdgeAccesses instances. This allows for dynamic load sharing between EdgeAccess instances at connection time (rather than QR code generation time as in the customized QR code case).

However, there are advantages to generating customized QR codes for every customer. One advantage is that users can be added to groups easily. SHAPE augmentations can be defined to be visible only to members of a group. This means that a group could have private sticky notes left around the SHAPE installation for example. Or, a group assignment could define a specific version of information and augmentations for an event. As an example, if two teams are playing some sort of match in an arena, customers might want to identify with one of the teams and see customized information feeds and augmentations that are most relevant to them.

While QR codes work well, NFC might be a better way to go for real installations. If an AR headset uses a smartphone to run the SHAPE app, the smartphone’s NFC capability could be used to transfer the SHAPE connection information. Or if a headset is able to run the SHAPE app standalone and has an NFC capability, that could also be used.

SHAPE itself is working pretty well now with sticky notes and whiteboards (essentially as in rt-xr) working fine with collaboration and persistence. CoreUniverse, EdgeSpace, EdgeAccess and asset serving are all operational. The QR code system got rid of some of the temporary configuration – there are a few more temporary fixes left to be eliminated before the implementation becomes more generally usable.

Integrating SHAPE with rt-ai: adding AI to highly augmented spaces

A key feature of SHAPE is its ability to leverage the power of external servers in order to enhance the AR experience. The idea of combining relatively simple and cheap AR headsets with low latency communications links (such as 5G wireless) to edge servers is what is driving SHAPE’s architecture. Giving SHAPE access to rt-ai edge systems is a first example of this in action.

The screen capture above gives an idea of the current state of SHAPE development. This was taken using an iPad Pro running the iOS SHAPE app. The polygons with red edges are the planes that have been detected by ARKit. At the bottom right the monitor shows the same app running on a Mac (in the Unity editor in this case). The macOS version greatly speeds development of everything other than ARKit-related functionality – especially space synchronization functions (e.g. adding, moving, modifying or deleting object actions that need to be shared between all SHAPE users in the same space). The Unity iOS SHAPE app uses the ARFoundation API to, amongst other things,  load and save ARWorldMaps in order to synchronize spatial locations between SHAPE app instances. ARWorldMaps are persisted by the CoreUniverse components and cached for real-time use by EdgeSpace components, one EdgeSpace per physical “room”. SHAPE apps physically entering the room receive the latest map along with the space definition for that room. This includes the directory of augmentation objects with metadata that allows them all to be downloaded from asset servers (unless already cached) and then positioned correctly in the physical space and connected to the appropriate external function servers.

Augmentation objects can be moved around the space manually by touching the object with three or more fingers – sounds awful but it does work. It can then be dragged around the screen and the screen can be moved around to position the objects in space. Touching the object with two fingers brings up the object menu for that instance. This allows the object to be deleted, resized or rotated. It also allows the object to be stuck to a wall or stuck to the floor. in this context, a wall is an ARKit vertical plane, a floor is an ARKit horizontal plane so the object could easily be placed on a table if a suitable plane has been detected. If not, it can be placed manually. All of these object changes are sent to the room’s EdgeSpace (via EdgeAccess) and shared between other users in the space to keep everything synchronized. In addition, updates are sent to CoreUniverse for persistence. These become integrated into the persistent space definition for the room which EdgeSpace instances receive on a regular basis from CoreUniverse (primary and backup). Now this creates an interesting race condition since EdgeSpace is modifying its cached space definition in real-time and it may take a while for the CoreUniverse version to catch up. This problem is handled using timestamps attached to updates so that EdgeSpace can correctly integrate new information from CoreUniverse (such a new object instantiated by a space design tool) while ignoring stale updates for existing objects.

The box with big “M”s is the menu object. Each room has one and it can be placed anywhere convenient in the room. You can click on it (well touch it actually if using an iPad touch screen) and this pops up a menu that allows the user to add augmentation objects. Right now this is just working for the infamous analog clock but will eventually present a catalog of available models with thumbnails. The analog clocks are proxy objects and being driven by an external analog clock server. Obviously it is trivial to implement this purely in the Unity app but it is meant as a simple test of the proxy object concept. The next proxy object to be added will be the sticky note object from rt-xr and then probably the rt-xr shared whiteboard.

Getting back to rt-ai integration, the rt-ai design above shows the simple test design that receives captured frames from the iPad’s rear camera. The frame rate is limited to 5fps so as not to load the WiFi link too much. For simplicity and low latency motion jpegs are used for this but of course compressed video could be used (and probably will be in the future). The new rt-ai SPE called SHAPEConductor looks to the SHAPE system like a SHAPE function server while mapping received messages into and out of an rt-ai stream processing network. In this case, the video is simply being passed through DeepLab to perform semantic segmentation and then the results displayed:


Here it is picking up the monitor running the macOS SHAPE app. In practice, more complex processing would be performed and results returned to proxy objects via the SHAPEConductor module and the SHAPE network.

One interesting application for this is to use the captured frames to recognize the physical space and automatically load the correct saved ARWorldMap for that physical space into the SHAPE app and instantiate all the appropriate augmentation objects, correctly located. Another would be to perform semantic segmentation and return the results to the SHAPE app so that it can be married to depth data and allow real time occlusion to be performed. ARKit 3 will do this on-device for people but apparently not in general. Offloading the segmentation should allow for a lot more flexibility, albeit with increased latency, and work on lower capability devices.

The SHAPE rt-ai integration is very much a work in progress and it will be fun to see what can be achieved with this combination.

The SHAPE architecture: scaling the core using Apache Kafka

SHAPE is being designed from the outset to scale to tens of thousands of simultaneous users or more in a single SHAPE universe, while providing a low latency experience to every AR user.  The current architectural concept is shown in the (somewhat messy) diagram above. A recent change has been the addition of Apache Kafka in the core layer. This helps solve one of the bigger problems: how to keep track of all of the augmentation object changes and interactions reliably and ensure a consistent representation for everyone.

SHAPE functionality is divided into four regions:

  • Core. Core functions are those that may involve significant amounts of data and processing but do not have tight latency requirements. Core functions could be implemented in a remote cloud for example. CoreUniverse manages all of the spatial maps, proxy object instances, spatial anchors and server configurations for the entire system and can be replicated for redundancy and load sharing. In order to ensure eventual consistency, Apache Kafka is used to keep a permanent record of updates to the space configuration (data flowing along the red arrows), allowing easy recovery from failures along with high reliability and scalability. The idea of using Kafka for this purpose was triggered by this paper incidentally.
  • Proxy. The proxy region contains the servers that drive the proxy objects (i.e. the AR augmentations) in the space. There are two types of servers in this region: asset servers and function servers. Asset servers contain the assets that form the proxy object – a Unity assetbundle for example. Users go directly to the asset servers (blue arrows – only a few shown for clarity) to obtain assets to instantiate. Function servers interact with the instantiated proxy objects in real time (via EdgeAccess as described below). For example, in the case of the famous analog clock proxy object (my proxy object equivalent of the classic Utah teapot), the function server drives the hands of the clock by supplying updated angles to the sub-objects with the analog clock asset.
  • Edge. The edge functions consist of those that have to respond to users with low latency. The first point of contact for SHAPE users is EdgeAccess. During normal operation, all real-time interaction takes place over a single link to an instance of EdgeAccess. This makes management, control and status on a per user basis very easy. EdgeAccess then makes ongoing connections to EdgeSpace servers and proxy function servers. A key performance enhancement is that EdgeAccess is able to multicast data from function servers if the data has not been customized for a specific proxy object instance. Function server data that can be multicast in this way is called undirected data, function server data intended for a specific proxy object instance is called directed data. The analog clock server generates undirected data whereas a server that is interacting directly with a user (via proxy object interaction support) has to use directed data. EdgeSpace acts as a sort of local cache for CoreUniverse. Each EdgeSpace instance supports a sub-space of the entire universe. It caches the local spatial maps, object instances and anchors for the sub-space so that users located within that sub-space experience low latency updates. These updates are also forwarded to Kafka so that CoreUniverse instances will eventually correctly reflect the state of the local caches. EdgeSpace instances sync with CoreUniverse at startup and periodically during operation to ensure consistency.
  • User. In this context, users are SHAPE apps running on AR headsets. An important concept is that a standard SHAPE app can be used in any SHAPE universe. The SHAPE app establishes a single connection (black arrows) to an EdgeAccess instance. EdgeAccess provides the user app with the local spatial map to use, proxy object instances, asset server paths and spatial anchors. The user app then fetches the assets from one or more asset servers to populate its augmentation scene. In addition, the user app registers with EdgeAccess for each function server required by the proxy object instances. Edge Access is responsible for setting up any connections to function servers (green arrows – only a few shown for clarity) that aren’t already in existence.

As an example of operation, consider a set of users physically present in the same sub-space. They may be connected to SHAPE via different EdgeAccess instances but will all use the same EdgeSpace. If one user makes a change to a proxy object instance (rotates it for example), the update information will be sent to EdgeSpace (via EdgeAccess) and then broadcast to the other users in the sub-space so that the changes are reflected in their augmentation scenes in real-time. The updates are also forwarded to Kafka so that CoreUniverse instances can track every local change.

This is very much a work in progress so details may change of course. There are quite a few details that I have glossed over here (such as spatial map management and a user moving from one sub-space to another) and they may well require changes.

Real time OpenPose on an iPad…with the help of remote inference and rendering

I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.

So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.

The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.

The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.

Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML  for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.

Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.

Adding depth to DNN object detection with ARKit and Unity AR Foundation


Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable.  I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.

In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.

Using edge inference to detect real world objects with Unity AR Foundation, ARKit and rt-ai Edge

The Unity AR Foundation provides a convenient high level way of utilizing ARCore and ARKit in order to implement mixed and augmented reality applications. I used it to implement an iPad app that could access an rt-ai Edge Composable Processing Pipeline (CPP) via the new Conductor Stream Processing Element (SPE). This is the CPP used to test Conductor:


The Conductor SPE provides a Websocket API to mobile devices and is able to pass data from the mobile device to the pipeline and then return the results of the CPP’s processing back to the mobile device. In this case, I am using the CYOLO SPE to perform object detection on the video stream from the mobile device’s camera. The output of the CYOLO SPE goes to three destinations – back to the Conductor, to a MediaView for display locally (for debug) and also to a PutManifold SPE for long term storage and off-line processing.

The iPad Unity app used to test this arrangement uses AR Foundation and ARKit for spatial management and convenient access to camera data. The AR Foundation is especially nice as, if you only need the subset of ARKit functionality currently available, you can do everything in the C# domain without having to get involved with Swift and/or Objective C and all that. The captured camera data is formatted as an rt-ai Edge message and sent via the Websocket API to the Conductor. The Conductor returns detection metadata to the iPad which then uses this to display the labelled detection frames in the Unity space.

Right now, the app draws a labelled frame at a constant distance of 1 meter from the camera to align with the detected object. However, an enhancement would be to use depth information (if there is any) so that the frame could be positioned at the correct depth. Or if that wasn’t useful, the frame label could include depth information.

This setup demonstrates that it is feasible for an XR app to offload inference to an edge compute system and process results in real time. This greatly reduces the load on the mobile device, pointing the way to lightweight, low power, head mounted XR devices that could last for a full workday without recharge. Performing inference on-device (with CoreML for example) is certainly a viable alternative, especially where privacy dictates that raw data (such as video) cannot leave the device. However, processing such data using an edge compute system is hardly the same as sending data out to a remote cloud so, in many cases, privacy requirements can still be satisfied using edge offload.

This particular setup does not require Orchestrator as the iPad test app can go directly to the Conductor, which is part of a statically allocated CPP. The next step to complete the architecture is to add in the Orchestrator interaction so that CPPs can be dynamically instantiated.