Platform-independent highly augmented spaces using UWB

The SHAPE project needs to be able to support consistent highly augmented spaces no matter what platform (headset + software) is chosen by any user situated within the physical space. Previously, SHAPE had just used ARKit to design spaces as an interim measure but this was not going to solve the problem in a platform-independent way. What SHAPE needed was a platform-independent way of linking an ARKit spatial map to the real physical environment. UWB technology provides just such a mechanism.

SHAPE breaks a large physical space into multiple subspaces – often mapped to physical rooms. A big problem is that augmentations can be seen through walls unless something prevents this. ARKit is relatively awful at wall detection so I gave up trying to get that to work. It’s not really ARKit’s fault. Using a single camera to reliably map a room’s walls is just not reliable. Another problem concerns windows and doors. Ideally, it should be possible to see augmentations outside of a physical room if they can be viewed through a window. That might be tough for any mapping software to handle correctly.

SHAPE is now able to solve these problems using UWB. The photo above shows part of the process used to link ARKit’s coordinate system to a physical space coordinate system defined by the UWB installation in the space. What this is showing is how the yaw (rotation about the y-axis in Unity terms) offset between ARKit’s coordinate system and UWB’s coordinate system is measured and subsequently corrected. Basically, a UWB tag (which has a known position in the space) is covered by the red ball in the center of the iPad screen and data recorded at that point. The data recorded consists of the virtual AR camera position and rotation along with the iPad’s position in the physical space. Because iPads currently do not support UWB, I attached one of the Decawave tags to the back of the iPad. Separately, a tag in Listener mode is used by a new SHAPE component, EdgeUWB, to provide a service that makes available the positions of all UWB tags in the subspace. EdgeSpace keeps track of these tag positions so when it receives a message to update the ARKit offset, it can combine the AR camera pose (in the message from the SHAPE app to EdgeSpace), iPad position (from the UWB tag via EdgeUWB) and the known location of the target UWB anchor (from a configuration file). With all of this information, EdgeSpace can calculate position and rotation offsets that are sent back to the SHAPE app so that the Unity augmentations can be correctly aligned in the physical space.

As the ARKit spatial map is also saved during this process and reloaded every time the SHAPE app starts up (or enters the room/subspace in a full implementation), the measured offsets remain valid.


Coming back to the issue of making sure that only augmentations in a physical subspace can be seen, except through a window or door, SHAPE now includes the concept of rooms that can have walls, a floor and a ceiling. The screenshot above shows an example of this. The yellow sticky note is outside of the room and so only the part in the “window” is visible. Since it is not at all obvious what is happening, the invisible but occluding walls used in normal mode can be replaced with visible walls so alignment can be visualized more easily.


This screenshot was taken in debug mode. The effect is subtle but there is a blue film representing the normal invisible occluding walls and the cutout for the window can be seen as it is clear. It can also be seen that the alignment isn’t totally perfect – for example, the cutout is a couple of inches higher than the actual transparent part of the physical window. In this case, the full sticky note is visible as the debug walls don’t occlude.

Incidentally, the room walls use the procedural punctured plane technology that was developed a while ago.


This is an example showing a wall with rectangular and elliptical cutouts. Cutouts can be defined to allow augmentations to be seen through windows and doors by configuring the appropriate cutout in terms of the UWB coordinate system.

While the current system only supports ARKit, in principle any equivalent system could be aligned to the UWB coordinate system using some sort of similar alignment process. Once this is done, a user in the space with any supported XR headset type will see a consistent set of augmentations, reliably positioned within the physical space (at least within the alignment accuracy limits of the underlying platform and the UWB location system). Note that, while UWB support in user devices is helpful, it is only required for setting up the space and initial map alignment. After that, user devices can achieve spatial lock (via ARKit for example) and then maintain tracking in the normal way.

Indoor position measurement for XR applications using UWB

Obtaining reasonably accurate absolute indoor position measurements for mobile devices has always been tricky, to say the least. Things like ARKit can determine a relative position within a mapped space reasonably well in ideal circumstances, as can structured IR. What would be much more useful for creating complex XR experiences spread over a large physical space in which lots of people and objects might be moving around and occluding things is something that allows the XR application to determine its absolute position in the space. Ultra-wideband (UWB) provides a mechanism for obtaining an absolute position (with respect to a fixed point in the space) over a large area and in challenging environments and could be an ideal partner for XR applications. This is a useful backgrounder on how the technology works. Interestingly, Apple have added some form of UWB support to the iPhone 11. Hopefully future manufacturers of XR headsets, or phones that pair with XR headsets, will include UWB capability so that they can act as tags in an RTLS system.

UWB RTLS (Real Time Location System) technology seems like an ideal fit for the SHAPE project. An important requirement for SHAPE is that the system can locate a user within one of the subspaces that together cover the entire physical space. The use of subspaces allows the system to grow to cover very large physical spaces just by scaling servers. One idea that works to some extent is to use the AR headset or tablet camera to recognize physical features within a subspace, as in this work. However, once again, this only works in ideal situations where the physical environment has not changed since the reference images were taken. And, in a large space that has no features, such as a warehouse, this just won’t work at all.

Using UWB RTLS with a number of anchors spanning the physical space, it should be possible to determine an absolute location in the space and map this to a specific subspace, regardless of other objects moving through the space or how featureless the space is. To try this out, I am using the Decawave MDEK1001 development kit.

This includes 12 devices that can be used as anchors, tags, or gateways. Anchors are placed at fixed positions in the space – I have mounted four high up in the corners of my office for example. Tags represent users moving around in the space. Putting a device in gateway mode allows it to be connected to a Raspberry Pi which provides web and MQTT access to the tag position data for example.

Setting up is pretty easy using an Android app that auto-discovers devices. It does have an automatic measurement system that tries to determine the relative positions of anchors with respect to an origin but that didn’t seem to work too well for me so I resorted to physical measurement. Not a big problem and, in any case, the software cannot determine the system z offset automatically. Another thing that confused me is that tags have, by default, a very slow update rate if they are not moving much which makes it seem that the system isn’t working. Changing this to a much higher rate probably much increases power usage but certainly helps with testing! Speaking of power, the devices can be powered from USB power supplies or internal rechargeable batteries (which were not included incidentally).

Anyway, once everything was set up correctly, it seemed to work very well, using the Android app to display the tag locations with respect to the anchors. The next step is to integrate this system into SHAPE so that subspaces can be defined in terms of RTLS coordinates.

An interesting additional feature of UWB is that it also supports data transfers between devices. This could lead to some interesting new concepts for XR experiences…

Using homography to solve the “Where am I?” problem

In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.

This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.

The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.

The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:

Next, I took another image to represent a user generated image:

It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:

As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:

It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.

In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.

The SHAPE architecture: scaling the core using Apache Kafka

SHAPE is being designed from the outset to scale to tens of thousands of simultaneous users or more in a single SHAPE universe, while providing a low latency experience to every AR user.  The current architectural concept is shown in the (somewhat messy) diagram above. A recent change has been the addition of Apache Kafka in the core layer. This helps solve one of the bigger problems: how to keep track of all of the augmentation object changes and interactions reliably and ensure a consistent representation for everyone.

SHAPE functionality is divided into four regions:

  • Core. Core functions are those that may involve significant amounts of data and processing but do not have tight latency requirements. Core functions could be implemented in a remote cloud for example. CoreUniverse manages all of the spatial maps, proxy object instances, spatial anchors and server configurations for the entire system and can be replicated for redundancy and load sharing. In order to ensure eventual consistency, Apache Kafka is used to keep a permanent record of updates to the space configuration (data flowing along the red arrows), allowing easy recovery from failures along with high reliability and scalability. The idea of using Kafka for this purpose was triggered by this paper incidentally.
  • Proxy. The proxy region contains the servers that drive the proxy objects (i.e. the AR augmentations) in the space. There are two types of servers in this region: asset servers and function servers. Asset servers contain the assets that form the proxy object – a Unity assetbundle for example. Users go directly to the asset servers (blue arrows – only a few shown for clarity) to obtain assets to instantiate. Function servers interact with the instantiated proxy objects in real time (via EdgeAccess as described below). For example, in the case of the famous analog clock proxy object (my proxy object equivalent of the classic Utah teapot), the function server drives the hands of the clock by supplying updated angles to the sub-objects with the analog clock asset.
  • Edge. The edge functions consist of those that have to respond to users with low latency. The first point of contact for SHAPE users is EdgeAccess. During normal operation, all real-time interaction takes place over a single link to an instance of EdgeAccess. This makes management, control and status on a per user basis very easy. EdgeAccess then makes ongoing connections to EdgeSpace servers and proxy function servers. A key performance enhancement is that EdgeAccess is able to multicast data from function servers if the data has not been customized for a specific proxy object instance. Function server data that can be multicast in this way is called undirected data, function server data intended for a specific proxy object instance is called directed data. The analog clock server generates undirected data whereas a server that is interacting directly with a user (via proxy object interaction support) has to use directed data. EdgeSpace acts as a sort of local cache for CoreUniverse. Each EdgeSpace instance supports a sub-space of the entire universe. It caches the local spatial maps, object instances and anchors for the sub-space so that users located within that sub-space experience low latency updates. These updates are also forwarded to Kafka so that CoreUniverse instances will eventually correctly reflect the state of the local caches. EdgeSpace instances sync with CoreUniverse at startup and periodically during operation to ensure consistency.
  • User. In this context, users are SHAPE apps running on AR headsets. An important concept is that a standard SHAPE app can be used in any SHAPE universe. The SHAPE app establishes a single connection (black arrows) to an EdgeAccess instance. EdgeAccess provides the user app with the local spatial map to use, proxy object instances, asset server paths and spatial anchors. The user app then fetches the assets from one or more asset servers to populate its augmentation scene. In addition, the user app registers with EdgeAccess for each function server required by the proxy object instances. Edge Access is responsible for setting up any connections to function servers (green arrows – only a few shown for clarity) that aren’t already in existence.

As an example of operation, consider a set of users physically present in the same sub-space. They may be connected to SHAPE via different EdgeAccess instances but will all use the same EdgeSpace. If one user makes a change to a proxy object instance (rotates it for example), the update information will be sent to EdgeSpace (via EdgeAccess) and then broadcast to the other users in the sub-space so that the changes are reflected in their augmentation scenes in real-time. The updates are also forwarded to Kafka so that CoreUniverse instances can track every local change.

This is very much a work in progress so details may change of course. There are quite a few details that I have glossed over here (such as spatial map management and a user moving from one sub-space to another) and they may well require changes.

Introducing SHAPE: Scalable Highly Augmented Physical Environment


This screenshot is an example of a  virtual environment augmented with proxy objects created using rt-xr. However, this was always intended to be a VR precursor for an AR solution now called SHAPE – Scalable Highly Augmented Physical Environment. The difference is that the virtual objects being used to augment the virtual environment shown above (such as whiteboards, status displays, sticky notes, camera screens and other static virtual objects in this case) are used to augment real physical environments with a primary focus on scalability and local collaboration for physically present occupants. The intent is to open source SHAPE in the hope that others might like to contribute to the framework and/or contribute virtual objects to the object library.

Some of the features of SHAPE are:

  • SHAPEs are designed for collaboration. Multiple AR device users, present in the same space are able to interact with virtual objects just like real objects with consistent state maintained for all users.
  • SHAPE users can be grouped so that they see different virtual objects in the same space depending on their assigned group. A simple example of this would be where virtual objects are customized for language support – the virtual object set instantiated would then depend on the language selected by a user.
  • SHAPEs are scalable because they minimize the loading on AR devices. Complex processing is performed using a local edge server or remote cloud. Each virtual object is either static (just for display) or else can be connected to a server function that drives the virtual object and also receives interaction inputs that may modify the state of the virtual object, leaving the AR device to display objects and pass interaction events rather than performing complex functions on-device. Reducing the AR device loading in this way extends battery life and reduces heat, allowing devices to be used for longer sessions.
  • There is a natural fit between SHAPE and artificial intelligence/machine learning. As virtual objects are connected to off-device server functions, they can make use of inference results or supply data for machine learning derived from user interactions while leveraging much more powerful capabilities than are practical on-device.
  • A single universal app can be used for all SHAPEs. Any virtual objects needed for a particular space are downloaded at run time from an object server. However, there would be nothing stopping the creation of a customized app that included hard-coded assets while still leveraging the rest of SHAPE – this might be useful in some applications.
  • New virtual objects can be instantiated by users of the space, configured appropriately (including connection to remote server function) and then made persistent in location by registering with the object server.

A specific goal is to be able to support large scale physical environments such as amusement parks or sports stadiums, where there may be a very large number of users distributed over a very large space. The SHAPE system is being designed to support this level of scalability while being highly responsive to interaction.

In order to turn this into reality, the SHAPE concept requires low cost, lightweight AR headsets that can be worn for extended periods of time, perform reliable spatial localization in changing outdoor environments while also providing high quality, wide angle augmentation displays. Technology isn’t there yet so initially development will use iPads as the AR devices and ARKit for localization. Using iPads for this purpose isn’t ideal ergonomically but does allow all of the required functionality to be developed. When suitable headsets do become available, SHAPE will hopefully be ready to take advantage of them.

Converting screen coordinates + depth into spatial coordinates for OpenPose…or anything else really

Depth cameras are wonderful things but they typically only give a distance associated with each (x, y) coordinate in screen space. To convert to spatial coordinates involves some calculation. One thing to note is that I am ignoring camera calibration which is required to get best accuracy. See this page for details of how to use calibration data in iOS for example. I have implemented this calculation for the iPad TrueDepth camera and also the ZED stereo camera to process OpenPose joint data and it seems to work but I cannot guarantee complete accuracy!

The concept for the conversion is shown in the diagram above. One can think of the 2D camera image as being mapped to a screen plane – the blue plane in the diagram. The width and height of the plane are determined by its distance from the camera and the camera’s field of view. Using the iPad as an example, you can get the horizontal and vertical camera field of view angles (hFOV and vFOV in the diagram) like this:

hFOV = captureDevice.activeFormat.videoFieldOfView * Float.pi / 180.0
vFOV = atan(height / width * tan(hFOV))
tanHalfHFOV = tan(hFOV / 2) 
tanHalfVFOV = tan(vFOV / 2)

where width and height are the width and height of the 2D image. This calculation can be done once at the start of the session since it is defined by the camera itself.

For the Stereolabs ZED camera (this is a partial code extract):

#include <sl_zed/Camera.hpp>

sl::Camera zed;
sl::InitParameters init_params;

// set up params here
if (zed.open(init_params) != sl::SUCCESS) {
    exit(-1);
}

sl::CameraInformation ci = zed.getCameraInformation();
sl::CameraParameters cp = ci.calibration_parameters.left_cam;
hFOV = cp.h_fov;
vFOV = cp.v_fov;
tanHalfHFOV = tan(hFOV / 2);
tanHalfVFOV = tan(vFOV / 2);

To pick up the depth value, you just look up the hit point (x, y) coordinate in the depth buffer. For the TrueDepth camera and the ZED, this seems to be the perpendicular distance from the center of the camera to the plane defined by the target point that is perpendicular to the camera look at point – the yellow plane in the diagram. Other types of depth sensors might give the radial distance from the center of the camera to the hit point which will obviously require a slightly modified calculation. Here I am assuming that the depth buffer contains the perpendicular distance – call this spatialZ.

What we need now are the tangents of the reduced angles that correspond to the horizontal and vertical angle components between the ray from the camera to the screen plane hit point and the ray that is the camera’s look at point. – call these angles ThetaX (horizontal) and ThetaY (vertical). Given the perpendicular distance to the yellow plane, we can then easily calculate the spatial x and y coordinates using the field of view tangents previously calculated:

tanThetaX = (x - Float32(width / 2)) / Float32(width / 2) * tanHalfHFOV
tanThetaY = (y - Float32(height / 2)) / Float32(height / 2) * tanHalfVFOV

spatialX = spatialZ * tanThetaX
spatialY = spatialZ * tanThetaY

The coordinates (spatialZ, spatialY, spatialZ) are in whatever units the depth buffer uses (often meters) and in the camera’s coordinate system. To convert the camera’s coordinate system to world coordinates is a standard operation given the camera’s pose in the world space.

Registering virtual and real worlds with rt-xr, ARKit and Unity


One of the goals for rt-xr is to allow augmented reality users within a space to collaborate with virtual reality users physically outside of the space, with the VR users getting a telepresent sense of being physically within the same space. To this end, VR users see a complete model of the space (my office in this case) including augmentations while physically present AR users just see the augmentations. Some examples of augmentations are virtual whiteboards and virtual sticky notes. Both AR and VR users see avatars representing the position and pose of other users in the space.

Achieving this for AR users requires that their coordinate system corresponds with that of the virtual models of the room. For iOS, ARKit goes a long way to achieving this so the rt-xr app for iOS has been extended to include ARKit and work in AR mode. The screen capture above shows how coordinate systems are synced. A known location in physical space (in this case, the center of the circular control of the fan controller) is selected by touching the iPad screen on the exact center of the control. This identifies position. To avoid multiple control points, the app is currently started in the correct pose so that the yaw rotation is zero relative to the model origin. It is pretty quick and easy to do. The video below shows the process and the result.

After starting the app in the correct orientation, the user is then free to move to click on the control point. Once that’s done, the rt-xr part of the app starts up and loads the virtual model of the room. For this test, the complete model is being shown (i.e. as for VR users rather than AR users) although in real life only the augmentations would be visible – the idea here was to see how the windows lined up. The results are not too bad all things considered although moving or rotating too fast can cause some drift. However, collaborating using augmentations can tolerate some offset so this should not be a major problem.

There are just a couple of augmentations in this test. One is the menu switch (the glowing M) which is used to instantiate and control augmentations. There is also a video screen showing the snowy scene from the driveway camera, the feed being generated by an rt-ai design.

Next step is to test out VR and AR collaboration properly by displaying the correct AR scene on the iOS app. Since VR collaboration has worked for some time, extending it to AR users should not be too hard.