SHAPE AssetTags: a different way to create virtual augmentations for XR spaces

The earlier work with UWB tags generated an idea for something I am calling a SHAPE AssetTag. Essentially, this is a tag that is associated with a virtual SHAPE augmentation. The augmentation follows the position and orientation of the tag, making for a very simple way to implement augmented spaces. If engineered properly, it could be an extremely simple piece of hardware that would be essentially the UWB hardware along with a MEMS IMU and a battery. Instead of WiFi as in this prototype, pose updates could be sent over the UWB infrastructure to make things really simple. Ideally, these would be extremely cheap and could be placed anywhere in a space as a simple way of adding augmentations. These augmentations can be proxy objects (all aspects of a proxy object augmentation can be modified by remote servers) and really can be as simple or complex as desired.

Note that the SHAPE AssetTag doesn’t need to contain the actual asset data (although it could if desired). All it needs to do is to provide a URL of a repository where the asset (either Unity assetbundle or glTF blob) can be found. The asset is then streamed dynamically when it needs to be instantiated. It also provides information about where to find function servers in the case of a proxy object. The SHAPE device app (in this case an iOS app running on an iPad Pro) doesn’t need to know anything about SHAPE AssetTags – they just inject normal looking (but transient) augmentation updates into the SHAPE system so that augmentations magically appear. Obviously, this kind of flexibility could easily be abused and, in real life, a proper security strategy would need to be implemented in most cases. For development, though, it’s nice for things just to work!

One application that I like is a shared space where people can bring along their virtual creations in the form of some SHAPE AssetTags and just place them in the SHAPE-enhanced space so that any user in the space with XR devices could see them.

Another idea is that items in stores could have SHAPE AssetTags attached to them (like security tags today) so that looking at the item with an XR device would perhaps demonstrate some kind of feature. Manufacturers could supply the asset and functions servers, freeing the retail store from having to implement something for every stocked item. This could of course be done with QR codes but then the augmentations would not be physically locked to the item, something that could enable some very interesting augmentations. The item could be picked up and moved but the augmentation would retain the correct physical pose with respect to the item.

For now, the hardware is a complete hack with multiple components but it does prove that the concept is viable. In the photo above, the UWB tag (the white box on the floor under the figure’s right foot) controls the location of the augmentation in the physical space. A Raspberry Pi fitted with an IMU provides orientation information and drives the resulting pose via WiFi to the SHAPE servers. The augmentation is the glTF sample CesiumMan and includes animation data. Here are a couple of videos showing how the augmentation tracks the UWB tag around and that the IMU controls the augmentation’s orientation.

By the way, the software didn’t quite work first time…

So, is there any point to this? I am not sure. There are obviously many ways of doing the same thing without any physical hardware. However, the use of UWB makes it easy to achieve consistent results across multiple platforms with different spatial mapping as it provides absolute physical coordinates. Plus, there’s something fun about throwing a small tag on a surface and watching the augmentation appear!

Using homography to solve the “Where am I?” problem

In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.

This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.

The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.

The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:

Next, I took another image to represent a user generated image:

It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:

As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:

It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.

In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.

The SHAPE architecture: scaling the core using Apache Kafka

SHAPE is being designed from the outset to scale to tens of thousands of simultaneous users or more in a single SHAPE universe, while providing a low latency experience to every AR user.  The current architectural concept is shown in the (somewhat messy) diagram above. A recent change has been the addition of Apache Kafka in the core layer. This helps solve one of the bigger problems: how to keep track of all of the augmentation object changes and interactions reliably and ensure a consistent representation for everyone.

SHAPE functionality is divided into four regions:

  • Core. Core functions are those that may involve significant amounts of data and processing but do not have tight latency requirements. Core functions could be implemented in a remote cloud for example. CoreUniverse manages all of the spatial maps, proxy object instances, spatial anchors and server configurations for the entire system and can be replicated for redundancy and load sharing. In order to ensure eventual consistency, Apache Kafka is used to keep a permanent record of updates to the space configuration (data flowing along the red arrows), allowing easy recovery from failures along with high reliability and scalability. The idea of using Kafka for this purpose was triggered by this paper incidentally.
  • Proxy. The proxy region contains the servers that drive the proxy objects (i.e. the AR augmentations) in the space. There are two types of servers in this region: asset servers and function servers. Asset servers contain the assets that form the proxy object – a Unity assetbundle for example. Users go directly to the asset servers (blue arrows – only a few shown for clarity) to obtain assets to instantiate. Function servers interact with the instantiated proxy objects in real time (via EdgeAccess as described below). For example, in the case of the famous analog clock proxy object (my proxy object equivalent of the classic Utah teapot), the function server drives the hands of the clock by supplying updated angles to the sub-objects with the analog clock asset.
  • Edge. The edge functions consist of those that have to respond to users with low latency. The first point of contact for SHAPE users is EdgeAccess. During normal operation, all real-time interaction takes place over a single link to an instance of EdgeAccess. This makes management, control and status on a per user basis very easy. EdgeAccess then makes ongoing connections to EdgeSpace servers and proxy function servers. A key performance enhancement is that EdgeAccess is able to multicast data from function servers if the data has not been customized for a specific proxy object instance. Function server data that can be multicast in this way is called undirected data, function server data intended for a specific proxy object instance is called directed data. The analog clock server generates undirected data whereas a server that is interacting directly with a user (via proxy object interaction support) has to use directed data. EdgeSpace acts as a sort of local cache for CoreUniverse. Each EdgeSpace instance supports a sub-space of the entire universe. It caches the local spatial maps, object instances and anchors for the sub-space so that users located within that sub-space experience low latency updates. These updates are also forwarded to Kafka so that CoreUniverse instances will eventually correctly reflect the state of the local caches. EdgeSpace instances sync with CoreUniverse at startup and periodically during operation to ensure consistency.
  • User. In this context, users are SHAPE apps running on AR headsets. An important concept is that a standard SHAPE app can be used in any SHAPE universe. The SHAPE app establishes a single connection (black arrows) to an EdgeAccess instance. EdgeAccess provides the user app with the local spatial map to use, proxy object instances, asset server paths and spatial anchors. The user app then fetches the assets from one or more asset servers to populate its augmentation scene. In addition, the user app registers with EdgeAccess for each function server required by the proxy object instances. Edge Access is responsible for setting up any connections to function servers (green arrows – only a few shown for clarity) that aren’t already in existence.

As an example of operation, consider a set of users physically present in the same sub-space. They may be connected to SHAPE via different EdgeAccess instances but will all use the same EdgeSpace. If one user makes a change to a proxy object instance (rotates it for example), the update information will be sent to EdgeSpace (via EdgeAccess) and then broadcast to the other users in the sub-space so that the changes are reflected in their augmentation scenes in real-time. The updates are also forwarded to Kafka so that CoreUniverse instances can track every local change.

This is very much a work in progress so details may change of course. There are quite a few details that I have glossed over here (such as spatial map management and a user moving from one sub-space to another) and they may well require changes.