Using homography to solve the “Where am I?” problem

In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.

This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.

The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.

The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:

Next, I took another image to represent a user generated image:

It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:

As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:

It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.

In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.

The SHAPE architecture: scaling the core using Apache Kafka

SHAPE is being designed from the outset to scale to tens of thousands of simultaneous users or more in a single SHAPE universe, while providing a low latency experience to every AR user.  The current architectural concept is shown in the (somewhat messy) diagram above. A recent change has been the addition of Apache Kafka in the core layer. This helps solve one of the bigger problems: how to keep track of all of the augmentation object changes and interactions reliably and ensure a consistent representation for everyone.

SHAPE functionality is divided into four regions:

  • Core. Core functions are those that may involve significant amounts of data and processing but do not have tight latency requirements. Core functions could be implemented in a remote cloud for example. CoreUniverse manages all of the spatial maps, proxy object instances, spatial anchors and server configurations for the entire system and can be replicated for redundancy and load sharing. In order to ensure eventual consistency, Apache Kafka is used to keep a permanent record of updates to the space configuration (data flowing along the red arrows), allowing easy recovery from failures along with high reliability and scalability. The idea of using Kafka for this purpose was triggered by this paper incidentally.
  • Proxy. The proxy region contains the servers that drive the proxy objects (i.e. the AR augmentations) in the space. There are two types of servers in this region: asset servers and function servers. Asset servers contain the assets that form the proxy object – a Unity assetbundle for example. Users go directly to the asset servers (blue arrows – only a few shown for clarity) to obtain assets to instantiate. Function servers interact with the instantiated proxy objects in real time (via EdgeAccess as described below). For example, in the case of the famous analog clock proxy object (my proxy object equivalent of the classic Utah teapot), the function server drives the hands of the clock by supplying updated angles to the sub-objects with the analog clock asset.
  • Edge. The edge functions consist of those that have to respond to users with low latency. The first point of contact for SHAPE users is EdgeAccess. During normal operation, all real-time interaction takes place over a single link to an instance of EdgeAccess. This makes management, control and status on a per user basis very easy. EdgeAccess then makes ongoing connections to EdgeSpace servers and proxy function servers. A key performance enhancement is that EdgeAccess is able to multicast data from function servers if the data has not been customized for a specific proxy object instance. Function server data that can be multicast in this way is called undirected data, function server data intended for a specific proxy object instance is called directed data. The analog clock server generates undirected data whereas a server that is interacting directly with a user (via proxy object interaction support) has to use directed data. EdgeSpace acts as a sort of local cache for CoreUniverse. Each EdgeSpace instance supports a sub-space of the entire universe. It caches the local spatial maps, object instances and anchors for the sub-space so that users located within that sub-space experience low latency updates. These updates are also forwarded to Kafka so that CoreUniverse instances will eventually correctly reflect the state of the local caches. EdgeSpace instances sync with CoreUniverse at startup and periodically during operation to ensure consistency.
  • User. In this context, users are SHAPE apps running on AR headsets. An important concept is that a standard SHAPE app can be used in any SHAPE universe. The SHAPE app establishes a single connection (black arrows) to an EdgeAccess instance. EdgeAccess provides the user app with the local spatial map to use, proxy object instances, asset server paths and spatial anchors. The user app then fetches the assets from one or more asset servers to populate its augmentation scene. In addition, the user app registers with EdgeAccess for each function server required by the proxy object instances. Edge Access is responsible for setting up any connections to function servers (green arrows – only a few shown for clarity) that aren’t already in existence.

As an example of operation, consider a set of users physically present in the same sub-space. They may be connected to SHAPE via different EdgeAccess instances but will all use the same EdgeSpace. If one user makes a change to a proxy object instance (rotates it for example), the update information will be sent to EdgeSpace (via EdgeAccess) and then broadcast to the other users in the sub-space so that the changes are reflected in their augmentation scenes in real-time. The updates are also forwarded to Kafka so that CoreUniverse instances can track every local change.

This is very much a work in progress so details may change of course. There are quite a few details that I have glossed over here (such as spatial map management and a user moving from one sub-space to another) and they may well require changes.