Using homography to solve the “Where am I?” problem

In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.

This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.

The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.

The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:

Next, I took another image to represent a user generated image:

It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:

As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:

It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.

In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.

The SHAPE architecture: scaling the core using Apache Kafka

SHAPE is being designed from the outset to scale to tens of thousands of simultaneous users or more in a single SHAPE universe, while providing a low latency experience to every AR user.  The current architectural concept is shown in the (somewhat messy) diagram above. A recent change has been the addition of Apache Kafka in the core layer. This helps solve one of the bigger problems: how to keep track of all of the augmentation object changes and interactions reliably and ensure a consistent representation for everyone.

SHAPE functionality is divided into four regions:

  • Core. Core functions are those that may involve significant amounts of data and processing but do not have tight latency requirements. Core functions could be implemented in a remote cloud for example. CoreUniverse manages all of the spatial maps, proxy object instances, spatial anchors and server configurations for the entire system and can be replicated for redundancy and load sharing. In order to ensure eventual consistency, Apache Kafka is used to keep a permanent record of updates to the space configuration (data flowing along the red arrows), allowing easy recovery from failures along with high reliability and scalability. The idea of using Kafka for this purpose was triggered by this paper incidentally.
  • Proxy. The proxy region contains the servers that drive the proxy objects (i.e. the AR augmentations) in the space. There are two types of servers in this region: asset servers and function servers. Asset servers contain the assets that form the proxy object – a Unity assetbundle for example. Users go directly to the asset servers (blue arrows – only a few shown for clarity) to obtain assets to instantiate. Function servers interact with the instantiated proxy objects in real time (via EdgeAccess as described below). For example, in the case of the famous analog clock proxy object (my proxy object equivalent of the classic Utah teapot), the function server drives the hands of the clock by supplying updated angles to the sub-objects with the analog clock asset.
  • Edge. The edge functions consist of those that have to respond to users with low latency. The first point of contact for SHAPE users is EdgeAccess. During normal operation, all real-time interaction takes place over a single link to an instance of EdgeAccess. This makes management, control and status on a per user basis very easy. EdgeAccess then makes ongoing connections to EdgeSpace servers and proxy function servers. A key performance enhancement is that EdgeAccess is able to multicast data from function servers if the data has not been customized for a specific proxy object instance. Function server data that can be multicast in this way is called undirected data, function server data intended for a specific proxy object instance is called directed data. The analog clock server generates undirected data whereas a server that is interacting directly with a user (via proxy object interaction support) has to use directed data. EdgeSpace acts as a sort of local cache for CoreUniverse. Each EdgeSpace instance supports a sub-space of the entire universe. It caches the local spatial maps, object instances and anchors for the sub-space so that users located within that sub-space experience low latency updates. These updates are also forwarded to Kafka so that CoreUniverse instances will eventually correctly reflect the state of the local caches. EdgeSpace instances sync with CoreUniverse at startup and periodically during operation to ensure consistency.
  • User. In this context, users are SHAPE apps running on AR headsets. An important concept is that a standard SHAPE app can be used in any SHAPE universe. The SHAPE app establishes a single connection (black arrows) to an EdgeAccess instance. EdgeAccess provides the user app with the local spatial map to use, proxy object instances, asset server paths and spatial anchors. The user app then fetches the assets from one or more asset servers to populate its augmentation scene. In addition, the user app registers with EdgeAccess for each function server required by the proxy object instances. Edge Access is responsible for setting up any connections to function servers (green arrows – only a few shown for clarity) that aren’t already in existence.

As an example of operation, consider a set of users physically present in the same sub-space. They may be connected to SHAPE via different EdgeAccess instances but will all use the same EdgeSpace. If one user makes a change to a proxy object instance (rotates it for example), the update information will be sent to EdgeSpace (via EdgeAccess) and then broadcast to the other users in the sub-space so that the changes are reflected in their augmentation scenes in real-time. The updates are also forwarded to Kafka so that CoreUniverse instances can track every local change.

This is very much a work in progress so details may change of course. There are quite a few details that I have glossed over here (such as spatial map management and a user moving from one sub-space to another) and they may well require changes.

Introducing SHAPE: Scalable Highly Augmented Physical Environment


This screenshot is an example of a  virtual environment augmented with proxy objects created using rt-xr. However, this was always intended to be a VR precursor for an AR solution now called SHAPE – Scalable Highly Augmented Physical Environment. The difference is that the virtual objects being used to augment the virtual environment shown above (such as whiteboards, status displays, sticky notes, camera screens and other static virtual objects in this case) are used to augment real physical environments with a primary focus on scalability and local collaboration for physically present occupants. The intent is to open source SHAPE in the hope that others might like to contribute to the framework and/or contribute virtual objects to the object library.

Some of the features of SHAPE are:

  • SHAPEs are designed for collaboration. Multiple AR device users, present in the same space are able to interact with virtual objects just like real objects with consistent state maintained for all users.
  • SHAPE users can be grouped so that they see different virtual objects in the same space depending on their assigned group. A simple example of this would be where virtual objects are customized for language support – the virtual object set instantiated would then depend on the language selected by a user.
  • SHAPEs are scalable because they minimize the loading on AR devices. Complex processing is performed using a local edge server or remote cloud. Each virtual object is either static (just for display) or else can be connected to a server function that drives the virtual object and also receives interaction inputs that may modify the state of the virtual object, leaving the AR device to display objects and pass interaction events rather than performing complex functions on-device. Reducing the AR device loading in this way extends battery life and reduces heat, allowing devices to be used for longer sessions.
  • There is a natural fit between SHAPE and artificial intelligence/machine learning. As virtual objects are connected to off-device server functions, they can make use of inference results or supply data for machine learning derived from user interactions while leveraging much more powerful capabilities than are practical on-device.
  • A single universal app can be used for all SHAPEs. Any virtual objects needed for a particular space are downloaded at run time from an object server. However, there would be nothing stopping the creation of a customized app that included hard-coded assets while still leveraging the rest of SHAPE – this might be useful in some applications.
  • New virtual objects can be instantiated by users of the space, configured appropriately (including connection to remote server function) and then made persistent in location by registering with the object server.

A specific goal is to be able to support large scale physical environments such as amusement parks or sports stadiums, where there may be a very large number of users distributed over a very large space. The SHAPE system is being designed to support this level of scalability while being highly responsive to interaction.

In order to turn this into reality, the SHAPE concept requires low cost, lightweight AR headsets that can be worn for extended periods of time, perform reliable spatial localization in changing outdoor environments while also providing high quality, wide angle augmentation displays. Technology isn’t there yet so initially development will use iPads as the AR devices and ARKit for localization. Using iPads for this purpose isn’t ideal ergonomically but does allow all of the required functionality to be developed. When suitable headsets do become available, SHAPE will hopefully be ready to take advantage of them.

Converting screen coordinates + depth into spatial coordinates for OpenPose…or anything else really

Depth cameras are wonderful things but they typically only give a distance associated with each (x, y) coordinate in screen space. To convert to spatial coordinates involves some calculation. One thing to note is that I am ignoring camera calibration which is required to get best accuracy. See this page for details of how to use calibration data in iOS for example. I have implemented this calculation for the iPad TrueDepth camera and also the ZED stereo camera to process OpenPose joint data and it seems to work but I cannot guarantee complete accuracy!

The concept for the conversion is shown in the diagram above. One can think of the 2D camera image as being mapped to a screen plane – the blue plane in the diagram. The width and height of the plane are determined by its distance from the camera and the camera’s field of view. Using the iPad as an example, you can get the horizontal and vertical camera field of view angles (hFOV and vFOV in the diagram) like this:

hFOV = captureDevice.activeFormat.videoFieldOfView * Float.pi / 180.0
vFOV = atan(height / width * tan(hFOV))
tanHalfHFOV = tan(hFOV / 2) 
tanHalfVFOV = tan(vFOV / 2)

where width and height are the width and height of the 2D image. This calculation can be done once at the start of the session since it is defined by the camera itself.

For the Stereolabs ZED camera (this is a partial code extract):

#include <sl_zed/Camera.hpp>

sl::Camera zed;
sl::InitParameters init_params;

// set up params here
if (zed.open(init_params) != sl::SUCCESS) {
    exit(-1);
}

sl::CameraInformation ci = zed.getCameraInformation();
sl::CameraParameters cp = ci.calibration_parameters.left_cam;
hFOV = cp.h_fov;
vFOV = cp.v_fov;
tanHalfHFOV = tan(hFOV / 2);
tanHalfVFOV = tan(vFOV / 2);

To pick up the depth value, you just look up the hit point (x, y) coordinate in the depth buffer. For the TrueDepth camera and the ZED, this seems to be the perpendicular distance from the center of the camera to the plane defined by the target point that is perpendicular to the camera look at point – the yellow plane in the diagram. Other types of depth sensors might give the radial distance from the center of the camera to the hit point which will obviously require a slightly modified calculation. Here I am assuming that the depth buffer contains the perpendicular distance – call this spatialZ.

What we need now are the tangents of the reduced angles that correspond to the horizontal and vertical angle components between the ray from the camera to the screen plane hit point and the ray that is the camera’s look at point. – call these angles ThetaX (horizontal) and ThetaY (vertical). Given the perpendicular distance to the yellow plane, we can then easily calculate the spatial x and y coordinates using the field of view tangents previously calculated:

tanThetaX = (x - Float32(width / 2)) / Float32(width / 2) * tanHalfHFOV
tanThetaY = (y - Float32(height / 2)) / Float32(height / 2) * tanHalfVFOV

spatialX = spatialZ * tanThetaX
spatialY = spatialZ * tanThetaY

The coordinates (spatialZ, spatialY, spatialZ) are in whatever units the depth buffer uses (often meters) and in the camera’s coordinate system. To convert the camera’s coordinate system to world coordinates is a standard operation given the camera’s pose in the world space.

Registering virtual and real worlds with rt-xr, ARKit and Unity


One of the goals for rt-xr is to allow augmented reality users within a space to collaborate with virtual reality users physically outside of the space, with the VR users getting a telepresent sense of being physically within the same space. To this end, VR users see a complete model of the space (my office in this case) including augmentations while physically present AR users just see the augmentations. Some examples of augmentations are virtual whiteboards and virtual sticky notes. Both AR and VR users see avatars representing the position and pose of other users in the space.

Achieving this for AR users requires that their coordinate system corresponds with that of the virtual models of the room. For iOS, ARKit goes a long way to achieving this so the rt-xr app for iOS has been extended to include ARKit and work in AR mode. The screen capture above shows how coordinate systems are synced. A known location in physical space (in this case, the center of the circular control of the fan controller) is selected by touching the iPad screen on the exact center of the control. This identifies position. To avoid multiple control points, the app is currently started in the correct pose so that the yaw rotation is zero relative to the model origin. It is pretty quick and easy to do. The video below shows the process and the result.

After starting the app in the correct orientation, the user is then free to move to click on the control point. Once that’s done, the rt-xr part of the app starts up and loads the virtual model of the room. For this test, the complete model is being shown (i.e. as for VR users rather than AR users) although in real life only the augmentations would be visible – the idea here was to see how the windows lined up. The results are not too bad all things considered although moving or rotating too fast can cause some drift. However, collaborating using augmentations can tolerate some offset so this should not be a major problem.

There are just a couple of augmentations in this test. One is the menu switch (the glowing M) which is used to instantiate and control augmentations. There is also a video screen showing the snowy scene from the driveway camera, the feed being generated by an rt-ai design.

Next step is to test out VR and AR collaboration properly by displaying the correct AR scene on the iOS app. Since VR collaboration has worked for some time, extending it to AR users should not be too hard.

MobileNet SSD object detection with Unity, ARKit and Core ML


This iOS app is really step 1 on the road to integrating Core ML enabled iOS devices with rt-ai Edge. The screenshot shows the MobileNet SSD object detector running within the ARKit-enabled Unity app on an iPad Pro. If anyone wants to try this, code is here. I put this together pretty quickly so apologies if it is a bit rough but it is early days. Detection box registration isn’t perfect as you can see (especially for the mouse) but it is not too bad. This is probably a field of view mismatch somewhere and will need to be investigated.

Next, this code needs to be integrated with the Manifold C# Unity client. Following that, I will need to write the PutManifold SPE for rt-ai Edge. When this is done, the video and object detection data stream from the iOS device will appear within an rt-ai Edge stream processing network and look exactly the same as the stream from the CYOLO SPE.

The app is based on two repos that were absolutely invaluable in putting it together:

Many thanks to the authors of those repos.

An rt-xr SpaceObjects tour de force

rt-xr SpaceObjects are now working very nicely. It’s easy to create, configure and delete SpaceObjects as needed using the menu switch which has been placed just above the light switch in my office model above.

The video below shows all of this in operation.

The typical process is to instantiate an object, place and size it and then attach it to a Manifold stream if it is a Proxy Object. Persistence, sharing and collaboration works for all relevant SpaceObjects across the supported platforms (Windows and macOS desktop, Windows MR, Android and iOS).

This is a good place to leave rt-xr for the moment while I wait for the arrival of some sort of AR headset in order to support local users of an rt-xr enhanced sentient space. Unfortunately, Magic Leap won’t deliver to my zip code (sigh) so that’s that for the moment. Lots of teasers about the HoloLens 2 right now and this might be the best way to go…eventually.

Now the focus moves back to rt-ai Edge. While this is working pretty well, it needs to have a few bugs fixed and also add some production modes (such as auto-starting SPNs when server nodes are started). Then begins the process of data collection for machine learning. ZeroSensors will collect data from each monitored room and this will be saved by ManifoldStore for later use. The idea is to classify normal and abnormal situations and also to be proactive in responding to the needs of occupants of the sentient space.