Using homography to solve the “Where am I?” problem

In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.

This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.

The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.

The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:

Next, I took another image to represent a user generated image:

It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:

As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:

It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.

In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.

Registering virtual and real worlds with rt-xr, ARKit and Unity


One of the goals for rt-xr is to allow augmented reality users within a space to collaborate with virtual reality users physically outside of the space, with the VR users getting a telepresent sense of being physically within the same space. To this end, VR users see a complete model of the space (my office in this case) including augmentations while physically present AR users just see the augmentations. Some examples of augmentations are virtual whiteboards and virtual sticky notes. Both AR and VR users see avatars representing the position and pose of other users in the space.

Achieving this for AR users requires that their coordinate system corresponds with that of the virtual models of the room. For iOS, ARKit goes a long way to achieving this so the rt-xr app for iOS has been extended to include ARKit and work in AR mode. The screen capture above shows how coordinate systems are synced. A known location in physical space (in this case, the center of the circular control of the fan controller) is selected by touching the iPad screen on the exact center of the control. This identifies position. To avoid multiple control points, the app is currently started in the correct pose so that the yaw rotation is zero relative to the model origin. It is pretty quick and easy to do. The video below shows the process and the result.

After starting the app in the correct orientation, the user is then free to move to click on the control point. Once that’s done, the rt-xr part of the app starts up and loads the virtual model of the room. For this test, the complete model is being shown (i.e. as for VR users rather than AR users) although in real life only the augmentations would be visible – the idea here was to see how the windows lined up. The results are not too bad all things considered although moving or rotating too fast can cause some drift. However, collaborating using augmentations can tolerate some offset so this should not be a major problem.

There are just a couple of augmentations in this test. One is the menu switch (the glowing M) which is used to instantiate and control augmentations. There is also a video screen showing the snowy scene from the driveway camera, the feed being generated by an rt-ai design.

Next step is to test out VR and AR collaboration properly by displaying the correct AR scene on the iOS app. Since VR collaboration has worked for some time, extending it to AR users should not be too hard.

Using Windows Mixed Reality to visualize sentient spaces with rtXRView

The Windows Mixed Reality version of 3DView is now working nicely. Had a few problems with my Windows development PC which is a few years old and didn’t have adequate USB ports. In the end this PCI-e USB 3.1 card solved that problem otherwise a complete upgrade might have been required. A different USB 3.0 card did not work however.

Hopefully this is the last time that I see the displays all lined up like that. The space modeling software is coming along and soon it will be possible to model a space with a (relatively) simple procedural definition file. Potentially this could be texture mapped from a 3D scan of rooms but the simplified models generated procedurally with simple textures might well be good enough. Then it will be possible to position versions of these displays (and lots of other things) in the correct rooms.

XRView is intended to be runnable both on Windows MR headsets (I am using the Samsung Odyssey as it has a good display and built-in audio) and HoloLens. Now clearly VR modes and AR modes have to be completely different. In VR, you navigate and interact with the motion controllers and see the modeled space whereas in AR you navigate by walking around, interact using the clicker and don’t see the modeled space directly. However, the modeled space will still be there and will be used instead of the spatially mapped surfaces that the HoloLens might normally use. This means that objects placed in the model by a VR user will appear to AR users correctly positioned and vice versa. One key advantage of using the modeled space rather than the dynamically mapped space generated by the HoloLens itself is that it is easy to add context to the surfaces using the procedural model language. Another is the ability to interwork with non-HoloLens AR headsets that can share the HoloLens spatial map data. The procedural model becomes a platform-independent spatial mapping that “just” leaves the problem of spatial synchronization to the individual headsets.

I am sure that there will be some fun challenges in getting spatial synchronization but that’s something for later.