In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.
This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.
The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.
The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:
Next, I took another image to represent a user generated image:
It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:
As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:
It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.
In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.