Using homography to solve the “Where am I?” problem

In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.

This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.

The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.

The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:

Next, I took another image to represent a user generated image:

It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:

As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:

It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.

In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.

Containerizing of Manifold and rtndf (almost) complete

sensorviewI’ve certainly been learning a fair bit about Docker lately. Didn’t realize that it is reasonably easy to containerize GUI nodes as well as console mode nodes so now rtnDocker contains scripts to build and run almost every rtndf and Manifold node. There are only a few that haven’t been successfully moved yet. imuview, which is an OpenGL node to view data from IMUs, doesn’t work for some reason. The audio capture node (audio) and the audio part of avview (the video and audio viewer node) also don’t work as there’s something wrong with mapping the audio devices. It’s still possibly to run these outside of a container so it isn’t the end of the world but it is definitely a TODO.

Settings files for relevant containerized nodes are persisted at the same locations as the un-containerized versions making it very easy to switch between the two.

rtnDocker has an all script that builds all of the containers locally. These include:

  • manifoldcore. This is the base Manifold core built on Ubuntu 16.04.
  • manifoldcoretf. This uses the TensorFlow container as the base instead of raw Ubuntu.
  • manifoldcoretfgpu. This uses the TensorFlow GPU-enabled container as the base.
  • manifoldnexus. This is the core node that constructs the Manifold.
  • manifoldmanager. A management tool for Manifold nodes.
  • rtndfcore. The core rtn data flow container built on manifoldcore.
  • rtndfcoretf. The core rtn data flow container built on manifoldcoretf.
  • rtndfcoretfgpu. The core rtn data flow container built on manifoldcoretfgpu.
  • rtndfcoretfcv2. The core rtn data flow container built on rtndfcoretf and adding OpenCV V3.0.0.
  • rtndfcoretfgpucv2. The core rtn data flow container built on rtndfcoretfgpu and adding OpenCV V3.0.0.

The last two are good bases to use for anything combining machine learning and image processing in an rtn data flow PPE. The OpenCV build instructions were based on the very helpful example here. For example, the recognize PPE node, an encapsulation of Inception-v3, is based on rtndfcoretfgpucv2. The easiest way to build these is to use the scripts in the rtnDocker repo.

Motion detection pipeline processor using Python and OpenCV

MotionDetectI found this interesting tutorial describing ways to use OpenCV to implement motion detection. I thought that this might form the basis of a nice pipeline processing element for rtnDataFlow. Pipeline processing elements receive a stream from an MQTT topic, process it in some way and then output the modified stream on a new MQTT topic, usually in the same form but with appropriate changes. The new script is called and it takes a Jpeg over MQTT video stream and performs motion detection using OpenCV’s BackgroundSubtractorMOG2. The output stream consists of the input frames annotated with boxes around objects in motion in the frame. The screenshot shows an example. The small box is actually where the code has detected a moving screen saver on the monitor.

It can be tricky to get stable, large boxes rather than a whole bunch of smaller ones that percolate around. The code contains seven tunable parameters that can be modified as required – comments are in the code. Some will be dependent on frame size, some on frame rate. I tuned these parameters for 1280 x 720 frames at 30 frames per second, the default for the uvccam script.

The pipeline I was using for this test looked like this:

uvccam -> modet -> avview

I also tried it with the imageproc pipeline processor just for fun:

uvccam -> imageproc -> modet ->avview

This actually works pretty well too.