Platform-independent highly augmented spaces using UWB

The SHAPE project needs to be able to support consistent highly augmented spaces no matter what platform (headset + software) is chosen by any user situated within the physical space. Previously, SHAPE had just used ARKit to design spaces as an interim measure but this was not going to solve the problem in a platform-independent way. What SHAPE needed was a platform-independent way of linking an ARKit spatial map to the real physical environment. UWB technology provides just such a mechanism.

SHAPE breaks a large physical space into multiple subspaces – often mapped to physical rooms. A big problem is that augmentations can be seen through walls unless something prevents this. ARKit is relatively awful at wall detection so I gave up trying to get that to work. It’s not really ARKit’s fault. Using a single camera to reliably map a room’s walls is just not reliable. Another problem concerns windows and doors. Ideally, it should be possible to see augmentations outside of a physical room if they can be viewed through a window. That might be tough for any mapping software to handle correctly.

SHAPE is now able to solve these problems using UWB. The photo above shows part of the process used to link ARKit’s coordinate system to a physical space coordinate system defined by the UWB installation in the space. What this is showing is how the yaw (rotation about the y-axis in Unity terms) offset between ARKit’s coordinate system and UWB’s coordinate system is measured and subsequently corrected. Basically, a UWB tag (which has a known position in the space) is covered by the red ball in the center of the iPad screen and data recorded at that point. The data recorded consists of the virtual AR camera position and rotation along with the iPad’s position in the physical space. Because iPads currently do not support UWB, I attached one of the Decawave tags to the back of the iPad. Separately, a tag in Listener mode is used by a new SHAPE component, EdgeUWB, to provide a service that makes available the positions of all UWB tags in the subspace. EdgeSpace keeps track of these tag positions so when it receives a message to update the ARKit offset, it can combine the AR camera pose (in the message from the SHAPE app to EdgeSpace), iPad position (from the UWB tag via EdgeUWB) and the known location of the target UWB anchor (from a configuration file). With all of this information, EdgeSpace can calculate position and rotation offsets that are sent back to the SHAPE app so that the Unity augmentations can be correctly aligned in the physical space.

As the ARKit spatial map is also saved during this process and reloaded every time the SHAPE app starts up (or enters the room/subspace in a full implementation), the measured offsets remain valid.


Coming back to the issue of making sure that only augmentations in a physical subspace can be seen, except through a window or door, SHAPE now includes the concept of rooms that can have walls, a floor and a ceiling. The screenshot above shows an example of this. The yellow sticky note is outside of the room and so only the part in the “window” is visible. Since it is not at all obvious what is happening, the invisible but occluding walls used in normal mode can be replaced with visible walls so alignment can be visualized more easily.


This screenshot was taken in debug mode. The effect is subtle but there is a blue film representing the normal invisible occluding walls and the cutout for the window can be seen as it is clear. It can also be seen that the alignment isn’t totally perfect – for example, the cutout is a couple of inches higher than the actual transparent part of the physical window. In this case, the full sticky note is visible as the debug walls don’t occlude.

Incidentally, the room walls use the procedural punctured plane technology that was developed a while ago.


This is an example showing a wall with rectangular and elliptical cutouts. Cutouts can be defined to allow augmentations to be seen through windows and doors by configuring the appropriate cutout in terms of the UWB coordinate system.

While the current system only supports ARKit, in principle any equivalent system could be aligned to the UWB coordinate system using some sort of similar alignment process. Once this is done, a user in the space with any supported XR headset type will see a consistent set of augmentations, reliably positioned within the physical space (at least within the alignment accuracy limits of the underlying platform and the UWB location system). Note that, while UWB support in user devices is helpful, it is only required for setting up the space and initial map alignment. After that, user devices can achieve spatial lock (via ARKit for example) and then maintain tracking in the normal way.

Indoor position measurement for XR applications using UWB

Obtaining reasonably accurate absolute indoor position measurements for mobile devices has always been tricky, to say the least. Things like ARKit can determine a relative position within a mapped space reasonably well in ideal circumstances, as can structured IR. What would be much more useful for creating complex XR experiences spread over a large physical space in which lots of people and objects might be moving around and occluding things is something that allows the XR application to determine its absolute position in the space. Ultra-wideband (UWB) provides a mechanism for obtaining an absolute position (with respect to a fixed point in the space) over a large area and in challenging environments and could be an ideal partner for XR applications. This is a useful backgrounder on how the technology works. Interestingly, Apple have added some form of UWB support to the iPhone 11. Hopefully future manufacturers of XR headsets, or phones that pair with XR headsets, will include UWB capability so that they can act as tags in an RTLS system.

UWB RTLS (Real Time Location System) technology seems like an ideal fit for the SHAPE project. An important requirement for SHAPE is that the system can locate a user within one of the subspaces that together cover the entire physical space. The use of subspaces allows the system to grow to cover very large physical spaces just by scaling servers. One idea that works to some extent is to use the AR headset or tablet camera to recognize physical features within a subspace, as in this work. However, once again, this only works in ideal situations where the physical environment has not changed since the reference images were taken. And, in a large space that has no features, such as a warehouse, this just won’t work at all.

Using UWB RTLS with a number of anchors spanning the physical space, it should be possible to determine an absolute location in the space and map this to a specific subspace, regardless of other objects moving through the space or how featureless the space is. To try this out, I am using the Decawave MDEK1001 development kit.

This includes 12 devices that can be used as anchors, tags, or gateways. Anchors are placed at fixed positions in the space – I have mounted four high up in the corners of my office for example. Tags represent users moving around in the space. Putting a device in gateway mode allows it to be connected to a Raspberry Pi which provides web and MQTT access to the tag position data for example.

Setting up is pretty easy using an Android app that auto-discovers devices. It does have an automatic measurement system that tries to determine the relative positions of anchors with respect to an origin but that didn’t seem to work too well for me so I resorted to physical measurement. Not a big problem and, in any case, the software cannot determine the system z offset automatically. Another thing that confused me is that tags have, by default, a very slow update rate if they are not moving much which makes it seem that the system isn’t working. Changing this to a much higher rate probably much increases power usage but certainly helps with testing! Speaking of power, the devices can be powered from USB power supplies or internal rechargeable batteries (which were not included incidentally).

Anyway, once everything was set up correctly, it seemed to work very well, using the Android app to display the tag locations with respect to the anchors. The next step is to integrate this system into SHAPE so that subspaces can be defined in terms of RTLS coordinates.

An interesting additional feature of UWB is that it also supports data transfers between devices. This could lead to some interesting new concepts for XR experiences…

The ghost in the AI machine

The driveway monitoring system has been running full time for months now and it’s great to know if a vehicle or a person is moving on the driveway up to the house. The only bad thing is that it will give occasional false detections like the one above. This only happens at night and I guess there’s enough correct texture to trigger the “person” response with a very high confidence. Those white streaks might be rain or bugs being illuminated by the IR light. It also only seems to happen when the trash can is out for collection – it is in the frame about half way out from the center to the right.

It is well known that the image recognition capabilities of convolutional networks aren’t always exactly what they seem and this is a good example of the problem. Clearly, in this case, MobileNet feature detectors have detected things in small areas with a particular spatial relationship and added these together to come to the completely wrong conclusion. My problem is how to deal with these false detections. A couple of ideas come to mind. One is to use a different model in parallel and only generate an alert if both detect the same object at (roughly) the same place in the frame. Or instead of another CNN, use semantic segmentation to detect the object in a somewhat different way.

Whatever, it is a good practical demonstration of the fact that these simple neural networks don’t in any way understand what they are seeing. However, they can certainly be used as the basis of a more sophisticated system which adds higher level understanding to raw detections.

Using homography to solve the “Where am I?” problem

In SHAPE, a large highly augmented space is broken up into a number of sub-spaces. Each sub-space has its own set of virtual augmentation objects positioned persistently in the real space with which AR device users physically present in the sub-space can interact in a collaborative way. It is necessary to break up the global space in this way in order keep the number of augmentation objects that any one AR device has to handle down to a manageable number. Take the case of a large museum with very many individual rooms. A user can only experience augmentation objects in the same physical room so each room becomes a SHAPE sub-space and only the augmentation objects in that particular room need to be processed by the user’s AR device.

This brings up two problems: how to work out which room the user is in when the SHAPE app is started (the “Where am I?” problem) and also detecting that the user has moved from one room to another. It’s desirable to do this without depending on external navigation which, in indoor environments, can be pretty unreliable or completely unavailable.

The goal was to use the video feed from the AR device’s camera (e.g. the rear camera on an iPad running ARKit) to solve these problems. The question was how to make this work. This seemed like something that OpenCV probably had an answer to which meant that the first place to look was the Learn OpenCV web site. A while ago there was a post about feature based image alignment which seemed like the right sort of technique to use for this. I used the code as the basis for my code which ended up working quite nicely.

The approach is to take a set of overlapping reference photos for each room and then pre-process them to extract the necessary keypoints and descriptors. These can then go into a database, labelled with the sub-space to which they belong, for comparison against user generated images. Here are two reference images of my (messy) office for example:

Next, I took another image to represent a user generated image:

It is obviously similar but not the same as any of the reference set. Running this image against the database resulted in the following two results for the two reference images above:

As you can see, the code has done a pretty good job of selecting the overlaps of the test image with the two reference images. This is an example of what you see if the match is poor:

It looks very cool but clearly has nothing to do with a real match! In order to select the best reference image match, I add up the distances for the 10 best feature matches against every reference image and then select the reference image (and therefore sub-space) with the lowest total distance. This can also be thresholded in case there is no good match. For these images, a threshold of around 300 would work.

In practice, the SHAPE app will start sending images to a new SHAPE component, the homography server, which will keep processing images until the lowest distance match is under the threshold. At that point, the sub-space has been detected and the augmentation objects and spatial map can be downloaded to the app and use to populate the scene. By continuing this process, if the user moves from one room to another, the room (sub-space) change will be detected and the old set of augmentation objects and spatial map replaced with the ones for the new room.

Object detection on the Raspberry Pi 4 with the Neural Compute Stick 2


Following on from the Coral USB experiment, the next step was to try it out with the NCS 2. Installation of OpenVINO on Raspbian Buster was straightforward. The rt-ai design was basically the same as for the Coral USB experiment but with the CoralSSD SPE replaced with the OpenVINO equivalent called CSSDPi. Both SPEs run ssd_mobilenet_v2_coco object detection.

Performance was pretty good – 17fps with 1280 x 720 frames. This is a little better than the Coral USB accelerator attained but then again the OpenVINO SPE is a C++ SPE while the Coral USB SPE is a Python SPE and image preparation and post processing takes its toll on performance. One day I am really going to use the C++ API to produce a new Coral USB SPE so that the two are on a level playing field. The raw inference time on the Coral USB accelerator is about 40mS or so meaning that there is plenty of opportunity for higher throughputs.

Object detection on the Raspberry Pi 4 with the Coral USB accelerator


SSD object detection with the Coral USB accelerator had been running on a Raspberry Pi 3 but the performance was disappointing and I was curious to see what would happen on the Raspberry Pi 4.


This is the test rt-ai design. The UVCCam and MediaView SPEs are running on an Ubuntu desktop, the CoralSSD SPE is running on the Raspberry Pi 4. It is getting a respectable 12fps with 1280 x 720 frames (an earlier version of this post had reported much worse performance but that was due to some silly image loading code). The utilization of one CPU core is around 93% which is fair enough for a Python SPE. I am sure that a C++ version of this SPE would be considerably faster again.

Getting this running at all was interesting as the Pi 4 requires Raspbian Buster and that comes with Python 3.7 which is not supported by the edgetpu_api toolkit at this point in time.

After writing the original blog post I discovered that in fact it is trivial to convert the edgetpu_api installation to work with Python 3.7. Without doing any virtualenv and Python 3.5 stuff, just run install.sh (modified as described below to recognize the Pi 4 and fix the sudo bug) and enter these commands:

cd /usr/local/lib/python3.7/dist-packages/edgetpu/swig
sudo cp _edgetpu_cpp_wrapper.cpython-35m-arm-linux-gnueabihf.so _edgetpu_cpp_wrapper.cpython-37m-arm-linux-gnueabihf.so

Turns out all it needed was a correctly named .so file to match the Python version. Anyway, if you want to go the Python 3.5 route…

The ARM version of the Python library is only compiled for Python 3.5. So, Python 3.5 needs to be installed alongside Python 3.7. To do this, download the GZipped source from here and expand and build with:

tar xzf Python-3.5.7.tgz
cd Python-3.5.7
sudo apt-get install libssl-dev
./configure --enable-optimizations
sudo make -j4 altinstall
virtualenv --python=python3.5 venv
source venv/bin/activate

The result of all of this should be Python 3.5 available in a virtual environment. Any specific packages that need to be installed should be installed using pip3.5 as required. Regarding numpy, I found that the install didn’t work for some reason (there were missing dependencies when imported) and I had to use this command (as described here):

pip3.5 install numpy --upgrade --no-binary :all:

Now it is time to install the edgetpu_api which is basically a case of following the instructions here. However, install.sh has a small bug and also will not recognize the Pi 4.

Modify install.sh to recognize the Pi 4 by adding this after line 59:

  elif [[ "${MODEL}" == "Raspberry Pi 4 Model B Rev"* ]]; then
    info "Recognized as Raspberry Pi 4 B."
    LIBEDGETPU_SUFFIX=arm32
    HOST_GNU_TYPE=arm-linux-gnueabihf

Once that is added, go to line 128 and replace it with:

sudo udevadm control --reload-rules && sudo udevadm trigger

The original is missing the second sudo. Once that is done, the Coral USB accelerator should be able to run the bird classifier example.

MobileNet SSD object detection using the Intel Neural Compute Stick 2 and a Raspberry Pi

I had successfully run ssd_mobilenet_v2_coco object detection using an Intel NCS2 running on an Ubuntu PC in the past but had not tried this using a Raspberry Pi running Raspbian as it was not supported at that time (if I remember correctly). Now, OpenVINO does run on Raspbian so I thought it would be fun to get this working on the Pi. The main task consisted of getting the CSSD rt-ai Stream Processing Element (SPE) compiling and running using Raspbian and its version of OpenVINO rather then the usual x86 64 Ubuntu system.

Compiled rt-ai SPEs use Qt so it was a case of putting together a different .pro qmake file to reflect the particular requirements of the Raspbian environment. Once I had sorted out the slight link command changes, the SPE crashed as soon as it tried to read in the model .xml file. I got stuck here for quite a long time until I realized that I was missing a compiler argument that meant that my binary was incompatible with the OpenVINO inference engine. This was fixed by adding the following line to the Raspbian .pro file:

QMAKE_CXXFLAGS += -march=armv7-a

Once that was added, the code worked perfectly. To test, I set up a simple rt-ai design:


For this test, the CSSDPi SPE was the only thing running on the Pi itself (rtai1), the other two SPEs were running on a PC (default). The incoming captured frames from the webcam to the CSSDPi SPE were 1280 x 720 at 30fps. The CSSDPi SPE was able to process 17 frames per second, not at all bad for a Raspberry Pi 3 model B! Incidentally, I had tried a similar setup using the Coral Edge TPU device and its version of the SSD SPE, CoralSSD, but the performance was nowhere near as good. One obvious difference is that CoralSSD is a Python SPE because, at that time, the C++ API was not documented. One day I may change this to a C++ SPE and then the comparison will be more representative.

Of course you can use multiple NCS 2s to get better performance if required although I haven’t tried this on the Pi as yet. Still, the same can be done with Coral with suitable code. In any case, rt-ai has the Scaler SPE that allows any number of edge inference devices on any number of hosts to be used together to accelerate processing of a single flow. I have to say, the ability to use rt-ai and rtaiDesigner to quickly deploy distributed stream processing networks to heterogeneous hosts is a lot of fun!

The motivation for all of this is to move from x86 processors with big GPUs to Raspberry Pis with edge inference accelerators to save power. The driveway project has been running for months now, heating up the basement very nicely. Moving from YOLOv3 on a GTX 1080 to MobileNet SSD and a Coral edge TPU saved about 60W, moving the entire thing from that system to the Raspberry Pi has probably saved a total of 80W or so.

This is the design now running full time on the Pi:


CPU utilization for the CSSDPi SPE is around 21% and it uses around 23% of the RAM. The raw output of the CSSDPi SPE is fed through a filter SPE that only outputs a message when a detection has passed certain criteria to avoid false alarms. Then, I get an email with a frame showing what triggered the system. The View module is really just for debugging – this is the kind of thing it displays:


The metadata displayed on the right is what the SSDFilter SPE uses to determine whether the detection should be reported or not. It requires a configurable number of sequential frames with a similar detection (e.g. car rather than something else) over a configurable confidence level before emitting a message. Then, it has a hold-off in case the detected object remains in the frame for a long time and, even then, requires a defined gap before that detection is re-armed. It seems to work pretty well.

One advantage of using CSSD rather than CYOLO as before is that, while I don’t get specific messages for things like a USPS van, it can detect a wider range of objects:


Currently the filter only accepts all the COCO vehicle classes and the person class while rejecting others, all in the interest of reducing false detection messages.

I had expected to need a Raspberry Pi 4 (mine is on its way đŸ™‚ ) to get decent performance but clearly the Pi 3 is well able to cope with the help fo the NCS 2.