The ghost in the AI machine

The driveway monitoring system has been running full time for months now and it’s great to know if a vehicle or a person is moving on the driveway up to the house. The only bad thing is that it will give occasional false detections like the one above. This only happens at night and I guess there’s enough correct texture to trigger the “person” response with a very high confidence. Those white streaks might be rain or bugs being illuminated by the IR light. It also only seems to happen when the trash can is out for collection – it is in the frame about half way out from the center to the right.

It is well known that the image recognition capabilities of convolutional networks aren’t always exactly what they seem and this is a good example of the problem. Clearly, in this case, MobileNet feature detectors have detected things in small areas with a particular spatial relationship and added these together to come to the completely wrong conclusion. My problem is how to deal with these false detections. A couple of ideas come to mind. One is to use a different model in parallel and only generate an alert if both detect the same object at (roughly) the same place in the frame. Or instead of another CNN, use semantic segmentation to detect the object in a somewhat different way.

Whatever, it is a good practical demonstration of the fact that these simple neural networks don’t in any way understand what they are seeing. However, they can certainly be used as the basis of a more sophisticated system which adds higher level understanding to raw detections.

Object detection on the Raspberry Pi 4 with the Neural Compute Stick 2


Following on from the Coral USB experiment, the next step was to try it out with the NCS 2. Installation of OpenVINO on Raspbian Buster was straightforward. The rt-ai design was basically the same as for the Coral USB experiment but with the CoralSSD SPE replaced with the OpenVINO equivalent called CSSDPi. Both SPEs run ssd_mobilenet_v2_coco object detection.

Performance was pretty good – 17fps with 1280 x 720 frames. This is a little better than the Coral USB accelerator attained but then again the OpenVINO SPE is a C++ SPE while the Coral USB SPE is a Python SPE and image preparation and post processing takes its toll on performance. One day I am really going to use the C++ API to produce a new Coral USB SPE so that the two are on a level playing field. The raw inference time on the Coral USB accelerator is about 40mS or so meaning that there is plenty of opportunity for higher throughputs.

MobileNet SSD object detection using the Intel Neural Compute Stick 2 and a Raspberry Pi

I had successfully run ssd_mobilenet_v2_coco object detection using an Intel NCS2 running on an Ubuntu PC in the past but had not tried this using a Raspberry Pi running Raspbian as it was not supported at that time (if I remember correctly). Now, OpenVINO does run on Raspbian so I thought it would be fun to get this working on the Pi. The main task consisted of getting the CSSD rt-ai Stream Processing Element (SPE) compiling and running using Raspbian and its version of OpenVINO rather then the usual x86 64 Ubuntu system.

Compiled rt-ai SPEs use Qt so it was a case of putting together a different .pro qmake file to reflect the particular requirements of the Raspbian environment. Once I had sorted out the slight link command changes, the SPE crashed as soon as it tried to read in the model .xml file. I got stuck here for quite a long time until I realized that I was missing a compiler argument that meant that my binary was incompatible with the OpenVINO inference engine. This was fixed by adding the following line to the Raspbian .pro file:

QMAKE_CXXFLAGS += -march=armv7-a

Once that was added, the code worked perfectly. To test, I set up a simple rt-ai design:


For this test, the CSSDPi SPE was the only thing running on the Pi itself (rtai1), the other two SPEs were running on a PC (default). The incoming captured frames from the webcam to the CSSDPi SPE were 1280 x 720 at 30fps. The CSSDPi SPE was able to process 17 frames per second, not at all bad for a Raspberry Pi 3 model B! Incidentally, I had tried a similar setup using the Coral Edge TPU device and its version of the SSD SPE, CoralSSD, but the performance was nowhere near as good. One obvious difference is that CoralSSD is a Python SPE because, at that time, the C++ API was not documented. One day I may change this to a C++ SPE and then the comparison will be more representative.

Of course you can use multiple NCS 2s to get better performance if required although I haven’t tried this on the Pi as yet. Still, the same can be done with Coral with suitable code. In any case, rt-ai has the Scaler SPE that allows any number of edge inference devices on any number of hosts to be used together to accelerate processing of a single flow. I have to say, the ability to use rt-ai and rtaiDesigner to quickly deploy distributed stream processing networks to heterogeneous hosts is a lot of fun!

The motivation for all of this is to move from x86 processors with big GPUs to Raspberry Pis with edge inference accelerators to save power. The driveway project has been running for months now, heating up the basement very nicely. Moving from YOLOv3 on a GTX 1080 to MobileNet SSD and a Coral edge TPU saved about 60W, moving the entire thing from that system to the Raspberry Pi has probably saved a total of 80W or so.

This is the design now running full time on the Pi:


CPU utilization for the CSSDPi SPE is around 21% and it uses around 23% of the RAM. The raw output of the CSSDPi SPE is fed through a filter SPE that only outputs a message when a detection has passed certain criteria to avoid false alarms. Then, I get an email with a frame showing what triggered the system. The View module is really just for debugging – this is the kind of thing it displays:


The metadata displayed on the right is what the SSDFilter SPE uses to determine whether the detection should be reported or not. It requires a configurable number of sequential frames with a similar detection (e.g. car rather than something else) over a configurable confidence level before emitting a message. Then, it has a hold-off in case the detected object remains in the frame for a long time and, even then, requires a defined gap before that detection is re-armed. It seems to work pretty well.

One advantage of using CSSD rather than CYOLO as before is that, while I don’t get specific messages for things like a USPS van, it can detect a wider range of objects:


Currently the filter only accepts all the COCO vehicle classes and the person class while rejecting others, all in the interest of reducing false detection messages.

I had expected to need a Raspberry Pi 4 (mine is on its way đŸ™‚ ) to get decent performance but clearly the Pi 3 is well able to cope with the help fo the NCS 2.

Integrating SHAPE with rt-ai: adding AI to highly augmented spaces

A key feature of SHAPE is its ability to leverage the power of external servers in order to enhance the AR experience. The idea of combining relatively simple and cheap AR headsets with low latency communications links (such as 5G wireless) to edge servers is what is driving SHAPE’s architecture. Giving SHAPE access to rt-ai edge systems is a first example of this in action.

The screen capture above gives an idea of the current state of SHAPE development. This was taken using an iPad Pro running the iOS SHAPE app. The polygons with red edges are the planes that have been detected by ARKit. At the bottom right the monitor shows the same app running on a Mac (in the Unity editor in this case). The macOS version greatly speeds development of everything other than ARKit-related functionality – especially space synchronization functions (e.g. adding, moving, modifying or deleting object actions that need to be shared between all SHAPE users in the same space). The Unity iOS SHAPE app uses the ARFoundation API to, amongst other things,  load and save ARWorldMaps in order to synchronize spatial locations between SHAPE app instances. ARWorldMaps are persisted by the CoreUniverse components and cached for real-time use by EdgeSpace components, one EdgeSpace per physical “room”. SHAPE apps physically entering the room receive the latest map along with the space definition for that room. This includes the directory of augmentation objects with metadata that allows them all to be downloaded from asset servers (unless already cached) and then positioned correctly in the physical space and connected to the appropriate external function servers.

Augmentation objects can be moved around the space manually by touching the object with three or more fingers – sounds awful but it does work. It can then be dragged around the screen and the screen can be moved around to position the objects in space. Touching the object with two fingers brings up the object menu for that instance. This allows the object to be deleted, resized or rotated. It also allows the object to be stuck to a wall or stuck to the floor. in this context, a wall is an ARKit vertical plane, a floor is an ARKit horizontal plane so the object could easily be placed on a table if a suitable plane has been detected. If not, it can be placed manually. All of these object changes are sent to the room’s EdgeSpace (via EdgeAccess) and shared between other users in the space to keep everything synchronized. In addition, updates are sent to CoreUniverse for persistence. These become integrated into the persistent space definition for the room which EdgeSpace instances receive on a regular basis from CoreUniverse (primary and backup). Now this creates an interesting race condition since EdgeSpace is modifying its cached space definition in real-time and it may take a while for the CoreUniverse version to catch up. This problem is handled using timestamps attached to updates so that EdgeSpace can correctly integrate new information from CoreUniverse (such a new object instantiated by a space design tool) while ignoring stale updates for existing objects.

The box with big “M”s is the menu object. Each room has one and it can be placed anywhere convenient in the room. You can click on it (well touch it actually if using an iPad touch screen) and this pops up a menu that allows the user to add augmentation objects. Right now this is just working for the infamous analog clock but will eventually present a catalog of available models with thumbnails. The analog clocks are proxy objects and being driven by an external analog clock server. Obviously it is trivial to implement this purely in the Unity app but it is meant as a simple test of the proxy object concept. The next proxy object to be added will be the sticky note object from rt-xr and then probably the rt-xr shared whiteboard.

Getting back to rt-ai integration, the rt-ai design above shows the simple test design that receives captured frames from the iPad’s rear camera. The frame rate is limited to 5fps so as not to load the WiFi link too much. For simplicity and low latency motion jpegs are used for this but of course compressed video could be used (and probably will be in the future). The new rt-ai SPE called SHAPEConductor looks to the SHAPE system like a SHAPE function server while mapping received messages into and out of an rt-ai stream processing network. In this case, the video is simply being passed through DeepLab to perform semantic segmentation and then the results displayed:


Here it is picking up the monitor running the macOS SHAPE app. In practice, more complex processing would be performed and results returned to proxy objects via the SHAPEConductor module and the SHAPE network.

One interesting application for this is to use the captured frames to recognize the physical space and automatically load the correct saved ARWorldMap for that physical space into the SHAPE app and instantiate all the appropriate augmentation objects, correctly located. Another would be to perform semantic segmentation and return the results to the SHAPE app so that it can be married to depth data and allow real time occlusion to be performed. ARKit 3 will do this on-device for people but apparently not in general. Offloading the segmentation should allow for a lot more flexibility, albeit with increased latency, and work on lower capability devices.

The SHAPE rt-ai integration is very much a work in progress and it will be fun to see what can be achieved with this combination.

Introducing SHAPE: Scalable Highly Augmented Physical Environment


This screenshot is an example of a  virtual environment augmented with proxy objects created using rt-xr. However, this was always intended to be a VR precursor for an AR solution now called SHAPE – Scalable Highly Augmented Physical Environment. The difference is that the virtual objects being used to augment the virtual environment shown above (such as whiteboards, status displays, sticky notes, camera screens and other static virtual objects in this case) are used to augment real physical environments with a primary focus on scalability and local collaboration for physically present occupants. The intent is to open source SHAPE in the hope that others might like to contribute to the framework and/or contribute virtual objects to the object library.

Some of the features of SHAPE are:

  • SHAPEs are designed for collaboration. Multiple AR device users, present in the same space are able to interact with virtual objects just like real objects with consistent state maintained for all users.
  • SHAPE users can be grouped so that they see different virtual objects in the same space depending on their assigned group. A simple example of this would be where virtual objects are customized for language support – the virtual object set instantiated would then depend on the language selected by a user.
  • SHAPEs are scalable because they minimize the loading on AR devices. Complex processing is performed using a local edge server or remote cloud. Each virtual object is either static (just for display) or else can be connected to a server function that drives the virtual object and also receives interaction inputs that may modify the state of the virtual object, leaving the AR device to display objects and pass interaction events rather than performing complex functions on-device. Reducing the AR device loading in this way extends battery life and reduces heat, allowing devices to be used for longer sessions.
  • There is a natural fit between SHAPE and artificial intelligence/machine learning. As virtual objects are connected to off-device server functions, they can make use of inference results or supply data for machine learning derived from user interactions while leveraging much more powerful capabilities than are practical on-device.
  • A single universal app can be used for all SHAPEs. Any virtual objects needed for a particular space are downloaded at run time from an object server. However, there would be nothing stopping the creation of a customized app that included hard-coded assets while still leveraging the rest of SHAPE – this might be useful in some applications.
  • New virtual objects can be instantiated by users of the space, configured appropriately (including connection to remote server function) and then made persistent in location by registering with the object server.

A specific goal is to be able to support large scale physical environments such as amusement parks or sports stadiums, where there may be a very large number of users distributed over a very large space. The SHAPE system is being designed to support this level of scalability while being highly responsive to interaction.

In order to turn this into reality, the SHAPE concept requires low cost, lightweight AR headsets that can be worn for extended periods of time, perform reliable spatial localization in changing outdoor environments while also providing high quality, wide angle augmentation displays. Technology isn’t there yet so initially development will use iPads as the AR devices and ARKit for localization. Using iPads for this purpose isn’t ideal ergonomically but does allow all of the required functionality to be developed. When suitable headsets do become available, SHAPE will hopefully be ready to take advantage of them.

Detailed remote node status and distributed logging in rt-ai Edge

Something that had been missing from rt-ai Edge was the ability to see easily the state of remote nodes. That’s now been corrected with the new node status display. Each node has its own tab displaying key real-time usage information and the plan is to add more information to this in future versions – such as precisely which modules are running and also to support multiple GPUs.


Another missing feature was any sort of distributed logging, so that an app could receive and process log messages generated by any module in a design. This is now working and a module log display has been added to the design GUI in rtaiDesigner. As the logging system uses the same communications infrastructure as the rest of the management data, it will be easy to add redundant persistent storage and review of historic log information.

Just as an irrelevant footnote, the node status code came from the BioTestBench, something that I wrote some years ago when I was interested in bioinformatics. I was working on whole genome alignment and wanted a convenient single window that displayed all needed resource utilization information. Haven’t used this for ages but I am glad that some of the code finally came in handy for something else.

Raspberry Pi 3 Model B with Coral Edge TPU acceleration running SSD object detection


It wasn’t too hard to go from the inline rt-ai Edge Stream Processing Element using the Coral Edge TPU accelerator to an embedded version running on a Raspberry Pi 3 Model B with Pi camera.  The rt-ai Edge test design for this SPE is pretty simple again:


As can be seen, the Pi + Coral runs at about 4 fps with 1280 x 720 frames which is not too bad at all. In this example, I am running the PiCoral camera SPE on the Raspberry Pi node (Pi7) and the View SPE on the Default node (an i7 Ubuntu machine). Also, I’m using the combined video and metadata output which contains both the detection data and the associated JPEG video frame. However, the PiCoral SPE also has a metadata-only output. This contains all the frame information and detection data (scores, boxes etc) but not the JPEG frame itself. This can be useful for a couple of reasons. First, especially if the Raspberry Pi is connected via WiFi, transmitting the JPEGs can be a bit onerous and, if they are not needed, very wasteful. Secondly, it satisfies a potential privacy issue in that the raw video data never leaves the Raspberry Pi. Provided the metadata contains enough information for useful downstream processing, this can be a very efficient way to configure a system.