AIY Vision Kit + MobileNet+SSD: a smart camera for rt-ai Edge

I decided that it would be fun to try out a Google AIY Vision Kit as a sort of warm-up for the potentially much more significant Edge TPU.

The Vision Kit is basically the same configuration as the ZeroSensor camera except with an extra board in the camera path that can perform inference on the captured images. The kit comes with some frozen graphs that can be used to detect a few things but I thought it would be interesting to try training a MobileNet SSD network with the Pascal VOC 2012 training data which can identify 20 different objects. The instructions for how to do this are here.

Once that was all running, the next step was to integrate it with rt-ai Edge. It’s pretty similar to the earlier full-blown TensorFlow version so it didn’t take too long to get working.

The design is much the same as usual except with the new VisionKit object detection SPE instead of TFObjectDetect or Deeplab. Note that the PiCam and VisionKit SPEs are running on the AIY Vision Kit, whereas the MediaView SPE is running on a desktop.

This is the output from the MediaView SPE. The metadata has been formatted to look exactly the same as the previous TensorFlow detector so that they can be used interchangeably in stream processing networks. I am getting about 2 fps with 640 x 360 images which is actually better than I expected.

Validating long-term sensor data collection with rtaiView

In order to avoid the whole garbage in, garbage out problem, I want to make sure that the long term data that I am collecting from the ZeroSensors and IP cameras is actually what I think it is. rtaiView is an rt-ai Edge app that allows both real-time streams and historic stored data to be reviewed in a convenient way. It can be used to check both raw data and extracted metadata for each stream. The screen capture above shows an rtaiView instance monitoring the real-time streams from two external IP cameras and the three streams (video, audio and multi-channel sensor) from a ZeroSensor. I am going to add a second ZeroSensor in a second internal location and then let the whole thing run for a while. Just to make life complicated I am using the straight TensorFlow object detector for the external cameras and DeepLabv3 segmentation in color map mode for the internal cameras so that I don’t upset the other inhabitants who aren’t keen on me putting cameras everywhere.

Focus now is moving on to extracting interesting sequences from the stored data and then using these to train an anomaly detector. The exact architecture of the anomaly detector is still a bit unclear – definitely a research project.

Completed ZeroSensors all ready for long term data collection

Finally this is a ZeroSensor all ready to go into full time service, capturing video, audio and environmental data. The goal is to use this data, and that from other cameras around the space, as training data for machine learning systems.

One specific goal is to create an anomaly detector with minimal supervision. As much as possible, it will learn from experience. This is kind of tricky as it requires detection of unknown length sequences depending on the circumstances. I am intrigued by the ideas behind the Universal Translator but not sure how much could carry over to this application. This paper reviews some of the techniques usually applied, at least for video processing. The situation here is a little different as there are quite different types of features involved. My plan is to preprocess video and audio to recognize salient features (using object detection or whatever) and then input these features, along with environmental sensor data, in the form of uniform time-slotted data sets to the anomaly detector. This doesn’t help with detecting the length of an interesting sequence – that’s the fun part of the project.

Adding audio support to the ZeroSensor

Something conspicuously missing from the original ZeroSensor concept was audio support. That’s now been remedied with the SPH0645 board from Adafruit. I was originally a bit put off by the somewhat extended software installation but, in the end, this is the best approach.

This is the new ZeroSynth synth module for the ZeroSensor. Basically it just consists of an audio capture stream processing element added to the existing video and sensor capture SPEs.

The capture above shows the simple test design with the ZeroSynth module and which once again uses the DeepLabV3 SPE along with three PutManifold blocks to map the rt-ai Edge data into the Manifold for storage and offline processing.

rtaiView can be used to view and review the three streams captured from the ZeroSensor. The audio can also be played out of the rtaiView host’s speakers if desired.

The intention is not so much to process audio for content like words at this stage. It’s more the presence or absence of sounds or transients in conjunction with other sensed events that could be more useful. There seem to be two basic choices: either use just an average amplitude level during a specific timeslot or actually try to recognize sound sources during a timeslot. Either of these then becomes a feature for input into an inference engine along with other detected features from the video and sensor streams.

rtaiView: an rt-ai app for viewing real-time and historic sensor data

I am now pulling things together so that I can use the ZeroSensors to perform long-term data collection. Data generated by the rt-ai Edge design is passed into the Manifold and then captured by ManifoldStore, one of the standard Manifold nodes. Obviously it would be nice to know that meaningful data is being stored and that’s where rtaiView comes in. The screen capture above shows the real-time display when it has been configured to receive streams from the video and data components of the ZeroSensor streams. This is showing the streams from a couple of ZeroSensors but more can be added and the display adjusts accordingly.

This is the simple ZeroSpace design as seen in the rtaiDesigner editor window. The hardware setup consists of the ZeroSensors running the SensorZero synth stream processor element (SPE) and a server running the DeepLabv3 SPEs and the ManifoldZero synths. The ManifoldZero synths consist of a couple of PutManifold SPEs that take each stream from the ZeroSensor and map it to a Manifold stream.

ManifoldStore captures these streams and persists them to disk as can be seen from the screen capture above.

This allows rtaiView to display the real-time data coming from the ZeroSensors and historic data based on timecode.

The screen capture above shows rtaiView in historic (or DVR) mode. The control widget (at the top right) allows the user to scan through periods of time and visualize the data. The same timecode is used for all streams displayed, making it easy to correlate events between them.

rtaiView is a useful tool for checking that the rt-ai Edge design is operating correctly and that the data stored is useful. In these examples, I have set DeepLabv3 to color map recognized objects. However, this is not the desired mode as I just want to store images that have people detected in them and then have the images only contain the people. The ultimate goal is to use these image sequences along with other sensor data to detect anomalous behavior and also to predict actions so that the rt-ai Edge enabled sentient space can be proactive in taking actions.

ZeroSensor case design

It has taken a while to get to this point but, now the focus is back on rt-ai Edge, it is time to get the ZeroSensors sorted out properly. The design above is the prototype 3D printed case. It’s a free standing case, about 3 inches by 2.7 inches and 1.4 inches deep. The biggest problem with these things is getting thermal isolation so that the temperature reading is from the outside air rather than Raspberry Pi Zero heated air. The big baffle on the rear (on the right of the image above) is intended to keep the air separate in the two halves. The little slot is to allow four thin cables to run between the Pi and the sensor boards. Right now the back has no holes so that air flow is fully bottom to top convection on both the sensor side and the Pi side. However, this might need to be changed if the initial design doesn’t work. The plastic material will conduct heat so it may be necessary to add more thermal isolation using holes or slots in the back.

The ZeroSensor – a sentient space point of presence

One application for rt-ai Edge is ubiquitous sensing leading to sentient spaces – spaces that can interact with people moving through and provide useful functionality, whether learned or programmed. A step on the road to that is the ZeroSensor, four prototypes of which are shown in the photo. Each ZeroSensor consists of a Raspberry Pi Zero W, a Pi camera module v2, an Adafruit BME 680 breakout and an Adafruit TSL2561 breakout. The combination gives a video stream and a sensor stream with light, temperature, pressure, humidity and air quality values. The video stream can be used to derive motion sensing and identification while the other sensors provide a general idea of conditions in the space. Notably missing is audio. Microphone support would be useful for general sensing and I might add that in real devices. A 3D printable case design is underway in order to allow wide-scale deployment.

Voice-based interaction is a powerful way for users to interact with sentient spaces. However, it is assumed that people who want to interact are using an AR headset of some sort which itself provides the audio I/O capabilities. Gesture input would be possible via the ZeroSensor’s camera. For privacy reasons video would not be viewed directly or stored but just used as a source of activity data and interaction.

This is the simple rt-ai design used to test the ZeroSensors. The ZeroSynth modules are rt-ai Edge synth modules that contain SPEs that interface with the ZeroSensor’s hardware and generate a video stream and a sensor data stream. An instance of a video viewer and sensor viewer are connected to each ZeroSynth module.

This is the result of running the ZeroSensor test design, showing a video and sensor window for each ZeroSensor. The cameras are staring at the ceiling because the four sensors were on a table. When the correct case is available, they will be deployed in the corners of rooms in the space.