Fresh from success with YOLOv3 on the desktop, a question came up of whether this could be made to work on the Movidius Neural Compute Stick and therefore run on the Raspberry Pi.
The NCS is a neat little device and because it connects via USB, it is easy to develop on a desktop and then transfer everything needed to the Pi.
The app zoo, on the ncsdk2 branch, has a tiny_yolo_v2 implementation that I used as the basis for this. It only took about an hour to get this working on the desktop – integration with rt-ai was very easy. The Raspberry Pi end was not – all kinds of version number issues and things like that. However, even though not all of the tools would compile, I just moved the compiled graph from the desktop to the Pi and that worked fine.
This is the design. The main difference here from the usual test designs is that the MYOLO SPE is assigned to node pi34 (the Raspberry Pi) rather than the desktop (Default). Just assigning the MYOLO SPE to the Pi saved me from having to connect a Picam or uvc camera to the Pi and also allowed me to get a better feel for the pure performance of the Pi with the NCS.
As can be seen from the first screen capture it worked fine although, because it supports only a subset (20 of 91) of the usual COCO labels, it did not pick up the mouse or the keyboard. Performance-wise, it was running at about 1fps and 30% CPU. Just for reference, I was getting about 8fps on the i7 desktop.
I decided that it would be fun to try out a Google AIY Vision Kit as a sort of warm-up for the potentially much more significant Edge TPU.
The Vision Kit is basically the same configuration as the ZeroSensor camera except with an extra board in the camera path that can perform inference on the captured images. The kit comes with some frozen graphs that can be used to detect a few things but I thought it would be interesting to try training a MobileNet SSD network with the Pascal VOC 2012 training data which can identify 20 different objects. The instructions for how to do this are here.
Once that was all running, the next step was to integrate it with rt-ai Edge. It’s pretty similar to the earlier full-blown TensorFlow version so it didn’t take too long to get working.
The design is much the same as usual except with the new VisionKit object detection SPE instead of TFObjectDetect or Deeplab. Note that the PiCam and VisionKit SPEs are running on the AIY Vision Kit, whereas the MediaView SPE is running on a desktop.
This is the output from the MediaView SPE. The metadata has been formatted to look exactly the same as the previous TensorFlow detector so that they can be used interchangeably in stream processing networks. I am getting about 2 fps with 640 x 360 images which is actually better than I expected.
Finally this is a ZeroSensor all ready to go into full time service, capturing video, audio and environmental data. The goal is to use this data, and that from other cameras around the space, as training data for machine learning systems.
One specific goal is to create an anomaly detector with minimal supervision. As much as possible, it will learn from experience. This is kind of tricky as it requires detection of unknown length sequences depending on the circumstances. I am intrigued by the ideas behind the Universal Translator but not sure how much could carry over to this application. This paper reviews some of the techniques usually applied, at least for video processing. The situation here is a little different as there are quite different types of features involved. My plan is to preprocess video and audio to recognize salient features (using object detection or whatever) and then input these features, along with environmental sensor data, in the form of uniform time-slotted data sets to the anomaly detector. This doesn’t help with detecting the length of an interesting sequence – that’s the fun part of the project.
I have been using DeepLabv3 for a while now for object detection but I thought it would be interesting to try some examples from the TensorFlow object detection repo. I now have an rt-ai Edge stream processing element that is based on the Jupyter notebook example in the repo. Presumably this will work with any of the models in the model zoo although I am just using the default one for now.
As you can see from the preview capture above (apart from the nasty looking grass on the left) it picks out the car happily, although not with a great confidence level. Maybe it doesn’t like the elevated camera position or the car is a bit too far away or a difficult pose – I will need to do some more experiments. With the preview display on (using PyGame) I am only getting 1 fps with 1280 x 720 frames from the camera which is a little disappointing. However, with preview turned off (the normal production mode anyway), I am getting over 15fps which is entirely adequate.
The capture above shows the raw image along with the object recognition data in the form of metadata rather than drawn on the image. This is actually pretty useful for both real-time and offline processing (such as a machine learning run). Capturing the original image does have the advantage that alternate object detectors could be run at any time, at the expense of having to store more data. Real-time actions can be based on the metadata and the raw image just discarded.
Anyway, definitely a work in progress. It will be interesting to see how it compares with the DeepLabv3 version as the implementation gets more efficient. What’s nice is that it is trivial to swap out one object detector for another or run them in parallel in order to run tests. Just takes a few seconds with the rtaiDesigner GUI.
These days, machine learning techniques have led to the ability to create very realistic but fake video and audio that can be tough to distinguish from the real thing. The video above shows a very interesting example of this capability. The problem with this technology is that it will become impossible to determine if anything is genuine at all. What’s needed is some verification that a video of someone (for example) really is that person. Blockchain technology would seem to provide a solution for this.
Many years ago I was working on a digital watermarking-based system for detecting tampering in video records. Essentially, this embedded error-correcting codes in each frame that could be used to determine if any region of a frame had been modified after the digital watermark had been added. Cameras would add the digital watermark at source, limiting the opportunity for modification prior to watermarking.
One problem with this is that it worked on a frame by frame basis but didn’t ensure the integrity of an entire sequence. In theory this could be done with temporally distributed watermarks but blockchain technology provides a very nice alternative.
A simple strategy would be to have the sensor (camera, microphone, motion detector, whatever) create a hash for each unit of data (video frame, chunk of audio etc) and add this to a blockchain. Then a review app could create new hashes from the sensor data itself (stored elsewhere) and compare them to those in the blockchain. It could also determine that the account owner or device is who or what it is supposed to be in order to avoid spoofing. It’s easy to envisage an Etherium smart contract being the basis of such a system.
One issue with this is the potential rate at which hashes need to be added to the blockchain. This rate could be reduce by collecting more data (e.g. accumulating one second’s worth of data to generate one hash) or creating a hash of hashes at an appropriate rate. The only downside to this is losing temporal resolution of where changes have been made.
It’s worth considering the effects of lossy compression. Obviously if a stream is uncompressed or only uses lossless compression, watermarking and hash generation can be done at a very early stage. Watermarking of video is designed to withstand compression so that can still be done at a very early stage, even with lossy compression. The hash has to be be bit-accurate with the stream as stored on the video storage medium though so the hash must be computed after lossy compression.
It seems as though this blockchain concept could definitely be made to work and possibly combined with the digital watermarking technique in the case of video to provide temporal and spatial resolution of tampering. I am sure that variations of this concept are out there already or being developed and maybe, one day, it will be possible for anybody to check if a video of a well-known person is real or fake.
The MQTT-based heart of rt-ai Edge is ideal for constructing stream processing networks (SPNs) that are intended to run continuously. rt-ai Edge tools (such as rtaiDesigner) make it easy to modify and re-deploy SPNs across multiple nodes during the design phase but, once in full time operation, these SPNs just run by themselves. An existing stream processing element (SPE), PutNiFi, allows data from an rt-ai Edge network to be stored and processed by big data tools – using Elasticsearch for example. However, these types of big data tools aren’t always appropriate, especially if low latency access is required as Java garbage collection can cause random delays.
For many applications, much simpler but reliably low latency storage is desirable. The Manifold system already has a storage app, ManifoldStore, that is optimized for timestamp-based searches of historical data. A new SPE called PutManifold allows data from an SPN to flow into a Manifold networking surface. The SPN screen capture above shows two instances of the PutManifold SPE used to transfer audio and video data from the SPN. ManifoldStore grabs passing data and stores it using timestamp as the key. Manifold applications can then access historical data flows using streamId/timestamp pairs. It is particularly simple to coordinate access across multiple data streams. This is very useful when trying to correlate events across multiple data sources at a particular point or window in time.
ManifoldStore is intrinsically schemaless in that it can store anything that consists of a JSON part and a binary data part, as used in rt-ai Edge. A new application called rtaiView is a universal viewer that allows multiple streams of all types to be displayed in a traditional split-screen monitoring format. It uses ManifoldStore for its underlying storage and provides a window into the operation of the SPN.
Manifold is designed to be very flexible with various features that reduce configuration for ad-hoc uses. This makes it very easy to perform offline processing of stored data as and when required which is ideal for offline machine learning applications.
rt-ai Edge is a new concept in edge processing that makes it easy for anyone to build AI and ML enhanced stream processing pipelines in order to close the local loop and offload communications networks and the cloud. Semantic extraction of meaningful data from raw data feeds at the edge ensures that the core only has to deal with actionable information, not noise. rt-ai Edge leverages hardware acceleration within embedded devices to filter raw data into highly salient messages for higher level processing.
rt-ai Edge is in active development right now.