It wasn’t too hard to go from the inline rt-ai Edge Stream Processing Element using the Coral Edge TPU accelerator to an embedded version running on a Raspberry Pi 3 Model B with Pi camera. The rt-ai Edge test design for this SPE is pretty simple again:
As can be seen, the Pi + Coral runs at about 4 fps with 1280 x 720 frames which is not too bad at all. In this example, I am running the PiCoral camera SPE on the Raspberry Pi node (Pi7) and the View SPE on the Default node (an i7 Ubuntu machine). Also, I’m using the combined video and metadata output which contains both the detection data and the associated JPEG video frame. However, the PiCoral SPE also has a metadata-only output. This contains all the frame information and detection data (scores, boxes etc) but not the JPEG frame itself. This can be useful for a couple of reasons. First, especially if the Raspberry Pi is connected via WiFi, transmitting the JPEGs can be a bit onerous and, if they are not needed, very wasteful. Secondly, it satisfies a potential privacy issue in that the raw video data never leaves the Raspberry Pi. Provided the metadata contains enough information for useful downstream processing, this can be a very efficient way to configure a system.
A Coral USB Accelerator turned up yesterday so of course it had to be integrated with rt-ai Edge to see what it could do. Creating a Python-based SPE from the object detection demo in the API download didn’t take too long. I used the MobileNet SSD v2 COCO model as a starting point to generate this example output:
The very basic rt-ai Edge test design looks like this:
Using 1280 x 720 video frames from the webcam, I was getting around 2 frames per second from the CoralSSD SPE. This isn’t as good as the Intel NCS 2 SPE but that is a compiled C++ SPE whereas the Coral SPE is a Python 3 SPE. I haven’t found a C++ API spec for the Edge TPU as yet. Perhaps by investigating the SWIG-generated Python interface I could link the compiled libraries directly but that’s for another day…
Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai Edge. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.
This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.
An Intel Neural Compute Stick 2 (NCS 2) just turned up. It will be interesting to see how it compares to the earlier version. The software supporting it is now OpenVINO which is pretty interesting as it makes it relatively easy to move models across multiple hardware platforms.
The NCS 2 is really the best edge inference hardware engine that is generally available as far as I am aware. Hopefully, one day, the Edge TPU will be generally available and there seem to be many more in the pipeline. Typically these edge inference devices do not support RNNs or related architectures but, in the short term, that isn’t a problem as CNNs and DNNs are probably most useful at the edge at the moment, being very effective at compressing audio and video streams down to low rate information streams for example.
A nice feature of the NCS 2 is that it is easy to connect multiple examples to a single powerful CPU. The combination of a reasonably powerful CPU along with dedicated inference hardware is pretty interesting and is an ideal architecture for an rt-ai Edge node as it happens.
The Stereolabs ZED camera is a quite effective way of generating depth-enhanced video streams and it seemed like it was time to get one and integrate it with rt-ai Edge. I have worked with one of these before in a different context and I knew that using the ZED was pretty straightforward.
The screen capture above shows the ZED YOLO C++ example code running. The mug in the shot was a bit too close to the monitor to get picked up and my hand was probably too close in general hence the strange 4.92m depth reading. However, it does seem to work pretty well. It even picked up the image of the monitor on the screen as a monitor.
Just as a note, I did have to modify the main.cpp code to run. At line 49, I had to add a std:: in front of an isfinite() call for some reason. Maybe something odd on my Ubuntu system. Also, to get the standard samples to build, I had to add libxmu-dev as another dependency.
Now comes the task of adding this to rt-ai Edge. I am going to split this into two: the first is to produce a new camera SPE that works with the ZED and outputs the depth image in addition to the normal camera image. Then, the CYOLO SPE will be modified to accept optional depth information and perform the processing to generate the actual object depth value. This seems like a more general solution as the ZED SPE then looks like a standard depth camera while the upgraded CYOLO will be able to work with any depth camera.
Following on from the previous post, I have now put together a pretty usable workflow for creating custom YOLOv3 models – the code and instructions are here. There are quite a few alternatives out there already but it was interesting putting this together from a learning point of view. The screen capture above was taken during some testing. I stopped the training early (which is why the probabilities are pretty low) so that I could test the weights with an rt-ai stream processing network design and then restarted the training. The tools automatically generate customized scripts to train and restart training, making this pretty painless.
There is a tremendous amount of valuable information here, including the code for the custom anchor generator that I have integrated into my workflow. I haven’t yet tried this enhanced version of Darknet yet but will do that soon. One thing I did learn from that repo is that there is an option to treat mirror image objects as distinct objects – no doubt that was what was hindering the accurate detection of the left and right motion controllers previously.
I have an application that requires a custom object detector for rt-ai and YOLOv3 seemed liked a good base from which to start. The challenge as always is to capture and prepare suitable training data. I followed the guide here which certainly saved a lot of work. For this test, I used about 50 photos each of the left and right controllers from a Windows MR headset. The result from the rt-ai SPE is shown in the capture above. I was interested to see how well it could determine between the left and right controllers as they are just mirror images of each other. It’s a bit random but not terrible. Certainly it is very good at detecting the presence or absence of controllers, even if it is not sure which one it is. No doubt adding more samples for training would improve this substantially.
The guide I followed to create the training data works but has a number of steps that need to be done correctly and in the right order. I am going to modify the Python code to consolidate this into a smaller number of (hopefully) idiot-proof steps and put the results up on GitHub in case anyone else finds it useful.