Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai Edge. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.
This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.
An Intel Neural Compute Stick 2 (NCS 2) just turned up. It will be interesting to see how it compares to the earlier version. The software supporting it is now OpenVINO which is pretty interesting as it makes it relatively easy to move models across multiple hardware platforms.
The NCS 2 is really the best edge inference hardware engine that is generally available as far as I am aware. Hopefully, one day, the Edge TPU will be generally available and there seem to be many more in the pipeline. Typically these edge inference devices do not support RNNs or related architectures but, in the short term, that isn’t a problem as CNNs and DNNs are probably most useful at the edge at the moment, being very effective at compressing audio and video streams down to low rate information streams for example.
A nice feature of the NCS 2 is that it is easy to connect multiple examples to a single powerful CPU. The combination of a reasonably powerful CPU along with dedicated inference hardware is pretty interesting and is an ideal architecture for an rt-ai Edge node as it happens.
The Stereolabs ZED camera is a quite effective way of generating depth-enhanced video streams and it seemed like it was time to get one and integrate it with rt-ai Edge. I have worked with one of these before in a different context and I knew that using the ZED was pretty straightforward.
The screen capture above shows the ZED YOLO C++ example code running. The mug in the shot was a bit too close to the monitor to get picked up and my hand was probably too close in general hence the strange 4.92m depth reading. However, it does seem to work pretty well. It even picked up the image of the monitor on the screen as a monitor.
Just as a note, I did have to modify the main.cpp code to run. At line 49, I had to add a std:: in front of an isfinite() call for some reason. Maybe something odd on my Ubuntu system. Also, to get the standard samples to build, I had to add libxmu-dev as another dependency.
Now comes the task of adding this to rt-ai Edge. I am going to split this into two: the first is to produce a new camera SPE that works with the ZED and outputs the depth image in addition to the normal camera image. Then, the CYOLO SPE will be modified to accept optional depth information and perform the processing to generate the actual object depth value. This seems like a more general solution as the ZED SPE then looks like a standard depth camera while the upgraded CYOLO will be able to work with any depth camera.
Following on from the previous post, I have now put together a pretty usable workflow for creating custom YOLOv3 models – the code and instructions are here. There are quite a few alternatives out there already but it was interesting putting this together from a learning point of view. The screen capture above was taken during some testing. I stopped the training early (which is why the probabilities are pretty low) so that I could test the weights with an rt-ai stream processing network design and then restarted the training. The tools automatically generate customized scripts to train and restart training, making this pretty painless.
There is a tremendous amount of valuable information here, including the code for the custom anchor generator that I have integrated into my workflow. I haven’t yet tried this enhanced version of Darknet yet but will do that soon. One thing I did learn from that repo is that there is an option to treat mirror image objects as distinct objects – no doubt that was what was hindering the accurate detection of the left and right motion controllers previously.
I have an application that requires a custom object detector for rt-ai and YOLOv3 seemed liked a good base from which to start. The challenge as always is to capture and prepare suitable training data. I followed the guide here which certainly saved a lot of work. For this test, I used about 50 photos each of the left and right controllers from a Windows MR headset. The result from the rt-ai SPE is shown in the capture above. I was interested to see how well it could determine between the left and right controllers as they are just mirror images of each other. It’s a bit random but not terrible. Certainly it is very good at detecting the presence or absence of controllers, even if it is not sure which one it is. No doubt adding more samples for training would improve this substantially.
The guide I followed to create the training data works but has a number of steps that need to be done correctly and in the right order. I am going to modify the Python code to consolidate this into a smaller number of (hopefully) idiot-proof steps and put the results up on GitHub in case anyone else finds it useful.
Fresh from success with YOLOv3 on the desktop, a question came up of whether this could be made to work on the Movidius Neural Compute Stick and therefore run on the Raspberry Pi.
The NCS is a neat little device and because it connects via USB, it is easy to develop on a desktop and then transfer everything needed to the Pi.
The app zoo, on the ncsdk2 branch, has a tiny_yolo_v2 implementation that I used as the basis for this. It only took about an hour to get this working on the desktop – integration with rt-ai was very easy. The Raspberry Pi end was not – all kinds of version number issues and things like that. However, even though not all of the tools would compile, I just moved the compiled graph from the desktop to the Pi and that worked fine.
This is the design. The main difference here from the usual test designs is that the MYOLO SPE is assigned to node pi34 (the Raspberry Pi) rather than the desktop (Default). Just assigning the MYOLO SPE to the Pi saved me from having to connect a Picam or uvc camera to the Pi and also allowed me to get a better feel for the pure performance of the Pi with the NCS.
As can be seen from the first screen capture it worked fine although, because it supports only a subset (20 of 91) of the usual COCO labels, it did not pick up the mouse or the keyboard. Performance-wise, it was running at about 1fps and 30% CPU. Just for reference, I was getting about 8fps on the i7 desktop.
I decided that it would be fun to try out a Google AIY Vision Kit as a sort of warm-up for the potentially much more significant Edge TPU.
The Vision Kit is basically the same configuration as the ZeroSensor camera except with an extra board in the camera path that can perform inference on the captured images. The kit comes with some frozen graphs that can be used to detect a few things but I thought it would be interesting to try training a MobileNet SSD network with the Pascal VOC 2012 training data which can identify 20 different objects. The instructions for how to do this are here.
Once that was all running, the next step was to integrate it with rt-ai Edge. It’s pretty similar to the earlier full-blown TensorFlow version so it didn’t take too long to get working.
The design is much the same as usual except with the new VisionKit object detection SPE instead of TFObjectDetect or Deeplab. Note that the PiCam and VisionKit SPEs are running on the AIY Vision Kit, whereas the MediaView SPE is running on a desktop.
This is the output from the MediaView SPE. The metadata has been formatted to look exactly the same as the previous TensorFlow detector so that they can be used interchangeably in stream processing networks. I am getting about 2 fps with 640 x 360 images which is actually better than I expected.