SSD object detection using the Neural Compute Stick 2 now has its own rt-ai stream processing element

Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai Edge. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.

This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.

ssd_mobilenet_v2_coco running on the Intel Neural Compute Stick 2

I had more luck running the ssd_mobilenet_v2_coco model from the TensorFlow model detection zoo on the NCS 2 than I did with YOLOv3. To convert from the .pb file to the OpenVINO-friendly files I used:

python3 --input_model ssdmv2.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/ssd_v2_support.json --tensorflow_object_detection_api_pipeline_config ssdmv2_pipeline.config --data_type FP16

In this case, I had renamed the frozen_instance_graph.pb from the download as ssdmv2.pb and renamed the pipeline.config file from the download as ssdmv2_pipeline.config. The screen capture above shows the object_detection_demo_ssd_async demo app running with the NCS 2. I didn’t sort out the labels for this test which is why it is just displaying numbers for the detected objects.

I also tried this using the CPU (using –data_type FP32) with this result:

It is worth noting that the video was running at 1920 x 1080 which is a significant challenge for just about anything. The CPU (an i7 5820K) is obviously a fair bit faster than the NCS 2 but a real advantage is the small physical footprint, low price, low power and CPU offload that the Myriad X VPU in the NCS 2 offers.

Running YOLOv3 with OpenVINO on CPU and (not) NCS 2

Since OpenVINO is the software framework for the Neural Compute Stick 2, I thought it would be interesting to get the OpenVINO YOLOv3 example up and running. While the toolkit download does include a number of models, YOLOv3 isn’t one of them. Instead, the model has to be created from a TensorFlow version.

The instructions here describe how to do this. Steps 1 and 2 are fine but it is kind of awkward how the .pb file is generated so I created a new simple script to do this:

# -*- coding: utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import graph_io

from yolo_v3 import yolo_v3, load_weights, detections_boxes, non_max_suppression

def load_coco_names(file_name):
    names = {}
    with open(file_name) as f:
        for id, name in enumerate(f):
            names[id] = name
    return names
def main(argv):

    classes = load_coco_names("coco.names")

    # placeholder for detector inputs
    inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])

    with tf.variable_scope('detector'):
        detections = yolo_v3(inputs, len(classes), data_format='NHWC')
        load_ops = load_weights(tf.global_variables(scope='detector'), "yolov3.weights")

    boxes = detections_boxes(detections)

    with tf.Session() as sess:
        frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ['concat_1'])
        graph_io.write_graph(frozen, './', 'yolo_v3.pb', as_text=False)

if __name__ == '__main__':

This has the important filenames hardcoded – you just need to put yolo_v3.weights and coco.names in the tensorflow-yolo-v3 directory. Run the script above with:


and the yolo_v3.pb file should be created. Copy this into the model_optimizer directory, set that as the current directory and run:

python3 --input_model yolo_v3.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/yolo_v3.json --input_shape [1,416,416,3]

The –input_shape parameter is needed as otherwise it blows up due to getting -1 for the mini-batch size. I just forced this to 1 and it was happy.

The result is in yolo_v3.xml and yolo_v3.bin. These can be used with the demo object_detection_demo_yolov3_async and an example output is shown in the screen capture above. Note that it is necessary to run the following:


in the same terminal session as the demo will be run in order for CPU mode to work.

By default, the output just annotates the boxes with label numbers rather than readable labels. To get readable labels, copy coco.names to yolo_v3.labels and put it in the same directory as the xml file. One problem is that the label file reader doesn’t handle spaces in the labels. Rather than mess with the code, I just changed the spaces in the yolo_v3.labels file to underlines. Otherwise it thinks a mouse is a donut and a monitor a dog which is a little confusing.

However, what I really wanted to do was to run this on the NCS 2. The model as generated is FP32 and the NCS 2 wants FP16. Adding –data_type FP16 to the command line fixes that but unfortunately it reports that the NCS 2 doesn’t support the Resample layer which is used by YOLOv3. If I had been smart I would have noticed that the usage info only mentions CPU and GPU :-(. Interestingly, the table of supported layers indicates that both Resample and Interp are supported on MYRIAD so I do not know what is going on here.

I did try changing the offending tf.image_resize_nearest_neighbor call into a tf.image.resize.bilinear call (by editing in the tensorflow-yolo-v3 directory). This maps to Interp instead of Resample in the OpenVINO IR.  This worked fine in CPU mode but still failed to run on the NCS 2 except in a different way:

Not sure if that is a bug or intended. Anyway, that seems to be the end of the road with running YOLOv3 on the NCS 2 for the moment at least. However, there are a lot of things that do run on the NCS 2 very nicely. Still, YOLOv3 had started to become my standard way of checking inference things out, just like my strategy of evaluating restaurants by the quality of their Caesar salad – at least in the days when you could still get them!

Dockerized YOLOv3 rt-ai SPE = YAOD (yet another object detector)

I had intended to be doing something completely different today (working on auto-compiling highlight reels of interesting events generated from the prototype production rt-ai Edge object detection system) but managed to get sidetracked by reading about Darknet-based YOLOv3.  As Darknet itself is in C and compiles to a shared library this was a good candidate for a Dockerized stream processing element. I used a cuDNN image from NVIDIA as the base since it provides pretty much everything required – I just had to add in the rt-ai SPE library software and compile Darknet on top of that.

The results are pretty good. The preview above shows some detected objects. I discovered that it could detect toothbrushes which is why I am waving one around. It also did a good job of picking up the second mouse just by my left shoulder. 2fps with 1280 x 720 frame size is a little disappointing but this seems to be due to the Python parts of the code since the C demo provided with the library runs much faster. It is a little faster with preview turned off, however (which would be the production mode anyway).

Speaking of production, it does have a problem as it consumes just over 7GB of memory on my GTX 1080 ti GPU card. This means that one GPU card can’t run two instances simultaneously, unlike with the TensorFlow SSD detector. In fact, I can get two instances of that working on a GTX 1080 card with 8GB total memory.

Just for completeness, this is the design which looks just like the usual test designs. The Docker container is built and pushed to a private Docker registry automatically when the design is generated. The target node then just pulls the image from the registry when the design starts up.

This is the MediaView output showing the metadata. The metadata format is equivalent to that generated by the TensorFlow object detector so that they are completely interchangeable.

rt-ai stream processing elements in Docker containers

Docker containers are a great way of reducing the headaches caused by pre-requisites and software versions when deploying code in general and rt-ai SPEs in particular. So it made sense to add support for SPEs in Docker containers in addition to the existing bare metal SPEs. The screen capture above shows the test design in rtaiDesigner using the Docker containerized version of the existing TensorFlow object detector. It is essentially identical to the bare metal version, just with the object detection SPE replaced with the Dockerized version. The container was based on the TensorFlow GPU image.

SPE code is deployed to nodes as a package that includes start and stop scripts. Normally, the start script is something very simple: a single line kicking off a Python script for example. Docker SPEs use a slightly more complex start script that first tries to pull the required Docker image from a defined registry location and then invokes the container in the required manner (using nvidia-docker if necessary).

No changes were required to the SPE code itself in this case – just customization of the start and stop scripts and I added some files used to build the container and install it in the local registry so that the build and update process is very straightforward. Plus, as this test design shows, bare metal and containerized SPEs can be mixed without limitation as the stream interfaces are identical in all cases.

AIY Vision Kit + MobileNet+SSD: a smart camera for rt-ai Edge

I decided that it would be fun to try out a Google AIY Vision Kit as a sort of warm-up for the potentially much more significant Edge TPU.

The Vision Kit is basically the same configuration as the ZeroSensor camera except with an extra board in the camera path that can perform inference on the captured images. The kit comes with some frozen graphs that can be used to detect a few things but I thought it would be interesting to try training a MobileNet SSD network with the Pascal VOC 2012 training data which can identify 20 different objects. The instructions for how to do this are here.

Once that was all running, the next step was to integrate it with rt-ai Edge. It’s pretty similar to the earlier full-blown TensorFlow version so it didn’t take too long to get working.

The design is much the same as usual except with the new VisionKit object detection SPE instead of TFObjectDetect or Deeplab. Note that the PiCam and VisionKit SPEs are running on the AIY Vision Kit, whereas the MediaView SPE is running on a desktop.

This is the output from the MediaView SPE. The metadata has been formatted to look exactly the same as the previous TensorFlow detector so that they can be used interchangeably in stream processing networks. I am getting about 2 fps with 640 x 360 images which is actually better than I expected.

Integrating TensorFlow object detection into rt-ai Edge

I have been using DeepLabv3 for a while now for object detection but I thought it would be interesting to try some examples from the TensorFlow object detection repo. I now have an rt-ai Edge stream processing element that is based on the Jupyter notebook example in the repo. Presumably this will work with any of the models in the model zoo although I am just using the default one for now.

As you can see from the preview capture above (apart from the nasty looking grass on the left) it picks out the car happily, although not with a great confidence level. Maybe it doesn’t like the elevated camera position or the car is a bit too far away or a difficult pose – I will need to do some more experiments. With the preview display on (using PyGame) I am only getting 1 fps with 1280 x 720 frames from the camera which is a little disappointing. However, with preview turned off (the normal production mode anyway), I am getting over 15fps which is entirely adequate.

The capture above shows the raw image along with the object recognition data in the form of metadata rather than drawn on the image. This is actually pretty useful for both real-time and offline processing (such as a machine learning run). Capturing the original image does have the advantage that alternate object detectors could be run at any time, at the expense of having to store more data. Real-time actions can be based on the metadata and the raw image just discarded.

Anyway, definitely a work in progress. It will be interesting to see how it compares with the DeepLabv3 version as the implementation gets more efficient. What’s nice is that it is trivial to swap out one object detector for another or run them in parallel in order to run tests. Just takes a few seconds with the rtaiDesigner GUI.