Using multiple Neural Compute Sticks with OpenVINO

As I had discovered, one Neural Compute Stick 2 (NCS 2) has pretty decent throughput. The question then is: what happens if you connect more than one of these to the same machine? I only have one NCS 2 and one of the older NCS devices to test this out but that combination worked ok with some tuning. OpenVINO manages allocation of requests to physical devices so there is no explicit way for this to be controlled via the API. However, it appears that multiple SPEs on the same node can be supported as then the NCSs are divided up between the SPEs. A reset error message is typically emitted but then everything seems to work fine.

To get the best performance, I ran in async mode using multiple ExecutableNetwork/InferRequest pairs, with the actual number being configurable from the rtaiDesigner GUI. In this case, 5 pairs gave the best results. The throughput is around 18 frames per second running ssd_mobilenet_v2_coco object detection.

Using one NCS at a time, the NCS 2 was able to process 12 frames per second (versus 9 frames per second in synchronous mode using the original SPE code) while the older NCS was able to process 6 frames per second,  suggesting that both were being fully utilized.

Now I need to get a second NCS 2…

SSD object detection using the Neural Compute Stick 2 now has its own rt-ai stream processing element

Turned out to be pretty easy to integrate the ssd_mobilenet_v2_coco model compiled for the Intel NCS 2 into rt-ai Edge. Since it doesn’t use the GPU, I was able to run this and the YOLOv3 SPE on the same machine which is kind of amusing – one YOLOv3 instance tends to chew up most of the GPU memory, unfortunately, so the GPU can’t be shared. I would have liked to have run YOLOv3 on the NCS 2 for direct comparison but could not. The screen capture above shows the MediaView SPE output for both detectors running on the same 1280 x 720 video stream.

This is the design and it is showing the throughput of each detection SPE – 14 fps for the GTX 1080 ti YOLO and 9 fps for the NCS 2 based SSD. Not exactly a fair comparison, however, but still interesting. It would be much better if I had the same model running using a GPU of course. Right now, the GPU-based SPE that can run ssd_mobilenet_v2_coco (and similar models) is Python based and that (not surprisingly) runs a fair bit slower than the compiled C++ versions I am using here.

ssd_mobilenet_v2_coco running on the Intel Neural Compute Stick 2

I had more luck running the ssd_mobilenet_v2_coco model from the TensorFlow model detection zoo on the NCS 2 than I did with YOLOv3. To convert from the .pb file to the OpenVINO-friendly files I used:

python3 --input_model ssdmv2.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/ssd_v2_support.json --tensorflow_object_detection_api_pipeline_config ssdmv2_pipeline.config --data_type FP16

In this case, I had renamed the frozen_instance_graph.pb from the download as ssdmv2.pb and renamed the pipeline.config file from the download as ssdmv2_pipeline.config. The screen capture above shows the object_detection_demo_ssd_async demo app running with the NCS 2. I didn’t sort out the labels for this test which is why it is just displaying numbers for the detected objects.

I also tried this using the CPU (using –data_type FP32) with this result:

It is worth noting that the video was running at 1920 x 1080 which is a significant challenge for just about anything. The CPU (an i7 5820K) is obviously a fair bit faster than the NCS 2 but a real advantage is the small physical footprint, low price, low power and CPU offload that the Myriad X VPU in the NCS 2 offers.

Running YOLOv3 with OpenVINO on CPU and (not) NCS 2

Since OpenVINO is the software framework for the Neural Compute Stick 2, I thought it would be interesting to get the OpenVINO YOLOv3 example up and running. While the toolkit download does include a number of models, YOLOv3 isn’t one of them. Instead, the model has to be created from a TensorFlow version.

The instructions here describe how to do this. Steps 1 and 2 are fine but it is kind of awkward how the .pb file is generated so I created a new simple script to do this:

# -*- coding: utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import graph_io

from yolo_v3 import yolo_v3, load_weights, detections_boxes, non_max_suppression

def load_coco_names(file_name):
    names = {}
    with open(file_name) as f:
        for id, name in enumerate(f):
            names[id] = name
    return names
def main(argv):

    classes = load_coco_names("coco.names")

    # placeholder for detector inputs
    inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])

    with tf.variable_scope('detector'):
        detections = yolo_v3(inputs, len(classes), data_format='NHWC')
        load_ops = load_weights(tf.global_variables(scope='detector'), "yolov3.weights")

    boxes = detections_boxes(detections)

    with tf.Session() as sess:
        frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ['concat_1'])
        graph_io.write_graph(frozen, './', 'yolo_v3.pb', as_text=False)

if __name__ == '__main__':

This has the important filenames hardcoded – you just need to put yolo_v3.weights and coco.names in the tensorflow-yolo-v3 directory. Run the script above with:


and the yolo_v3.pb file should be created. Copy this into the model_optimizer directory, set that as the current directory and run:

python3 --input_model yolo_v3.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/yolo_v3.json --input_shape [1,416,416,3]

The –input_shape parameter is needed as otherwise it blows up due to getting -1 for the mini-batch size. I just forced this to 1 and it was happy.

The result is in yolo_v3.xml and yolo_v3.bin. These can be used with the demo object_detection_demo_yolov3_async and an example output is shown in the screen capture above. Note that it is necessary to run the following:


in the same terminal session as the demo will be run in order for CPU mode to work.

By default, the output just annotates the boxes with label numbers rather than readable labels. To get readable labels, copy coco.names to yolo_v3.labels and put it in the same directory as the xml file. One problem is that the label file reader doesn’t handle spaces in the labels. Rather than mess with the code, I just changed the spaces in the yolo_v3.labels file to underlines. Otherwise it thinks a mouse is a donut and a monitor a dog which is a little confusing.

However, what I really wanted to do was to run this on the NCS 2. The model as generated is FP32 and the NCS 2 wants FP16. Adding –data_type FP16 to the command line fixes that but unfortunately it reports that the NCS 2 doesn’t support the Resample layer which is used by YOLOv3. If I had been smart I would have noticed that the usage info only mentions CPU and GPU :-(. Interestingly, the table of supported layers indicates that both Resample and Interp are supported on MYRIAD so I do not know what is going on here.

I did try changing the offending tf.image_resize_nearest_neighbor call into a tf.image.resize.bilinear call (by editing in the tensorflow-yolo-v3 directory). This maps to Interp instead of Resample in the OpenVINO IR.  This worked fine in CPU mode but still failed to run on the NCS 2 except in a different way:

Not sure if that is a bug or intended. Anyway, that seems to be the end of the road with running YOLOv3 on the NCS 2 for the moment at least. However, there are a lot of things that do run on the NCS 2 very nicely. Still, YOLOv3 had started to become my standard way of checking inference things out, just like my strategy of evaluating restaurants by the quality of their Caesar salad – at least in the days when you could still get them!

Intel Neural Compute Stick 2

An Intel Neural Compute Stick 2 (NCS 2) just turned up. It will be interesting to see how it compares to the earlier version. The software supporting it is now OpenVINO which is pretty interesting as it makes it relatively easy to move models across multiple hardware platforms.

The NCS 2 is really the best edge inference hardware engine that is generally available as far as I am aware. Hopefully, one day, the Edge TPU will be generally available and there seem to be many more in the pipeline. Typically these edge inference devices do not support RNNs or related architectures but, in the short term, that isn’t a problem as CNNs and DNNs are probably most useful at the edge at the moment, being very effective at compressing audio and video streams down to low rate information streams for example.

A nice feature of the NCS 2 is that it is easy to connect multiple examples to a single powerful CPU. The combination of a reasonably powerful CPU along with dedicated inference hardware is pretty interesting and is an ideal architecture for an rt-ai Edge node as it happens.

Detecting what’s coming up the driveway with YOLOv3

It is hardly an original desire to want to know who or what is coming up the driveway. As a step along that road (as it were), I used my YOLO workflow to train YOLOv3 on a few things likely to be seen there. With my usual impatience, this test captured above was performed with an early set of weights (at around 1200 iterations) but actually seemed to work reasonably well and was easily able to differentiate between the different vehicle types and makes. Training is continuing again now but it is nice to know that it is going to work. I am training it to detect a range of vehicles, including UPS trucks and mail vans.

One thing I don’t know as yet is the situation with false positives – will random cars and trucks trigger one of the learned classes or not? Time will tell. If so, I’ll probably have to include some negative examples in the training set that includes examples of other types of vehicles that I don’t want to detect. Or, put all these other examples into a new general vehicle class. Not sure which is best at this point.

This is the fairly boring rt-ai Edge design that’s using the new model. It is basically passing the video frames through CYOLO and then pushing out to Manifold where it is being stored and can be viewed in real-time. This is running full time now so I will be able to look back and see how the detection performs in real life. In addition, selected and annotated frames from the stored data can be recycled to add to the training data in a future training cycle.

I could go crazy and use the license reading SPE to be much more specific about the individual vehicles. However, I still don’t have the right sort of cameras to make that work effectively.

Ok, so now that I have YOLO producing metadata indicating what is moving on the driveway, I then need to process that into useful information. That’s going to require a new SPE to process and filter the raw detections so that I can get real-time alerts for interesting events.

CYOLO – a pure C++ implementation of a YOLOv3 SPE for rt-ai

The Python-based YOLOv3 SPE has been working for a while now but the performance was a little disappointing at 2 or 3 fps using 1280 x 720 frames on an i7 5820K CPU/GTX 1080ti GPU machine. I was interested to see how much effect the Python code was having on overall performance. To do this I implemented the C++ rt-ai SPE API and added the C version of the YOLOv3 demo code. The result is shown above and this version now runs at just over 14 fps (17 fps at 640 x 480) which is very usable.

While Python is very convenient, it is clearly (and unsurprisingly) more efficient to use C/C++ so I will probably do that in the future where possible. The main side-effect is that rtaiDesigner has to deploy the correct compiled SPE for the target node (typically x64 or ARM) and that any shared libraries that are not part of the standard install are included too. A Dockerized version would of course solve the dependency problem and just require a container for each target architecture.