Running YOLOv3 with OpenVINO on CPU and (not) NCS 2


Since OpenVINO is the software framework for the Neural Compute Stick 2, I thought it would be interesting to get the OpenVINO YOLOv3 example up and running. While the toolkit download does include a number of models, YOLOv3 isn’t one of them. Instead, the model has to be created from a TensorFlow version.

The instructions here describe how to do this. Steps 1 and 2 are fine but it is kind of awkward how the .pb file is generated so I created a new simple script to do this:

# -*- coding: utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import graph_io

from yolo_v3 import yolo_v3, load_weights, detections_boxes, non_max_suppression

def load_coco_names(file_name):
    names = {}
    with open(file_name) as f:
        for id, name in enumerate(f):
            names[id] = name
    return names
    
def main(argv):

    classes = load_coco_names("coco.names")

    # placeholder for detector inputs
    inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])

    with tf.variable_scope('detector'):
        detections = yolo_v3(inputs, len(classes), data_format='NHWC')
        load_ops = load_weights(tf.global_variables(scope='detector'), "yolov3.weights")

    boxes = detections_boxes(detections)

    with tf.Session() as sess:
        sess.run(load_ops)
        frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ['concat_1'])
        graph_io.write_graph(frozen, './', 'yolo_v3.pb', as_text=False)

if __name__ == '__main__':
    tf.app.run()

This has the important filenames hardcoded – you just need to put yolo_v3.weights and coco.names in the tensorflow-yolo-v3 directory. Run the script above with:

python3 script.py

and the yolo_v3.pb file should be created. Copy this into the model_optimizer directory, set that as the current directory and run:

python3 mo_tf.py --input_model yolo_v3.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/yolo_v3.json --input_shape [1,416,416,3]

The –input_shape parameter is needed as otherwise it blows up due to getting -1 for the mini-batch size. I just forced this to 1 and it was happy.

The result is in yolo_v3.xml and yolo_v3.bin. These can be used with the demo object_detection_demo_yolov3_async and an example output is shown in the screen capture above. Note that it is necessary to run the following:

~/intel/computer_vision_sdk/bin/setupvars.sh

in the same terminal session as the demo will be run in order for CPU mode to work.

By default, the output just annotates the boxes with label numbers rather than readable labels. To get readable labels, copy coco.names to yolo_v3.labels and put it in the same directory as the xml file. One problem is that the label file reader doesn’t handle spaces in the labels. Rather than mess with the code, I just changed the spaces in the yolo_v3.labels file to underlines. Otherwise it thinks a mouse is a donut and a monitor a dog which is a little confusing.

However, what I really wanted to do was to run this on the NCS 2. The model as generated is FP32 and the NCS 2 wants FP16. Adding –data_type FP16 to the mo_tf.py command line fixes that but unfortunately it reports that the NCS 2 doesn’t support the Resample layer which is used by YOLOv3. If I had been smart I would have noticed that the usage info only mentions CPU and GPU :-(. Interestingly, the table of supported layers indicates that both Resample and Interp are supported on MYRIAD so I do not know what is going on here.

I did try changing the offending tf.image_resize_nearest_neighbor call into a tf.image.resize.bilinear call (by editing yolo_v3.py in the tensorflow-yolo-v3 directory). This maps to Interp instead of Resample in the OpenVINO IR.  This worked fine in CPU mode but still failed to run on the NCS 2 except in a different way:


Not sure if that is a bug or intended. Anyway, that seems to be the end of the road with running YOLOv3 on the NCS 2 for the moment at least. However, there are a lot of things that do run on the NCS 2 very nicely. Still, YOLOv3 had started to become my standard way of checking inference things out, just like my strategy of evaluating restaurants by the quality of their Caesar salad – at least in the days when you could still get them!

Simplified workflow for YOLOv3 retraining

Following on from the previous post, I have now put together a pretty usable workflow for creating custom YOLOv3 models – the code and instructions are here. There are quite a few alternatives out there already but it was interesting putting this together from a learning point of view. The screen capture above was taken during some testing. I stopped the training early (which is why the probabilities are pretty low) so that I could test the weights with an rt-ai stream processing network design and then restarted the training. The tools automatically generate customized scripts to train and restart training, making this pretty painless.

There is a tremendous amount of valuable information here, including the code for the custom anchor generator that I have integrated into my workflow. I haven’t yet tried this enhanced version of Darknet yet but will do that soon. One thing I did learn from that repo is that there is an option to treat mirror image objects as distinct objects – no doubt that was what was hindering the accurate detection of the left and right motion controllers previously.

Accessing the iOS WiFi IP address within a Unity app

At the moment, Manifold requires that a client supplies an appropriate IP address (although I might change this in the future). Mostly it is pretty easy to do but the .NET way of doing things (using Dns.GetHostEntry()) didn’t seem to pick up the WiFi IP address on an iPad. After wasting a lot of time, I decided to go back to basics and create a native plugin.

The basis of the code comes from here – it just needed the right wrapping to get it to work with Unity. I will be the first to admit that I know nothing about native iOS coding but, following the Bonjour example here, the code below  seemed to work just fine when placed in the Assets/Plugins/iOS directory of the project.

IPAddress.h:

#import <Foundation/Foundation.h>

@interface IPAddressDelegate : NSObject

- (NSString *)getAddress;
@end

IPAddress.m:

#include <ifaddrs.h>
#include <arpa/inet.h>

#import "IPAddress.h"
@implementation IPAddressDelegate

- (id)init
{
    self = [super init];
    return self;
}

- (NSString *)getAddress {
    NSString *address = @"error";
    struct ifaddrs *interfaces = NULL;
    struct ifaddrs *temp_addr = NULL;
    int success = 0;
    success = getifaddrs(&interfaces);
    if (success == 0) {
        temp_addr = interfaces;
        while(temp_addr != NULL) {
            if(temp_addr->ifa_addr->sa_family == AF_INET) {
                if([[NSString stringWithUTF8String:temp_addr->ifa_name] isEqualToString:@"en0"]) {
                    address = [NSString stringWithUTF8String:inet_ntoa(((struct sockaddr_in *)temp_addr->ifa_addr)->sin_addr)];
                }
            }
            temp_addr = temp_addr->ifa_next;
        }
    }
    // Free memory
    freeifaddrs(interfaces);
    return address;
}
@end

static IPAddressDelegate* delegateObject = nil;

char* MakeStringCopy (const char* string)
{
    if (string == NULL)
        return NULL;
    
    char* res = (char*)malloc(strlen(string) + 1);
    strcpy(res, string);
    return res;
}

const char * getLocalWifiIpAddress()
{
    if (delegateObject == nil)
        delegateObject = [[IPAddressDelegate alloc] init];
    
    return MakeStringCopy([[delegateObject getAddress] UTF8String]);
}

To use the plugin is pretty straightforward. Just add this declaration to a C# class:

	[DllImport ("__Internal")]
	private static extern string getLocalWifiIpAddress();

Then call getLocalWiFiIpAddress() to get the dotted address string.

rt-xr sentient space visualization now on iOS!

I have to admit, I am in a state of shock right now. For some reason today I decided to try to get the rt-xr Viewer software working on iOS. After all, it worked fine on Windows desktop, UWP (Windows MR), macOS and Android so why not? However, I expected endless trouble with the Manifold library but, as it turned out, getting it to work on iOS was trivial. I guess Unity and .NET magic came together so I didn’t have to do too much work once again. In fact, the hardest part was working out how to sort out microphone permission and that wasn’t too hard – this thread certainly helped with that. Avatar pose sharing, audio sharing, proxy objects, video and sensor feeds all work perfectly.

The nice thing now is that most (if not all) of the further development is intrinsically multi-platform.

Streaming PCM audio from Unity on Android

The final step in adding audio support to rt-xr visualization was to make it work with Android. Supporting audio capture natively on Windows desktop and Windows UWP was relatively easy since it could all be done in C#. However, I didn’t really want to implement a native capture plugin for Android and in turns out that the Unity capture technique works pretty well, albeit with noticeable latency.

The Inspector view in the screen capture shows the idea. The MicrophoneFilter script starts up the Unity Microphone and adds it to the AudioSource. When running, the output of the AudioSource is passed to MicrophoneFilter via the OnAudioFilterRead method that gives access to the PCM stream from the microphone.

The resulting stream needs some processing, however. I am sending single channel PCM audio at 16000 samples per second on the network whereas the output of the AudioSource is stereo, either 16000 or 48000 depending on the platform and floating point rather than 16 bit values so the code has to be able to convert this. It also needs to zero out the output of the filter otherwise it will be picked up by the listener on the main camera which is certainly not desirable! There is an alternate way of running this that uses the AudioSource.clip.GetData call directly but I had problems with that and also prefer the asynchronous callback used for OnAudioFilterRead rather than using Update or FixedUpdate to poll. The complete MicrophoneFilter script looks like this:

using UnityEngine;

[RequireComponent(typeof(AudioSource))]
public class MicrophoneFilter : MonoBehaviour
{
    [Tooltip("Index of microphone to use")]
    public int deviceIndex = 0;

    private StatusUpdate statusUpdate;
    private bool running = false;
    private byte[] buffer = new byte[32000];
    private int scale;

    // Use this for initialization
    void Start()
    {

        AudioSource source = GetComponent<AudioSource>();

        if (deviceIndex >= Microphone.devices.Length)
            deviceIndex = 0;

        GameObject scripts = GameObject.Find("Scripts");
        statusUpdate = scripts.GetComponent<StatusUpdate>();

        int sampleRate = AudioSettings.outputSampleRate;

        if (sampleRate > 16000)
            scale = 3;
        else
            scale = 1;

        source.clip = Microphone.Start(Microphone.devices[deviceIndex], true, 1, sampleRate);
        source.Play();
        running = true;
    }

    private void OnAudioFilterRead(float[] data, int channels)
    {
        if (!running)
            return;

        int byteIndex = 0;
        if (channels == 1) {
            for (int i = 0; i < data.Length;) {
                short val = (short)((data[i]) * 32767.0f);
                for (int offset = 0; offset < scale; offset++) {
                    if (i < data.Length) 
                        data[i++] = 0; 
                } 
                buffer[byteIndex++] = (byte)(val & 0xff); 
                buffer[byteIndex++] = (byte)((val >> 8) & 0xff);
            }
        } else {
            for (int i = 0; i < data.Length;) {
                short val = (short)((data[i] + data[i + 1]) * 32767.0f / 2.0f);
                for (int offset = 0; offset < 2 * scale; offset++) {
                    if (i < data.Length) 
                        data[i++] = 0; 
                } 
                buffer[byteIndex++] = (byte)(val & 0xff); 
                buffer[byteIndex++] = (byte)((val >> 8) & 0xff);
            }
        }
        statusUpdate.newAudioData(buffer, byteIndex);
    }
}

Note the fixed maximal size buffer allocation to try to prevent too much garbage collection. In general, the code uses maximal sized fixed buffers wherever possible.

The SharingServer has now been updated to generate separate feeds for VR and AR/MR users with all user audio feeds in the VR version and just VR headset users’ audio in the MR version. The audio update rate has also been decoupled from the avatar pose update rate. This allows a faster update rate for pose updates than makes sense for audio.

Just a note on why I am using single channel 16 bit PCM at 16000 samples per second rather than sending single channel floats at 48000 samples per second which would be a better fit in many cases. The problem is that this makes the data rate 6 times higher – it goes from 256kbps to 1.536Mbps. Using uncompressed 16 bit audio and dealing with the consequences seemed like a better trade than either the higher data rate or moving to compressed audio. This decision may have to be revisited when running on real MR headset hardware however.

Proxy objects: Unity assets that are UI extensions of remote servers

For some reason I often end up back at the analog clock for trying out new ideas. I guess it is because it is pretty trivial to operate a clock – just supply three angles. In this case, the clock is a proxy object which is in many ways just a simple extension of the system that animates the avatars for other occupants of a sentient space. A proxy object is a conventional Unity GameObject hierarchy that has certain specially named child nodes. By itself, there’s nothing special about the Unity asset part of a proxy object – it could be an asset included in the app or an asset downloaded from a server using Unity’s asset bundle system. Either way, these specially named nodes can be linked to external servers. In this case, the SharingServer generates an analog clock stream that animates the clock hands. The clock definition is contained in the space definition file that instantiates all the other parts of the scene.

In principle, interaction (i.e. sending stuff back to the remote server) can be added by using specially named nodes to attach scripts that are hard-coded in the app. I haven’t tried this yet but see no reason why it wouldn’t work. The key point is that proxy objects leverage standard scripts in the app as opposed to customized scripts for every asset.

Right now, you can modify the local scale, local position, local orientation, color and text (if associated with a TextMesh) of any of the GameObjects in an asset’s hierarchy. This could easily be extended to other things including updating a texture with a new image. For example, a virtual fireplace could be created where the flames are animated by constantly varying the textures being displayed. The system is still simplistic however as there are no mechanisms for controlling transitions (such as lerping between positions or fading between textures) but this could certainly be added without too much difficulty.

Just for reference, the analog clock stream message looks like this:

{
    "type": "proxyobject",
    "updateList": [
        {
            "name": "PO_AnalogClock_Second",
            "orientation": {
                "x": 0,
                "y": 222,
                "z": 0
            },
            "orientationValid": true
        },
        {
            "name": "PO_AnalogClock_Minute",
            "orientation": {
                "x": 0,
                "y": 342,
                "z": 0
            },
            "orientationValid": true
        },
        {
            "name": "PO_AnalogClock_Hour",
            "orientation": {
                "x": 0,
                "y": 568,
                "z": 0
            },
            "orientationValid": true
        }
    ]
}

Here the y value encodes the relevant hand angle. The hour angle is greater than 360 degrees as the system uses a 24 hour clock but the result is the same whatever.

Using blockchain technology to create verifiable sensor records and detect fakes

These days, machine learning techniques have led to the ability to create very realistic but fake video and audio that can be tough to distinguish from the real thing. The video above shows a very interesting example of this capability. The problem with this technology is that it will become impossible to determine if anything is genuine at all. What’s needed is some verification that a video of someone (for example) really is that person. Blockchain technology would seem to provide a solution for this.

Many years ago I was working on a digital watermarking-based system for detecting tampering in video records. Essentially, this embedded error-correcting codes in each frame that could be used to determine if any region of a frame had been modified after the digital watermark had been added. Cameras would add the digital watermark at source, limiting the opportunity for modification prior to watermarking.

One problem with this is that it worked on a frame by frame basis but didn’t ensure the integrity of an entire sequence. In theory this could be done with temporally distributed watermarks but blockchain technology provides a very nice alternative.

A simple strategy would be to have the sensor (camera, microphone, motion detector, whatever) create a hash for each unit of data (video frame, chunk of audio etc) and add this to a blockchain. Then a review app could create new hashes from the sensor data itself (stored elsewhere) and compare them to those in the blockchain. It could also determine that the account owner or device is who or what it is supposed to be in order to avoid spoofing. It’s easy to envisage an Etherium smart contract being the basis of such a system.

One issue with this is the potential rate at which hashes need to be added to the blockchain. This rate could be reduce by collecting more data (e.g. accumulating one second’s worth of data to generate one hash) or creating a hash of hashes at an appropriate rate. The only downside to this is losing temporal resolution of where changes have been made.

It’s worth considering the effects of lossy compression. Obviously if a stream is uncompressed or only uses lossless compression, watermarking and hash generation can be done at a very early stage. Watermarking of video is designed to withstand compression so that can still be done at a very early stage, even with lossy compression. The hash has to be be bit-accurate with the stream as stored on the video storage medium though so the hash must be computed after lossy compression.

It seems as though this blockchain concept could definitely be made to work and possibly combined with the digital watermarking technique in the case of video to provide temporal and spatial resolution of tampering. I am sure that variations of this concept are out there already or being developed and maybe, one day, it will be possible for anybody to check if a video of a well-known person is real or fake.