Converting screen coordinates + depth into spatial coordinates for OpenPose…or anything else really

Depth cameras are wonderful things but they typically only give a distance associated with each (x, y) coordinate in screen space. To convert to spatial coordinates involves some calculation. One thing to note is that I am ignoring camera calibration which is required to get best accuracy. See this page for details of how to use calibration data in iOS for example. I have implemented this calculation for the iPad TrueDepth camera and also the ZED stereo camera to process OpenPose joint data and it seems to work but I cannot guarantee complete accuracy!

The concept for the conversion is shown in the diagram above. One can think of the 2D camera image as being mapped to a screen plane – the blue plane in the diagram. The width and height of the plane are determined by its distance from the camera and the camera’s field of view. Using the iPad as an example, you can get the horizontal and vertical camera field of view angles (hFOV and vFOV in the diagram) like this:

hFOV = captureDevice.activeFormat.videoFieldOfView * Float.pi / 180.0
vFOV = atan(height / width * tan(hFOV))
tanHalfHFOV = tan(hFOV / 2) 
tanHalfVFOV = tan(vFOV / 2)

where width and height are the width and height of the 2D image. This calculation can be done once at the start of the session since it is defined by the camera itself.

For the Stereolabs ZED camera (this is a partial code extract):

#include <sl_zed/Camera.hpp>

sl::Camera zed;
sl::InitParameters init_params;

// set up params here
if (zed.open(init_params) != sl::SUCCESS) {
    exit(-1);
}

sl::CameraInformation ci = zed.getCameraInformation();
sl::CameraParameters cp = ci.calibration_parameters.left_cam;
hFOV = cp.h_fov;
vFOV = cp.v_fov;
tanHalfHFOV = tan(hFOV / 2);
tanHalfVFOV = tan(vFOV / 2);

To pick up the depth value, you just look up the hit point (x, y) coordinate in the depth buffer. For the TrueDepth camera and the ZED, this seems to be the perpendicular distance from the center of the camera to the plane defined by the target point that is perpendicular to the camera look at point – the yellow plane in the diagram. Other types of depth sensors might give the radial distance from the center of the camera to the hit point which will obviously require a slightly modified calculation. Here I am assuming that the depth buffer contains the perpendicular distance – call this spatialZ.

What we need now are the tangents of the reduced angles that correspond to the horizontal and vertical angle components between the ray from the camera to the screen plane hit point and the ray that is the camera’s look at point. – call these angles ThetaX (horizontal) and ThetaY (vertical). Given the perpendicular distance to the yellow plane, we can then easily calculate the spatial x and y coordinates using the field of view tangents previously calculated:

tanThetaX = (x - Float32(width / 2)) / Float32(width / 2) * tanHalfHFOV
tanThetaY = (y - Float32(height / 2)) / Float32(height / 2) * tanHalfVFOV

spatialX = spatialZ * tanThetaX
spatialY = spatialZ * tanThetaY

The coordinates (spatialZ, spatialY, spatialZ) are in whatever units the depth buffer uses (often meters) and in the camera’s coordinate system. To convert the camera’s coordinate system to world coordinates is a standard operation given the camera’s pose in the world space.

Running YOLOv3 with OpenVINO on CPU and (not) NCS 2


Since OpenVINO is the software framework for the Neural Compute Stick 2, I thought it would be interesting to get the OpenVINO YOLOv3 example up and running. While the toolkit download does include a number of models, YOLOv3 isn’t one of them. Instead, the model has to be created from a TensorFlow version.

The instructions here describe how to do this. Steps 1 and 2 are fine but it is kind of awkward how the .pb file is generated so I created a new simple script to do this:

# -*- coding: utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import graph_io

from yolo_v3 import yolo_v3, load_weights, detections_boxes, non_max_suppression

def load_coco_names(file_name):
    names = {}
    with open(file_name) as f:
        for id, name in enumerate(f):
            names[id] = name
    return names
    
def main(argv):

    classes = load_coco_names("coco.names")

    # placeholder for detector inputs
    inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])

    with tf.variable_scope('detector'):
        detections = yolo_v3(inputs, len(classes), data_format='NHWC')
        load_ops = load_weights(tf.global_variables(scope='detector'), "yolov3.weights")

    boxes = detections_boxes(detections)

    with tf.Session() as sess:
        sess.run(load_ops)
        frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ['concat_1'])
        graph_io.write_graph(frozen, './', 'yolo_v3.pb', as_text=False)

if __name__ == '__main__':
    tf.app.run()

This has the important filenames hardcoded – you just need to put yolo_v3.weights and coco.names in the tensorflow-yolo-v3 directory. Run the script above with:

python3 script.py

and the yolo_v3.pb file should be created. Copy this into the model_optimizer directory, set that as the current directory and run:

python3 mo_tf.py --input_model yolo_v3.pb --tensorflow_use_custom_operations_config ./extensions/front/tf/yolo_v3.json --input_shape [1,416,416,3]

The –input_shape parameter is needed as otherwise it blows up due to getting -1 for the mini-batch size. I just forced this to 1 and it was happy.

The result is in yolo_v3.xml and yolo_v3.bin. These can be used with the demo object_detection_demo_yolov3_async and an example output is shown in the screen capture above. Note that it is necessary to run the following:

~/intel/computer_vision_sdk/bin/setupvars.sh

in the same terminal session as the demo will be run in order for CPU mode to work.

By default, the output just annotates the boxes with label numbers rather than readable labels. To get readable labels, copy coco.names to yolo_v3.labels and put it in the same directory as the xml file. One problem is that the label file reader doesn’t handle spaces in the labels. Rather than mess with the code, I just changed the spaces in the yolo_v3.labels file to underlines. Otherwise it thinks a mouse is a donut and a monitor a dog which is a little confusing.

However, what I really wanted to do was to run this on the NCS 2. The model as generated is FP32 and the NCS 2 wants FP16. Adding –data_type FP16 to the mo_tf.py command line fixes that but unfortunately it reports that the NCS 2 doesn’t support the Resample layer which is used by YOLOv3. If I had been smart I would have noticed that the usage info only mentions CPU and GPU :-(. Interestingly, the table of supported layers indicates that both Resample and Interp are supported on MYRIAD so I do not know what is going on here.

I did try changing the offending tf.image_resize_nearest_neighbor call into a tf.image.resize.bilinear call (by editing yolo_v3.py in the tensorflow-yolo-v3 directory). This maps to Interp instead of Resample in the OpenVINO IR.  This worked fine in CPU mode but still failed to run on the NCS 2 except in a different way:


Not sure if that is a bug or intended. Anyway, that seems to be the end of the road with running YOLOv3 on the NCS 2 for the moment at least. However, there are a lot of things that do run on the NCS 2 very nicely. Still, YOLOv3 had started to become my standard way of checking inference things out, just like my strategy of evaluating restaurants by the quality of their Caesar salad – at least in the days when you could still get them!

*** Update: YOLOv3 does now work on the NCS 2 using the latest OpenVINO release.

Simplified workflow for YOLOv3 retraining

Following on from the previous post, I have now put together a pretty usable workflow for creating custom YOLOv3 models – the code and instructions are here. There are quite a few alternatives out there already but it was interesting putting this together from a learning point of view. The screen capture above was taken during some testing. I stopped the training early (which is why the probabilities are pretty low) so that I could test the weights with an rt-ai stream processing network design and then restarted the training. The tools automatically generate customized scripts to train and restart training, making this pretty painless.

There is a tremendous amount of valuable information here, including the code for the custom anchor generator that I have integrated into my workflow. I haven’t yet tried this enhanced version of Darknet yet but will do that soon. One thing I did learn from that repo is that there is an option to treat mirror image objects as distinct objects – no doubt that was what was hindering the accurate detection of the left and right motion controllers previously.

Accessing the iOS WiFi IP address within a Unity app

At the moment, Manifold requires that a client supplies an appropriate IP address (although I might change this in the future). Mostly it is pretty easy to do but the .NET way of doing things (using Dns.GetHostEntry()) didn’t seem to pick up the WiFi IP address on an iPad. After wasting a lot of time, I decided to go back to basics and create a native plugin.

The basis of the code comes from here – it just needed the right wrapping to get it to work with Unity. I will be the first to admit that I know nothing about native iOS coding but, following the Bonjour example here, the code below  seemed to work just fine when placed in the Assets/Plugins/iOS directory of the project.

IPAddress.h:

#import <Foundation/Foundation.h>

@interface IPAddressDelegate : NSObject

- (NSString *)getAddress;
@end

IPAddress.m:

#include <ifaddrs.h>
#include <arpa/inet.h>

#import "IPAddress.h"
@implementation IPAddressDelegate

- (id)init
{
    self = [super init];
    return self;
}

- (NSString *)getAddress {
    NSString *address = @"error";
    struct ifaddrs *interfaces = NULL;
    struct ifaddrs *temp_addr = NULL;
    int success = 0;
    success = getifaddrs(&interfaces);
    if (success == 0) {
        temp_addr = interfaces;
        while(temp_addr != NULL) {
            if(temp_addr->ifa_addr->sa_family == AF_INET) {
                if([[NSString stringWithUTF8String:temp_addr->ifa_name] isEqualToString:@"en0"]) {
                    address = [NSString stringWithUTF8String:inet_ntoa(((struct sockaddr_in *)temp_addr->ifa_addr)->sin_addr)];
                }
            }
            temp_addr = temp_addr->ifa_next;
        }
    }
    // Free memory
    freeifaddrs(interfaces);
    return address;
}
@end

static IPAddressDelegate* delegateObject = nil;

char* MakeStringCopy (const char* string)
{
    if (string == NULL)
        return NULL;
    
    char* res = (char*)malloc(strlen(string) + 1);
    strcpy(res, string);
    return res;
}

const char * getLocalWifiIpAddress()
{
    if (delegateObject == nil)
        delegateObject = [[IPAddressDelegate alloc] init];
    
    return MakeStringCopy([[delegateObject getAddress] UTF8String]);
}

To use the plugin is pretty straightforward. Just add this declaration to a C# class:

	[DllImport ("__Internal")]
	private static extern string getLocalWifiIpAddress();

Then call getLocalWiFiIpAddress() to get the dotted address string.

rt-xr sentient space visualization now on iOS!

I have to admit, I am in a state of shock right now. For some reason today I decided to try to get the rt-xr Viewer software working on iOS. After all, it worked fine on Windows desktop, UWP (Windows MR), macOS and Android so why not? However, I expected endless trouble with the Manifold library but, as it turned out, getting it to work on iOS was trivial. I guess Unity and .NET magic came together so I didn’t have to do too much work once again. In fact, the hardest part was working out how to sort out microphone permission and that wasn’t too hard – this thread certainly helped with that. Avatar pose sharing, audio sharing, proxy objects, video and sensor feeds all work perfectly.

The nice thing now is that most (if not all) of the further development is intrinsically multi-platform.

Streaming PCM audio from Unity on Android

The final step in adding audio support to rt-xr visualization was to make it work with Android. Supporting audio capture natively on Windows desktop and Windows UWP was relatively easy since it could all be done in C#. However, I didn’t really want to implement a native capture plugin for Android and in turns out that the Unity capture technique works pretty well, albeit with noticeable latency.

The Inspector view in the screen capture shows the idea. The MicrophoneFilter script starts up the Unity Microphone and adds it to the AudioSource. When running, the output of the AudioSource is passed to MicrophoneFilter via the OnAudioFilterRead method that gives access to the PCM stream from the microphone.

The resulting stream needs some processing, however. I am sending single channel PCM audio at 16000 samples per second on the network whereas the output of the AudioSource is stereo, either 16000 or 48000 depending on the platform and floating point rather than 16 bit values so the code has to be able to convert this. It also needs to zero out the output of the filter otherwise it will be picked up by the listener on the main camera which is certainly not desirable! There is an alternate way of running this that uses the AudioSource.clip.GetData call directly but I had problems with that and also prefer the asynchronous callback used for OnAudioFilterRead rather than using Update or FixedUpdate to poll. The complete MicrophoneFilter script looks like this:

using UnityEngine;

[RequireComponent(typeof(AudioSource))]
public class MicrophoneFilter : MonoBehaviour
{
    [Tooltip("Index of microphone to use")]
    public int deviceIndex = 0;

    private StatusUpdate statusUpdate;
    private bool running = false;
    private byte[] buffer = new byte[32000];
    private int scale;

    // Use this for initialization
    void Start()
    {

        AudioSource source = GetComponent<AudioSource>();

        if (deviceIndex >= Microphone.devices.Length)
            deviceIndex = 0;

        GameObject scripts = GameObject.Find("Scripts");
        statusUpdate = scripts.GetComponent<StatusUpdate>();

        int sampleRate = AudioSettings.outputSampleRate;

        if (sampleRate > 16000)
            scale = 3;
        else
            scale = 1;

        source.clip = Microphone.Start(Microphone.devices[deviceIndex], true, 1, sampleRate);
        source.Play();
        running = true;
    }

    private void OnAudioFilterRead(float[] data, int channels)
    {
        if (!running)
            return;

        int byteIndex = 0;
        if (channels == 1) {
            for (int i = 0; i < data.Length;) {
                short val = (short)((data[i]) * 32767.0f);
                for (int offset = 0; offset < scale; offset++) {
                    if (i < data.Length) 
                        data[i++] = 0; 
                } 
                buffer[byteIndex++] = (byte)(val & 0xff); 
                buffer[byteIndex++] = (byte)((val >> 8) & 0xff);
            }
        } else {
            for (int i = 0; i < data.Length;) {
                short val = (short)((data[i] + data[i + 1]) * 32767.0f / 2.0f);
                for (int offset = 0; offset < 2 * scale; offset++) {
                    if (i < data.Length) 
                        data[i++] = 0; 
                } 
                buffer[byteIndex++] = (byte)(val & 0xff); 
                buffer[byteIndex++] = (byte)((val >> 8) & 0xff);
            }
        }
        statusUpdate.newAudioData(buffer, byteIndex);
    }
}

Note the fixed maximal size buffer allocation to try to prevent too much garbage collection. In general, the code uses maximal sized fixed buffers wherever possible.

The SharingServer has now been updated to generate separate feeds for VR and AR/MR users with all user audio feeds in the VR version and just VR headset users’ audio in the MR version. The audio update rate has also been decoupled from the avatar pose update rate. This allows a faster update rate for pose updates than makes sense for audio.

Just a note on why I am using single channel 16 bit PCM at 16000 samples per second rather than sending single channel floats at 48000 samples per second which would be a better fit in many cases. The problem is that this makes the data rate 6 times higher – it goes from 256kbps to 1.536Mbps. Using uncompressed 16 bit audio and dealing with the consequences seemed like a better trade than either the higher data rate or moving to compressed audio. This decision may have to be revisited when running on real MR headset hardware however.

Proxy objects: Unity assets that are UI extensions of remote servers

For some reason I often end up back at the analog clock for trying out new ideas. I guess it is because it is pretty trivial to operate a clock – just supply three angles. In this case, the clock is a proxy object which is in many ways just a simple extension of the system that animates the avatars for other occupants of a sentient space. A proxy object is a conventional Unity GameObject hierarchy that has certain specially named child nodes. By itself, there’s nothing special about the Unity asset part of a proxy object – it could be an asset included in the app or an asset downloaded from a server using Unity’s asset bundle system. Either way, these specially named nodes can be linked to external servers. In this case, the SharingServer generates an analog clock stream that animates the clock hands. The clock definition is contained in the space definition file that instantiates all the other parts of the scene.

In principle, interaction (i.e. sending stuff back to the remote server) can be added by using specially named nodes to attach scripts that are hard-coded in the app. I haven’t tried this yet but see no reason why it wouldn’t work. The key point is that proxy objects leverage standard scripts in the app as opposed to customized scripts for every asset.

Right now, you can modify the local scale, local position, local orientation, color and text (if associated with a TextMesh) of any of the GameObjects in an asset’s hierarchy. This could easily be extended to other things including updating a texture with a new image. For example, a virtual fireplace could be created where the flames are animated by constantly varying the textures being displayed. The system is still simplistic however as there are no mechanisms for controlling transitions (such as lerping between positions or fading between textures) but this could certainly be added without too much difficulty.

Just for reference, the analog clock stream message looks like this:

{
    "type": "proxyobject",
    "updateList": [
        {
            "name": "PO_AnalogClock_Second",
            "orientation": {
                "x": 0,
                "y": 222,
                "z": 0
            },
            "orientationValid": true
        },
        {
            "name": "PO_AnalogClock_Minute",
            "orientation": {
                "x": 0,
                "y": 342,
                "z": 0
            },
            "orientationValid": true
        },
        {
            "name": "PO_AnalogClock_Hour",
            "orientation": {
                "x": 0,
                "y": 568,
                "z": 0
            },
            "orientationValid": true
        }
    ]
}

Here the y value encodes the relevant hand angle. The hour angle is greater than 360 degrees as the system uses a 24 hour clock but the result is the same whatever.