Converting screen coordinates + depth into spatial coordinates for OpenPose…or anything else really

Depth cameras are wonderful things but they typically only give a distance associated with each (x, y) coordinate in screen space. To convert to spatial coordinates involves some calculation. One thing to note is that I am ignoring camera calibration which is required to get best accuracy. See this page for details of how to use calibration data in iOS for example. I have implemented this calculation for the iPad TrueDepth camera and also the ZED stereo camera to process OpenPose joint data and it seems to work but I cannot guarantee complete accuracy!

The concept for the conversion is shown in the diagram above. One can think of the 2D camera image as being mapped to a screen plane – the blue plane in the diagram. The width and height of the plane are determined by its distance from the camera and the camera’s field of view. Using the iPad as an example, you can get the horizontal and vertical camera field of view angles (hFOV and vFOV in the diagram) like this:

hFOV = captureDevice.activeFormat.videoFieldOfView * Float.pi / 180.0
vFOV = atan(height / width * tan(hFOV))
tanHalfHFOV = tan(hFOV / 2) 
tanHalfVFOV = tan(vFOV / 2)

where width and height are the width and height of the 2D image. This calculation can be done once at the start of the session since it is defined by the camera itself.

For the Stereolabs ZED camera (this is a partial code extract):

#include <sl_zed/Camera.hpp>

sl::Camera zed;
sl::InitParameters init_params;

// set up params here
if ( != sl::SUCCESS) {

sl::CameraInformation ci = zed.getCameraInformation();
sl::CameraParameters cp = ci.calibration_parameters.left_cam;
hFOV = cp.h_fov;
vFOV = cp.v_fov;
tanHalfHFOV = tan(hFOV / 2);
tanHalfVFOV = tan(vFOV / 2);

To pick up the depth value, you just look up the hit point (x, y) coordinate in the depth buffer. For the TrueDepth camera and the ZED, this seems to be the perpendicular distance from the center of the camera to the plane defined by the target point that is perpendicular to the camera look at point – the yellow plane in the diagram. Other types of depth sensors might give the radial distance from the center of the camera to the hit point which will obviously require a slightly modified calculation. Here I am assuming that the depth buffer contains the perpendicular distance – call this spatialZ.

What we need now are the tangents of the reduced angles that correspond to the horizontal and vertical angle components between the ray from the camera to the screen plane hit point and the ray that is the camera’s look at point. – call these angles ThetaX (horizontal) and ThetaY (vertical). Given the perpendicular distance to the yellow plane, we can then easily calculate the spatial x and y coordinates using the field of view tangents previously calculated:

tanThetaX = (x - Float32(width / 2)) / Float32(width / 2) * tanHalfHFOV
tanThetaY = (y - Float32(height / 2)) / Float32(height / 2) * tanHalfVFOV

spatialX = spatialZ * tanThetaX
spatialY = spatialZ * tanThetaY

The coordinates (spatialZ, spatialY, spatialZ) are in whatever units the depth buffer uses (often meters) and in the camera’s coordinate system. To convert the camera’s coordinate system to world coordinates is a standard operation given the camera’s pose in the world space.

Raspberry Pi 3 Model B with Coral Edge TPU acceleration running SSD object detection

It wasn’t too hard to go from the inline rt-ai Edge Stream Processing Element using the Coral Edge TPU accelerator to an embedded version running on a Raspberry Pi 3 Model B with Pi camera.  The rt-ai Edge test design for this SPE is pretty simple again:

As can be seen, the Pi + Coral runs at about 4 fps with 1280 x 720 frames which is not too bad at all. In this example, I am running the PiCoral camera SPE on the Raspberry Pi node (Pi7) and the View SPE on the Default node (an i7 Ubuntu machine). Also, I’m using the combined video and metadata output which contains both the detection data and the associated JPEG video frame. However, the PiCoral SPE also has a metadata-only output. This contains all the frame information and detection data (scores, boxes etc) but not the JPEG frame itself. This can be useful for a couple of reasons. First, especially if the Raspberry Pi is connected via WiFi, transmitting the JPEGs can be a bit onerous and, if they are not needed, very wasteful. Secondly, it satisfies a potential privacy issue in that the raw video data never leaves the Raspberry Pi. Provided the metadata contains enough information for useful downstream processing, this can be a very efficient way to configure a system.

An Edge TPU stream processing element for rt-ai Edge using the Coral USB Accelerator

A Coral USB Accelerator turned up yesterday so of course it had to be integrated with rt-ai Edge to see what it could do. Creating a Python-based SPE from the object detection demo in the API download didn’t take too long. I used the MobileNet SSD v2 COCO model as a starting point to generate this example output:

The very basic rt-ai Edge test design looks like this:

Using 1280 x 720 video frames from the webcam, I was getting around 2 frames per second from the CoralSSD SPE. This isn’t as good as the Intel NCS 2 SPE but that is a compiled C++ SPE whereas the Coral SPE is a Python 3 SPE. I haven’t found a C++ API spec for the Edge TPU as yet. Perhaps by investigating the SWIG-generated Python interface I could link the compiled libraries directly but that’s for another day…

Creating a new plugin for the Janus WebRTC server

I am working on a system to support multi-site podcasting using WebRTC and the Janus Server seemed like a good place to start. None of the example plugins does exactly what I want so, rather than modify an existing plugin, I decided to create a new one based on an existing one (videoroom). The screen capture shows the result. At this stage, it is identical to the video room plugin hence the identical look of the test. There are a few steps to doing this such that it is integrated into the configuration and build system and there’s no way I will remember them, hence this aide-memoire!

One thing I noticed which has nothing to do with a new plugin is that I needed to install gtk-doc-tools before I could compile libnice as described in the dependency section of the readme.

Anyway, the janus-gateway repo has a plugins directory that contains c source (amongst other things) of the various plugins. I decided to base my new plugin on the videoroom plugin so I copied janus_videoroom.c into rt_podcall.c for the new plugin. Then, using a text editor, I changed all forms of text involving “videoroom” into “podcall”.

Once the source is created, it can be added into the file which is in the root of the repo. Basically, I copied anything involving “videoroom” and changed the text from “videoroom” to “podcall”. The same also needs to be done for

It is also necessary to create a configuration file for the new plugin. The repo root has a directory called conf which is where all of the configurations are held. I copied the janus.plugin.videoroom.jcfg.sample into janus.plugin.podcall.jcfg.sample to satisfy that requirement.

In order to test the plugin, it’s useful to add code into the existing demo system. The repo root has a directory called html that contains the test code. I copied videoroomtest.html and videoroomtest.js into podcall.html and podcall.js and edited the files to fix the references (such as plugin name) from videoroom to podcall.

To make the test available in the Demos dropdown, edit navbar.html and add the appropriate line in the dropdown menu.

Once all that’s done, it should be possible to build and install the modified Janus server:

./configure --prefix=/opt/janus
sudo make install
sudo make configs

The Janus server needs a webserver in order to run these tests. I used a very simple Python server to do this:

from http.server import HTTPServer, SimpleHTTPRequestHandler
import ssl

server_address = ('localhost', 8080)
httpd = HTTPServer(server_address, SimpleHTTPRequestHandler)
httpd.socket = ssl.wrap_socket(httpd.socket,

This is run with Python3 in the html directory and borrows the sample Janus certificates to support ssl. Replace localhost with a real IP address to allow access this server outside of the local machine.

Combining TrueDepth, remote OpenPose inference and local depth map processing to generate spatial 3D pose coordinates

The problem with depth maps for video is that the depth data is very large and can’t be compressed easily. I had previously run OpenPose at 30 FPS using an iPad Pro and remote inference but that was just for the standard OpenPose (x, y) coordinate output. There’s no way that 30 FPS could be achieved by sending out TrueDepth depth maps with each frame. Instead, the depth processing has to be handled locally on the iPad – the depth map never leaves the device.

The screen capture above shows the system running at 30 FPS. I had to turn a lot of lights on in the office – the frame rate from the iPad camera will drop below 30 FPS if it is too dark which messes up the data!

This is the design. It is the triple scaled OpenPoseGPU design used previously. iOSOpenPose connects to the Conductor via a websocket connection that is used to send images to and receive processed images from the pipeline.

One issue is that each image frame has its own depth map and that’s the one that has to be used to convert the OpenPose (x, y) coordinates into spatial (x, y, z) distances. The solution, in a new app called iOSOpenPose, is to cache the depth maps locally and re-associate them with the processed images when they return. Each image and depth frame is marked with a unique incrementing index to assist with this. Incidentally, this is why I love using JSON for this kind of work – it is possible to add non-standard fields at any point and they will be carried transparently to their destination.

Empirically with my current setup, there is a six frame processing lag which is not too bad. It would probably be better with the dual scaled pipeline, two node design that more easily handles 30 FPS but I did not try that. Another issue is that the processing pipeline can validly lose image frames if it can’t keep up with the offered rate. The depth map cache management software has to take care of all of the nasty details like this and other real-world effects.

Generating 3D spatial coordinates from OpenPose with the help of the Stereolabs ZED camera

OpenPose does a great job of estimating the (x, y) coordinates of body points. However, in many situations, the spatial (3D) coordinates of the body joints is what’s required. To do that, the z coordinate has to be provided in some way. There are two common ways of doing that: using multiple cameras or using a depth camera. In this case, I chose using RGBD data from a StereoLabs ZED camera. An example of the result is shown in the screen capture above and another below. Coordinates are in units of meters.

The (x, y) 2D coordinates within the image (generated by OpenPose) along with the depth information at that (x, y) point in the image are used to calculate a spatial (sx, sy, sz) coordinate with origin at the camera and defined by the camera’s orientation. The important thing is that the spatial relationship between the joints is then trivial to calculate. This can be used by downstream inference blocks to discriminate higher level motions.

Incidentally I don’t have a leprechaun sitting on my computers to the right of the first screen capture – OpenPose was picking up my reflection in the window as another person.

The ZED is able to produce a depth map or point cloud but the depth map is more practical in this case as it necessary to transmit the data between processes (possibly on different machines). Even so, it is large and difficult to compress. The trick is to extract the meaningful data and then discard the depth information as soon as possible! The ZED camera also sends along the calibrated horizontal and vertical fields of view as this is essential to constructing (sx, sy, sz) from (x, y) and depth. Since the ZED doesn’t seem to produce a depth value for every pixel, the code samples an area around the (x, y) coordinate to evaluate a depth figure. If it fails to do this, the spatial coordinate is returned as (0, 0, 0).

This is the design I ended up using. Basically a dual OpenPose pipeline with scaler as for standard OpenPose. It averaged around 16 FPS with 1280 x 720 images (24 FPS with VGA images) using JPEG for the image part and raw depth map for the depth part. Using just one pipeline achieved about 13 FPS so the speed up from the second pipeline was disappointing. I expect that this was largely due to the communications overhead of moving the depth map around between nodes. Better network interfaces might improve this.

30 FPS OpenPose using rt-ai Edge scaling and iOSEdgeRemote

Combining the rt-ai Edge Scaler SPE, the OpenPose GPU SPE and iOSEdgeRemote running on an iPad as a camera/display generated some pretty good results, shown in the screen capture above. Full frame rate (30 frames per second) in OpenBose Body mode was obtained running one OpenPoseGPU SPE instance on each of two nodes: Default (equipped with a GTX 1080 ti GPU) and Node110 (equipped with a GTX 1080 GPU). The Scaler SPE divided up the video stream between the two OpenPoseGPU SPEs in order to share the load between the GPUs, performing its usual reassembly and reordering to generate a complete output stream after parallel processing. Latency was not noticeable.

As another experiment, I tried to achieve the same result with just one node, Default, the GTX 1080 ti node. The resulting configuration that ran at the full 30 FPS is shown above. Three OpenPoseGPU SPEs were required to achieve 30 FPS in this case, two topped out at 27 FPS.

In addition, 22 FPS was obtained in OpenPose Body and Face mode, this time using the second node (Node110) for the OpenPoseGPU2 block. Running OpenPoseGPU2 on the Default node along with the other two did not improve performance, presumably because the GPU was saturating.