Creating a new plugin for the Janus WebRTC server

I am working on a system to support multi-site podcasting using WebRTC and the Janus Server seemed like a good place to start. None of the example plugins does exactly what I want so, rather than modify an existing plugin, I decided to create a new one based on an existing one (videoroom). The screen capture shows the result. At this stage, it is identical to the video room plugin hence the identical look of the test. There are a few steps to doing this such that it is integrated into the configuration and build system and there’s no way I will remember them, hence this aide-memoire!

One thing I noticed which has nothing to do with a new plugin is that I needed to install gtk-doc-tools before I could compile libnice as described in the dependency section of the readme.

Anyway, the janus-gateway repo has a plugins directory that contains c source (amongst other things) of the various plugins. I decided to base my new plugin on the videoroom plugin so I copied janus_videoroom.c into rt_podcall.c for the new plugin. Then, using a text editor, I changed all forms of text involving “videoroom” into “podcall”.

Once the source is created, it can be added into the file which is in the root of the repo. Basically, I copied anything involving “videoroom” and changed the text from “videoroom” to “podcall”. The same also needs to be done for

It is also necessary to create a configuration file for the new plugin. The repo root has a directory called conf which is where all of the configurations are held. I copied the janus.plugin.videoroom.jcfg.sample into janus.plugin.podcall.jcfg.sample to satisfy that requirement.

In order to test the plugin, it’s useful to add code into the existing demo system. The repo root has a directory called html that contains the test code. I copied videoroomtest.html and videoroomtest.js into podcall.html and podcall.js and edited the files to fix the references (such as plugin name) from videoroom to podcall.

To make the test available in the Demos dropdown, edit navbar.html and add the appropriate line in the dropdown menu.

Once all that’s done, it should be possible to build and install the modified Janus server:

./configure --prefix=/opt/janus
sudo make install
sudo make configs

The Janus server needs a webserver in order to run these tests. I used a very simple Python server to do this:

from http.server import HTTPServer, SimpleHTTPRequestHandler
import ssl

server_address = ('localhost', 8080)
httpd = HTTPServer(server_address, SimpleHTTPRequestHandler)
httpd.socket = ssl.wrap_socket(httpd.socket,

This is run with Python3 in the html directory and borrows the sample Janus certificates to support ssl. Replace localhost with a real IP address to allow access this server outside of the local machine.

Combining TrueDepth, remote OpenPose inference and local depth map processing to generate spatial 3D pose coordinates

The problem with depth maps for video is that the depth data is very large and can’t be compressed easily. I had previously run OpenPose at 30 FPS using an iPad Pro and remote inference but that was just for the standard OpenPose (x, y) coordinate output. There’s no way that 30 FPS could be achieved by sending out TrueDepth depth maps with each frame. Instead, the depth processing has to be handled locally on the iPad – the depth map never leaves the device.

The screen capture above shows the system running at 30 FPS. I had to turn a lot of lights on in the office – the frame rate from the iPad camera will drop below 30 FPS if it is too dark which messes up the data!

This is the design. It is the triple scaled OpenPoseGPU design used previously. iOSOpenPose connects to the Conductor via a websocket connection that is used to send images to and receive processed images from the pipeline.

One issue is that each image frame has its own depth map and that’s the one that has to be used to convert the OpenPose (x, y) coordinates into spatial (x, y, z) distances. The solution, in a new app called iOSOpenPose, is to cache the depth maps locally and re-associate them with the processed images when they return. Each image and depth frame is marked with a unique incrementing index to assist with this. Incidentally, this is why I love using JSON for this kind of work – it is possible to add non-standard fields at any point and they will be carried transparently to their destination.

Empirically with my current setup, there is a six frame processing lag which is not too bad. It would probably be better with the dual scaled pipeline, two node design that more easily handles 30 FPS but I did not try that. Another issue is that the processing pipeline can validly lose image frames if it can’t keep up with the offered rate. The depth map cache management software has to take care of all of the nasty details like this and other real-world effects.

Generating 3D spatial coordinates from OpenPose with the help of the Stereolabs ZED camera

OpenPose does a great job of estimating the (x, y) coordinates of body points. However, in many situations, the spatial (3D) coordinates of the body joints is what’s required. To do that, the z coordinate has to be provided in some way. There are two common ways of doing that: using multiple cameras or using a depth camera. In this case, I chose using RGBD data from a StereoLabs ZED camera. An example of the result is shown in the screen capture above and another below. Coordinates are in units of meters.

The (x, y) 2D coordinates within the image (generated by OpenPose) along with the depth information at that (x, y) point in the image are used to calculate a spatial (sx, sy, sz) coordinate with origin at the camera and defined by the camera’s orientation. The important thing is that the spatial relationship between the joints is then trivial to calculate. This can be used by downstream inference blocks to discriminate higher level motions.

Incidentally I don’t have a leprechaun sitting on my computers to the right of the first screen capture – OpenPose was picking up my reflection in the window as another person.

The ZED is able to produce a depth map or point cloud but the depth map is more practical in this case as it necessary to transmit the data between processes (possibly on different machines). Even so, it is large and difficult to compress. The trick is to extract the meaningful data and then discard the depth information as soon as possible! The ZED camera also sends along the calibrated horizontal and vertical fields of view as this is essential to constructing (sx, sy, sz) from (x, y) and depth. Since the ZED doesn’t seem to produce a depth value for every pixel, the code samples an area around the (x, y) coordinate to evaluate a depth figure. If it fails to do this, the spatial coordinate is returned as (0, 0, 0).

This is the design I ended up using. Basically a dual OpenPose pipeline with scaler as for standard OpenPose. It averaged around 16 FPS with 1280 x 720 images (24 FPS with VGA images) using JPEG for the image part and raw depth map for the depth part. Using just one pipeline achieved about 13 FPS so the speed up from the second pipeline was disappointing. I expect that this was largely due to the communications overhead of moving the depth map around between nodes. Better network interfaces might improve this.

30 FPS OpenPose using rt-ai Edge scaling and iOSEdgeRemote

Combining the rt-ai Edge Scaler SPE, the OpenPose GPU SPE and iOSEdgeRemote running on an iPad as a camera/display generated some pretty good results, shown in the screen capture above. Full frame rate (30 frames per second) in OpenBose Body mode was obtained running one OpenPoseGPU SPE instance on each of two nodes: Default (equipped with a GTX 1080 ti GPU) and Node110 (equipped with a GTX 1080 GPU). The Scaler SPE divided up the video stream between the two OpenPoseGPU SPEs in order to share the load between the GPUs, performing its usual reassembly and reordering to generate a complete output stream after parallel processing. Latency was not noticeable.

As another experiment, I tried to achieve the same result with just one node, Default, the GTX 1080 ti node. The resulting configuration that ran at the full 30 FPS is shown above. Three OpenPoseGPU SPEs were required to achieve 30 FPS in this case, two topped out at 27 FPS.

In addition, 22 FPS was obtained in OpenPose Body and Face mode, this time using the second node (Node110) for the OpenPoseGPU2 block. Running OpenPoseGPU2 on the Default node along with the other two did not improve performance, presumably because the GPU was saturating.


Fun with Apple’s TrueDepth camera

Somehow I had completely missed the fact that Apple’s TrueDepth camera isn’t just for Face ID but can be used by any app. The screen captures here come from a couple of example apps. The one above is the straight depth data being used in different ways. The one below uses the TrueDepth camera based face tracking functions in ARKit to do things like replace my face with a box. I am winking at the camera and my jaw has been suitably dropped in the image on the right.

What’s really intriguing is that this depth data could be combined with OpenPose 2D joint positions to create 3D spatial coordinates for the joints. A very useful addition to the rt-ai Edge next generation gym concept perhaps, as it would enable much better form estimation and analysis.

Real time OpenPose on an iPad…with the help of remote inference and rendering

I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.

So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.

The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.

The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.

Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML  for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.

Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.

rt-ai Edge dynamic and adaptive parallel inference using the new Scaler SPE

One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Edge Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.

The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.

Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.

The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.