OpenPose does a great job of estimating the (x, y) coordinates of body points. However, in many situations, the spatial (3D) coordinates of the body joints is what’s required. To do that, the z coordinate has to be provided in some way. There are two common ways of doing that: using multiple cameras or using a depth camera. In this case, I chose using RGBD data from a StereoLabs ZED camera. An example of the result is shown in the screen capture above and another below. Coordinates are in units of meters.
The (x, y) 2D coordinates within the image (generated by OpenPose) along with the depth information at that (x, y) point in the image are used to calculate a spatial (sx, sy, sz) coordinate with origin at the camera and defined by the camera’s orientation. The important thing is that the spatial relationship between the joints is then trivial to calculate. This can be used by downstream inference blocks to discriminate higher level motions.
Incidentally I don’t have a leprechaun sitting on my computers to the right of the first screen capture – OpenPose was picking up my reflection in the window as another person.
The ZED is able to produce a depth map or point cloud but the depth map is more practical in this case as it necessary to transmit the data between processes (possibly on different machines). Even so, it is large and difficult to compress. The trick is to extract the meaningful data and then discard the depth information as soon as possible! The ZED camera also sends along the calibrated horizontal and vertical fields of view as this is essential to constructing (sx, sy, sz) from (x, y) and depth. Since the ZED doesn’t seem to produce a depth value for every pixel, the code samples an area around the (x, y) coordinate to evaluate a depth figure. If it fails to do this, the spatial coordinate is returned as (0, 0, 0).
This is the design I ended up using. Basically a dual OpenPose pipeline with scaler as for standard OpenPose. It averaged around 16 FPS with 1280 x 720 images (24 FPS with VGA images) using JPEG for the image part and raw depth map for the depth part. Using just one pipeline achieved about 13 FPS so the speed up from the second pipeline was disappointing. I expect that this was largely due to the communications overhead of moving the depth map around between nodes. Better network interfaces might improve this.
Combining the rt-ai Edge Scaler SPE, the OpenPose GPU SPE and iOSEdgeRemote running on an iPad as a camera/display generated some pretty good results, shown in the screen capture above. Full frame rate (30 frames per second) in OpenBose Body mode was obtained running one OpenPoseGPU SPE instance on each of two nodes: Default (equipped with a GTX 1080 ti GPU) and Node110 (equipped with a GTX 1080 GPU). The Scaler SPE divided up the video stream between the two OpenPoseGPU SPEs in order to share the load between the GPUs, performing its usual reassembly and reordering to generate a complete output stream after parallel processing. Latency was not noticeable.
As another experiment, I tried to achieve the same result with just one node, Default, the GTX 1080 ti node. The resulting configuration that ran at the full 30 FPS is shown above. Three OpenPoseGPU SPEs were required to achieve 30 FPS in this case, two topped out at 27 FPS.
In addition, 22 FPS was obtained in OpenPose Body and Face mode, this time using the second node (Node110) for the OpenPoseGPU2 block. Running OpenPoseGPU2 on the Default node along with the other two did not improve performance, presumably because the GPU was saturating.
Somehow I had completely missed the fact that Apple’s TrueDepth camera isn’t just for Face ID but can be used by any app. The screen captures here come from a couple of example apps. The one above is the straight depth data being used in different ways. The one below uses the TrueDepth camera based face tracking functions in ARKit to do things like replace my face with a box. I am winking at the camera and my jaw has been suitably dropped in the image on the right.
What’s really intriguing is that this depth data could be combined with OpenPose 2D joint positions to create 3D spatial coordinates for the joints. A very useful addition to the rt-ai Edge next generation gym concept perhaps, as it would enable much better form estimation and analysis.
I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.
So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.
The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.
The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.
Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.
Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.
One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Edge Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.
The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.
Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.
The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.
As a thought experiment, I considered how rt-ai Edge could be used to implement a next generation gym. The thought was sparked by Orangetheory who make nice use of technology to enhance the gym experience. The question was: where next? My answer is here: rt-ai smart gym. It would be fun to implement some of these ideas!
Following on from the GPU version, I now have OpenPose running in an Intel NCS 2 Stream Processing Element, as shown in the screen capture above. This wasn’t too hard as it is based on an Intel sample and model. The metadata format is consistent with the GPU version (apart from the lack of support for face and hand pose estimation) but that’s fine for a lot of applications.
This is the familiar simple test design. The OpenPoseVINO SPE is running at about 3fps on 1280 x 720 video using an NCS 2 (the GPU version with a GTX 1080ti gets about 17fps in body pose only mode). The current SPE inherited a blocking inference OpenVINO call from the demo rather than an asynchronous inference call – this needs to be changed to be similar to the technique used by the SSD version so that the full capabilities of multiple NCS 2s can be utilized for body pose estimation.