I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.
So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.
The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.
The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.
Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.
Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.
One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Edge Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.
The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.
Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.
The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.
As a thought experiment, I considered how rt-ai Edge could be used to implement a next generation gym. The thought was sparked by Orangetheory who make nice use of technology to enhance the gym experience. The question was: where next? My answer is here: rt-ai smart gym. It would be fun to implement some of these ideas!
Following on from the GPU version, I now have OpenPose running in an Intel NCS 2 Stream Processing Element, as shown in the screen capture above. This wasn’t too hard as it is based on an Intel sample and model. The metadata format is consistent with the GPU version (apart from the lack of support for face and hand pose estimation) but that’s fine for a lot of applications.
This is the familiar simple test design. The OpenPoseVINO SPE is running at about 3fps on 1280 x 720 video using an NCS 2 (the GPU version with a GTX 1080ti gets about 17fps in body pose only mode). The current SPE inherited a blocking inference OpenVINO call from the demo rather than an asynchronous inference call – this needs to be changed to be similar to the technique used by the SSD version so that the full capabilities of multiple NCS 2s can be utilized for body pose estimation.
One of the more interesting pieces of open source software for video processing is OpenPose. I used this code as the basis of a new OpenPoseGPU Stream Processing Element for rt-ai Edge and the results can be seen in the screen capture. The metadata produce can be seen partially on the right hand side – it is pretty extensive as it contains all of the detected key points, depending on whether face and hand processing is enabled.
This version is x86/NVIDIA GPU based. The next thing to do is to get the equivalent working with the Intel NCS 2, based on this example, and then compare performance to see if the NCS 2 is practical for applications needing specific frame rates. The goal is to generate metadata that can be used to train a deep neural net to recognize specific activities. This could be used to create a Stream Processing Network that generates high level metadata about what users in the view of the camera are doing. This is in turn could be used to generate feedback to users, generate alerts on anomalous behavior etc.
It seems that the problem preventing YOLOv3 working on the NCS 2 has been fixed in the latest OpenVINO version (2018.5.445) – thanks to a commenter on the previous post for pointing that out. The screen capture above was obtained using the supplied Python demo running on 1920 x 1080 webcam video.
Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable. I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.
In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.