It wasn’t too hard to go from the inline rt-ai Edge Stream Processing Element using the Coral Edge TPU accelerator to an embedded version running on a Raspberry Pi 3 Model B with Pi camera. The rt-ai Edge test design for this SPE is pretty simple again:
As can be seen, the Pi + Coral runs at about 4 fps with 1280 x 720 frames which is not too bad at all. In this example, I am running the PiCoral camera SPE on the Raspberry Pi node (Pi7) and the View SPE on the Default node (an i7 Ubuntu machine). Also, I’m using the combined video and metadata output which contains both the detection data and the associated JPEG video frame. However, the PiCoral SPE also has a metadata-only output. This contains all the frame information and detection data (scores, boxes etc) but not the JPEG frame itself. This can be useful for a couple of reasons. First, especially if the Raspberry Pi is connected via WiFi, transmitting the JPEGs can be a bit onerous and, if they are not needed, very wasteful. Secondly, it satisfies a potential privacy issue in that the raw video data never leaves the Raspberry Pi. Provided the metadata contains enough information for useful downstream processing, this can be a very efficient way to configure a system.
A Coral USB Accelerator turned up yesterday so of course it had to be integrated with rt-ai Edge to see what it could do. Creating a Python-based SPE from the object detection demo in the API download didn’t take too long. I used the MobileNet SSD v2 COCO model as a starting point to generate this example output:
The very basic rt-ai Edge test design looks like this:
Using 1280 x 720 video frames from the webcam, I was getting around 2 frames per second from the CoralSSD SPE. This isn’t as good as the Intel NCS 2 SPE but that is a compiled C++ SPE whereas the Coral SPE is a Python 3 SPE. I haven’t found a C++ API spec for the Edge TPU as yet. Perhaps by investigating the SWIG-generated Python interface I could link the compiled libraries directly but that’s for another day…
The problem with depth maps for video is that the depth data is very large and can’t be compressed easily. I had previously run OpenPose at 30 FPS using an iPad Pro and remote inference but that was just for the standard OpenPose (x, y) coordinate output. There’s no way that 30 FPS could be achieved by sending out TrueDepth depth maps with each frame. Instead, the depth processing has to be handled locally on the iPad – the depth map never leaves the device.
The screen capture above shows the system running at 30 FPS. I had to turn a lot of lights on in the office – the frame rate from the iPad camera will drop below 30 FPS if it is too dark which messes up the data!
This is the design. It is the triple scaled OpenPoseGPU design used previously. iOSOpenPose connects to the Conductor via a websocket connection that is used to send images to and receive processed images from the pipeline.
One issue is that each image frame has its own depth map and that’s the one that has to be used to convert the OpenPose (x, y) coordinates into spatial (x, y, z) distances. The solution, in a new app called iOSOpenPose, is to cache the depth maps locally and re-associate them with the processed images when they return. Each image and depth frame is marked with a unique incrementing index to assist with this. Incidentally, this is why I love using JSON for this kind of work – it is possible to add non-standard fields at any point and they will be carried transparently to their destination.
Empirically with my current setup, there is a six frame processing lag which is not too bad. It would probably be better with the dual scaled pipeline, two node design that more easily handles 30 FPS but I did not try that. Another issue is that the processing pipeline can validly lose image frames if it can’t keep up with the offered rate. The depth map cache management software has to take care of all of the nasty details like this and other real-world effects.
OpenPose does a great job of estimating the (x, y) coordinates of body points. However, in many situations, the spatial (3D) coordinates of the body joints is what’s required. To do that, the z coordinate has to be provided in some way. There are two common ways of doing that: using multiple cameras or using a depth camera. In this case, I chose using RGBD data from a StereoLabs ZED camera. An example of the result is shown in the screen capture above and another below. Coordinates are in units of meters.
The (x, y) 2D coordinates within the image (generated by OpenPose) along with the depth information at that (x, y) point in the image are used to calculate a spatial (sx, sy, sz) coordinate with origin at the camera and defined by the camera’s orientation. The important thing is that the spatial relationship between the joints is then trivial to calculate. This can be used by downstream inference blocks to discriminate higher level motions.
Incidentally I don’t have a leprechaun sitting on my computers to the right of the first screen capture – OpenPose was picking up my reflection in the window as another person.
The ZED is able to produce a depth map or point cloud but the depth map is more practical in this case as it necessary to transmit the data between processes (possibly on different machines). Even so, it is large and difficult to compress. The trick is to extract the meaningful data and then discard the depth information as soon as possible! The ZED camera also sends along the calibrated horizontal and vertical fields of view as this is essential to constructing (sx, sy, sz) from (x, y) and depth. Since the ZED doesn’t seem to produce a depth value for every pixel, the code samples an area around the (x, y) coordinate to evaluate a depth figure. If it fails to do this, the spatial coordinate is returned as (0, 0, 0).
This is the design I ended up using. Basically a dual OpenPose pipeline with scaler as for standard OpenPose. It averaged around 16 FPS with 1280 x 720 images (24 FPS with VGA images) using JPEG for the image part and raw depth map for the depth part. Using just one pipeline achieved about 13 FPS so the speed up from the second pipeline was disappointing. I expect that this was largely due to the communications overhead of moving the depth map around between nodes. Better network interfaces might improve this.
One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Edge Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.
The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.
Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.
The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.
Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable. I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.
In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.
As I had discovered, one Neural Compute Stick 2 (NCS 2) has pretty decent throughput. The question then is: what happens if you connect more than one of these to the same machine? I only have one NCS 2 and one of the older NCS devices to test this out but that combination worked ok with some tuning. OpenVINO manages allocation of requests to physical devices so there is no explicit way for this to be controlled via the API. However, it appears that multiple SPEs on the same node can be supported as then the NCSs are divided up between the SPEs. A reset error message is typically emitted but then everything seems to work fine.
To get the best performance, I ran in async mode using multiple ExecutableNetwork/InferRequest pairs, with the actual number being configurable from the rtaiDesigner GUI. In this case, 5 pairs gave the best results. The throughput is around 18 frames per second running ssd_mobilenet_v2_coco object detection.
Using one NCS at a time, the NCS 2 was able to process 12 frames per second (versus 9 frames per second in synchronous mode using the original SPE code) while the older NCS was able to process 6 frames per second, suggesting that both were being fully utilized.
Now I need to get a second NCS 2…