One way to achieve higher video frame inference rates in situations where no state is maintained between frames is to split an incoming video stream across multiple inference pipelines. The new rt-ai Edge Scaler Stream Processing Element (SPE) does exactly that. The screen capture above shows the design and the real time performance information (in the windows on the right). The pipelines in this case are just single SPEs running single shot object detection on the Intel NCS 2. The CSSD SPE is able to process around 13 1280 x 720 frames per second by itself. Using the Scaler SPE to leverage two CSSD SPEs, each with one NCS 2 running on different nodes, the throughput has been doubled. In fact, performance should scale roughly linearly with the number of pipelines attached.
The Scaler SPE implements a health check function that determines the availability of pipelines at run time. Only pipelines that pass the health check are eligible to receive frames to be processed. In the example, Scaler can support eight pipelines but only two are connected (2 and 6) so only these pass the health check and receive frames. Frames from the In port are distributed across the active pipelines in a round robin fashion.
Pipelines are configured with a maximum to the number of in-flight frames in order to maximize pipeline throughput and minimize latency. Without multiple in-flight frames, CSSD performance would be roughly halved. In fact, pipelines can have different processing throughputs – the Scaler SPE automatically adjusts pipeline usage based on achieved throughput. Result messages may be received from pipelines out of sequence and the Scaler SPE ensures that the final output stream on the Out port has been reordered correctly.
The Scaler SPE can actually support any type of data (not just video) and any type of pipeline (not just inference) provided there is no retained state between messages. This makes it a very useful new building block.
Following on from the GPU version, I now have OpenPose running in an Intel NCS 2 Stream Processing Element, as shown in the screen capture above. This wasn’t too hard as it is based on an Intel sample and model. The metadata format is consistent with the GPU version (apart from the lack of support for face and hand pose estimation) but that’s fine for a lot of applications.
This is the familiar simple test design. The OpenPoseVINO SPE is running at about 3fps on 1280 x 720 video using an NCS 2 (the GPU version with a GTX 1080ti gets about 17fps in body pose only mode). The current SPE inherited a blocking inference OpenVINO call from the demo rather than an asynchronous inference call – this needs to be changed to be similar to the technique used by the SSD version so that the full capabilities of multiple NCS 2s can be utilized for body pose estimation.
It seems that the problem preventing YOLOv3 working on the NCS 2 has been fixed in the latest OpenVINO version (2018.5.445) – thanks to a commenter on the previous post for pointing that out. The screen capture above was obtained using the supplied Python demo running on 1920 x 1080 webcam video.
One of the issues with the GPU-based CYOLO (for example) is that it uses about 8GB of GPU memory meaning that, even on a GTX 1080 ti GPU card, it is only possible to have one instance of the CYOLO SPE on any one GPU card. A way around this is to run multiple streams through a single SPE instance. The architecture of rt-ai Edge always supported fan in (i.e. stream multiplexing) but not fan out (i.e. stream demultiplexing). The new FanOut module solves this problem. The screen capture above shows the new FanOut SPE running with the Intel NCS 2-based CSSD SPE. Video streams from three cameras are multiplexed on the CSSD SPE’s input pin. The multiplexed output is then passed to the FanOut SPE which demultiplexes the composite stream to up to eight individual streams. The screen capture also shows the FanOut configuration dialog – you just enter the source SPE name for the stream to be associated with each output pin.
Since my second NCS 2 has arrived I was able to run the triple NCS configuration shown above. The old NCS didn’t really contribute much in this case – the two NCS 2s were able to get an aggregate throughput of around 26 frames per second. This is shared between the three input streams of course.
The fan in/fan out multiplexing idea fits very well with the NCS 2 as you can just add more NCS 2s (or more likely, a special purpose multiple Myriad X board) to a node to increase aggregate throughput.
I wanted a small and portable rt-ai Edge node using the Neural Compute Stick for demos and decided to base it on a Gigabyte BRi7H-8550 compact PC as it is the lowest cost, smallest footprint, device that I could find with a decent i7 CPU. This is fitted with 16GB of DDR4 DRAM and a 256GB NVMe M2 disk. Previously I needed a mini ITX board along with a GPU which is much bigger and heavier as can be seen below.
The node is running Ubuntu 16.04 along with standard rt-ai node management software and performs very nicely. A second NCS can be fitted on the front USB port and a small USB hub could be used if more than two are required. For demo purposes, a Windows or Ubuntu laptop runs rtaiDesigner for GUI-based control and status with the node acting as a headless inference server.
While this is primarily intended as a demo device, it would actually be quite a nice embedded inference node.
As I had discovered, one Neural Compute Stick 2 (NCS 2) has pretty decent throughput. The question then is: what happens if you connect more than one of these to the same machine? I only have one NCS 2 and one of the older NCS devices to test this out but that combination worked ok with some tuning. OpenVINO manages allocation of requests to physical devices so there is no explicit way for this to be controlled via the API. However, it appears that multiple SPEs on the same node can be supported as then the NCSs are divided up between the SPEs. A reset error message is typically emitted but then everything seems to work fine.
To get the best performance, I ran in async mode using multiple ExecutableNetwork/InferRequest pairs, with the actual number being configurable from the rtaiDesigner GUI. In this case, 5 pairs gave the best results. The throughput is around 18 frames per second running ssd_mobilenet_v2_coco object detection.
Using one NCS at a time, the NCS 2 was able to process 12 frames per second (versus 9 frames per second in synchronous mode using the original SPE code) while the older NCS was able to process 6 frames per second, suggesting that both were being fully utilized.
Now I need to get a second NCS 2…