The Python-based YOLOv3 SPE has been working for a while now but the performance was a little disappointing at 2 or 3 fps using 1280 x 720 frames on an i7 5820K CPU/GTX 1080ti GPU machine. I was interested to see how much effect the Python code was having on overall performance. To do this I implemented the C++ rt-ai SPE API and added the C version of the YOLOv3 demo code. The result is shown above and this version now runs at just over 14 fps (17 fps at 640 x 480) which is very usable.
While Python is very convenient, it is clearly (and unsurprisingly) more efficient to use C/C++ so I will probably do that in the future where possible. The main side-effect is that rtaiDesigner has to deploy the correct compiled SPE for the target node (typically x64 or ARM) and that any shared libraries that are not part of the standard install are included too. A Dockerized version would of course solve the dependency problem and just require a container for each target architecture.
Fresh from success with YOLOv3 on the desktop, a question came up of whether this could be made to work on the Movidius Neural Compute Stick and therefore run on the Raspberry Pi.
The NCS is a neat little device and because it connects via USB, it is easy to develop on a desktop and then transfer everything needed to the Pi.
The app zoo, on the ncsdk2 branch, has a tiny_yolo_v2 implementation that I used as the basis for this. It only took about an hour to get this working on the desktop – integration with rt-ai was very easy. The Raspberry Pi end was not – all kinds of version number issues and things like that. However, even though not all of the tools would compile, I just moved the compiled graph from the desktop to the Pi and that worked fine.
This is the design. The main difference here from the usual test designs is that the MYOLO SPE is assigned to node pi34 (the Raspberry Pi) rather than the desktop (Default). Just assigning the MYOLO SPE to the Pi saved me from having to connect a Picam or uvc camera to the Pi and also allowed me to get a better feel for the pure performance of the Pi with the NCS.
As can be seen from the first screen capture it worked fine although, because it supports only a subset (20 of 91) of the usual COCO labels, it did not pick up the mouse or the keyboard. Performance-wise, it was running at about 1fps and 30% CPU. Just for reference, I was getting about 8fps on the i7 desktop.
I had intended to be doing something completely different today (working on auto-compiling highlight reels of interesting events generated from the prototype production rt-ai Edge object detection system) but managed to get sidetracked by reading about Darknet-based YOLOv3. As Darknet itself is in C and compiles to a shared library this was a good candidate for a Dockerized stream processing element. I used a cuDNN image from NVIDIA as the base since it provides pretty much everything required – I just had to add in the rt-ai SPE library software and compile Darknet on top of that.
The results are pretty good. The preview above shows some detected objects. I discovered that it could detect toothbrushes which is why I am waving one around. It also did a good job of picking up the second mouse just by my left shoulder. 2fps with 1280 x 720 frame size is a little disappointing but this seems to be due to the Python parts of the code since the C demo provided with the library runs much faster. It is a little faster with preview turned off, however (which would be the production mode anyway).
Speaking of production, it does have a problem as it consumes just over 7GB of memory on my GTX 1080 ti GPU card. This means that one GPU card can’t run two instances simultaneously, unlike with the TensorFlow SSD detector. In fact, I can get two instances of that working on a GTX 1080 card with 8GB total memory.
Just for completeness, this is the design which looks just like the usual test designs. The Docker container is built and pushed to a private Docker registry automatically when the design is generated. The target node then just pulls the image from the registry when the design starts up.
This is the MediaView output showing the metadata. The metadata format is equivalent to that generated by the TensorFlow object detector so that they are completely interchangeable.
I decided that it would be fun to try out a Google AIY Vision Kit as a sort of warm-up for the potentially much more significant Edge TPU.
The Vision Kit is basically the same configuration as the ZeroSensor camera except with an extra board in the camera path that can perform inference on the captured images. The kit comes with some frozen graphs that can be used to detect a few things but I thought it would be interesting to try training a MobileNet SSD network with the Pascal VOC 2012 training data which can identify 20 different objects. The instructions for how to do this are here.
Once that was all running, the next step was to integrate it with rt-ai Edge. It’s pretty similar to the earlier full-blown TensorFlow version so it didn’t take too long to get working.
The design is much the same as usual except with the new VisionKit object detection SPE instead of TFObjectDetect or Deeplab. Note that the PiCam and VisionKit SPEs are running on the AIY Vision Kit, whereas the MediaView SPE is running on a desktop.
This is the output from the MediaView SPE. The metadata has been formatted to look exactly the same as the previous TensorFlow detector so that they can be used interchangeably in stream processing networks. I am getting about 2 fps with 640 x 360 images which is actually better than I expected.
Most IP cameras, including security and surveillance cameras, support RTSP H.264 streaming so it made sense to implement a compatible stream processing element (SPE) for rt-ai Edge. The design above is a simple test design. The video stream from the camera is converted into JPEG frames using GStreamer within the SPE and then passed to the DeepLabv3 SPE. The output from DeepLabv3 is then passed to a MediaView SPE for display.
I have a few ONVIF/RTSP cameras around the property and the screen capture above shows the results from one of these. There’s a car sitting in its field of view that’s picked out very nicely. I am using the DeepLabv3 SPE here in its masked image mode where the output frames just consist of recognized object images and nothing else. Just for reference, this is the original frame:
Clearly the segmented image only retains what it is important for later processing.
rt-xr SpaceObjects are now working very nicely. It’s easy to create, configure and delete SpaceObjects as needed using the menu switch which has been placed just above the light switch in my office model above.
The video below shows all of this in operation.
The typical process is to instantiate an object, place and size it and then attach it to a Manifold stream if it is a Proxy Object. Persistence, sharing and collaboration works for all relevant SpaceObjects across the supported platforms (Windows and macOS desktop, Windows MR, Android and iOS).
This is a good place to leave rt-xr for the moment while I wait for the arrival of some sort of AR headset in order to support local users of an rt-xr enhanced sentient space. Unfortunately, Magic Leap won’t deliver to my zip code (sigh) so that’s that for the moment. Lots of teasers about the HoloLens 2 right now and this might be the best way to go…eventually.
Now the focus moves back to rt-ai Edge. While this is working pretty well, it needs to have a few bugs fixed and also add some production modes (such as auto-starting SPNs when server nodes are started). Then begins the process of data collection for machine learning. ZeroSensors will collect data from each monitored room and this will be saved by ManifoldStore for later use. The idea is to classify normal and abnormal situations and also to be proactive in responding to the needs of occupants of the sentient space.
One of the goals of the rt-ai Edge system is that users of the system can use whatever device they have available to interact and extract value from it. Unity is a tremendous help given that Unity apps can be run on pretty much everything. The main task was integration with Manifold so that all apps can receive and interact with everything else in the system. Manifold currently supports Windows, UWP, Linux, Android and macOS. iOS is a notable absentee and will hopefully be added at some point in the future. However, I perceive Android support as more significant as it also leads to multiple MR headset support.
The screen shot above and video below show three instances of the rt-ai viewer apps running on Windows desktop, Windows Mixed Reality and Android interacting in a shared sentient space. Ok, so the avatars are rubbish (I call them Sad Robots) but that’s just a detail and can be improved later. The wall panels are receiving sensor and video data from ZeroSensors via an rt-ai Edge stream processing network while the light switch is operated via a home automation server and Insteon.
Sharing is mediated by a SharingServer that is part of Manifold. The SharingServer uses Manifold multicast and end to end services to implement scalable sharing while minimizing the load on each individual device. Ultimately, the SharingServer will also download the space definition file when the user enters a sentient space and also provide details of virtual objects that may have been placed in the space by other users. This allows a new user with a standard app to enter a space and quickly create a view of the sentient space consistent with existing users.
While this is all kind of fun, the more interesting thing is when this is combined with a HoloLens or similar MR headset. The MR headset user in a space would see any VR users in the space represented by their avatars. Likewise, VR users in a space would see avatars representing MR users in the space. The idea is to get as close to a telepresent experience for VR users as possible without very complex setups. It would be much nicer to use Holoportation but that would require every room in the space has a very complex and expensive setup which really isn’t the point. The idea is to make it very easy and low cost to implement an rt-ai Edge based sentient space.
Still lots to do of course. One big thing is audio. Another is representing interaction devices (pointers, motion controllers etc) to all users. Right now, each app just sends out the camera transform to the SharingServer which then distributes this to all other users. This will be extended to include PCM audio chunks and transforms for interaction devices so that everyone will be able to create a meaningful scene. Each user will receive the audio stream from every other user. The reason for this is that then each individual audio stream can be attached to the avatar for each user giving a spatialized sound effect using Unity capabilities (that’s the hope anyway). Another very important thing is that the apps work differently if they are running on VR type devices or AR/MR type devices. In the latter case, the walls and related objects are not drawn and just the colliders instantiated although virtual objects and avatars will be visible. Obviously AR/MR users want to see the real walls, light switches etc, not the virtual representations. However, they will still be able to interact in exactly the same way as a VR user.