The latest iPads and iPhones have some pretty serious edge neural network capabilities that are a natural fit with ARKit and Unity. AR and Unity go together quite nicely as AR provides an excellent way of communicating back to the user the results of intelligently processing sensor data from the user, other users and static (infrastructure) sensors in a space. The screen capture above was obtained from code largely based on this repo which integrates Core ML models with Unity. In this case, Inceptionv3 was used. While it isn’t perfect, it does ably demonstrate that this can be done. Getting the plugin to work was quite straightforward – you just have to include the mlmodel file in XCode via the Files -> Add Files menu option rather than dragging the file into the project. The development cycle is pretty annoying as the plugin won’t run in the Unity Editor and compile (on my old Mac Mini) is painfully slow but I guess a decent Mac would do a better job.
This all brings up the point that there seem to be different perceptions of what the edge actually is. rt-ai Edge can be perceived as a local aggregation and compute facility for inference-capable or conventional mobile and infrastructure devices (such as security cameras) – basically an edge compute facility supporting edge devices. A particular advantage of edge compute is that it is possible to integrate legacy devices (such as dumb cameras) into an AI-enhanced system by utilizing edge compute inference capabilities. In a sense, edge compute is a local mini-cloud, providing high capacity compute and inference a short distance in time away from sensors and actuators. This minimizes backhaul and latency, not to mention securing data in the local area rather than dispersing it in a cloud. It can also be very cost-effective when compared to the costs of running multiple cloud CPU instances 24/7.
Given the latest developments in tablets and smart phones, it is essential that rt-ai Edge be able to incorporate inference-capable devices into its stream processing networks. Inference-capable, per user devices make scaling very straightforward as capability increases in direct proportion to the number of users of an edge system. The normal rt-ai Edge deployment system can’t be used with mobile devices which requires (at the very least) framework apps to make use of AI models within the devices themselves. However, with that proviso, it is certainly possible to incorporate smart edge devices into edge networks with rt-ai Edge.
There’s a new TensorFlow model for image captioning available here. It combines a deep convolutional neural network (Inception-v3) with an LSTM-based decoder network. LSTM is cropping up just about everywhere now…
Yes, that is me waving my Taylor (made in San Diego 🙂 ) guitar around in a very careless manner. It’s all in a good cause though. Turns out that Inception-v3 is very good at recognizing acoustic and electric guitars. I put together a new rtndf PPE called recognize based on the code here from the TensorFlow repo.
In its simplest mode, the recognize PPE takes an incoming video stream and tries to recognize an object in the entire frame. If it finds something, it adds a label in the bottom left corner of the image and uses that to generate a new output stream. That’s ok, but what’s more interesting is when it works with another PPE, modet. modet detects moving objects in the stream and draws a box around them. It now also adds metadata to the outgoing pipeline messages that can be used by downstream PPEs to do something with the regions where motion has been detected.
recognize can work in a mode where it uses the modet metadata to recognize moving objects in the stream. The screen capture with the guitar is an example. That’s why I was waving it around – it had to be in motion to get detected and recognized. The box is that big because I am in motion too! However, Inception-v3 seems quite able to recognize the dominant object in the image segment. While there is only one recognized object in this example, if there were more regions they would be individually recognized.
Of course, the example data set for Inception-v3 only knows so many things, guitars being an example. However, something I want to use this for is to detect a UPS truck coming up the drive. I’ll probably have to try retraining the final layer to do this.
I am currently working with TensorFlow and I thought it’d be interesting to see what kind of performance I could get when processing video and trying to recognize objects with Inception-v3. While I’d like to get TensorFlow integrated with some of my Qt apps, the whole “build with Bazel” thing is holding that up right now (problems with Eigen includes – one day I’ll get back to that). As a way of taking the path of least resistance, I included TensorFlow in an inline MQTT filter written in Python. It subscribes to a video topic sourced from a webcam and outputs recognized objects in the stream.
As can be seen from the screen capture, it’s currently achieving 11 frames per second using 640 x 480 frames with a GTX 970 GPU. With a GTX 960 GPU, the rate falls to around 8 frames per second. This is pretty much what I have seen with other TensorFlow graphs – the GTX 970 is about 50% faster than a GTX 960, probably due to the restricted memory bus width on the GTX 960.
Hopefully I’ll soon have a 10 series GPU – that should be an interesting comparison.