Integrating SHAPE with rt-ai: adding AI to highly augmented spaces

A key feature of SHAPE is its ability to leverage the power of external servers in order to enhance the AR experience. The idea of combining relatively simple and cheap AR headsets with low latency communications links (such as 5G wireless) to edge servers is what is driving SHAPE’s architecture. Giving SHAPE access to rt-ai edge systems is a first example of this in action.

The screen capture above gives an idea of the current state of SHAPE development. This was taken using an iPad Pro running the iOS SHAPE app. The polygons with red edges are the planes that have been detected by ARKit. At the bottom right the monitor shows the same app running on a Mac (in the Unity editor in this case). The macOS version greatly speeds development of everything other than ARKit-related functionality – especially space synchronization functions (e.g. adding, moving, modifying or deleting object actions that need to be shared between all SHAPE users in the same space). The Unity iOS SHAPE app uses the ARFoundation API to, amongst other things,  load and save ARWorldMaps in order to synchronize spatial locations between SHAPE app instances. ARWorldMaps are persisted by the CoreUniverse components and cached for real-time use by EdgeSpace components, one EdgeSpace per physical “room”. SHAPE apps physically entering the room receive the latest map along with the space definition for that room. This includes the directory of augmentation objects with metadata that allows them all to be downloaded from asset servers (unless already cached) and then positioned correctly in the physical space and connected to the appropriate external function servers.

Augmentation objects can be moved around the space manually by touching the object with three or more fingers – sounds awful but it does work. It can then be dragged around the screen and the screen can be moved around to position the objects in space. Touching the object with two fingers brings up the object menu for that instance. This allows the object to be deleted, resized or rotated. It also allows the object to be stuck to a wall or stuck to the floor. in this context, a wall is an ARKit vertical plane, a floor is an ARKit horizontal plane so the object could easily be placed on a table if a suitable plane has been detected. If not, it can be placed manually. All of these object changes are sent to the room’s EdgeSpace (via EdgeAccess) and shared between other users in the space to keep everything synchronized. In addition, updates are sent to CoreUniverse for persistence. These become integrated into the persistent space definition for the room which EdgeSpace instances receive on a regular basis from CoreUniverse (primary and backup). Now this creates an interesting race condition since EdgeSpace is modifying its cached space definition in real-time and it may take a while for the CoreUniverse version to catch up. This problem is handled using timestamps attached to updates so that EdgeSpace can correctly integrate new information from CoreUniverse (such a new object instantiated by a space design tool) while ignoring stale updates for existing objects.

The box with big “M”s is the menu object. Each room has one and it can be placed anywhere convenient in the room. You can click on it (well touch it actually if using an iPad touch screen) and this pops up a menu that allows the user to add augmentation objects. Right now this is just working for the infamous analog clock but will eventually present a catalog of available models with thumbnails. The analog clocks are proxy objects and being driven by an external analog clock server. Obviously it is trivial to implement this purely in the Unity app but it is meant as a simple test of the proxy object concept. The next proxy object to be added will be the sticky note object from rt-xr and then probably the rt-xr shared whiteboard.

Getting back to rt-ai integration, the rt-ai design above shows the simple test design that receives captured frames from the iPad’s rear camera. The frame rate is limited to 5fps so as not to load the WiFi link too much. For simplicity and low latency motion jpegs are used for this but of course compressed video could be used (and probably will be in the future). The new rt-ai SPE called SHAPEConductor looks to the SHAPE system like a SHAPE function server while mapping received messages into and out of an rt-ai stream processing network. In this case, the video is simply being passed through DeepLab to perform semantic segmentation and then the results displayed:

Here it is picking up the monitor running the macOS SHAPE app. In practice, more complex processing would be performed and results returned to proxy objects via the SHAPEConductor module and the SHAPE network.

One interesting application for this is to use the captured frames to recognize the physical space and automatically load the correct saved ARWorldMap for that physical space into the SHAPE app and instantiate all the appropriate augmentation objects, correctly located. Another would be to perform semantic segmentation and return the results to the SHAPE app so that it can be married to depth data and allow real time occlusion to be performed. ARKit 3 will do this on-device for people but apparently not in general. Offloading the segmentation should allow for a lot more flexibility, albeit with increased latency, and work on lower capability devices.

The SHAPE rt-ai integration is very much a work in progress and it will be fun to see what can be achieved with this combination.

Introducing SHAPE: Scalable Highly Augmented Physical Environment

This screenshot is an example of a  virtual environment augmented with proxy objects created using rt-xr. However, this was always intended to be a VR precursor for an AR solution now called SHAPE – Scalable Highly Augmented Physical Environment. The difference is that the virtual objects being used to augment the virtual environment shown above (such as whiteboards, status displays, sticky notes, camera screens and other static virtual objects in this case) are used to augment real physical environments with a primary focus on scalability and local collaboration for physically present occupants. The intent is to open source SHAPE in the hope that others might like to contribute to the framework and/or contribute virtual objects to the object library.

Some of the features of SHAPE are:

  • SHAPEs are designed for collaboration. Multiple AR device users, present in the same space are able to interact with virtual objects just like real objects with consistent state maintained for all users.
  • SHAPE users can be grouped so that they see different virtual objects in the same space depending on their assigned group. A simple example of this would be where virtual objects are customized for language support – the virtual object set instantiated would then depend on the language selected by a user.
  • SHAPEs are scalable because they minimize the loading on AR devices. Complex processing is performed using a local edge server or remote cloud. Each virtual object is either static (just for display) or else can be connected to a server function that drives the virtual object and also receives interaction inputs that may modify the state of the virtual object, leaving the AR device to display objects and pass interaction events rather than performing complex functions on-device. Reducing the AR device loading in this way extends battery life and reduces heat, allowing devices to be used for longer sessions.
  • There is a natural fit between SHAPE and artificial intelligence/machine learning. As virtual objects are connected to off-device server functions, they can make use of inference results or supply data for machine learning derived from user interactions while leveraging much more powerful capabilities than are practical on-device.
  • A single universal app can be used for all SHAPEs. Any virtual objects needed for a particular space are downloaded at run time from an object server. However, there would be nothing stopping the creation of a customized app that included hard-coded assets while still leveraging the rest of SHAPE – this might be useful in some applications.
  • New virtual objects can be instantiated by users of the space, configured appropriately (including connection to remote server function) and then made persistent in location by registering with the object server.

A specific goal is to be able to support large scale physical environments such as amusement parks or sports stadiums, where there may be a very large number of users distributed over a very large space. The SHAPE system is being designed to support this level of scalability while being highly responsive to interaction.

In order to turn this into reality, the SHAPE concept requires low cost, lightweight AR headsets that can be worn for extended periods of time, perform reliable spatial localization in changing outdoor environments while also providing high quality, wide angle augmentation displays. Technology isn’t there yet so initially development will use iPads as the AR devices and ARKit for localization. Using iPads for this purpose isn’t ideal ergonomically but does allow all of the required functionality to be developed. When suitable headsets do become available, SHAPE will hopefully be ready to take advantage of them.

Detailed remote node status and distributed logging in rt-ai Edge

Something that had been missing from rt-ai Edge was the ability to see easily the state of remote nodes. That’s now been corrected with the new node status display. Each node has its own tab displaying key real-time usage information and the plan is to add more information to this in future versions – such as precisely which modules are running and also to support multiple GPUs.

Another missing feature was any sort of distributed logging, so that an app could receive and process log messages generated by any module in a design. This is now working and a module log display has been added to the design GUI in rtaiDesigner. As the logging system uses the same communications infrastructure as the rest of the management data, it will be easy to add redundant persistent storage and review of historic log information.

Just as an irrelevant footnote, the node status code came from the BioTestBench, something that I wrote some years ago when I was interested in bioinformatics. I was working on whole genome alignment and wanted a convenient single window that displayed all needed resource utilization information. Haven’t used this for ages but I am glad that some of the code finally came in handy for something else.

Raspberry Pi 3 Model B with Coral Edge TPU acceleration running SSD object detection

It wasn’t too hard to go from the inline rt-ai Edge Stream Processing Element using the Coral Edge TPU accelerator to an embedded version running on a Raspberry Pi 3 Model B with Pi camera.  The rt-ai Edge test design for this SPE is pretty simple again:

As can be seen, the Pi + Coral runs at about 4 fps with 1280 x 720 frames which is not too bad at all. In this example, I am running the PiCoral camera SPE on the Raspberry Pi node (Pi7) and the View SPE on the Default node (an i7 Ubuntu machine). Also, I’m using the combined video and metadata output which contains both the detection data and the associated JPEG video frame. However, the PiCoral SPE also has a metadata-only output. This contains all the frame information and detection data (scores, boxes etc) but not the JPEG frame itself. This can be useful for a couple of reasons. First, especially if the Raspberry Pi is connected via WiFi, transmitting the JPEGs can be a bit onerous and, if they are not needed, very wasteful. Secondly, it satisfies a potential privacy issue in that the raw video data never leaves the Raspberry Pi. Provided the metadata contains enough information for useful downstream processing, this can be a very efficient way to configure a system.

An Edge TPU stream processing element for rt-ai Edge using the Coral USB Accelerator

A Coral USB Accelerator turned up yesterday so of course it had to be integrated with rt-ai Edge to see what it could do. Creating a Python-based SPE from the object detection demo in the API download didn’t take too long. I used the MobileNet SSD v2 COCO model as a starting point to generate this example output:

The very basic rt-ai Edge test design looks like this:

Using 1280 x 720 video frames from the webcam, I was getting around 2 frames per second from the CoralSSD SPE. This isn’t as good as the Intel NCS 2 SPE but that is a compiled C++ SPE whereas the Coral SPE is a Python 3 SPE. I haven’t found a C++ API spec for the Edge TPU as yet. Perhaps by investigating the SWIG-generated Python interface I could link the compiled libraries directly but that’s for another day…

Combining TrueDepth, remote OpenPose inference and local depth map processing to generate spatial 3D pose coordinates

The problem with depth maps for video is that the depth data is very large and can’t be compressed easily. I had previously run OpenPose at 30 FPS using an iPad Pro and remote inference but that was just for the standard OpenPose (x, y) coordinate output. There’s no way that 30 FPS could be achieved by sending out TrueDepth depth maps with each frame. Instead, the depth processing has to be handled locally on the iPad – the depth map never leaves the device.

The screen capture above shows the system running at 30 FPS. I had to turn a lot of lights on in the office – the frame rate from the iPad camera will drop below 30 FPS if it is too dark which messes up the data!

This is the design. It is the triple scaled OpenPoseGPU design used previously. iOSOpenPose connects to the Conductor via a websocket connection that is used to send images to and receive processed images from the pipeline.

One issue is that each image frame has its own depth map and that’s the one that has to be used to convert the OpenPose (x, y) coordinates into spatial (x, y, z) distances. The solution, in a new app called iOSOpenPose, is to cache the depth maps locally and re-associate them with the processed images when they return. Each image and depth frame is marked with a unique incrementing index to assist with this. Incidentally, this is why I love using JSON for this kind of work – it is possible to add non-standard fields at any point and they will be carried transparently to their destination.

Empirically with my current setup, there is a six frame processing lag which is not too bad. It would probably be better with the dual scaled pipeline, two node design that more easily handles 30 FPS but I did not try that. Another issue is that the processing pipeline can validly lose image frames if it can’t keep up with the offered rate. The depth map cache management software has to take care of all of the nasty details like this and other real-world effects.

Generating 3D spatial coordinates from OpenPose with the help of the Stereolabs ZED camera

OpenPose does a great job of estimating the (x, y) coordinates of body points. However, in many situations, the spatial (3D) coordinates of the body joints is what’s required. To do that, the z coordinate has to be provided in some way. There are two common ways of doing that: using multiple cameras or using a depth camera. In this case, I chose using RGBD data from a StereoLabs ZED camera. An example of the result is shown in the screen capture above and another below. Coordinates are in units of meters.

The (x, y) 2D coordinates within the image (generated by OpenPose) along with the depth information at that (x, y) point in the image are used to calculate a spatial (sx, sy, sz) coordinate with origin at the camera and defined by the camera’s orientation. The important thing is that the spatial relationship between the joints is then trivial to calculate. This can be used by downstream inference blocks to discriminate higher level motions.

Incidentally I don’t have a leprechaun sitting on my computers to the right of the first screen capture – OpenPose was picking up my reflection in the window as another person.

The ZED is able to produce a depth map or point cloud but the depth map is more practical in this case as it necessary to transmit the data between processes (possibly on different machines). Even so, it is large and difficult to compress. The trick is to extract the meaningful data and then discard the depth information as soon as possible! The ZED camera also sends along the calibrated horizontal and vertical fields of view as this is essential to constructing (sx, sy, sz) from (x, y) and depth. Since the ZED doesn’t seem to produce a depth value for every pixel, the code samples an area around the (x, y) coordinate to evaluate a depth figure. If it fails to do this, the spatial coordinate is returned as (0, 0, 0).

This is the design I ended up using. Basically a dual OpenPose pipeline with scaler as for standard OpenPose. It averaged around 16 FPS with 1280 x 720 images (24 FPS with VGA images) using JPEG for the image part and raw depth map for the depth part. Using just one pipeline achieved about 13 FPS so the speed up from the second pipeline was disappointing. I expect that this was largely due to the communications overhead of moving the depth map around between nodes. Better network interfaces might improve this.