Connecting to SHAPE-based augmented spaces via QR codes or NFC


The SHAPE concept requires that a single standard SHAPE app works with any SHAPE installation without the user having to do anything particularly special. The first thing that the SHAPE app has to be able to do is to connect with an EdgeAccess instance (see the SHAPE architecture here), or two (primary and backup) if redundant operation is required. A simple way to do this is to use QR codes and this is now working in the macOS and iOS Unity SHAPE apps (with the help of the ZXing.Net QR code reader). The idea is that a customer entering a theme park or sports arena for example would be given a customized QR code that contains their assigned temporary user name and the URLs to primary and backup EdgeAccesses. The SHAPE app is then started and begins searching for a valid SHAPE QR code. When one is found, the SHAPE app connects to the specified EdgeAccess(es) and begins normal operation.

This mode of operation implies a system that creates the QR codes and is tied in to purchasing tickets which might not always be practical. An alternative to this is to have standard QR codes for the SHAPE installation that have URLs to a new SHAPE component called AccessManager. One or two AccessManagers (two for redundancy) serve the entire installation which means that one or more standard QR codes could be supplied to any customer. The first step for the app then is to connect to the AccessManager (using the URL from the QR code) which then redirects the SHAPE app to the assigned primary and backup EdgeAccesses instances. This allows for dynamic load sharing between EdgeAccess instances at connection time (rather than QR code generation time as in the customized QR code case).

However, there are advantages to generating customized QR codes for every customer. One advantage is that users can be added to groups easily. SHAPE augmentations can be defined to be visible only to members of a group. This means that a group could have private sticky notes left around the SHAPE installation for example. Or, a group assignment could define a specific version of information and augmentations for an event. As an example, if two teams are playing some sort of match in an arena, customers might want to identify with one of the teams and see customized information feeds and augmentations that are most relevant to them.

While QR codes work well, NFC might be a better way to go for real installations. If an AR headset uses a smartphone to run the SHAPE app, the smartphone’s NFC capability could be used to transfer the SHAPE connection information. Or if a headset is able to run the SHAPE app standalone and has an NFC capability, that could also be used.

SHAPE itself is working pretty well now with sticky notes and whiteboards (essentially as in rt-xr) working fine with collaboration and persistence. CoreUniverse, EdgeSpace, EdgeAccess and asset serving are all operational. The QR code system got rid of some of the temporary configuration – there are a few more temporary fixes left to be eliminated before the implementation becomes more generally usable.

Integrating SHAPE with rt-ai: adding AI to highly augmented spaces

A key feature of SHAPE is its ability to leverage the power of external servers in order to enhance the AR experience. The idea of combining relatively simple and cheap AR headsets with low latency communications links (such as 5G wireless) to edge servers is what is driving SHAPE’s architecture. Giving SHAPE access to rt-ai edge systems is a first example of this in action.

The screen capture above gives an idea of the current state of SHAPE development. This was taken using an iPad Pro running the iOS SHAPE app. The polygons with red edges are the planes that have been detected by ARKit. At the bottom right the monitor shows the same app running on a Mac (in the Unity editor in this case). The macOS version greatly speeds development of everything other than ARKit-related functionality – especially space synchronization functions (e.g. adding, moving, modifying or deleting object actions that need to be shared between all SHAPE users in the same space). The Unity iOS SHAPE app uses the ARFoundation API to, amongst other things,  load and save ARWorldMaps in order to synchronize spatial locations between SHAPE app instances. ARWorldMaps are persisted by the CoreUniverse components and cached for real-time use by EdgeSpace components, one EdgeSpace per physical “room”. SHAPE apps physically entering the room receive the latest map along with the space definition for that room. This includes the directory of augmentation objects with metadata that allows them all to be downloaded from asset servers (unless already cached) and then positioned correctly in the physical space and connected to the appropriate external function servers.

Augmentation objects can be moved around the space manually by touching the object with three or more fingers – sounds awful but it does work. It can then be dragged around the screen and the screen can be moved around to position the objects in space. Touching the object with two fingers brings up the object menu for that instance. This allows the object to be deleted, resized or rotated. It also allows the object to be stuck to a wall or stuck to the floor. in this context, a wall is an ARKit vertical plane, a floor is an ARKit horizontal plane so the object could easily be placed on a table if a suitable plane has been detected. If not, it can be placed manually. All of these object changes are sent to the room’s EdgeSpace (via EdgeAccess) and shared between other users in the space to keep everything synchronized. In addition, updates are sent to CoreUniverse for persistence. These become integrated into the persistent space definition for the room which EdgeSpace instances receive on a regular basis from CoreUniverse (primary and backup). Now this creates an interesting race condition since EdgeSpace is modifying its cached space definition in real-time and it may take a while for the CoreUniverse version to catch up. This problem is handled using timestamps attached to updates so that EdgeSpace can correctly integrate new information from CoreUniverse (such a new object instantiated by a space design tool) while ignoring stale updates for existing objects.

The box with big “M”s is the menu object. Each room has one and it can be placed anywhere convenient in the room. You can click on it (well touch it actually if using an iPad touch screen) and this pops up a menu that allows the user to add augmentation objects. Right now this is just working for the infamous analog clock but will eventually present a catalog of available models with thumbnails. The analog clocks are proxy objects and being driven by an external analog clock server. Obviously it is trivial to implement this purely in the Unity app but it is meant as a simple test of the proxy object concept. The next proxy object to be added will be the sticky note object from rt-xr and then probably the rt-xr shared whiteboard.

Getting back to rt-ai integration, the rt-ai design above shows the simple test design that receives captured frames from the iPad’s rear camera. The frame rate is limited to 5fps so as not to load the WiFi link too much. For simplicity and low latency motion jpegs are used for this but of course compressed video could be used (and probably will be in the future). The new rt-ai SPE called SHAPEConductor looks to the SHAPE system like a SHAPE function server while mapping received messages into and out of an rt-ai stream processing network. In this case, the video is simply being passed through DeepLab to perform semantic segmentation and then the results displayed:


Here it is picking up the monitor running the macOS SHAPE app. In practice, more complex processing would be performed and results returned to proxy objects via the SHAPEConductor module and the SHAPE network.

One interesting application for this is to use the captured frames to recognize the physical space and automatically load the correct saved ARWorldMap for that physical space into the SHAPE app and instantiate all the appropriate augmentation objects, correctly located. Another would be to perform semantic segmentation and return the results to the SHAPE app so that it can be married to depth data and allow real time occlusion to be performed. ARKit 3 will do this on-device for people but apparently not in general. Offloading the segmentation should allow for a lot more flexibility, albeit with increased latency, and work on lower capability devices.

The SHAPE rt-ai integration is very much a work in progress and it will be fun to see what can be achieved with this combination.

The SHAPE architecture: scaling the core using Apache Kafka

SHAPE is being designed from the outset to scale to tens of thousands of simultaneous users or more in a single SHAPE universe, while providing a low latency experience to every AR user.  The current architectural concept is shown in the (somewhat messy) diagram above. A recent change has been the addition of Apache Kafka in the core layer. This helps solve one of the bigger problems: how to keep track of all of the augmentation object changes and interactions reliably and ensure a consistent representation for everyone.

SHAPE functionality is divided into four regions:

  • Core. Core functions are those that may involve significant amounts of data and processing but do not have tight latency requirements. Core functions could be implemented in a remote cloud for example. CoreUniverse manages all of the spatial maps, proxy object instances, spatial anchors and server configurations for the entire system and can be replicated for redundancy and load sharing. In order to ensure eventual consistency, Apache Kafka is used to keep a permanent record of updates to the space configuration (data flowing along the red arrows), allowing easy recovery from failures along with high reliability and scalability. The idea of using Kafka for this purpose was triggered by this paper incidentally.
  • Proxy. The proxy region contains the servers that drive the proxy objects (i.e. the AR augmentations) in the space. There are two types of servers in this region: asset servers and function servers. Asset servers contain the assets that form the proxy object – a Unity assetbundle for example. Users go directly to the asset servers (blue arrows – only a few shown for clarity) to obtain assets to instantiate. Function servers interact with the instantiated proxy objects in real time (via EdgeAccess as described below). For example, in the case of the famous analog clock proxy object (my proxy object equivalent of the classic Utah teapot), the function server drives the hands of the clock by supplying updated angles to the sub-objects with the analog clock asset.
  • Edge. The edge functions consist of those that have to respond to users with low latency. The first point of contact for SHAPE users is EdgeAccess. During normal operation, all real-time interaction takes place over a single link to an instance of EdgeAccess. This makes management, control and status on a per user basis very easy. EdgeAccess then makes ongoing connections to EdgeSpace servers and proxy function servers. A key performance enhancement is that EdgeAccess is able to multicast data from function servers if the data has not been customized for a specific proxy object instance. Function server data that can be multicast in this way is called undirected data, function server data intended for a specific proxy object instance is called directed data. The analog clock server generates undirected data whereas a server that is interacting directly with a user (via proxy object interaction support) has to use directed data. EdgeSpace acts as a sort of local cache for CoreUniverse. Each EdgeSpace instance supports a sub-space of the entire universe. It caches the local spatial maps, object instances and anchors for the sub-space so that users located within that sub-space experience low latency updates. These updates are also forwarded to Kafka so that CoreUniverse instances will eventually correctly reflect the state of the local caches. EdgeSpace instances sync with CoreUniverse at startup and periodically during operation to ensure consistency.
  • User. In this context, users are SHAPE apps running on AR headsets. An important concept is that a standard SHAPE app can be used in any SHAPE universe. The SHAPE app establishes a single connection (black arrows) to an EdgeAccess instance. EdgeAccess provides the user app with the local spatial map to use, proxy object instances, asset server paths and spatial anchors. The user app then fetches the assets from one or more asset servers to populate its augmentation scene. In addition, the user app registers with EdgeAccess for each function server required by the proxy object instances. Edge Access is responsible for setting up any connections to function servers (green arrows – only a few shown for clarity) that aren’t already in existence.

As an example of operation, consider a set of users physically present in the same sub-space. They may be connected to SHAPE via different EdgeAccess instances but will all use the same EdgeSpace. If one user makes a change to a proxy object instance (rotates it for example), the update information will be sent to EdgeSpace (via EdgeAccess) and then broadcast to the other users in the sub-space so that the changes are reflected in their augmentation scenes in real-time. The updates are also forwarded to Kafka so that CoreUniverse instances can track every local change.

This is very much a work in progress so details may change of course. There are quite a few details that I have glossed over here (such as spatial map management and a user moving from one sub-space to another) and they may well require changes.

Real time OpenPose on an iPad…with the help of remote inference and rendering

I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.

So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.

The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.

The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.

Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML  for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.

Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.

Adding depth to DNN object detection with ARKit and Unity AR Foundation


Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable.  I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.

In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.

Using edge inference to detect real world objects with Unity AR Foundation, ARKit and rt-ai Edge

The Unity AR Foundation provides a convenient high level way of utilizing ARCore and ARKit in order to implement mixed and augmented reality applications. I used it to implement an iPad app that could access an rt-ai Edge Composable Processing Pipeline (CPP) via the new Conductor Stream Processing Element (SPE). This is the CPP used to test Conductor:


The Conductor SPE provides a Websocket API to mobile devices and is able to pass data from the mobile device to the pipeline and then return the results of the CPP’s processing back to the mobile device. In this case, I am using the CYOLO SPE to perform object detection on the video stream from the mobile device’s camera. The output of the CYOLO SPE goes to three destinations – back to the Conductor, to a MediaView for display locally (for debug) and also to a PutManifold SPE for long term storage and off-line processing.

The iPad Unity app used to test this arrangement uses AR Foundation and ARKit for spatial management and convenient access to camera data. The AR Foundation is especially nice as, if you only need the subset of ARKit functionality currently available, you can do everything in the C# domain without having to get involved with Swift and/or Objective C and all that. The captured camera data is formatted as an rt-ai Edge message and sent via the Websocket API to the Conductor. The Conductor returns detection metadata to the iPad which then uses this to display the labelled detection frames in the Unity space.

Right now, the app draws a labelled frame at a constant distance of 1 meter from the camera to align with the detected object. However, an enhancement would be to use depth information (if there is any) so that the frame could be positioned at the correct depth. Or if that wasn’t useful, the frame label could include depth information.

This setup demonstrates that it is feasible for an XR app to offload inference to an edge compute system and process results in real time. This greatly reduces the load on the mobile device, pointing the way to lightweight, low power, head mounted XR devices that could last for a full workday without recharge. Performing inference on-device (with CoreML for example) is certainly a viable alternative, especially where privacy dictates that raw data (such as video) cannot leave the device. However, processing such data using an edge compute system is hardly the same as sending data out to a remote cloud so, in many cases, privacy requirements can still be satisfied using edge offload.

This particular setup does not require Orchestrator as the iPad test app can go directly to the Conductor, which is part of a statically allocated CPP. The next step to complete the architecture is to add in the Orchestrator interaction so that CPPs can be dynamically instantiated.

Registering virtual and real worlds with rt-xr, ARKit and Unity


One of the goals for rt-xr is to allow augmented reality users within a space to collaborate with virtual reality users physically outside of the space, with the VR users getting a telepresent sense of being physically within the same space. To this end, VR users see a complete model of the space (my office in this case) including augmentations while physically present AR users just see the augmentations. Some examples of augmentations are virtual whiteboards and virtual sticky notes. Both AR and VR users see avatars representing the position and pose of other users in the space.

Achieving this for AR users requires that their coordinate system corresponds with that of the virtual models of the room. For iOS, ARKit goes a long way to achieving this so the rt-xr app for iOS has been extended to include ARKit and work in AR mode. The screen capture above shows how coordinate systems are synced. A known location in physical space (in this case, the center of the circular control of the fan controller) is selected by touching the iPad screen on the exact center of the control. This identifies position. To avoid multiple control points, the app is currently started in the correct pose so that the yaw rotation is zero relative to the model origin. It is pretty quick and easy to do. The video below shows the process and the result.

After starting the app in the correct orientation, the user is then free to move to click on the control point. Once that’s done, the rt-xr part of the app starts up and loads the virtual model of the room. For this test, the complete model is being shown (i.e. as for VR users rather than AR users) although in real life only the augmentations would be visible – the idea here was to see how the windows lined up. The results are not too bad all things considered although moving or rotating too fast can cause some drift. However, collaborating using augmentations can tolerate some offset so this should not be a major problem.

There are just a couple of augmentations in this test. One is the menu switch (the glowing M) which is used to instantiate and control augmentations. There is also a video screen showing the snowy scene from the driveway camera, the feed being generated by an rt-ai design.

Next step is to test out VR and AR collaboration properly by displaying the correct AR scene on the iOS app. Since VR collaboration has worked for some time, extending it to AR users should not be too hard.