Somehow I had completely missed the fact that Apple’s TrueDepth camera isn’t just for Face ID but can be used by any app. The screen captures here come from a couple of example apps. The one above is the straight depth data being used in different ways. The one below uses the TrueDepth camera based face tracking functions in ARKit to do things like replace my face with a box. I am winking at the camera and my jaw has been suitably dropped in the image on the right.
What’s really intriguing is that this depth data could be combined with OpenPose 2D joint positions to create 3D spatial coordinates for the joints. A very useful addition to the rt-ai Edge next generation gym concept perhaps, as it would enable much better form estimation and analysis.
I wanted to use the front camera of an iPad to act as the input to OpenPose so that I could track pose in real time with the original idea being to leverage CoreML to run pose estimation on the device. There are a few iOS implementations of OpenPose (such as this one) but they are really designed for offline processing as they are pretty slow. I did try a different pose estimator that runs in real time on my iPad Pro but the estimation is not as good as OpenPose.
So the question was how to run iPad OpenPose in real time in some way – compromise was necessary! I do have an OpenPose SPE as part of rt-ai Edge that runs very nicely so an obvious solution was to run rt-ai Edge OpenPose on a server and just use the iPad as an input and output device. The nice plus of this new iOS app called iOSEdgeRemote is that it really doesn’t care what kind of remote processing is being used. Frames from the camera are sent to an rt-ai Edge Conductor connected to an OpenPose pipeline.
The rt-ai Edge design for this test is shown above. The pipeline optionally annotates the video and returns that and the pose metadata to the iPad for display. However, the pipeline could be doing anything provided it returns some sort of video back to the iPad.
The results are show in the screen captures above. Using a GTX 1080 ti GPU, I was getting around 19fps with just body pose processing turned on and around 9fps with face pose also turned on. Latency is not noticeable with body pose estimation and even with face pose estimation turned on it is entirely usable.
Remote inference and rendering has a lot of advantages over trying to squeeze everything into the iPad and use CoreML for inference if there is a low latency server available – 5G communications is an obvious enabler of this kind of remote inference and rendering in a wide variety of situations. Intrinsic performance of the iPad is also far less important as it is not doing anything too difficult and leaves lots of resource for other processing. The previous Unity/ARKit object detector uses a similar idea but does use more iPad resources and is not general purpose. If Unity and ARKit aren’t needed, iOSEdgeRemote with remote inference and rendering is a very powerful system.
Another nice aspect of this is that I believe that future mixed reality headset will be very lightweight devices that avoid complex processing in the headset (unlike the HoloLens for example) or require cables to an external processor (unlike the Magic Leap One for example). The headset provides cameras, SLAM of some sort, displays and radios. All other complex processing will be performed remotely and video used to drive the displays. This might be the only way to enable MR headsets that can run for 8 hours or more without a recharge and be light enough (and run cool enough) to be worn for extended periods.
Following on from the previous post, I thought that it would fun to try adding depth information to the detected objects using surface planes constructed by ARKit. The results are not at all bad. ARKit didn’t always detect the vertical planes correctly but horizontal ones seemed pretty reliable. I just used Unity AR Foundation‘s ray casting function at the center of the detected object to get a depth indication. Of course this is really the distance to the nearest horizontal or vertical plane so it isn’t perfect.
In the end, there’s no replacement for mobile devices with proper depth sensing cameras. Even though Tango didn’t make it, it would be nice to think that real depth sensing could become mainstream one day.
The Unity AR Foundation provides a convenient high level way of utilizing ARCore and ARKit in order to implement mixed and augmented reality applications. I used it to implement an iPad app that could access an rt-ai Edge Composable Processing Pipeline (CPP) via the new Conductor Stream Processing Element (SPE). This is the CPP used to test Conductor:
The Conductor SPE provides a Websocket API to mobile devices and is able to pass data from the mobile device to the pipeline and then return the results of the CPP’s processing back to the mobile device. In this case, I am using the CYOLO SPE to perform object detection on the video stream from the mobile device’s camera. The output of the CYOLO SPE goes to three destinations – back to the Conductor, to a MediaView for display locally (for debug) and also to a PutManifold SPE for long term storage and off-line processing.
The iPad Unity app used to test this arrangement uses AR Foundation and ARKit for spatial management and convenient access to camera data. The AR Foundation is especially nice as, if you only need the subset of ARKit functionality currently available, you can do everything in the C# domain without having to get involved with Swift and/or Objective C and all that. The captured camera data is formatted as an rt-ai Edge message and sent via the Websocket API to the Conductor. The Conductor returns detection metadata to the iPad which then uses this to display the labelled detection frames in the Unity space.
Right now, the app draws a labelled frame at a constant distance of 1 meter from the camera to align with the detected object. However, an enhancement would be to use depth information (if there is any) so that the frame could be positioned at the correct depth. Or if that wasn’t useful, the frame label could include depth information.
This setup demonstrates that it is feasible for an XR app to offload inference to an edge compute system and process results in real time. This greatly reduces the load on the mobile device, pointing the way to lightweight, low power, head mounted XR devices that could last for a full workday without recharge. Performing inference on-device (with CoreML for example) is certainly a viable alternative, especially where privacy dictates that raw data (such as video) cannot leave the device. However, processing such data using an edge compute system is hardly the same as sending data out to a remote cloud so, in many cases, privacy requirements can still be satisfied using edge offload.
This particular setup does not require Orchestrator as the iPad test app can go directly to the Conductor, which is part of a statically allocated CPP. The next step to complete the architecture is to add in the Orchestrator interaction so that CPPs can be dynamically instantiated.
One of the goals for rt-xr is to allow augmented reality users within a space to collaborate with virtual reality users physically outside of the space, with the VR users getting a telepresent sense of being physically within the same space. To this end, VR users see a complete model of the space (my office in this case) including augmentations while physically present AR users just see the augmentations. Some examples of augmentations are virtual whiteboards and virtual sticky notes. Both AR and VR users see avatars representing the position and pose of other users in the space.
Achieving this for AR users requires that their coordinate system corresponds with that of the virtual models of the room. For iOS, ARKit goes a long way to achieving this so the rt-xr app for iOS has been extended to include ARKit and work in AR mode. The screen capture above shows how coordinate systems are synced. A known location in physical space (in this case, the center of the circular control of the fan controller) is selected by touching the iPad screen on the exact center of the control. This identifies position. To avoid multiple control points, the app is currently started in the correct pose so that the yaw rotation is zero relative to the model origin. It is pretty quick and easy to do. The video below shows the process and the result.
After starting the app in the correct orientation, the user is then free to move to click on the control point. Once that’s done, the rt-xr part of the app starts up and loads the virtual model of the room. For this test, the complete model is being shown (i.e. as for VR users rather than AR users) although in real life only the augmentations would be visible – the idea here was to see how the windows lined up. The results are not too bad all things considered although moving or rotating too fast can cause some drift. However, collaborating using augmentations can tolerate some offset so this should not be a major problem.
There are just a couple of augmentations in this test. One is the menu switch (the glowing M) which is used to instantiate and control augmentations. There is also a video screen showing the snowy scene from the driveway camera, the feed being generated by an rt-ai design.
Next step is to test out VR and AR collaboration properly by displaying the correct AR scene on the iOS app. Since VR collaboration has worked for some time, extending it to AR users should not be too hard.
Anything that speeds up the development cycle is interesting and the Unity ARKit Remote manages to avoid having to go through Xcode every time around the loop. Provided the app can be run in the Editor, any changes to objects or scripts can be tested very quickly. The iPhone (in this case) runs a special remote app that passes ARKit data back to the app running in the Editor. You don’t see any of the Unity stuff in the phone itself, just the camera feed. The composite frames are shown in the Editor window as above.
The only problem is that I also want to support WebRTC in the app. There is a React Native WebRTC implementation but as far as I can tell it requires that the app be detached from Expo to ExpoKit so that it can be included in Xcode. Unfortunately, that didn’t work as AR support didn’t seem to be included in the automatically generated project.
To include ARKit support requires that the Podfile in the project’s ios directory be modified to add AR support. The first section should look like this:
platform :ios, '9.0'
target 'test' do
:git => "http://github.com/expo/expo.git",
:tag => "ios/2.0.3",
:subspecs => [
Basically “AR” is added as an extra subspec. Then ARKit seems to work quite happily with ExpoKit.