Multi-platform interaction styles for rt-xr and Unity

The implementation of sticky notes in rt-xr opened a whole can of worms but really just forced the development of a set of capabilities that will be needed for the general case where occupants of a sentient space can download assets from anywhere, instantiate them in a space and then interact with them. In particular, the need to be able to create a sticky note, position it and add text to it when being used on the supported platforms (Windows and mac OS desktop, Android, iOS and Windows Mixed Reality) required a surprising amount of work. I decided to standardize on a three button mouse model and map the various interaction styles into mouse events. This means that the bulk of the code doesn’t need to care about the interaction style (mouse, motion controller, touch screen etc) as all the complexity is housed in one script.

The short video below shows this in operation on a Windows desktop.

It ended up running a bit fast but that was due to the video recorder setup – I can’t really do things that fast!

I am still just using opaque devices – where is my HoloLens 2 or Magic Leap?!!! However, things should map across pretty well. Note how the current objects are glued to the virtual walls. Using MR devices, the spatial maps would be used for raycasting so that the objects would be glued to the real walls. I do need to add a mode where you can pull things off walls and position them arbitrarily in space but that’s just a TODO right now.

What doesn’t yet work is sharing of these actions. When an object is moved, that move should be visible to all occupants of the space. Likewise, when a new object is created or text updated on a sticky note, everyone should see that. That’s the next step, followed by making all of this persistent.

Anyway, here is how interaction works on the various platforms.

Windows and mac OS desktop

For Windows, the assumption is that three button (middle wheel) mouse is used. The middle button is used to grab and position objects. The right mouse button opens up the menu for that object. The left button is used for selection and resizing. On the Mac, which doesn’t have a middle button, the middle button is simulated by holding down the Command key which maps the left button into the middle button.

Navigation is via the SpaceMouse on both platforms.

Windows Mixed Reality

The motion controllers have quite a few controls and buttons available. I am using the Grab button to grab and position objects. The trigger is used for selection and resizing while the menu button is used to bring up the object menu. Pointing at the sticky note and pressing the trigger causes the virtual keyboard to appear.

Navigation uses the standard joystick-based teleport system.

Android and iOS

My solution here is a little ugly. Basically, the number of fingers used for a tap and/or hold dictates which mouse button the action maps to. A single touch means the left mouse button, two touches means the right mouse button while three touches means the middle button. It works but it is pretty amusing trying to get three simultaneous touches on an object to initiate a grab on a small screen device like a phone!

Navigation is via single or dual touch. Single touch and slide moves in x and y directions. Dual touch and slide rotates around the y axis. Since touches are used for other things, navigation touches need to be made away from objects or else they will be misinterpreted. Probably there is a better way of doing this. However, in the longer term, see-through mode using something like ARCore or ARKit will eliminate the navigation issue which is only a problem in VR (opaque) mode. I assume the physical occupants of a space will use see-through mode with only remote occupants using VR mode.

I haven’t been using ARCore or ARKit yet, mainly because they haven’t seemed good enough to create a spatial map that is useful for rt-xr. This is changing (ARKit 2 for example) but the question is whether it can cope with multiple rooms. For example, objects behind a real wall should not be visible – they need to be occluded by the spatial map. The HoloLens can do this however and is the best available option right now for multi-room MR with persistence.

 

Streaming PCM audio from Unity on Android

The final step in adding audio support to rt-xr visualization was to make it work with Android. Supporting audio capture natively on Windows desktop and Windows UWP was relatively easy since it could all be done in C#. However, I didn’t really want to implement a native capture plugin for Android and in turns out that the Unity capture technique works pretty well, albeit with noticeable latency.

The Inspector view in the screen capture shows the idea. The MicrophoneFilter script starts up the Unity Microphone and adds it to the AudioSource. When running, the output of the AudioSource is passed to MicrophoneFilter via the OnAudioFilterRead method that gives access to the PCM stream from the microphone.

The resulting stream needs some processing, however. I am sending single channel PCM audio at 16000 samples per second on the network whereas the output of the AudioSource is stereo, either 16000 or 48000 depending on the platform and floating point rather than 16 bit values so the code has to be able to convert this. It also needs to zero out the output of the filter otherwise it will be picked up by the listener on the main camera which is certainly not desirable! There is an alternate way of running this that uses the AudioSource.clip.GetData call directly but I had problems with that and also prefer the asynchronous callback used for OnAudioFilterRead rather than using Update or FixedUpdate to poll. The complete MicrophoneFilter script looks like this:

using UnityEngine;

[RequireComponent(typeof(AudioSource))]
public class MicrophoneFilter : MonoBehaviour
{
    [Tooltip("Index of microphone to use")]
    public int deviceIndex = 0;

    private StatusUpdate statusUpdate;
    private bool running = false;
    private byte[] buffer = new byte[32000];
    private int scale;

    // Use this for initialization
    void Start()
    {

        AudioSource source = GetComponent<AudioSource>();

        if (deviceIndex >= Microphone.devices.Length)
            deviceIndex = 0;

        GameObject scripts = GameObject.Find("Scripts");
        statusUpdate = scripts.GetComponent<StatusUpdate>();

        int sampleRate = AudioSettings.outputSampleRate;

        if (sampleRate > 16000)
            scale = 3;
        else
            scale = 1;

        source.clip = Microphone.Start(Microphone.devices[deviceIndex], true, 1, sampleRate);
        source.Play();
        running = true;
    }

    private void OnAudioFilterRead(float[] data, int channels)
    {
        if (!running)
            return;

        int byteIndex = 0;
        if (channels == 1) {
            for (int i = 0; i < data.Length;) {
                short val = (short)((data[i]) * 32767.0f);
                for (int offset = 0; offset < scale; offset++) {
                    if (i < data.Length) 
                        data[i++] = 0; 
                } 
                buffer[byteIndex++] = (byte)(val & 0xff); 
                buffer[byteIndex++] = (byte)((val >> 8) & 0xff);
            }
        } else {
            for (int i = 0; i < data.Length;) {
                short val = (short)((data[i] + data[i + 1]) * 32767.0f / 2.0f);
                for (int offset = 0; offset < 2 * scale; offset++) {
                    if (i < data.Length) 
                        data[i++] = 0; 
                } 
                buffer[byteIndex++] = (byte)(val & 0xff); 
                buffer[byteIndex++] = (byte)((val >> 8) & 0xff);
            }
        }
        statusUpdate.newAudioData(buffer, byteIndex);
    }
}

Note the fixed maximal size buffer allocation to try to prevent too much garbage collection. In general, the code uses maximal sized fixed buffers wherever possible.

The SharingServer has now been updated to generate separate feeds for VR and AR/MR users with all user audio feeds in the VR version and just VR headset users’ audio in the MR version. The audio update rate has also been decoupled from the avatar pose update rate. This allows a faster update rate for pose updates than makes sense for audio.

Just a note on why I am using single channel 16 bit PCM at 16000 samples per second rather than sending single channel floats at 48000 samples per second which would be a better fit in many cases. The problem is that this makes the data rate 6 times higher – it goes from 256kbps to 1.536Mbps. Using uncompressed 16 bit audio and dealing with the consequences seemed like a better trade than either the higher data rate or moving to compressed audio. This decision may have to be revisited when running on real MR headset hardware however.

rt-xr visualization with spatialized sound

An important goal of the rt-xr project is to allow MR and AR headset wearing physical occupants of a sentient space to interact as naturally as possible with virtual users in the same space. A component of this is spatialized sound, where a sound or someone’s voice appears to originate from where it should in the scene. Unity has a variety of tools for achieving this, depending on the platform.

I have standardized on 16 bit, single channel PCM at 16000 samples per second for audio within rt-xr in order to keep implementation simple (no need for codecs) but still keep the required bit rate down. The problem is that the SharingServer has to send all audio feeds to all users – each user needs all the other user’s feeds so that they can spatialize them correctly. If spatialized sound wasn’t required, the SharingServer could just mix them all together on some basis. Another solution is for the SharingServer to just forward the dominant speaker but this assumes that only intermittent speakers are supported. Plus it leads to the “half-duplex” effect where the loudest speaker blocks everyone else. Mixing them all is a lot more democratic.

Another question is how to deal with occupants in different rooms within the same sentient space. Some things (such as video) are turned off to reduce bit rate if the user isn’t in the same room as the video panel. However, it makes sense that you can hear users in other rooms at an appropriate level. The AudioSource in Unity has tools for ensuring that sound levels drop off appropriately.

Spatialized sound currently works on Windows desktop and Windows MR. The desktop version uses the Oculus spatializer as this can support 16000 samples per second. The Windows MR version uses the Microsoft HRTF spatializer which unfortunately requires 48000 samples per second so I have to upsample to do this. This does mess up the quality a bit – better upsampling is a todo.

Right now, the SharingServer just broadcasts a standard feed with all audio sources. Individual users filter these in two ways. First of all, they discard their own audio feed. Secondly, if the user is a physical occupant of the space, feeds from other physical occupants are omitted so as to just leave the VR user feeds. Whether or not it would be better to send customized feeds to each user is an interesting question – this could certainly be done if necessary. For example, a simple optimization would be to have two feeds – one for AR and MR users that only contains VR user audio and the current complete feed for VR users. This has the great benefit of cutting down bit rate to AR and MR users whose headsets may benefit from not having to deal with unnecessary data. In fact, this idea sounds so good that I think I am going to implement it!

Next up is getting something to work on Android. I am using native audio capture code on the two Windows platforms and something is needed for Android. There is a Unity technique using the Microphone that, coupled with a custom audio filter, might work. If not, I might have to brush up on JNI. Probably spatialized sound is going to be difficult in terms of panning. Volume rolloff with distance should work however.

Sentient space sharing avatars with Windows desktop, Windows Mixed Reality and Android apps


One of the goals of the rt-ai Edge system is that users of the system can use whatever device they have available to interact and extract value from it. Unity is a tremendous help given that Unity apps can be run on pretty much everything. The main task was integration with Manifold so that all apps can receive and interact with everything else in the system. Manifold currently supports Windows, UWP, Linux, Android and macOS. iOS is a notable absentee and will hopefully be added at some point in the future. However, I perceive Android support as more significant as it also leads to multiple MR headset support.

The screen shot above and video below show three instances of the rt-ai viewer apps running on Windows desktop, Windows Mixed Reality and Android interacting in a shared sentient space. Ok, so the avatars are rubbish (I call them Sad Robots) but that’s just a detail and can be improved later. The wall panels are receiving sensor and video data from ZeroSensors via an rt-ai Edge stream processing network while the light switch is operated via a home automation server and Insteon.

Sharing is mediated by a SharingServer that is part of Manifold. The SharingServer uses Manifold multicast and end to end services to implement scalable sharing while minimizing the load on each individual device. Ultimately, the SharingServer will also download the space definition file when the user enters a sentient space and also provide details of virtual objects that may have been placed in the space by other users. This allows a new user with a standard app to enter a space and quickly create a view of the sentient space consistent with existing users.

While this is all kind of fun, the more interesting thing is when this is combined with a HoloLens or similar MR headset. The MR headset user in a space would see any VR users in the space represented by their avatars. Likewise, VR users in a space would see avatars representing MR users in the space. The idea is to get as close to a telepresent experience for VR users as possible without very complex setups. It would be much nicer to use Holoportation but that would require every room in the space has a very complex and expensive setup which really isn’t the point. The idea is to make it very easy and low cost to implement an rt-ai Edge based sentient space.

Still lots to do of course. One big thing is audio. Another is representing interaction devices (pointers, motion controllers etc) to all users. Right now, each app just sends out the camera transform to the SharingServer which then distributes this to all other users. This will be extended to include PCM audio chunks and transforms for interaction devices so that everyone will be able to create a meaningful scene. Each user will receive the audio stream from every other user. The reason for this is that then each individual audio stream can be attached to the avatar for each user giving a spatialized sound effect using Unity capabilities (that’s the hope anyway). Another very important thing is that the apps work differently if they are running on VR type devices or AR/MR type devices. In the latter case, the walls and related objects are not drawn and just the colliders instantiated although virtual objects and avatars will be visible. Obviously AR/MR users want to see the real walls, light switches etc, not the virtual representations. However, they will still be able to interact in exactly the same way as a VR user.

Using blockchain technology to create verifiable sensor records and detect fakes

These days, machine learning techniques have led to the ability to create very realistic but fake video and audio that can be tough to distinguish from the real thing. The video above shows a very interesting example of this capability. The problem with this technology is that it will become impossible to determine if anything is genuine at all. What’s needed is some verification that a video of someone (for example) really is that person. Blockchain technology would seem to provide a solution for this.

Many years ago I was working on a digital watermarking-based system for detecting tampering in video records. Essentially, this embedded error-correcting codes in each frame that could be used to determine if any region of a frame had been modified after the digital watermark had been added. Cameras would add the digital watermark at source, limiting the opportunity for modification prior to watermarking.

One problem with this is that it worked on a frame by frame basis but didn’t ensure the integrity of an entire sequence. In theory this could be done with temporally distributed watermarks but blockchain technology provides a very nice alternative.

A simple strategy would be to have the sensor (camera, microphone, motion detector, whatever) create a hash for each unit of data (video frame, chunk of audio etc) and add this to a blockchain. Then a review app could create new hashes from the sensor data itself (stored elsewhere) and compare them to those in the blockchain. It could also determine that the account owner or device is who or what it is supposed to be in order to avoid spoofing. It’s easy to envisage an Etherium smart contract being the basis of such a system.

One issue with this is the potential rate at which hashes need to be added to the blockchain. This rate could be reduce by collecting more data (e.g. accumulating one second’s worth of data to generate one hash) or creating a hash of hashes at an appropriate rate. The only downside to this is losing temporal resolution of where changes have been made.

It’s worth considering the effects of lossy compression. Obviously if a stream is uncompressed or only uses lossless compression, watermarking and hash generation can be done at a very early stage. Watermarking of video is designed to withstand compression so that can still be done at a very early stage, even with lossy compression. The hash has to be be bit-accurate with the stream as stored on the video storage medium though so the hash must be computed after lossy compression.

It seems as though this blockchain concept could definitely be made to work and possibly combined with the digital watermarking technique in the case of video to provide temporal and spatial resolution of tampering. I am sure that variations of this concept are out there already or being developed and maybe, one day, it will be possible for anybody to check if a video of a well-known person is real or fake.

Real time edge inference monitoring with rt-ai Edge

rt-ai Edge is progressing nicely and now supports multi-node operation (i.e. multiple networked servers participating in a processing network) along with real-time monitoring. The screen capture shows a simple processing network where the video feed from a camera is passed through a DeepLab-v3+ stream processing element (SPE) and then on to two separate media viewers. At the top of each SPE block in the Designer window is some text like Cam(Default). Here, Cam is the name given to the SPE while Default is the name of the node (server) on which the SPE is running. In this design there are two nodes, Default and rtai0.

The code underlying the common SPE API communicates with the Designer window and supplies the stats about bytes and messages in and out. Soon, this path will also allow SPE-specific real-time parameter tweaking from the Designer window.

To add a node to the system, it just needs to have all of the prerequisites installed and run a special NodeManager SPE. This also communicates with the Designer and supports SPE deployment and runtime control, activated when the user presses the Deploy design button. Moving an SPE between nodes is just a case of reassigning it, generating the design and then deploying the design again.

The green outlines around each SPE indicate the state of the SPE and the node on which it is running. When it is all green, as in the first screen capture, this indicates that both SPE and node are running. For the second screen capture, I manually terminated the View2 SPE on rtai0. The inner part of the outline has now gone red. This indicates that the node is up but the SPE is down. If the outline is all red, it means that the node is down and not communicating with the Designer.

It’s interesting to note that DeepLab-v3+ is processing around 5 frames per second using a GTX-1080 GPU. The input rate from the camera is 30 frames per second. The processor drops frames while it is still processing an earlier frame, ensuring that queues do not build up and latency is kept to a minimum.

rtndf – Python scripts for creating streaming data flow processing pipelines

LaplacianThe idea of joining together separate, lightweight processing elements to form complex pipelines is nothing new. DirectX and GStreamer have been doing this kind of thing for a long time. More recently, Apache NiFi has done a similar kind of thing but with Java classes. While Apache NiFi does have a lot of nice features, I really don’t want to live in Java hell.
I have been playing with MQTT for some time now and it is a very easy to use publish/subscribe system that’s used in all kinds of places. Seemed like it could be the glue for something…

So that’s really the background for rtnDataFlow or rtndf as it is now called. It currently uses MQTT as its pub/sub infrastructure but there’s nothing too specific there – MQTT could easily be swapped out for something else if required. The repo consists of a number of pipeline processing elements that can be used to do some (hopefully) useful things. The primary language is Python although there’s nothing stopping anything being used provided it has an MQTT client and handles the JSON messages correctly. It will even be able to include pipeline processing elements in Docker containers. This will make deployment of new, complex, pipeline processing elements very simple.

The pipeline processing elements are all joined up using topics. Pipeline processing elements can publish to one or more topics and/or subscribe to one or more topics. Because pub/sub systems are intrinsically multicasting, it’s very easy to process data in multiple ways in parallel (for redundancy, performance or functionality). MQTT also allows pipeline processing elements to be distributed on multiple systems, allowing load sharing and heterogeneous computing systems (where only some machines might be fitted with GPUs for example).

Obviously, tools are required to design the pipelines and also to manage them at runtime. The design aspect will come from an old code generation project. While that actually generates C and Python code from a design that the user inputs via a graphical interface, the rtnDataFlow version will just make sure all topic names and broker addresses line up correctly and then produce a pipeline configuration file. A special app, rtnFlowControl, will run on each system and will be responsible for implementing the pipeline design specified.

So what’s the point of all of this? I’m tired of writing (or reworking) code multiple times for slightly different applications. My goal is to keep the pipeline processing elements simple enough and tightly focused so that the specific application can be achieved by just wiring together pipeline processing elements. There’ll end up being quite a few of these of course and probably most applications will still need custom elements but it’s better than nothing. My initial use of rtnDataFlow will be to assist with experiments to see how machine learning tools can be used with IoT devices to do interesting things.