An important goal of the rt-xr project is to allow MR and AR headset wearing physical occupants of a sentient space to interact as naturally as possible with virtual users in the same space. A component of this is spatialized sound, where a sound or someone’s voice appears to originate from where it should in the scene. Unity has a variety of tools for achieving this, depending on the platform.
I have standardized on 16 bit, single channel PCM at 16000 samples per second for audio within rt-xr in order to keep implementation simple (no need for codecs) but still keep the required bit rate down. The problem is that the SharingServer has to send all audio feeds to all users – each user needs all the other user’s feeds so that they can spatialize them correctly. If spatialized sound wasn’t required, the SharingServer could just mix them all together on some basis. Another solution is for the SharingServer to just forward the dominant speaker but this assumes that only intermittent speakers are supported. Plus it leads to the “half-duplex” effect where the loudest speaker blocks everyone else. Mixing them all is a lot more democratic.
Another question is how to deal with occupants in different rooms within the same sentient space. Some things (such as video) are turned off to reduce bit rate if the user isn’t in the same room as the video panel. However, it makes sense that you can hear users in other rooms at an appropriate level. The AudioSource in Unity has tools for ensuring that sound levels drop off appropriately.
Spatialized sound currently works on Windows desktop and Windows MR. The desktop version uses the Oculus spatializer as this can support 16000 samples per second. The Windows MR version uses the Microsoft HRTF spatializer which unfortunately requires 48000 samples per second so I have to upsample to do this. This does mess up the quality a bit – better upsampling is a todo.
Right now, the SharingServer just broadcasts a standard feed with all audio sources. Individual users filter these in two ways. First of all, they discard their own audio feed. Secondly, if the user is a physical occupant of the space, feeds from other physical occupants are omitted so as to just leave the VR user feeds. Whether or not it would be better to send customized feeds to each user is an interesting question – this could certainly be done if necessary. For example, a simple optimization would be to have two feeds – one for AR and MR users that only contains VR user audio and the current complete feed for VR users. This has the great benefit of cutting down bit rate to AR and MR users whose headsets may benefit from not having to deal with unnecessary data. In fact, this idea sounds so good that I think I am going to implement it!
Next up is getting something to work on Android. I am using native audio capture code on the two Windows platforms and something is needed for Android. There is a Unity technique using the Microphone that, coupled with a custom audio filter, might work. If not, I might have to brush up on JNI. Probably spatialized sound is going to be difficult in terms of panning. Volume rolloff with distance should work however.