AUI Design Specs

Technical Specifications

I have in mind that each "widget" in an AUI will have a) a 3D location with respect to either the speaker/headphones or the user's anatomy, and b) a sound source whether it is a pre-recorded file or a software-generated stream of audio. In addition there will be c) a volume control associated with each one, as well as, to be perfectly general, d) a general purpose, extensible method of providing additional controls to the sound source. (One may imagine that one might want to tell the widget to add in some audio filter, change the voice, speed up or slow down, switch to a different audio file as the data source, or even do any of the entire range of text-to-speech system controls, etc. Alternatively if an audio source has its own API, as in the case of a TTS system, then it may be best for the application program to control the source directly through that API, so that the AUI API is minimized.)

System Design Features

An AUI infrastructure should be implemented as an asynchronous programming system which controls the state of the current AUI scene. The scene can be thought of as a set of audio widgets each with their respective location, audio media source, playback state, and parameter set. So the API functionality can be thought of as putting a speaker at some location in auditory space and wiring it to some sound source, setting its volume level and other parameters, and, when desired, telling it to start playback as well as pause/seek/flush/stop (=pause,seek-to-end).

An AUI scene must be limited in the number of these widgets it can handle, with limitations being both real (up to the limit of the number of 3D sound sources that the sound hardware can generate and mix together), and virtual (where additional sources at different locations can be mixed in if they take turns being silent).

The audio output system associated with a single speaker-set must be (at least) double threaded. The low-level output thread, lets call it DAPump() should be highly real-time and should basically be a copying function; it should have an output buffer size and loop time-scale equivalent to only 30 to 100ms of audio. On the other hand the higher-level output buffer manager, let's call it the DAMixer(), can schedule long sequences into the current audio output flow. With an infrastructure like this, when a generic AUI API program calls a function to D/A a buffer of samples, it effectively calls a DAMixer() method mixer giving it a pointer to a buffer in memory and instructions as to how to mix it in (volume, delay or priority, 3D location, sample-rate). The mixer then mixes that signal into its output buffer, shared with the DAPump(), and the DAPump thread, because it is real-time, starts sending out the audio with the new stuff mixed in, within 30-100ms.

API

The API should include widget-specific functions:

to create an aui-widget,
to associate it with a 3D location,
to associate it with a sound source whether that is
- a file or
- a stream socket to which audio will be sent
to change its 3D location whether in absolute position terms or as relative rotation/translation instantaneously or gradually along some trajectory
to control playback:
- start
- stop
- pause
- flush
- seek

The API should also include widget-group functions to create a group of widgets which can be manipulated as a set, (corresponding in some ways to the concept of container windows in GUIs) along with corresponding joint manipulation functions:

joint volume control,
joint rotation or translation in the space
possibly, manipulation of those controls that are shared among the member widgets.

A natural grouping is the set of widgets which take turns being silent while the others in the group have their audio signal presented. Turntaking rules for a group should be established, including time-slicing and ordering. For ordering, the order in which a widget is added to a group is its playback order in the sequence of widgets in the group. For time-slicing there are various approaches which might be supported:

"microsoft multitasking", i.e., play-until-done-then-go-to-next,
fixed-cycle time-slicing where N widgets each take C/N seconds to complete a regular cycle of C seconds duration,
demand-fixed time-slicing, where each widget in the group specifies its desired playback duration (equivalent to MSMT but the sound source may not entirely complete)
demand-weighted fixed-cycle time-slicing where N widgets each state their weight Wi, and a fixed cycle of C seconds is divided among the widgets in proportion to their stated weight

The infrastructure underneath the API needs

a) to suck down the audio media stream from the various sources associated with each widget, subject to the widgets' playback state,
b) to synthesize a 2 or more dimensional (for stereo or multi-speaker presentation) audio signal which "displays" the stream at the target location in space, and
c) to mix these signals using the widget's volume control feature.

Note that the auditory-space locations of the widgets can be changed in real-time (to simulate flying, for example, over the auditory landscape); those should be instantly updated into the synthesis process.

This infrastructure could be implemented in Java, or C, inside PCs, workstations, or enhanced stereo systems or even portable walkman-type stereos, eventually.

Reference dimensions:

Let (0,0,0) be located in the middle of the head (at the third eye, behind and between the user's eyebrows, or more or less equivalently, in the center of stereo audio space halfway between the ears. Let the dimensions be ordered by perceptual importance, so that the first dimension is the stereo dimension, i.e., left-to-right (going from negative to positive as real-number graphs typically do); the second dimension is back-to-front (going from negative to positive, considering behind as negative and ahead as positive), and the third dimension is below-to-above, (also going from negative to positive; here the orientation is the same as for altitude).

Requirements of the Software Infrastructure/API

An AUI ought to cleanly handle

1) a single active text stream source
2) a number of other tool resources available to be activated: playback and volume controls on each of the entities various appliances: calculators, CD players, phone dialer, MIDI player, etc.
3) any signals that need to be monitored, alarms
4) context switching between one tool or text stream and another
5) flying through the space at speed, hearing what's out there.
6) a feedback stream would be important for the blind and also for sighted users of speech recognition.
-Do TTS on speech recognition results to monitor accuracy.
-Play letter names on keyboard input so you know what keys you typed.

API definititions (draft, not implemented)

auidget 
object AUIWidget {
  public method create();
  public method destroy();
  public method setLocation(LOCATION l);
  public method setLocation(LOCATION l,TRAJECTORYFUNCTION *f);
  public method LOCATION getLocation();  /* return NULL on error */

  public method setVolume(VOLUME v); /* return -1 on error */
  public method VOLUME getVolume();  /* return NULL on error */

  public method setState(AUIWIDGETSTATE s); /* return -1 on error */
  public method AUIWIDGETSTATE getState();  /* return NULL on error */

  public method setInput(FILE *f,SAMPLEINFO i);  /* return -1 on error */
  public method setInput(SOCKET *socket,SAMPLEINFO i); /* return -1 on error */
  public method setInput(AUDIOSTRUCT *audioStruct,SAMPLEINFO i); /* return -1 on error */

  private LOCATION location;
  private VOLUME volume;
  private AUIWIDGETSTATE state;
  private SHORT buffer[BUFSIZE];
  private int index = -1;

  in create() {
	location = NULL;
	volume = NULL;
	state = NULL;
	start a thread which waits for state to be unlocked for playback
	and then spits data from buffer out to the kernel audio buffers
	up to some limited duration (10ms?), unlock them

	don't let state be unlocked without having volume and location set
	at least to reasonable defaults.
  }
}

kernel: wait on kernel audio buffers to be unlocked, then spit them out
	to the audio drivers, until some limit is attained, then relinquish
 	the kernel audio buffer lock, loop.

GUI vs AUI

All modern GUI software development, including X-Windows, MS-Windows, Apple's UI, tcl-tk, Java AWT and JFC's, Berlin, and GGI, has as its motivation providing a layer of stuff that GUI programmers really need, namely a way to get their software entities (data input dialogs, data representations, action choice selection methods) displayed on the screen, which is to say, an implemented API that includes ways of creating and controlling the various GUI widgets that people like to have, scrollbars, buttons, menus, etc.

Similarly, AUI programmers need a way to get their own software entities (data input dialogs, data representations, action choice selection methods) represented audibly through stereo or 3D audio playback devices, which is to say, an API that includes ways of creating and controlling the various AUI features that people might wish to control.

Thus GUI work and AUI work are related at some highly abstract level but practically speaking, for most programming purposes, they are distinct. For example the sound card drivers that need to be written to form the basis for an AUI system are separate from the video card drivers that need to be written to form the basis of a GUI system.

Needs, Limitations

Developing applications that make use of an AUI infrastructure means application developers themselves must think in terms of the audio sources which could correspond to their various interface and logical entities in the application domain.

Everyone thinks they are a graphic designer and can usefully critique anyone's GUI layout; now we all have to become audio environment designers and cultivate our taste regarding AUI features! For professional qualifications one might try to become musical composer, but the concerns of music are quite different from those of computer software applications in general, so it's up to us lay people and inventors!

And it's up to users themselves who need or want it to develop it for themselves!

 
Copyright © 1998-2003
Tom Veatch 
All rights reserved.
Modified:  April 8, 2003