PP Remote Testing Wiki | Issues / Calibration

Stimulus Calibration

Audio Stimuli Calibration

Within the scope of psychological and physiological acoustics, the goal of many experiments using remote testing is to collect responses to a set of acoustic stimuli from test subjects at remote sites. It is crucial to calibrate the stimuli so that the obtained responses would not be dominated by variations in presentation levels and specific transducers (e.g., earphones, loudspeakers, etc.). The appropriate calibration of the sound pressure level is especially important because (1) the experimenter has to make sure that the stimuli are audible to the subjects; (2) the sound level needs to be kept within a safety limit (see Compliance); (3) many auditory and speech perception phenomena are level dependent.

Remote testing poses many challenges to the calibration of acoustic stimuli, mostly because the experimenter may not have full knowledge of the sound delivery system and environment at the subject's end. Therefore, when selecting an appropriate platform for a remote experiment, one of the first considerations would be how critical calibration is to the research question under investigation (see PlatformConsiderations ). Some platform-specific information with regard to calibration can be found under HardwareAndCalibration. General approaches for addressing this problem include:

Approach 0, No formal calibration: Many experiments involving supra-threshold and/or high-level processes may be conducted with limited influences from inexact audio calibration. In such cases, a browser-based experimentation platform (e.g., Gorilla, jsPsych, PsychoPy, see PlatformDescriptions for the descriptions on these platforms) may be preferable, because these platforms, although do not provide precise calibration of audio presentation, could be cost effective and hence enable a relatively large sample size. It should be noted that even for experiments without formal calibration, simple verifications are recommended to ensure the audibility of the stimuli and listening safety.

Approach 1, Electroacoustic calibration: Experiments that involve subjects with hearing impairment or addresses research questions that are expected to be dependent on stimulus characteristics such as level and spectrum may consider an experimentation platform involving sending pre-calibrated systems (e.g., tablets with headphones) to the subjects. Some examples for research platform under this category include hearX, PART, etc. (see PlatformDescriptions for the descriptions on these platforms). Although Approach 1 enable precise control of the test stimuli, the logistics (i.e., teaching subjects to use the equipment, shipping and receiving the equipment, troubleshooting remotely, and answering ongoing questions) may be time-consuming and costly.

Approach 2, Limited calibration involving reports from the subject: Besides the two scenarios described above, there may be situations in which some degrees of control on the stimulus characteristics are preferred but not critical. In these cases, browser-based platforms may be used and some moderate degrees of stimulus control may be achieved with the involvement of the subject. For example, the subject may be instructed to report the manufacturers and models of the devices used in the experiment. Calibration can then be completed using the known specifications for the devices.

Approach 3, Limited calibration using psychophysical techniques: One specific technique to calibrate the acoustic stimuli with the participation of the subjects is psychophysical calibration or perceptual calibration. This calibration method is especially useful for experiments aimed for healthy, normal-hearing adults as subjects. For this population, normative performance ranges (and psychophysical models) are well-established for many basic auditory tasks. Incorporating some of these basic auditory tasks into the experimental protocol provides a way to probe the fidelity of the stimulus presentation system and the background noise level at the subject's end. Additionally, there are many binaural phenomena that requires the appropriate placement of headphones (binaural beating, binaural masking level difference, binaural pitch, etc.). These tasks may be implemented to check whether a headphone is used and correctly placed during the experiment.

Approach 1, Electroacoustic Calibration

For this approach, the experimenter will be responsible for calibrating the hardware systems before sending them to the subjects. In this case, the calibration procedure should follow the ISO/ANSI standard (e.g., ANSI S3.7) and/or the best practice of the field. Some commercially available platforms include hardware calibration service in their annual subscriptions (e.g., hearX and SHOEBOX).

Calibration setup: A typical setup for level calibration for headphones consists of (1) a coupler (or artificial ear) for the type of transducer used, (2) a sound level meter with its measurement microphone attached to the coupler. The purpose of the coupler is to simulate the impedance of the ear canal, so that the readings from the sound level meter would simulate the sound pressure level that would be measured at the ear drum of a typical subject. Depending on the type of the transducer (e.g., supra-aural earphones, circum-aural earphones, insert earphones, or hearing aids), a corresponding type of coupler should be used. As a rule of thumb, supra- and circum-aural earphones require couplers with a larger volume (~6cc) while the insert earphones and insert-type hearing aid receivers require couplers with a smaller volume (~2cc).

Calibration stimuli: Most stimulus presentation systems (e.g., headphones connected to sound cards) are designed to be linear. Pure tones at various frequencies are typically used as the calibration stimuli. With the calibration tone presented above the noise floor, the correspondence between the rms amplitude of the digital signal and the measured sound pressure level can be found. In some special circumstances, the stimuli are presented via a device with nonlinear dynamic processing (e.g., digital hearing aids). Such systems would provide different amount of amplification to the stimuli in a frequency- and intensity-dependent fashion. For such applications, the stimuli used for calibration should resemble the level and spectrotemporal characteristics of the test stimuli. For example, for a speech-recognition experiment, International Speech Test Signal (ISTS), presented at the same level as the test speech in the experiment, may be used as the calibration stimulus.

Device verification by the subject: For an experiment that utilizes multiple experimental devices (e.g., tablet-headphones pairs), each device should be properly identified (e.g., by a device ID) and calibrated separately. During the experiment, it is useful to have the subject report the device ID to ensure the accurate pairing of the calibration data and the device. Even when the testing systems are appropriately calibrated at the site of the experimenter, the stimuli may still be off-calibration due to damage to the device during shipment, improper connection between various components of the device (e.g., the headphone is disconnected from the tablet), wrong experimental software being run, or improper placement of the headphones. To prevent such incidences, a simple psychophysical verification procedure may be used, which may include verifying whether stimuli presented at suprathreshold levels are indeed audible to the subject and whether the stimuli delivered through the two sides of the headphones are properly balanced and synchronized for binaural experiments.

Beyond level calibration: Besides calibrating the presentation level, other electroacoustic measurements may also be useful and informative to the experimenter. Some common measurements include the dynamic range, frequency responses, and crosstalk between channels.

* Dynamic range: The dynamic range is the range of level within which the test stimuli can be presented without significant distortion. The lower limit of the dynamic range is typically the noise floor, i.e. the output level when no stimulus is presented. It is worth pointing out that the noise floor is not the readings from the sound level meter when the system is powered off. Rather, it is the noise level expected during stimulus presentations. It is recommended that an all-zero array is used as the "calibration stimulus" during the measurement of the noise floor. The upper limit of the dynamic range is the maximum output level without distortion (e.g., clipping). The experimenter should choose the appropriate devices and transducers so that the test stimuli would not be too close to either the lower or upper limit of the dynamic range. In other words, an experimental system with a larger dynamic range would accommodate a greater variety of experiments.

* Frequency response: The frequency response is the response from the testing system for unit input as a function of frequency. An ideal sound delivery system would have a flat frequency response, so that it will not introduce additional coloration to the experimental stimuli. The frequency response is typically measured by analyzing the response to a broadband stimulus. Depending on specific methodologies used, the test stimuli may be a sequence of pure tones, a frequency-varying tone sweep, a Gaussian noise, a Maximum-Length Sequence, or Golay codes.

* Crosstalk between channels: Crosstalk refers to the signal leakage from one channel to another. Crosstalk between channels could be problematic for auditory experiments when the stimulus meant for one of the test ear is audible from the headphone for the opposite ear. Portable testing platforms may be more subjected to crosstalk because (1) the left and right channels of common portable computers and mobil phones usually have a shared ground return, and (2) low-impedance earphones are typically used with portable devices to ensure sufficiently high output levels. Crosstalk is measured by applying a test signal to one channel, measuring that signal's level from the other channel, and then expressing the measured level as a ratio (in dB) relative to the source signal. Since the crosstalk is often frequency dependent, the measurement is typically conducted as a function of frequency.

In-situ or self calibration: For send-home testing systems that also carry built-in microphones, the microphone on the device can be calibrated so that it can be used for in-situ or self calibration. In-situ calibration are required for most experiments that involve free-field stimulus presentation (using loudspeakers). In such cases, the microphone should be positioned near where the subject's head would be during the experiment. A calibration stimulus is then presented via the loudspeaker(s) and the sound level or frequency response can be derived from the recording of the stimulus using the microphone. Another application of built-in microphones is to conduct measurements of the noise level (or the power spectrum of the ambient noise) for the subject's test environment. It should be noted that the raw audio recordings made during the in-situ measurements should not be stored without an appropriate IRB approval and the subject's consent (see Compliance).

Approach 2, Limited calibration involving reports from the subject

For this approach, the subject reports the manufacturer and model of the devices used in the experiment, and the experimenter configure the stimuli based on the available specifications for these devices. For examples of databases for head-phone specifications, see Earphones. It should be noted that some of the available headphone specifications are measured without the use of a coupler.

Level calibration based on headphone sensitivity: The most relevant specification for level calibration is the headphones' sensitivity. It is the sound pressure level that would be produced by the headphone with unit input voltage (i.e. 1 volt) at 1 kHz. Therefore, if the dynamic range of the sound card (the output voltage in dBu at 0 dBFS, 0 dBu = sqrt(0.6) V) is also available, the maximum output sound pressure level can be derived and the conversion from dBFS to dB SPL can be achieved. Sometimes, the sensitivity of the headphones is given for unit input power (i.e. 1 milliWatt). In this case, the sound pressure level needs to be calculated based on both the voltage that the headphones receive and the headphones' impedance. For example, if the sound card has a maximum output level of 4 dBu at 0 dBFS, the sensitivity of the headphones is 102 dB SPL/mW, and the impedance of the headphones is 64 Ohm, then the maximum output voltage is sqrt(0.6)*10^(4/20) = 1.23 V, the maximum output power is (4.61)^2/64 = 23.5 mW, the maximum output sound pressure level (corresponding to 0 dB FS) is 102+10*log10(23.5/1) = 115.7 dB SPL.

Approach 3, Limited calibration using psychophysical techniques:

For this approach, a psychophysical procedure is conducted for the purpose of calibration and system verification.

Loudness-based level adjustment: One of the simplest way for perceptual calibration is to instruct the subject to adjust the volume control of the sound delivery system so that the test stimuli would be presented at a most comfortable level. Alternatively, the subject may adjust the level of an anchor stimulus to the most comfortable level, and the presentation levels of the test stimuli are set relative to that of the anchor stimulus. This is a relatively quick procedure, typically taking 2-3 minutes. For pure tones presented in quiet, the expected standard deviation in the listeners' most comfortable loudness (MCL) levels is typically greater than 10 dB (see Punch et al., 2004 for detailed discussions on the various measurement considerations).

Besides adjusting a stimulus to the most comfortable loudness level, a procedure to measure the loudness growth at 1 kHz may be conducted for calibration purpose. The loudness growth function at 1 kHz, i.e. how loudness rating grows with the stimulus level, is well-established for normal-hearing adults. The procedure for measuring the loudness growth function has been standardized (ISO 16832). Therefore, it is possible to first measure the loudness growth function using uncalibrated levels and then compare the obtained data with the published normative results to estimate the conversion factor for calibration.

Threshold-based calibration: sensation level: For an experimental system with unknown calibration, absolute thresholds may be measured first for the test stimuli using an uncalibrated arbitrary unit. The test stimuli can then be presented at a desired sensation level (in dB SL). 10 dB SL indicates a level that is 10 dB above the absolute threshold for the stimulus. The advantage of configuring the stimulus level in dB SL is that the audibility of the stimuli is ensured for each individual subjects and for their specific sound delivery systems. However, there are a few disadvantages associated with this approach that researchers need to consider. First, threshold measurements add additional testing time. Second, the subject's testing system and environment may be sub-optimal for threshold measurements, and the measured thresholds may be dominated by masking from the electronic or ambient noises. Third, for a sound delivery system with a limited dynamic range, it may be difficult to both conducting threshold measurements and presenting stimuli at high sensation levels. For example, consider an experiment with a pure-tone stimulus at 1 kHz and 50 dB SL. Measuring the absolute threshold for the tone requires adjusting the volume control of the system so that the noise floor would be reasonably lower than the threshold (the hardware noise should not be audible). If the noise floor is 10 dB below the absolute threshold, then presenting the tone at 50 dB SL requires at least 60 dB of dynamic range. Last, for subjects with more than moderate degrees of hearing loss, setting the stimulus level at a fixed sensation level, may lead to very high, unsafe, sound pressure levels (if still within the dynamic range of the system) or severe distortions (if the level exceeds the dynamic range).

Headphone verification using binaural phenomena: In many experiment, it is crucially important to verify that the subject is wearing headphones with the correct orientation during the experiment. This can be achieved by conducting a psychophysical procedure involving an auditory percept (such as binaural pitch) that is strongly dependent on specific interaural phase relationships. Examples of using binaural phenomena to verify headphone connection/placement are Woods et al. (2017) and Milne et al. (2020).

Visual Stimuli Calibration

Dimensions of calibration

* Color

* Luminance

* Contrast

* Spatial variation

Kollbaum, P. S., Jansen, M. E., Kollbaum, E. J., & Bullimore, M. A. (2014). Validation of an iPad test of letter contrast sensitivity. Optometry and Vision Science, 91(3), 291-296.

Dorr, M., Lesmes, L. A., Lu, Z. L., & Bex, P. J. (2013). Rapid and reliable assessment of the contrast sensitivity function on an iPad. Investigative ophthalmology & visual science, 54(12), 7266-7273.

de Fez, D., Luque, M. J., García-Domene, M. C., Camps, V., & Piñero, D. (2016). Colorimetric characterization of mobile devices for vision applications. Optometry and Vision Science, 93(1), 85-93.

Dorr, M., Lesmes, L. A., Elze, T., Wang, H., Lu, Z. L., & Bex, P. J. (2017). Evaluation of the precision of contrast sensitivity function assessment on a tablet device. Scientific Reports, 7, 46706.

Ozgur, O. K., Emborgo, T. S., Vieyra, M. B., Huselid, R. F., & Banik, R. (2018). Validity and acceptance of color vision testing on smartphones. Journal of Neuro-ophthalmology, 38(1), 13-16.

Audiovisual Stimuli Calibration

In many situations, speech stimuli are presented in both the auditory and visual modalities and the subjects' abilities in recognizing speech are assessed. The synchronization between auditory and visual speech cues can influence speech understanding, therefore it is crucial ensure the synchronization between the audio and video displays. In most cases, AV speech is stored in a compressed file in order to constrain the file size. A compressed video file consists of both audio and video signals compressed using separate codecs.

During an experiment, when presenting a compressed video file, the hardware on the subject's end will need to decode both the audio and video portions of the file, which may cause unintended asynchronies between the audio and video displays. This often unpredictable amount of AV asyncronies at the subject's end determines that remote experiments involving AV speech stimuli would need to ship pre-calibrated systems (e.g., tablet + headphones) to the subjects. The calibration procedures would consist of three steps: (1) audio stimuli calibration, (2) visual stimuli calibration, and (3) test of AV syncronization. The first two steps can be conducted following the procedures described in preceding section. Here, an example procedure for measuring AV synchronization is described.

Measuring AV synchronization: If possible, the stimulus files should be stored locally on the mobile testing system. The system applications that are running in the background should be kept at a minimum. Then, a calibration stimulus is presented via the same software environment as in the experiment. The calibration stimulus is a AV file with the identical specifications as the experimental stimuli (i.e. the same audio and video codecs, the same screen size, etc.). The calibration stimulus is generated so that the audio and video components share common onsets. For example, the stimulus may be periodic audio clicks and video flashes with the same onset times. During the presentation of the test stimulus, a passive photo sensor is placed on the screen, and the outputs from the photo sensor and the output from the sound card are fed into the two channels of an oscilloscope. The asynchrony between the audio and video outputs in milliseconds can then be measured. The above procedure should be repeated for a couple of times to check the consistency of the AV asynchrony. As an alternative to the clicks and flashes, the amplitudes of the audio and video signals can be defined analytically by the same sine function. The AV synchrony can then be verified by viewing the audio and video outputs for this calibration stimulus using the XY display mode of the oscilloscope.

Synchronizing AV stimuli: Once the average asynchrony (in milliseconds) is measured, the stimulus files can be modified to compensate the asynchrony. This means delaying the audio component if the audio output is leading the video, or delaying the video component if the video output is leading the audio. This can be done in a video editing software (e.g., Final Cut or Adobe Premiere). First, import the original stimulus file into the video editing software. Then, apply a delay according to the measured asynchrony to the appropriate stream. Finally, export a new stimulus file using the original codecs. This compensation procedure can also be applied to the calibration stimulus. When repeat the synchrony measurement using the modified calibration stimulus, the average asynchrony should be near 0 ms. When the sine function is used as the calibration stimuli, then after compensation the audio and video outputs, when viewed using the XY model of the oscilloscope, should form a diagonal line, rather than an ellipse or circle, indicating that the two outputs are in-phase.

Response Calibration (see also Response)

An advantage of most survey-type responses used in remote testing (buttons, multiple choice, rating scale) is that each response should be interpretable in an absolute sense, requiring no calibration to determine its value. That is, clicking "Yes" has the same meaning in every session and for every participant. Some response data, however, require additional calibration.

Calibration of participant response scale refers to the calibration of response scales or range across sessions and/or participants. This can be accomplished by instruction (e.g. using a labeled Likert scale, identifying endpoints such as "Inaudible" to "Painfully Loud", etc.) Similar considerations would seem to apply to both in-lab and remote testing, although the need for clear instruction may be more acute in remote testing.

Calibration of response hardware may be desired or necessary for some types of on-device sensors (touch displays, tilt/force sensors) and external hardware (pointing devices, physiological measurements).

Psychophysical calibration optional: In some cases, a rough calibration can be assumed with confidence; for example, the X and Y coordinates of a tablet-based touch response should be presented in standard units (pixels, perhaps) with only small offset away from the actual finger location. In such cases, a psychophysical procedure may be included to verify the calibration. For example, at the start of testing, the participant could be asked to touch a series of target locations indicated visually on the display. Offset from expected values can be measured to confirm or adjust calibration, although any measured offset will incorporate contributions of the response hardware and response biases of the participant (e.g. reaching with the right hand introducing a rightward touch bias).
Psychophysical calibration required: In other cases, a calibration step will be necessary to interpret the response values at all. For example, a head-mounted display could be used to measure head orientation in a pointing task. The angle reported by the device is relative to its position when initially set up, and may vary considerably across sessions and participants. A calibration task can be used to determine and correct for the values reported for known and repeatable target directions (e.g. straight ahead). As noted above, measured offsets will include both hardware and participant contributions. Access to hardware-based calibration (e.g. a physical target) may help to isolate these contributions but may be difficult to implement in remote-testing scenarios.
Hardware calibration required: Note that some devices may require more detailed calibration to a physical standard in order to provide reliable data (generally, absolute measures). Such cases are likely to be highly dependent on the specific device and calibrator, and may be difficult to implement in remote-testing scenarios.