Issues in Data Analysis

Elevated random error

In the lab, we typically have a lot of control over what the testing environments are like. Subjects are usually seated comfortably in a sound booth with minimal auditory and visual distractions. In the same space, the hardware is usually limited to only what is necessary for the subjects to respond to tasks and positioned in a way that enhances the overall experience during testing. The experimenter is typically present with lots of opportunities to provide task-related instructions and checks in frequently to make sure that subjects are engaged in the task. This is sometimes critical for testing with special populations such as young children and older adults (seeSpecial Population).Any remote testing platform outside of the traditional lab setting will unlikely provide such control, inevitably introducing greater random error or “noise” in the behavioral data. Below is a list of factors to consider that may lead to more variability in behavioral data.

* Environmental factors: Subjects are likely in an environment with increased auditory/visual distractions and elevated ambient noise than in the lab sound booth. Ambient noise may also fluctuate over the course of the experiment.

* Device factors: Consumer electronics (e.g., computer, headphones) that subjects have access to may introduce variability in stimulus quality, such as emphasis/de-emphasis of spectral regions of the output frequency response. Noise-canceling headphones may in fact introduce elevated noise floor with improper wear.

* Subject factors: For remote testing, subjects will more likely run the experiment during a time of their choice – This may mean in late evenings or other times outside of typical business hours. Their attention state during these times may also play a role in how they respond to psychoacoustic tasks in ways unlike when testing is typically done in the lab. In the absence of a task proctor, participants may have limited access to raise questions regarding the task instructions. For child and elderly participants, this may be important as part of behavioral testing.

Note that even though these factors may seem to increase individual variability between subjects, it is highly likely that they may also affect within-subject behavioral performances because of the highly dynamic environments where remote testing occurs.What we do not know is whether and to what degree these factors contribute to the difference (random error) in the data collected in remote testing environments versus in the lab. The elevated error may influence behavioral outcomes in the following categories:

* Test-retest reliability, particularly important for experimental protocols aimed at indicating diagnostic and training outcomes

* Baseline performance (e.g., threshold for speech in noise recognition)

* Effect size of experimental manipulation (e.g., amount of speech-in-noise masking due to different types of noise masker)

The following illustration provides a simple example of how remote data collection may influence behavioral outcomes as compared to data collected in the lab environment.

The table below shows the general types of studies that are progressively more susceptible to variabilities introduced by remote platforms, including the factors that will likely influence behavioral outcomes and potential solutions.

Factors influencing outcomes

Potential solutions

Non-threshold studies (e.g., talker discrimination)

  • More distractions (e.g., visual, auditory)
  • Potential disconnection of audio device during task
  • Provide examples of a good home environment (quiet, few distractions) to participants in the instructions
  • Recommend participants use audio device with consistent connectivity (wired headphone/earphones and loudspeakers)
  • Implement a perceptual screening (e.g., proper headphone/earphone wear, stimulus presentation at comfortable level)

Studies measuring pitch-based absolute thresholds (e.g., pitch discrimination)

  • All of the above +
  • Audio device characteristics (e.g., narrower output frequency range, fluctuating frequency responses)
  • For calibrated devices: consider compensating for audio device frequency responses (incl. all hardware in the output chain)
  • For browser-based implementation: consider the upper bound of frequency responses for most consumer electronics

Studies measuring loudness-based absolute thresholds (e.g., loudness threshold)

  • All of the above +
  • Elevated ambient noise (very likely to lead to generally elevated threshold) +
  • Lack of audio output level measures (e.g., SPL) unless device is calibrated
  • All of the above +
  • When the absolute SPL cannot be obtained, consider whether sensation level is acceptable for the study design.

Studies measuring relative thresholds (e.g., speech in noise)

  • All of the above +
  • Absolute threshold may have a smaller but substantial impact of relative thresholds
  • Smaller dynamic range from audio device that may lead to clipping more easily
  • All of the above +
  • On the software side, consider implementing additional safety measures to ensure sounds do not exceed a maximum output level
  • Consider adding steps in a perceptual screening to check for “most comfortable level” as another upper bound to guide the definition of maximum output level

Studies involving binaural hearing (e.g., interaural level difference discrimination)

  • All of the above +
  • Additional audio device characteristics
    • Accurate delivery of very short delays may require specialized synthesis which may not be verifiable in all platforms. Inserting blank samples, for example, can only generate delays corresponding to multiples of the audio sampling period (e.g. delays of 23, 45, 68, … µs for 44.1 kHz) and should be avoided.
    • Left/right channel imbalance (e.g., level, spectral, or even user error)
  • All of the above +
  • Additional perceptual screen to ensure
    • Proper headphone/earphone wear for left/right channels
    • Left/right balance check not to exceed a maximum output discrepancy. Use this information for either rejecting participation, or compensating for left/right imbalance.

Validation studies

Ideally, the effects of remote testing on baseline performance, effect size, and random error would be known (or at least measurable) for every study question. In general, however, that information is not available. Alternative approaches can be used to estimate these effects, and inclusion of one or more validation conditions is one of the few easily identifiablebest practicesfor remote-testing:Replication using in-lab methods.One option could be inclusion of in-lab testing in a subset of conditions or participants. Where appropriate, for example, lab personnel might collect data on themselves in both remote and in-lab settings, or in-lab data might already exist from pilot stages of the project. Directly comparing performance across settings can provide some reassurance of the validity of remotely collected data.

Remote replication of in-lab tests. If an existing in-lab dataset closely related to the study procedures can be obtained, replication via remote testing can provide additional assurance that remote methods are capable and comparable to in-lab approaches. For example, a condition from a prior in-lab study can be included to estimate baseline performance across test settings. While this approach might not be capable of detecting changes in effect size, it can be used to verify baseline performance level and variation.

Analytical approaches

The random error is expected to be higher in datasets collected on remote testing platforms due to the highly dynamic testing environments (see the contributing factors listed in Section 1). While validation studies may provide insights on the types of studies that are generally more resistant to the remote testing environments, for individual datasets collected remotely, below are some considerations to ensure robustness through data analysis.Removal of outlier participants form data analysis.:The potentially large degree of random error in remote testing suggests high variance across participants which will have negative impacts on study power. Some variation might be attributed to participant factorsrelated to the task. Eliminating non-compliant participants from data analysis is often necessary to preserve study power, but often such participants cannot be easily or cleanly identified from the task data alone. Investigators are encouraged to carefully consider potential sources of inter-participant variation and include additional tests, such as perceptual/cognitive screening, attention checks, and catch trials within the remote testing protocol. Predefined levels of performance on these additional measures can be used to censor participant data independently of the primary data, improving the power and sensitivity of the study in a balanced and rigorous way.Bootstrapping and reporting**The (unknown) underlying effect of interest is likely more susceptible to sampling error because of the elevated random error in datasets collected in highly dynamic environments. When modeling psychometric functions,Wichmann & Hill (2001)provides insights on estimating variability in fitted parameters through bootstrapping in thepsignifit toolbox. In essence,bootstrappingprovides a range of values in the parameter estimate(s) in any statistical model that is fitted to a dataset, both for descriptive statistics (e.g., mean, standard deviation) and inferential statistics (e.g., regression estimates, effect size). It uses a Monte Carlo approach to simulate resampling from the datasetwith replacementunder the assumption that the original sample represents the population (random sample).

**Bootstrapping can also provide a sanity check for sample size (Chihara & Hesterberg, 2011). Under the Central Limit Theoreom, the sample mean from a random sample that is sufficiently large will have a normal distribution regardless of the distribution of the population. Distribution from the bootstrapped (re)samples has the same spread and skew as the original random sample. So if the distribution of the bootstrapped parameter estimate(s) is not normal, it is highly likely that the original dataset is under-sampled.