Issues in Data Analysis
Elevated random error
In the lab, we typically have a lot of control over what the testing environments are like. Subjects are usually seated comfortably in a sound booth with minimal auditory and visual distractions. In the same space, the hardware is usually limited to only what is necessary for the subjects to respond to tasks and positioned in a way that enhances the overall experience during testing. The experimenter is typically present with lots of opportunities to provide task-related instructions and checks in frequently to make sure that subjects are engaged in the task. This is sometimes critical for testing with special populations such as young children and older adults (seeSpecial Population).Any remote testing platform outside of the traditional lab setting will unlikely provide such control, inevitably introducing greater random error or “noise” in the behavioral data. Below is a list of factors to consider that may lead to more variability in behavioral data.
* Environmental factors: Subjects are likely in an environment with increased auditory/visual distractions and elevated ambient noise than in the lab sound booth. Ambient noise may also fluctuate over the course of the experiment.
* Device factors: Consumer electronics (e.g., computer, headphones) that subjects have access to may introduce variability in stimulus quality, such as emphasis/de-emphasis of spectral regions of the output frequency response. Noise-canceling headphones may in fact introduce elevated noise floor with improper wear.
* Subject factors: For remote testing, subjects will more likely run the experiment during a time of their choice – This may mean in late evenings or other times outside of typical business hours. Their attention state during these times may also play a role in how they respond to psychoacoustic tasks in ways unlike when testing is typically done in the lab. In the absence of a task proctor, participants may have limited access to raise questions regarding the task instructions. For child and elderly participants, this may be important as part of behavioral testing.
Note that even though these factors may seem to increase individual variability between subjects, it is highly likely that they may also affect within-subject behavioral performances because of the highly dynamic environments where remote testing occurs.What we do not know is whether and to what degree these factors contribute to the difference (random error) in the data collected in remote testing environments versus in the lab. The elevated error may influence behavioral outcomes in the following categories:
* Test-retest reliability, particularly important for experimental protocols aimed at indicating diagnostic and training outcomes
* Baseline performance (e.g., threshold for speech in noise recognition)
* Effect size of experimental manipulation (e.g., amount of speech-in-noise masking due to different types of noise masker)
The following illustration provides a simple example of how remote data collection may influence behavioral outcomes as compared to data collected in the lab environment.
The table below shows the general types of studies that are progressively more susceptible to variabilities introduced by remote platforms, including the factors that will likely influence behavioral outcomes and potential solutions.
Factors influencing outcomes |
Potential solutions |
|
Non-threshold studies (e.g., talker discrimination) |
|
|
Studies measuring pitch-based absolute thresholds (e.g., pitch discrimination) |
|
|
Studies measuring loudness-based absolute thresholds (e.g., loudness threshold) |
|
|
Studies measuring relative thresholds (e.g., speech in noise) |
|
|
Studies involving binaural hearing (e.g., interaural level difference discrimination) |
|
|
Validation studies
Ideally, the effects of remote testing on baseline performance, effect size, and random error would be known (or at least measurable) for every study question. In general, however, that information is not available. Alternative approaches can be used to estimate these effects, and inclusion of one or more validation conditions is one of the few easily identifiablebest practicesfor remote-testing:Replication using in-lab methods.One option could be inclusion of in-lab testing in a subset of conditions or participants. Where appropriate, for example, lab personnel might collect data on themselves in both remote and in-lab settings, or in-lab data might already exist from pilot stages of the project. Directly comparing performance across settings can provide some reassurance of the validity of remotely collected data.
Remote replication of in-lab tests. If an existing in-lab dataset closely related to the study procedures can be obtained, replication via remote testing can provide additional assurance that remote methods are capable and comparable to in-lab approaches. For example, a condition from a prior in-lab study can be included to estimate baseline performance across test settings. While this approach might not be capable of detecting changes in effect size, it can be used to verify baseline performance level and variation.
Analytical approaches
The random error is expected to be higher in datasets collected on remote testing platforms due to the highly dynamic testing environments (see the contributing factors listed in Section 1). While validation studies may provide insights on the types of studies that are generally more resistant to the remote testing environments, for individual datasets collected remotely, below are some considerations to ensure robustness through data analysis.Removal of outlier participants form data analysis.:The potentially large degree of random error in remote testing suggests high variance across participants which will have negative impacts on study power. Some variation might be attributed to participant factorsrelated to the task. Eliminating non-compliant participants from data analysis is often necessary to preserve study power, but often such participants cannot be easily or cleanly identified from the task data alone. Investigators are encouraged to carefully consider potential sources of inter-participant variation and include additional tests, such as perceptual/cognitive screening, attention checks, and catch trials within the remote testing protocol. Predefined levels of performance on these additional measures can be used to censor participant data independently of the primary data, improving the power and sensitivity of the study in a balanced and rigorous way.Bootstrapping and reporting**The (unknown) underlying effect of interest is likely more susceptible to sampling error because of the elevated random error in datasets collected in highly dynamic environments. When modeling psychometric functions,Wichmann & Hill (2001)provides insights on estimating variability in fitted parameters through bootstrapping in thepsignifit toolbox. In essence,bootstrappingprovides a range of values in the parameter estimate(s) in any statistical model that is fitted to a dataset, both for descriptive statistics (e.g., mean, standard deviation) and inferential statistics (e.g., regression estimates, effect size). It uses a Monte Carlo approach to simulate resampling from the datasetwith replacementunder the assumption that the original sample represents the population (random sample).
**Bootstrapping can also provide a sanity check for sample size (Chihara & Hesterberg, 2011). Under the Central Limit Theoreom, the sample mean from a random sample that is sufficiently large will have a normal distribution regardless of the distribution of the population. Distribution from the bootstrapped (re)samples has the same spread and skew as the original random sample. So if the distribution of the bootstrapped parameter estimate(s) is not normal, it is highly likely that the original dataset is under-sampled.