This page presents a concatenated version of the "issues" pages hyperlinked throughout from the Issues page.
Issues and Best Practices
Shifting a laboratory-based study to one that employs a remote-testing approach, or designing a new remote study, requires the investigator to consider numerous issues that could impact the accuracy, reliability, validity, or legality of the work. In this section, we aim to explore the many questions that an investigator might consider and–where possible–identify some of the clearest options. Although a long-term goal of the task force is to identify best practices or guidelines for remote testing, the necessary evidence base (in terms of what works well and what does not) does not yet exist. Thus, for the time being you will find few explicit recommendations here, and investigators are encouraged to carefully consider each issue as it pertains to the specific goals of their own research. Please refer to Examples for additional specific information provided by investigators regarding the successes and failures of individual projects. View this content as a single concatenated page.
Reasons and motivations for remote testing
The need for remote testing began prior to 2020, however, it has now been accelerated due to the COVID-19 global pandemic. Lab closures or restrictions on in-lab testing due to quarantine or social-distancing recommendations have forced many investigators to decide between suspending progress of human-participant research and turning to alternatives to in-lab data collection such as meta-analyses, use of computational models, and remote testing of human participants.
Access to rare populations
Previous remote testing needs have included testing individuals in rural populations and recruitment of rare populations. Specifically, researchers and hearing healthcare providers may need to reach clinical patients and potential research subjects located in different counties or even states from specialized health care clinics and research laboratories. Often, financial stipends may be offered in these situations to incentivize the patients or research subjects to travel to specialized clinics and laboratories. However, there are many situations where even stipend travel is not justified. Families with young children and caregivers for older adults with mobility and health issues or individuals with special needs may not be able to visit specialized clinics and laboratories. Additionally, patients and research subjects at a distant location may only be able to make the trip once and would be unable to participate in studies requiring multiple visits or longitudinal components.
While there are obvious benefits in testing rare clinical and rural populations, there are also many conveniences in remote testing for families. Families with multiple children have very busy schedules and are often unable to work around parents work schedules, multiple children’s school schedules, and extra-curricular activities. Additionally, many families may be caring for their elderly parents or other dependents with various mobility or other needs. All of these scenarios may cause appointments to be delayed or limit the amount of in-person testing completed per visit, particularly if testing locations include commutes with excess traffic and poor parking. The ability to test remotely would be convenient for families and caretakers and could potentially be more efficient when several transportation related barriers are eliminated in the process.
Remote applications of auditory research and select clinical audiology services may begin to alter how people think about their hearing healthcare in the future. Mobile applications of hearing tests have been developed and refined for decades catered mostly to industrial audiology applications in accordance to the Occupational Safety and Health Administration (OSHA) regulations for hearing conservation programs. However, patient-oriented healthcare for personal use has been expanding in recent years. Unmonitored hearing tests are far from gold standard audiological testing, but it could be argued that access to mobile pure tone testing at home or at locations such as pharmacies could help identify those with hearing loss before they are ready to visit an audiology clinic. In terms of patient care, there may be unexplored auditory analogies to white coat hypertension that researchers and clinicians are unaware of due to the constrained current diagnostic hearing testing protocol. However, monitoring the ambient and other noises in a mobile test environment is potentially much more complex than other patient-centered healthcare applications.
Motivation for remote testing of hearing currently vary from continuing research testing outside of the laboratory to remote clinical applications including hearing device fittings and adjustments as well as aural rehabilitation or auditory training plans. During this period of COVID-19, most researchers and clinicians simply hope to continue their research data collection to improve clinical practice and to minimize the disruption of safely diagnosing and treating their patients. Conducting research experiments and treating clinical audiology patients in the comfort of their homes is in everyone’s best interest during a global pandemic. Further, lessons learned from remote testing now may open opportunities for more efficient data collection and more realistic hearing healthcare in the future.
Remote testing can also be used to support data collection in real-world environments, which may be an important issue for some questions of ecological validity. Longitudinal studies or trials of at-home training paradigms could also benefit from robust approaches to remote testing with online or portable platforms. Future developments aimed at enhancing the rigor and robustness of at-home tests could pay significant dividends for a wide variety of studies which might be freed from lab-based constraints and therefore achieve greater validity in terms of the listening environment, sampled population, etc.
Institutional Review Boards are tasked with protecting the rights and welfare of human subjects who participant in research. The procedures used to achieve that goal vary for different IRBs. When preparing a protocol for remote testing, or modifying an existing protocol, you may want to usetext from approved IRB protocolsas a guide.When adding remote testing to a protocol, your IRB may want to know whether these procedures will be used for a limited period of time (e.g., during a shelter-in-place order), or if they will be used indefinitely. If in-person testing is temporarily suspended, your IRB may ask that you retain in-person testing procedures in anticipation of resuming those activities in the future.In addition to considerations related to in-person testing, a protocol that includes remote testing may also need to consider:
* modified procedures for recruiting and obtaining informed consent remotely
* reduced stimulus control and/or data integrity, and resultant changes in the balance of risk vs. benefit
* additional risks of harm to the subject (e.g., sounds that are louder than intended)
* additional risk with respect to loss of confidentiality associated with transferring data from the remote test site
* procedures for providing hardware or verifying hardware already in the subject’s possession
* liability associated with asking a subject to download software onto a personal computer or remotely accessing a subject’s computer
* procedures for subject payment
HIPAA (see also data handling)
The Health Insurance Portability and Accountability Act (HIPAA) is a federal law that sets out standards for protecting a patient’s health information from disclosure without their consent or knowledge. A HIPAA release form gives a researcher permission to access Personal Health Information (PHI). The researcher is responsible for protecting that PHI. Your IRB will want to see procedures for doing that, including the use of secure communication channels and appropriate data handling procedures
General guidelines for obtaining consent remotely are the same as for in-person research. If the research presents no more than minimal risk of harm to the participant, your IRB may waive the requirement of obtaining documented (signed) informed consent. In these cases, the researcher is often asked to provide the subject with a study information sheet and/or a verbal description of the study, following an IRB-approved script. In some cases, a waiver of informed consent may be granted; this can occur if the risk is no more than minimal and if the research would not be feasible without a waiver.
If documented informed consent is required, procedures for obtaining remote consent may make use of phone, video-chat, or web-based applications. Many institutions have preferred software and procedures in place for secure, confidential communication. In some cases, procedures related to obtaining consent in the context of telemedicine may be appropriate (e.g., Bobb et al., 2016; Welch et al., 2016).
Data and safety monitoring
Collecting data remotely introduces additional security concerns that are often avoided with in-person testing. Encrypting data, deidentifying data, and using HIPAA compliant communication software are all steps that can mitigate risk. Your IRB will want to see that you have a plan in place to ensure data security and integrity. Many institutions have preferred software and procedures for handling remote data collection.
Compliance with regional and international law
Laws regulating human subjects research can vary widely across countries and even across states within the US. For example, most states require parental consent for children ≤ 17 years of age to participate in research, but this range is ≤ 18 years of age in Nebraska. In this case, the age of consent is determined by state law associated with the location of the instutution carrying out the research, not the location of the subject. In other cases, you may need to consider local regulations as they relate to remote testing. Be sure to obtain approval from your IRB before testing subjects who are outside the US.
Other categories to consider
- Information technology compliance (e.g. screen sharing) – institutional requirements may vary
- Local/institutional issues
- Platforms approved for use by local IT / IRB
Important factors to consider with recruitment include identifying participants that meet the inclusion criteria and communication. An advantage of remote testing is the ability to create new subject pools. As participants are not isolated to a geographical area, there is greater opportunity for diversity in the sample group. How a researcher recruits participants should be guided by factors such as sample size, geographic region, and any additional required details.Identifying participantsmay make use of the following platforms and approaches:
* Existing subject pools on lab and institution levels
* Family and friends of lab members
* Via social media, newsletters, and other advertisements
* Patients at a local clinic
Communicating with potential participantsoccurs on a continuum and can include emailing, talking on the phone, talking via a secure video call, text messaging, and automatic messaging to participants via various platforms.Email communication provides participants with a written version of the study’s details and allows potential participants time to consider their answers to eligibility screening questions and whether or not they would like to proceed with the consenting process. Alternatively, emailing back and forth with participants can be time-consuming. Examples of email templates include an initial recruitment email describing the study, a recruitment email for existing participants in the lab or institution’s subject pool, a follow-up email for interested participants that includes eligibility questions, a confirmation email, an instructions email, a reminder email, a sorry-we-missed-you email if the participant forgets about their scheduled participation time, a post-study payment email, and a thank-you email for the individual’s participation.Phone or video call communications allow for speedier delivery of information and a real-time opportunity for participants to ask questions and get an appointment scheduled. If you elect to communicate with participants via telephone, anonymous, encrypted, or platform-based phone numbers are encouraged for use to avoid distributing your personal phone number to participants.Software programs can assist in providing automatic and scheduled communications with potential participants. For example,Gorillacan provide automatic replies to participants. Scheduling platforms and calendars such asTimeTapandOutlook Calendarallow for the researcher to input curated reminder emails to be sent automatically to participants at specified times. If Outlook Calendar is used, keep in mind that the Outlook account needs to be open for that reminder to send.
The use of Institutional Review Board-approved templates for these communications, particularly in the initial recruitment stage, is suggested to ensure that participants are receive the same, fair, straightforward descriptions of the study.
Consenting can be performed in person, over the phone, or using a video platform. There may be institutional considerations based on each Institutional Review Board regarding which are preferred and allowed.Video platformscommonly used for consenting include:
Electronic consent (henceforth referred to ase-consent) platforms collect the participant’s signature remotely. This eliminates the need to print consent forms, obtain in person signatures, and email forms back and forth. These platforms can house each form needed, including the consent form, HIPAA form, assent form, intake form, and demographics form.Obtaining electronic consentmay be facilitated by the following platforms:
* G Suite
Additional options for consent may be a checked box stating that when the participant clicks the “next” button, they consent and agree to the terms stated on the page. For this option, an important consideration is how much documentation is required for consent.Signaturescan be collected by:
* Emailing forms and asking for scanned copy with a signature
* Typed signatures
* Trackpad signature through e-consent platforms
Like in person testing, there are special considerations for pediatric testing. Assent forms can be signed using the same platforms as forms for adult patients. Children may not always be visible on the webcam during consenting, which makes it difficult for a researcher to identify body language that is associated with a child removing their consent. Should the intake process be lengthy, the child may need a break in order to refocus. The consenting session should also occur at a time when there are minimal distractions in the room.
Instructions for the study can be provided in many ways. These can be typed up in a document and emailed to participants, laminated and included in a remote testing kit, or mailed to participants. Verbal instructions via a phone or video call can also supplement written instructions.Accessibility is critical when providing instructions so that participants fully understand the task in which they are partaking. Consider writing instructions at a 2nd grade reading level so potential participants, regardless of literacy level, can clearly understand what the study involves. Videos may be useful as a supplement for demonstration purposes. Ensure the text on the instruction documents is of a large enough font size for individuals to read easily.
Consider using both text and visuals in instructions. This may involve listing step-by-step instructions along with screenshots of the programs used and/or diagrams of the study’s setup. For children, cartoons may be helpful in conveying the main points. Instructions should also include reminders about how long each component is expected to take, any environmental modifications the participant may need to make (e.g., avoiding the washing machine running if the environment needs to be quiet), and contact information if the participant has questions.
There are many professional considerations on the researcher’s end when engaging in remote consenting, including the modality of the consent process, confidentiality, whether or not a witnessing signature is required, ways to reach potential participants who do not have access to Internet or technological resources, and whether or not consent forms should be sent to the potential participant for review prior to the consenting call.Confidentialitymay be protected with the following procedures:
* Consent while alone in a closed room
* Make sure the background is quiet
* Wear headphones
* Assure the potential participant that you are the only one seeing and hearing him/her
* Using a blank wall as a background
* Blurring or add a custom background
* Positioning yourself with windows or lighting in front of you
* Asking ahead of time if the potential participant would benefit from captions
Delivery of Materials (If Applicable)
Take-home materials can be distributed in a variety of ways. Materials can be shipped from the lab or an online ordering system such as Amazon, dropped off at doorstep by research staff, or sent electronically (such as in downloading computer software).Shipment can be useful for getting materials to participants who are not in close proximity to your lab.Distribution of testing materials by lab staff can be useful for reaching participants in close proximity. Research staff and the participant can plan a time ahead of time for drop-off of equipment, allowing the participant a period of time (e.g., 24 hours) to complete the experiment before contacting the research staff for pickup of materials.Sanitation and social distancingmay require you to:
* Wear a face mask and gloves when exchanging materials in person
* Exchange materials by arranging with the subject to leave (or retrieve) items from an accessible location (e.g., outside of their door)
* Include sanitation products with the take-home materials
* Following material use, sanitize all surfaces of the materials with sterile alcohol prep pads and set aside to dry before use
In remote testing, many times the researcher is removed from the session. If participants have questions or difficulties with consent, intake forms, or accessing the materials for the study, they will need a point of contact within the lab. On-call troubleshooting assistance can be available via email, video call, phone call, or by text dependent on the institution. Video calls allow the participant to share their screen with the researcher. This can be a useful tool to rapidly resolve the problem. While these are viable options during business hours, it may be necessary for a lab member to be available after hours depending on when the participant completes the study.
Challenges unique to remote testing may include internal factors related directly to the participant. Considerations are participant compliance with instructions, familiarity level with the technology, strength of the WiFi signal, and additional considerations for pediatrics. More information on this subject can be found in (link section of Wiki).
Type of payment and documentation of time should be considered when determining reimbursement of participants. Reimbursement for remote testing can be similar to those used in labs in person, or they can be entirely electronic. Most electronic payment methods simple to use and provide the researcher with a notification when they have been received by the participant.
Gift card (e.g., Visa gift cards, other electronic gift cards)
This form of payment can restrict the participant to using their earned funds solely at one business. Visa gift cards remove this limitation but cost more than the amount disbursed to participants.
Not all participants may have access to the payment method. Requiring a participant to create an account with a third-party service may not be the most user-friendly option.
This method may be restrictive due to institutional policy as they require a great amount of personal information prior to sending.
Each participant is entered following completion and a name is selected. Could be useful when time spent on the study cannot be verified.
Earphones (see also Resources.Earphones )
Commercially available headphones will vary in their frequency response and ability to isolate external sounds. For example,Apple earpodshave somewhat poor ability to reproduce frequencies below 100 Hz and above 10 kHz, but within this range the reproduction accuracy is fairly good. If particularly low or high frequency responses are needed then the experimenter should consider shipping headphones known to have good frequency responses in the desired range to participants. If prefect frequency reproduction is not particularly important for the experiment then typical commercially available headphones may be sufficient. Additionally, earbuds tend to be worse at isolating environmental noise, so unless participants are guaranteed to be in a quiet environment in-ear headphones or closed back headphones may be better to ensure the experimental sounds are clearly audible relative to the participant’s environment. Additionally, wireless bluetooth headphones may receive interference from other bluetooth devices in the area, and as a result may lose segments of the auditory signal. This is undesirable, so participants should use wired headphones whenever possible.
Loudspeakers vary more in their ability to faithfully recreate stimuli and also interact with the acoustic characteristics of the listening environment. Generally speaking, small speakers with a single driver will not be able to recreate the full spectrum of sounds. In particular, laptop loudspeakers often suffer from poor quality low frequency reproduction, and as such should be considered with caution for most experiments. Encourage participants to avoid listening to loudspeakers in rooms that have bare walls and floors, as the reverberation from the room may interfere with hearing the stimuli. If the participant is expected to be a certain distance from the loudspeaker to control for level or speaker characteristics one option is to send something like a yoga or floor mat along with the speaker to show exactly where the participant should sit and where the speaker stand should be set up.
The only way to precisely control audio levels or calibrate the frequency response of the device is to use a known combination of headphones/loudspeakers and a sound card. If the experimenter wants to provide all of the equipment necessary to complete an experiment then they can ship a whole computer/tablet with earphones or a loudspeaker to participants (e.g. PART). Alternatively, if the experimenter wants to control the auditory stimulus but allow participants to run research software on their own computer then an alternative is to ship an external sound card and earphones to the participant. In this case, the sound card and earphones can be calibrated together by the experimenter so that the level and frequency response of the audio hardware is known and can be precisely controlled in the experiment. The experimenter should provide easy to understand instructions for how to connect hardware to the participant’s computer and be available to help troubleshoot any issues the participant has. Relying solely on the participant’s computer is possible, albeit with less precise control of stimuli. One particular issue to watch for is that the standard sound drivers in Windows 10 try to shape the amplitude of sounds to avoid sudden onsets. This means it is possible for Windows to decide to ramp up the audio amplitude of a program during stimulus presentation, which is problematic for short (less than approximately 1 s) sounds, and in some cases may even render short sounds inaudible. In some cases, Windows OS power settings may affect this behavior (see, for example, https://appuals.com/audio-crackling-windows-10/). Consider testing stimulus playback on a variety of computer platforms before running an experiment with participant supplied hardware to ensure that specific platforms will not interfere with stimulus playback.
Sound file formats
Sound files can be saved in many formats. For experimental purposes, most labs use wav files, which save the exact signal that will be sent to the audio device. These files are ideal for reproducing stimuli, but take the most data. If a large amount of audio needs to be sent through an online server to the participant you may want to consider some form of compression. The compression format you are likely most familiar with is mp3, which is a format designed to compress sound files by removing information that is unlikely to be important for what most people can hear in music. This compression is lossy, which means that it does not perfectly preserve all of the details of the audio signal. mp3 has been superseded by m4a/mp4. As an example, Matlab’s audiowrite command can write mp4 files, but not mp3s. These lossy compression formats are based on assumptions about what music usually sounds like and what people are capable of hearing, so they may create weird artifacts when attempting to compress more psychophysical stimuli, such as quiet or bandlimited sounds. An alternative would be to use lossless compression, such as flac. This will save some space, but whether the process of converting audio files to flac from whatever the lab normally uses is one the experimenter will need to decide.
Providing Audio Online
It may be useful to record experimental sessions. On Windows computers, one default audio input device is called ‘Stereo Mix’, which is essentially a recording of any audio the computer is playing. If you have a video or audio call open with a research participant recording from the Stereo Mix device will record the audio signal from the call. Note that you should have Informed Consent from the participant to make these recordings. These recordings can be used post-hoc to determine if the background noise in the participant’s test environment is acceptable and to re-analyze verbal responses provided during the experiment. It is also possible to record the audio the participant hears. This can be useful to determine if stimuli are distorted or to check that the participant was hearing the correct stimuli. This can be done by having the participant share through the video or audio call the browser the experiment is running in and their computer audio. That way, the audio stream the participant sends to the experimenter will contain what they were hearing, any environmental noise that their microphone picks up, and the participant’s verbal responses in the same audio stream.
Modern computer monitors are capable of high-quality display, but there is a wide variety of video processing and rendering options at the operating system, video card, and hardware levels. Critical aspects of image rendering (e.g. colors and contrast to different stimuli from one another and from backgrounds, legibility of text, absence of video artifacts) should be checked on the test hardware prior to an experiment. Rendering accuracy can be affected by screen resolution and by anti-aliasing. With low resolutions, stimuli may appear pixilated and will lose some fidelity. Most modern computers default to reasonably good resolutions, but if a participant is using an older computer or their video drivers are misconfigured it is possible their display will have a low resolution. Anti-aliasing is a rendering technique which smooths transitions between adjacent pixels to make a picture look higher fidelity. This is usually a good thing, but variability in anti-aliasing across displays could alter the clarity of images or alter the readability of text. Modern screens often include different processing modes that are optimized for games or movies. These modes vary from manufacturer to manufacturer, but usually alter the throughput delay of the screen and may notably affect the color balance of images. If a known, fixed timing between auditory and visual events is desired, as is often the case in studies of audio-visual integration, it may be necessary to use the same calibrated hardware across participants. Additionally, keep in mind that video screens have a slower updating and refresh rate than audio devices (usually 60 Hz), so there will be some variability in the timing between visual and auditory events across screens.
Participants should be sitting so they are directly facing the screen. Stimuli should be designed to accommodate some back and forth sway of the participants’ head relative to the screen, as they will not be completely still while performing a task. If attending the visual stimulus is essential to the task it may be helpful to have the experimenter monitor eye and head position through a video call during the experiment. Have the participant look at a fixation cross at the center of the screen and note the angle of the head and eyes on video, then monitor for obvious deviations from this position during stimulus presentation.
Concerns about screens and placement apply to tablets, with the added concern that participants have more degrees of freedom for orienting their eyes and head relative to the display. Make sure stimuli are clearly visible even on small displays that are held far from the head, and consider what the experimenter should do if a participant accidentally drops the tablet or looks away while adjusting position.
Head mounted displays
Another approach to visual stimulation is the use of commercially available head-mounted displays (HMD) intended for virtual reality (VR), such as Oculus Rift, Quest, HTC Vive, etc. Some advantages of this approach include known and reproducible placement of the display, head (and possibly eye) tracking. Although the field of view is generally much narrower than natural vision, large visual displays can be simulated by tracking the head position and updating the eye-fixed display appropriately. This functionality is built in to these devices, and can be exploited using standard 3D game-programming techniques in development platforms such as Unity 3D and Unreal Engine. Calibration of video, tracking, and audio/video sync with HMD is beyond the scope of this article. However, although the capabilities of a specific device type and model should be assessed, modern manufacturing tends to produce units with very similar performance (as is the case for many tablet devices), so it may be reasonable to assume a standard level of performance, and specific calibration–in the field–of each unit may not be required.Most head-mounted displays also feature some means to deliver audio stimulation, either through earphones attached to the unit or small HMD-mounted loudspeakers. Where possible, earphones with good passive attenuation and a direct electrical (rather than acoustical) path to the ear are preferred. These may interface directly with the HMD or with a host PC, in which case audio concerns are similar to those discussed above. Bear in mind that software-based "3D audio" features may distort the binaural and spectral features of the audio in attempt to compensate for head movements and simulate virtual audio sources. In general, commercial 3D-audio algorithms may not be well suited to research purposes, and investigators should consider whether "3D" audio is important to the goals of the study or should be disabled. Convincing (i.e. true) 3D audio can also be achieved using loudspeakers, although the HMD will interfere to some degree with the spatial acoustics, particularly at high frequencies.
In each case, consider whether device type specified, provided, BYO
Many of the above issues can be handled with good precision on known devices. If participants use their own devices, consider adding quick checks at the beginning of the experiment to ensure essential details are visible. It may also help to ask participants if they use any additional video processing (e.g. accessibility options or night display modes) and to disable that processing if it interferes with the experimental stimuli.
Compatibility with clinical devices
Earphones are generally not an option for aided listening. If a participant’s audiogram is known in advance stimuli presented through earphones can be amplified to improve audibility when listening without hearing aids. The experimenter should take care to check that amplification does not produce uncomfortably loud stimuli. Participants with hearing aids that fit entirely in the ear canal may be able to use their hearing aids while listening through earphones, although this should be tested to check for undesired physical or acoustic interactions between the earphone and the hearing aid. Loudspeaker presentation is an option for aided listening, although the experimenter should take care to ensure that participants are oriented relative to the loudspeaker to avoid confounding differences in hearing aid directionality. Some hearing aids also have streaming capabilities through Bluetooth, which could enable a direct connection from a computer to the hearing aid. It may be helpful to obtain permission from participants to contact their audiologists for information on how the device is programmed, as various settings (noise reduction algorithms, directional microphones, compression) will alter how acoustic stimuli are processed across individuals.
Similar to hearing aids, headphones do not provide good aided listening to participants with cochlear implants. However, there are published studies that used circumaural headphones to present stimuli (Grantham et al., 2008andGoupell et al., 2018). Loudspeaker presentation is an option, and some cochlear implants have direct connection audio jacks and/or Bluetooth streaming capabilities. Implants tend to process a narrower frequency range than acoustic hearing, so at-home audio devices (e.g. laptop speakers) may have a sufficient frequency response in the range that the cochlear implant processes, but this should be experimentally verified.
The advice for hearing aids and cochlear implants may generalize to other assistive devices (e.g. bone-anchored hearing aids, auditory brainstem implants), but it is up to the experimenter to determine whether stimuli are being heard as intended. If you have experience with remote testing specific devices please share your advice here.
Issues related to collection of response data
Remote testing, as understood by this Task Force, involves the collection of data ("responses") from research participants interacting with a response device such as a paper form, web survey, tablet app, or VR game. This article attempts to describe some considerations that might apply to the collection and processing of response data.
Comparison to in-lab response collection (see also Task Performance. )
Assuming that appropriate hardware/software resources can be provided to the participant, the types of response data that can be collected during remote testing do not, in principle, differ from those available during in-lab testing. However, the types of response data that are most easily accessed will depend on the remote-testing platform and the types of tasks the platform is intended to implement.
Thus, a superficial but useful distinction can be made between two major types of tasks:
- survey-based tasks include a series of different question/answer (or stimulus/response) pairs
- trial-based tasks present a repeating series of similar stimulus/response pairs
For example, a typical survey might include a series of questions and question-specific response choices:
- How loud would you consider your workplace? [Scale of 0-100]
- How often do use hearing protection at work? [Five-point Likert scale, "Never" to "Always"]
- Does your workplace provide earplugs? [Yes / No]
- List the main noise sources present in your workplace: [free response]
A typical trial-based task would present the same type of trial repeatedly to assess a distribution of responses:
Note that either type of task could be administered in the lab or through remote testing. Many web-based platforms, however, are oriented primarily toward survey-based tasks. Trial-based tasks are more often implemented as standalone programs (e.g. MATLAB, PC, or tablet apps). Luckily this is not likely to be an issue for remote testing; most trial-based tasks can be easily reframed as survey-based tasks (by treating each "trial" as a survey "question"), although platforms vary in their support for common trial-based approaches such as randomized presentation order, adaptive presentation (using performance to select the next trial), etc.
Another difference between survey- and trial-based tasks has to do with whether individual participants complete the task once (as typical for a survey) or many times (as typical for trial-based tasks). Different considerations may apply to data handling (managing one vs. many data files per participant), counter-balancing conditions across repeated trial-based runs, randomizing question order across survey versions assigned to different participants, etc. Platforms may vary in their suitability for administering a survey task once to each of many (possibly anonymous) participants versus tracking a smaller number of participants across multiple sessions of trial-based tasks.
Types of response data that may be collected during remote testing
Most relevant to the purpose of this article are the types of response data collected during survey- vs trial-based tasks. There is no hard distinction between these, and most (all?) response types (multiple choice, rating scale, head pointing, pupil dilation) could, in principle, be used in either survey- and trial-based tasks. However, certain types of responses are commonly encountered in survey-based tasks, and these are the most commonly supported by many remote-testing platforms.
Note that the availability of specific response devices (buttons, sliders, etc) and response data types may be limited by platform. For example, Gorilla (seePlatforms) supports the following response types:
- Continue Button
- Response Button (optionally featuring Text, Paragraph, or Image content)
- Rating Scale/Likert
- Response Slider
- Keyboard Response (Single or Multi key press)
- Keyboard Space to Continue
- Response Text Entry (Single/Multi line / Area)
Some of these options can provide immediate response processing (e.g. response recorded when button clicked), which may support some degree of timing data or even conditional/adaptive processing. Other response types (e.g. text entry) require a second step, such as "click to continue" to record the response.
Other platforms, particularly non-browser platforms such as PC or tablet apps, may offer a wider range of response types, including:
- Touch Response (one or two dimensions)
- Multi-touch / Gesture Response (e.g. swipe left or right)
- Tilt / Acceleration Measures
- Special Hardware Support
- Camera or Depth Camera
- Tracked Controllers (head-mounted display / VR touch controllers)
- Physiological Sensors (heart rate, GSR, EKG, EEG, pupilometry, eye-tracking etc)
Platforms that provide additional support for accessing on-device sensors or hardware controllers include:
- Most PC or custom app frameworks (MATLAB, Windows, Android Studio, xCode, Unity, etc.)
- Some commercial platforms for physiological measurement, audiology, telehealth
Speech / Audio response collection
In-lab testing of speech perception (for example) often combines open-set responding ("Repeat the word BLUE") with in-person scoring by a human observer. Some platforms allow synchronous interaction between experimenter and remote participant which can support a similar approach. However, low quality audio/video streaming, dropouts, or distortion might disrupt accurate scoring. A few approaches may be used to support open-set data collection:
- Audio or AV recording and storage of responses for later verification
- Potential challenges: defining the response window, storing/transmitting audio/AV data files, ensuring participant privacy.
- Redundant response collection or self-scoring after feedback (e.g. "Say the word, then type it", or "Check box if you were correct")
- Potential challenges: reconciling mismatched responses
Response Calibration (see also Calibration
Although most survey-type responses should interpretable in an absolute sense and thus require no calibration to determine value, some continuously variable response data (for example, touch displays, tilt/force sensors) may require psychophysical or hardware calibration. See Calibration for more details.
Audio Stimuli Calibration
Within the scope of psychological and physiological acoustics, the goal of many experiments using remote testing is to collect responses to a set of acoustic stimuli from test subjects at remote sites. It is crucial to calibrate the stimuli so that the obtained responses would not be dominated by variations in presentation levels and specific transducers (e.g., earphones, loudspeakers, etc.). The appropriate calibration of the sound pressure level is especially important because (1) the experimenter has to make sure that the stimuli are audible to the subjects; (2) the sound level needs to be kept within a safety limit (seeCompliance); (3) many auditory and speech perception phenomena are level dependent.Remote testing poses many challenges to the calibration of acoustic stimuli, mostly because the experimenter may not have full knowledge of the sound delivery system and environment at the subject’s end. Therefore, when selecting an appropriate platform for a remote experiment, one of the first considerations would be how critical calibration is to the research question under investigation (seePlatformConsiderations). Some platform-specific information with regard to calibration can be found underHardwareAndCalibrationGeneral approachesfor addressing this problem include:Approach 0, No formal calibration: Many experiments involving supra-threshold and/or high-level processes may be conducted with limited influences from inexact audio calibration. In such cases, a browser-based experimentation platform (e.g., Gorilla, jsPsych, PsychoPy, seePlatformDescriptionsfor the descriptions on these platforms) may be preferable, because these platforms, although do not provide precise calibration of audio presentation, could be cost effective and hence enable a relatively large sample size. It should be noted that even for experiments without formal calibration, simple verifications are recommended to ensure the audibility of the stimuli andlistening safetyApproach 1, Electroacoustic calibration: Experiments that involve subjects with hearing impairment or addresses research questions that are expected to be dependent on stimulus characteristics such as level and spectrum may consider an experimentation platform involving sending pre-calibrated systems (e.g., tablets with headphones) to the subjects. Some examples for research platform under this category include hearX, PART, etc. (seePlatformDescriptionsfor the descriptions on these platforms). Although Approach 1 enable precise control of the test stimuli, the logistics (i.e., teaching subjects to use the equipment, shipping and receiving the equipment, troubleshooting remotely, and answering ongoing questions) may be time-consuming and costly.Approach 2, Limited calibration involving reports from the subject: Besides the two scenarios described above, there may be situations in which some degrees of control on the stimulus characteristics are preferred but not critical. In these cases, browser-based platforms may be used and some moderate degrees of stimulus control may be achieved with the involvement of the subject. For example, the subject may be instructed to report the manufacturers and models of the devices used in the experiment. Calibration can then be completed using the known specifications for the devices.
Approach 3, Limited calibration using psychophysical techniques: One specific technique to calibrate the acoustic stimuli with the participation of the subjects is psychophysical calibration or perceptual calibration. This calibration method is especially useful for experiments aimed for healthy, normal-hearing adults as subjects. For this population, normative performance ranges (and psychophysical models) are well-established for many basic auditory tasks. Incorporating some of these basic auditory tasks into the experimental protocol provides a way to probe the fidelity of the stimulus presentation system and the background noise level at the subject’s end. Additionally, there are many binaural phenomena that requires the appropriate placement of headphones (binaural beating, binaural masking level difference, binaural pitch, etc.). These tasks may be implemented to check whether a headphone is used and correctly placed during the experiment.
Approach 1, Electroacoustic Calibration
For this approach, the experimenter will be responsible for calibrating the hardware systems before sending them to the subjects. In this case, the calibration procedure should follow the ISO/ANSI standard (e.g., ANSI S3.7) and/or the best practice of the field. Some commercially available platforms include hardware calibration service in their annual subscriptions (e.g., hearX and SHOEBOX).Calibration setup: A typical setup for level calibration for headphones consists of (1) a coupler (or artificial ear) for the type of transducer used, (2) a sound level meter with its measurement microphone attached to the coupler. The purpose of the coupler is to simulate the impedance of the ear canal, so that the readings from the sound level meter would simulate the sound pressure level that would be measured at the ear drum of a typical subject. Depending on the type of the transducer (e.g., supra-aural earphones, circum-aural earphones, insert earphones, or hearing aids), a corresponding type of coupler should be used. As a rule of thumb, supra- and circum-aural earphones require couplers with a larger volume (~6cc) while the insert earphones and insert-type hearing aid receivers require couplers with a smaller volume (~2cc).Calibration stimuli: Most stimulus presentation systems (e.g., headphones connected to sound cards) are designed to be linear. Pure tones at various frequencies are typically used as the calibration stimuli. With the calibration tone presented above the noise floor, the correspondence between the rms amplitude of the digital signal and the measured sound pressure level can be found. In some special circumstances, the stimuli are presented via a device with nonlinear dynamic processing (e.g., digital hearing aids). Such systems would provide different amount of amplification to the stimuli in a frequency- and intensity-dependent fashion. For such applications, the stimuli used for calibration should resemble the level and spectrotemporal characteristics of the test stimuli. For example, for a speech-recognition experiment,International Speech Test Signal (ISTS), presented at the same level as the test speech in the experiment, may be used as the calibration stimulus.Device verification by the subject: For an experiment that utilizes multiple experimental devices (e.g., tablet-headphones pairs), each device should be properly identified (e.g., by a device ID) and calibrated separately. During the experiment, it is useful to have the subject report the device ID to ensure the accurate pairing of the calibration data and the device. Even when the testing systems are appropriately calibrated at the site of the experimenter, the stimuli may still be off-calibration due to damage to the device during shipment, improper connection between various components of the device (e.g., the headphone is disconnected from the tablet), wrong experimental software being run, or improper placement of the headphones. To prevent such incidences, a simple psychophysical verification procedure may be used, which may include verifying whether stimuli presented at suprathreshold levels are indeed audible to the subject and whether the stimuli delivered through the two sides of the headphones are properly balanced and synchronized for binaural experiments.Beyond level calibration: Besides calibrating the presentation level, other electroacoustic measurements may also be useful and informative to the experimenter. Some common measurements include the dynamic range, frequency responses, and crosstalk between channels.Dynamic range: The dynamic range is the range of level within which the test stimuli can be presented without significant distortion. The lower limit of the dynamic range is typically the noise floor, i.e. the output level when no stimulus is presented. It is worth pointing out that the noise floor is not the readings from the sound level meter when the system is powered off. Rather, it is the noise level expected during stimulus presentations. It is recommended that an all-zero array is used as the "calibration stimulus" during the measurement of the noise floor. The upper limit of the dynamic range is the maximum output level without distortion (e.g., clipping). The experimenter should choose the appropriate devices and transducers so that the test stimuli would not be too close to either the lower or upper limit of the dynamic range. In other words, an experimental system with a larger dynamic range would accommodate a greater variety of experiments.Frequency response: The frequency response is the response from the testing system for unit input as a function of frequency. An ideal sound delivery system would have a flat frequency response, so that it will not introduce additional coloration to the experimental stimuli. The frequency response is typically measured by analyzing the response to a broadband stimulus. Depending on specific methodologies used, the test stimuli may be a sequence of pure tones, a frequency-varying tone sweep, a Gaussian noise, a Maximum-Length Sequence, orGolay codes
* Crosstalk between channels: Crosstalk refers to the signal leakage from one channel to another. Crosstalk between channels could be problematic for auditory experiments when the stimulus meant for one of the test ear is audible from the headphone for the opposite ear. Portable testing platforms may be more subjected to crosstalk because (1) the left and right channels of common portable computers and mobil phones usually have a shared ground return, and (2) low-impedance earphones are typically used with portable devices to ensure sufficiently high output levels. Crosstalk is measured by applying a test signal to one channel, measuring that signal’s level from the other channel, and then expressing the measured level as a ratio (in dB) relative to the source signal. Since the crosstalk is often frequency dependent, the measurement is typically conducted as a function of frequency.
In-situ or self calibration: For send-home testing systems that also carry built-in microphones, the microphone on the device can be calibrated so that it can be used for in-situ or self calibration. In-situ calibration are required for most experiments that involve free-field stimulus presentation (using loudspeakers). In such cases, the microphone should be positioned near where the subject’s head would be during the experiment. A calibration stimulus is then presented via the loudspeaker(s) and the sound level or frequency response can be derived from the recording of the stimulus using the microphone. Another application of built-in microphones is to conduct measurements of the noise level (or the power spectrum of the ambient noise) for the subject’s test environment. It should be noted that the raw audio recordings made during the in-situ measurements should not be stored without an appropriate IRB approval and the subject’s consent (see Compliance).
Approach 2, Limited calibration involving reports from the subject
For this approach, the subject reports the manufacturer and model of the devices used in the experiment, and the experimenter configure the stimuli based on the available specifications for these devices. For examples of databases for head-phone specifications, seeEarphones. It should be noted that some of the available headphone specifications are measured without the use of a coupler.
Level calibration based on headphone sensitivity: The most relevant specification for level calibration is the headphones’ sensitivity. It is the sound pressure level that would be produced by the headphone with unit input voltage (i.e. 1 volt) at 1 kHz. Therefore, if the dynamic range of the sound card (the output voltage in dBu at 0 dBFS, 0 dBu = sqrt(0.6) V) is also available, the maximum output sound pressure level can be derived and the conversion from dBFS to dB SPL can be achieved. Sometimes, the sensitivity of the headphones is given for unit input power (i.e. 1 milliWatt). In this case, the sound pressure level needs to be calculated based on both the voltage that the headphones receive and the headphones’ impedance. For example, if the sound card has a maximum output level of 4 dBu at 0 dBFS, the sensitivity of the headphones is 102 dB SPL/mW, and the impedance of the headphones is 64 Ohm, then the maximum output voltage is sqrt(0.6)*10^(4/20) = 1.23 V, the maximum output power is (4.61)^2/64 = 23.5 mW, the maximum output sound pressure level (corresponding to 0 dB FS) is 102+10*log10(23.5/1) = 115.7 dB SPL.
Approach 3, Limited calibration using psychophysical techniques:
For this approach, a psychophysical procedure is conducted for the purpose of calibration and system verification.Loudness-based level adjustment: One of the simplest way for perceptual calibration is to instruct the subject to adjust the volume control of the sound delivery system so that the test stimuli would be presented at a most comfortable level. Alternatively, the subject may adjust the level of an anchor stimulus to the most comfortable level, and the presentation levels of the test stimuli are set relative to that of the anchor stimulus. This is a relatively quick procedure, typically taking 2-3 minutes. For pure tones presented in quiet, the expected standard deviation in the listeners’ most comfortable loudness (MCL) levels is typically greater than 10 dB (seePunch et al., 2004for detailed discussions on the various measurement considerations).Besides adjusting a stimulus to the most comfortable loudness level, a procedure to measure the loudness growth at 1 kHz may be conducted for calibration purpose. The loudness growth function at 1 kHz, i.e. how loudness rating grows with the stimulus level, is well-established for normal-hearing adults. The procedure for measuring the loudness growth function has been standardized (ISO 16832). Therefore, it is possible to first measure the loudness growth function using uncalibrated levels and then compare the obtained data with the published normative results to estimate the conversion factor for calibration.Threshold-based calibration: sensation level: For an experimental system with unknown calibration, absolute thresholds may be measured first for the test stimuli using an uncalibrated arbitrary unit. The test stimuli can then be presented at a desired sensation level (in dB SL). 10 dB SL indicates a level that is 10 dB above the absolute threshold for the stimulus. The advantage of configuring the stimulus level in dB SL is that the audibility of the stimuli is ensured for each individual subjects and for their specific sound delivery systems. However, there are a few disadvantages associated with this approach that researchers need to consider. First, threshold measurements add additional testing time. Second, the subject’s testing system and environment may be sub-optimal for threshold measurements, and the measured thresholds may be dominated by masking from the electronic or ambient noises. Third, for a sound delivery system with a limited dynamic range, it may be difficult to both conducting threshold measurements and presenting stimuli at high sensation levels. For example, consider an experiment with a pure-tone stimulus at 1 kHz and 50 dB SL. Measuring the absolute threshold for the tone requires adjusting the volume control of the system so that the noise floor would be reasonably lower than the threshold (the hardware noise should not be audible). If the noise floor is 10 dB below the absolute threshold, then presenting the tone at 50 dB SL requires at least 60 dB of dynamic range. Last, for subjects with more than moderate degrees of hearing loss, setting the stimulus level at a fixed sensation level, may lead to very high, unsafe, sound pressure levels (if still within the dynamic range of the system) or severe distortions (if the level exceeds the dynamic range).
Headphone verification using binaural phenomena: In many experiment, it is crucially important to verify that the subject is wearing headphones with the correct orientation during the experiment. This can be achieved by conducting a psychophysical procedure involving an auditory percept (such as binaural pitch) that is strongly dependent on specific interaural phase relationships. Examples of using binaural phenomena to verify headphone connection/placement are Woods et al. (2017) and Milne et al. (2020)
Visual Stimuli Calibration
Dimensions of calibration
* Spatial variationKollbaum, P. S., Jansen, M. E., Kollbaum, E. J., & Bullimore, M. A. (2014). Validation of an iPad test of letter contrast sensitivity. Optometry and Vision Science, 91(3), 291-296.Dorr, M., Lesmes, L. A., Lu, Z. L., & Bex, P. J. (2013). Rapid and reliable assessment of the contrast sensitivity function on an iPad. Investigative ophthalmology & visual science, 54(12), 7266-7273.de Fez, D., Luque, M. J., García-Domene, M. C., Camps, V., & Piñero, D. (2016). Colorimetric characterization of mobile devices for vision applications. Optometry and Vision Science, 93(1), 85-93.Dorr, M., Lesmes, L. A., Elze, T., Wang, H., Lu, Z. L., & Bex, P. J. (2017). Evaluation of the precision of contrast sensitivity function assessment on a tablet device. Scientific Reports, 7, 46706.
Ozgur, O. K., Emborgo, T. S., Vieyra, M. B., Huselid, R. F., & Banik, R. (2018). Validity and acceptance of color vision testing on smartphones. Journal of Neuro-ophthalmology, 38(1), 13-16.
Audiovisual Stimuli Calibration
In many situations, speech stimuli are presented in both the auditory and visual modalities and the subjects’ abilities in recognizing speech are assessed. The synchronization between auditory and visual speech cues can influence speech understanding, therefore it is crucial ensure the synchronization between the audio and video displays. In most cases, AV speech is stored in a compressed file in order to constrain the file size. A compressed video file consists of both audio and video signals compressed using separate codecs.During an experiment, when presenting a compressed video file, the hardware on the subject’s end will need to decode both the audio and video portions of the file, which may cause unintended asynchronies between the audio and video displays. This often unpredictable amount of AV asyncronies at the subject’s end determines that remote experiments involving AV speech stimuli would need to ship pre-calibrated systems (e.g., tablet + headphones) to the subjects. The calibration procedures would consist of three steps: (1) audio stimuli calibration, (2) visual stimuli calibration, and (3) test of AV syncronization. The first two steps can be conducted following the procedures described in preceding section. Here, an example procedure for measuring AV synchronization is described.Measuring AV synchronization: If possible, the stimulus files should be stored locally on the mobile testing system. The system applications that are running in the background should be kept at a minimum. Then, a calibration stimulus is presented via the same software environment as in the experiment. The calibration stimulus is a AV file with the identical specifications as the experimental stimuli (i.e. the same audio and video codecs, the same screen size, etc.). The calibration stimulus is generated so that the audio and video components share common onsets. For example, the stimulus may be periodic audio clicks and video flashes with the same onset times. During the presentation of the test stimulus, a passive photo sensor is placed on the screen, and the outputs from the photo sensor and the output from the sound card are fed into the two channels of an oscilloscope. The asynchrony between the audio and video outputs in milliseconds can then be measured. The above procedure should be repeated for a couple of times to check the consistency of the AV asynchrony. As an alternative to the clicks and flashes, the amplitudes of the audio and video signals can be defined analytically by the same sine function. The AV synchrony can then be verified by viewing the audio and video outputs for this calibration stimulus using the XY display mode of the oscilloscope.
Synchronizing AV stimuli: Once the average asynchrony (in milliseconds) is measured, the stimulus files can be modified to compensate the asynchrony. This means delaying the audio component if the audio output is leading the video, or delaying the video component if the video output is leading the audio. This can be done in a video editing software (e.g., Final Cut or Adobe Premiere). First, import the original stimulus file into the video editing software. Then, apply a delay according to the measured asynchrony to the appropriate stream. Finally, export a new stimulus file using the original codecs. This compensation procedure can also be applied to the calibration stimulus. When repeat the synchrony measurement using the modified calibration stimulus, the average asynchrony should be near 0 ms. When the sine function is used as the calibration stimuli, then after compensation the audio and video outputs, when viewed using the XY model of the oscilloscope, should form a diagonal line, rather than an ellipse or circle, indicating that the two outputs are in-phase.
Response Calibration (see also Response
An advantage of most survey-type responses used in remote testing (buttons, multiple choice, rating scale) is that each response should be interpretable in an absolute sense, requiring no calibration to determine its value. That is, clicking "Yes" has the same meaning in every session and for every participant. Some response data, however, require additional calibration.
Calibration of participant response scalerefers to the calibration of response scales or range across sessions and/or participants. This can be accomplished by instruction (e.g. using a labeled Likert scale, identifying endpoints such as "Inaudible" to "Painfully Loud", etc.) Similar considerations would seem to apply to both in-lab and remote testing, although the need for clear instruction may be more acute in remote testing.
Calibration of response hardwaremay be desired or necessary for some types of on-device sensors (touch displays, tilt/force sensors) and external hardware (pointing devices, physiological measurements).
- Psychophysical calibration optional: In some cases, a rough calibration can be assumed with confidence; for example, the X and Y coordinates of a tablet-based touch response should be presented in standard units (pixels, perhaps) with only small offset away from the actual finger location. In such cases, a psychophysical procedure may be included to verify the calibration. For example, at the start of testing, the participant could be asked to touch a series of target locations indicated visually on the display. Offset from expected values can be measured to confirm or adjust calibration, although any measured offset will incorporate contributions of the response hardware and response biases of the participant (e.g. reaching with the right hand introducing a rightward touch bias).
- Psychophysical calibration required: In other cases, a calibration step will be necessary to interpret the response values at all. For example, a head-mounted display could be used to measure head orientation in a pointing task. The angle reported by the device is relative to its position when initially set up, and may vary considerably across sessions and participants. A calibration task can be used to determine and correct for the values reported for known and repeatable target directions (e.g. straight ahead). As noted above, measured offsets will include both hardware and participant contributions. Access to hardware-based calibration (e.g. a physical target) may help to isolate these contributions but may be difficult to implement in remote-testing scenarios.
- Hardware calibration required: Note that some devices may require more detailed calibration to a physical standard in order to provide reliable data (generally, absolute measures). Such cases are likely to be highly dependent on the specific device and calibrator, and may be difficult to implement in remote-testing scenarios.
Issues related to participants’ performance of the required tasks
Potential effects of the testing context
When testing outside a sound booth it is important to consider both the cognitive and acoustic effects that the testing environment may have task performance. While it is possible for remote testing to be conducted in environments with limited distractions, it is also possible that participants may not be alone in the testing environment or that the environment has distracting elements. Further, the participant could attempt to multitask during the testing. For headphone based studies, passive noise attenuating headphones may be advantageous as well as using a moderate level masking noise. For remote testing with speakers, background noise, room acoustics and the positioning of the speakers can influence performance. The use of moderate level masking noise to overcome background noise may inconvenience individuals near the participant. Brungart et al (in press) measured speech perception in crowded public spaces while simultaneously measuring the background noise level on every trial. They were then able to compare performance as a function of the background noise level.
In conventional testing environments, after explaining the task, participants often have the opportunity to ask questions. Further, it is often possible to observe the data in near real time allowing for the experimenter to correct obviously incorrect behavior. During remote testing, this is often not the case. Subjects may ignore simple instructions like ensuring that headphones are placed on the correct ears and may not fully comprehend more complicated instructions. Multiple versions of the instructions may be required depending on the number of platforms the experiment is compatible with and if all subjects are not required to be fluent in the same language.
Apart from standard considerations related to the relationship between age, hearing impairment, and cognitive decline, remote testing performance may depend upon the comfort and skill level using a computer/tablet. Auditory remote testing presents a unique challenge since there is a complex relationship between computer skills and hearing impairment and age (see Henshaw et al., 2012).
Linguistic considerations (translation, etc)
Remote testing provides the access to participant populations who speak a much wider variety of languages than may be available in a traditional single-site experimental design. While this can provide benefits, it also can affect performance on the task if the testing material is not appropriately translated and modified for the range of languages to be tested.
Technological literacy of participants
Remote testing may involve diverse levels of participants’ familiarity, experience, and facility with the technologies employed. Careful consideration should be given to maximizing accessibility across the targeted population when selecting a platform and designing a study. Consider, for example, a tablet preconfigured to run a single app with settings for that participant versus a laptop that requires signing in to wi-fi, downloading an update, and saving/uploading a data file. A typically developing college-age cohort might reasonably be expected to complete either study with minimal extra intervention (see participant administration), but the first option might be appropriate for a broader cohort. The latter approach also risks confounding results by reducing the likelihood that some subject groups complete the full task and/or by introducing additional cognitive load unrelated to the research question.
Supervision of performance
Remote testing presents challenges for the experimenter to supervise the participant and their performance. Adequate supervision of the participants can be critical for keeping the participant motivated and engaged with the task. Supervision can also be critical for identifying situations where the participant may be confused with the instructions. Finally, while some experimental procedures can be fully automated, in some cases experimenter intervention is critical. For example, open-set speech-in-noise testing either requires the participants to self-score their performance or for the experimenter to observe the responses in near real time. Supervision of remote testing can range from the experimenter being at the remote testing site, either physically or virtually, to automated help systems being built into the testing platforms, to asynchronous supervision and help via phone, etc.
Evaluation of participant experience
Remote testing participant populations potentially have a wider range of experience with auditory and/or behavioral testing. The experience of the participant with standard behavioral paradigms may impact the performance on the task.
Kinds of data
Remote testing is associated with various kinds of data that need to be exchanged between participants and researchers. These typically include
- Participant information, including personally identifiable information (PII) and protected health information (PHI) that may be subject to regulatory compliance constraints
- Stimuli (e.g., audio, image and video data) and experiment parameters. Although often fixed across participants, these may be individualized, for instance when participants are randomly assigned to different "conditions", or if the measurements are adaptive.
- Response data, which will likely be the bulk of the data that is of interest to the experimenters. Access to single-trial response data may be needed during testing for progress monitoring and verification of task compliance, to provide feedback to participants, and/or to calculate summary performance metrics to make decisions about task flow. The full set of final responses will then need to be assembled for detailed analyses. Long-term archival and sharing with collaborators or the broader research community are also considerations that may apply.
In some cases PII may incidentally be linked to the response data, thus requiring special considerations. For instance, when verbal responses are recorded, or if there is live video interaction between the participant and experimenter, raw audio/video data will contain identifying information.
Server-side versus client-side data handling
One advantage of handling data primarily on the client side (i.e., on the participant device) is that internet access is not necessary except for the initial download of the app and task material, and the upload of data at the end. Furthermore, when computations are done on the client side with pre-loaded stimuli, better timing control can be achieved compared to loading stimuli from the server on a trial-by-trial basis. Another advantage of client-side data handling is that some privacy/security issues may be circumnavigated, as described in the next section. On the flip side, server-side handling of data typically allows for greater standardization, near real-time logging of progress and aggregation of data, and perhaps most importantly, a simpler experience for the participants because their involvement beyond completing the task itself is minimal (e.g., no need for participant involvement in installing the app or uploading the data).
Privacy and Security
Another layer of security may be achieved by encrypting all data stored (i.e., encryption at rest). All major databases (e.g., PostgreSQL, MySQL, SQLite) and cloud computing service providers (e.g., AWS, GCP, Azure) provide multiple options for encryption at rest. However, it may be desirable to have public "clear-text" copies of the de-identified research data in the interest of open science. When sharing data to public repositories, it is good practice to use different anonymous participant IDs than used during data collection. Finally, it is important that all communication between participant devices and servers are encrypted. This is especially the case for browser-based communications with form fields where the participant can type in information. This can be achieved using SSL/TLS. Keys for TLS/SSL may be obtained from certificate authorities. A popular free option for TLS/SSL certification is Let’s Encrypt
Remote testing platforms vary in their support for automatic data backup. If setting up a custom platform, major databases (e.g., PostgreSQL, MySQL, SQLite) also come with support for manual backup snapshots that may be executed by scripts that are scheduled to run at specific times (e.g., cron jobs on Linux). Multiple clones of the database may be used to reduce server downtime in the event of a database crash. Otherwise, all data backup considerations as with in-person studies apply.
Issues in Data Analysis
Elevated random error
In the lab, we typically have a lot of control over what the testing environments are like. Subjects are usually seated comfortably in a sound booth with minimal auditory and visual distractions. In the same space, the hardware is usually limited to only what is necessary for the subjects to respond to tasks and positioned in a way that enhances the overall experience during testing. The experimenter is typically present with lots of opportunities to provide task-related instructions and checks in frequently to make sure that subjects are engaged in the task. This is sometimes critical for testing with special populations such as young children and older adults (seeSpecial Population).Any remote testing platform outside of the traditional lab setting will unlikely provide such control, inevitably introducing greater random error or “noise” in the behavioral data. Below is a list of factors to consider that may lead to more variability in behavioral data.
* Environmental factors: Subjects are likely in an environment with increased auditory/visual distractions and elevated ambient noise than in the lab sound booth. Ambient noise may also fluctuate over the course of the experiment.
* Device factors: Consumer electronics (e.g., computer, headphones) that subjects have access to may introduce variability in stimulus quality, such as emphasis/de-emphasis of spectral regions of the output frequency response. Noise-canceling headphones may in fact introduce elevated noise floor with improper wear.
* Subject factors: For remote testing, subjects will more likely run the experiment during a time of their choice – This may mean in late evenings or other times outside of typical business hours. Their attention state during these times may also play a role in how they respond to psychoacoustic tasks in ways unlike when testing is typically done in the lab. In the absence of a task proctor, participants may have limited access to raise questions regarding the task instructions. For child and elderly participants, this may be important as part of behavioral testing.
Note that even though these factors may seem to increase individual variability between subjects, it is highly likely that they may also affect within-subject behavioral performances because of the highly dynamic environments where remote testing occurs.What we do not know is whether and to what degree these factors contribute to the difference (random error) in the data collected in remote testing environments versus in the lab. The elevated error may influence behavioral outcomes in the following categories:
* Test-retest reliability, particularly important for experimental protocols aimed at indicating diagnostic and training outcomes
* Baseline performance (e.g., threshold for speech in noise recognition)
* Effect size of experimental manipulation (e.g., amount of speech-in-noise masking due to different types of noise masker)
The following illustration provides a simple example of how remote data collection may influence behavioral outcomes as compared to data collected in the lab environment.
The table below shows the general types of studies that are progressively more susceptible to variabilities introduced by remote platforms, including the factors that will likely influence behavioral outcomes and potential solutions.
Factors influencing outcomes
Non-threshold studies (e.g., talker discrimination)
Studies measuring pitch-based absolute thresholds (e.g., pitch discrimination)
Studies measuring loudness-based absolute thresholds (e.g., loudness threshold)
Studies measuring relative thresholds (e.g., speech in noise)
Studies involving binaural hearing (e.g., interaural level difference discrimination)
Ideally, the effects of remote testing on baseline performance, effect size, and random error would be known (or at least measurable) for every study question. In general, however, that information is not available. Alternative approaches can be used to estimate these effects, and inclusion of one or more validation conditions is one of the few easily identifiablebest practicesfor remote-testing:Replication using in-lab methods.One option could be inclusion of in-lab testing in a subset of conditions or participants. Where appropriate, for example, lab personnel might collect data on themselves in both remote and in-lab settings, or in-lab data might already exist from pilot stages of the project. Directly comparing performance across settings can provide some reassurance of the validity of remotely collected data.
Remote replication of in-lab tests. If an existing in-lab dataset closely related to the study procedures can be obtained, replication via remote testing can provide additional assurance that remote methods are capable and comparable to in-lab approaches. For example, a condition from a prior in-lab study can be included to estimate baseline performance across test settings. While this approach might not be capable of detecting changes in effect size, it can be used to verify baseline performance level and variation.
The random error is expected to be higher in datasets collected on remote testing platforms due to the highly dynamic testing environments (see the contributing factors listed in Section 1). While validation studies may provide insights on the types of studies that are generally more resistant to the remote testing environments, for individual datasets collected remotely, below are some considerations to ensure robustness through data analysis.Removal of outlier participants form data analysis.:The potentially large degree of random error in remote testing suggests high variance across participants which will have negative impacts on study power. Some variation might be attributed to participant factorsrelated to the task. Eliminating non-compliant participants from data analysis is often necessary to preserve study power, but often such participants cannot be easily or cleanly identified from the task data alone. Investigators are encouraged to carefully consider potential sources of inter-participant variation and include additional tests, such as perceptual/cognitive screening, attention checks, and catch trials within the remote testing protocol. Predefined levels of performance on these additional measures can be used to censor participant data independently of the primary data, improving the power and sensitivity of the study in a balanced and rigorous way.Bootstrapping and reporting**The (unknown) underlying effect of interest is likely more susceptible to sampling error because of the elevated random error in datasets collected in highly dynamic environments. When modeling psychometric functions,Wichmann & Hill (2001)provides insights on estimating variability in fitted parameters through bootstrapping in thepsignifit toolbox. In essence,bootstrappingprovides a range of values in the parameter estimate(s) in any statistical model that is fitted to a dataset, both for descriptive statistics (e.g., mean, standard deviation) and inferential statistics (e.g., regression estimates, effect size). It uses a Monte Carlo approach to simulate resampling from the datasetwith replacementunder the assumption that the original sample represents the population (random sample).
**Bootstrapping can also provide a sanity check for sample size (Chihara & Hesterberg, 2011). Under the Central Limit Theoreom, the sample mean from a random sample that is sufficiently large will have a normal distribution regardless of the distribution of the population. Distribution from the bootstrapped (re)samples has the same spread and skew as the original random sample. So if the distribution of the bootstrapped parameter estimate(s) is not normal, it is highly likely that the original dataset is under-sampled.
The primary dimension along which approaches to remote testing vary is the trade-off between experimental control and convenience/accessibility. This trade-off impacts all aspects of the study design, although different balances may be more appropriate for different aspects (e.g. combining careful stimulus control with convenient sampling of participants). There are many different research Platforms available to support remote testing; please refer to Platform Descriptions for detailed information about specific platforms.
What approach should I use for remote testing?
There are really three big questions you need to answer when deciding how to set up an auditory experiment for remote data collection: what hardware will I use, what software will I use, and who are the subjects I want to test. In all three cases, the alternatives range from convenient and less controlled to more time consuming and well specified.
hardware: calibration & interfaces
Any user hardware
software: data handling and experimental control
Preconfigured hearing research packages
fully custom scripts
subjects: demographics & instruction
anonymous & unsupervised
“by invitation” access
supervision by proxy (e.g., parent)
Some other dimensions along which platforms vary:
- Settings (In-lab, kiosk, at-home, in-the-wild)
- At this time, most of the platforms identified with remote-testing appear optimized for testing in remote but isolated settings, such as in a participant’s home. Most are equally deployable to in-lab settings, possibly with greater control over computing and stimulus hardware. Depending on the study, it might be feasible to utilize a single design for both remote and in-lab validation studies. Keep in mind, however, that for commercial platforms the pricing structure may not be ideal for in-lab deployment.
Portable systems configured for standalone use with minimal experiment intervention could also be deployed in a kiosk setting, i.e. semi-permanently installed for unsupervised walk-up use. Depending on the motivation for remote-testing, kiosk deployment could provide numerous advantages such as sampling of geographically targeted populations in health-care offices, at music events, etc. Study design for kiosk-based testing is likely to share many elements of design for take-home / tablet-based studies where simplicity and clear instruction are prioritized over controlled sampling and experimenter supervision.
Finally, some of the platforms identified for remote testing may be suitable for use in everyday / "real-world" settings such as bars, cafes, classrooms, and outdoor settings. Again, depending on the motivation for remote testing, this use could enhance the ecological validity of a study by measuring performance in behaviorally relevant backgrounds rather than in controlled lab settings.
- Supervision (experimenter present in person or remote, vs standalone task)
- Instruction about the task, clarification when questions or malfunctions occur, and debriefing are all common interactions between experimenters and participants in lab-based testing. Shifting to remote testing requires careful consideration of how (or if) such communication must be facilitated, and identification of a research platform capable of supporting it. At one extreme of this dimension lies in-lab testing, with constant in-person interaction available as needed. At the other lie completely standalone tasks. Detailed and effective instruction can take the place of many interactions, but is less helpful for unexpected errors in the task or in the research hardware/software itself, or for special populations which experience specific challenges. An intermediate solution is for an experimenter to provide direct real-time supervision remotely. Some platforms support this feature directly. Others may require use of a secondary service (telephone / video calling, screen sharing, etc.) running alongside and independently of the research platform itself.
- Whose device? (experimenter provided / take-home vs participant BYO)
- Another important dimension is that of the hardware selection and control. Laboratory-owned equipment (e.g. a lab-configured PC or tablet) can be used for in-lab testing, and for remote testing, by delivering or shipping the equipment to the participant’s location. Greater control is obviously achievable with experimenter-provided equipment, which could be delivered with earphones, displays, and response devices. In that case, device calibration can be done prior to delivery, and verified after use and return.
Participant-controlled equipment offers less control, but may allow greater flexibility in accessing the research paradigm online or by direct download. In that case, other procedures will be necessary to verify the correct operation and stimulus delivery (e.g. psychophysical calibration), and preparations should be made to provide technical support to participants attempting to download and install research software for participation.
- Platform device type (browser, tablet, headset, PC, custom hardware)
- Similarly, device types vary significantly across research platforms, from entirely software-based platforms accessed online (via a web browser) or by download, to standalone devices such as tablets and VR headsets, to general purpose or custom computing hardware. Some approaches (e.g. physiological data collection, custom earphones) may require additional custom hardware for control or calibration.
- Special hardware support (headphones, soundcard, UI response, etc)
- Many online platforms are capable of presenting auditory stimuli via the interface or sound card built in to the participants hardware, and using whatever earphones the participant has on hand. Few of these offer the level of stimulus control and calibration that experimenters may be used to working with in the lab. For this reason, it may be worthwhile to consider platforms capable of working with off-board audio interfaces (e.g. USB sound cards) which can be delivered, along with a standard model earphone, to the participant’s location even if the research platform itself is fully online. Other types of specialized hardware may be required for certain types of response data, such as physiological data (heart rate, GSR, EKG / EEG), head and hand tracking (possible using VR headsets), etc.
- Platform OS environment (e.g. Windows, iOS, Android)
- Platforms also vary according to the operating system environment in which they run. Tablet-based systems may be compatible with Apple’s iOS (iPhone / iPad) or Google’s Android, which also supports a range of other devices including VR headsets; PC-based systems may run on Microsoft Windows or Apple MacOS. Few platforms run on multiple OSes, aside from fully online platforms, which may be compatible with any OS and a wide range of web browsers.
- Software environment type (bespoke app, customizable app, MATLAB, js, Python, etc)
- Costs involved (software costs, subscriptions, required services)
- Finally, the costs associated with various remote-testing platforms vary significantly and follow a number of different models. On the one hand are in-house and open-source programs with minimal or no acquisition costs. On the other are online platforms with subscription models that charge by the year, study, or participant. Some research platforms also require specialized hardware, which may be available from the platform vendor or third parties.
Special considerations may apply to specific populations of research participants, including children, elderly, people with sensory-impairment (e.g., hearing loss or low vision), users of hearing aid and cochlear implants, and patients experiencing cognitive and/or neurological challenges. Limited lteracy or limited proficiency in the primary languate of the test can also pose a challange. Although each individual or group of participants will experience their own pattern of challenges for in-lab and remote testing, some consideration can be given to common features of remote testing the might particularly affect special populations.
Remote testing imposes additional challenges compared to in-person testing. These additional demands include:
- the ability to communicate using remote technologies, such as written instructions or information conveyed via video conferencing
- practical knowledge required to set up the test environment, such turning off devices that generate background noise or asking family members not to interrupt testing
- technical knowledge required to set up hardware or software
- the ability to maintain attention without direct supervision
Special populations may require additional accommodations to ensure consistency and quality of data collected remotely. These accommodations might include:
video-chat for obtaining consent or assent
Video may be particularly beneficial for use with special populations, because it provides a rich set of interpersonal cues to ensure understanding and guard against coercion. Closed caption may also be appropriate for some subjects.
While verbal instructions may be sufficient, some participants benefit from additional materials showing concrete examples of the task and what they will be asked to do.
a progress chart or visual schedule
Like a progress bar, these tools help the subject track their progress through a task or set of tasks.
an experimenter available when testing occurs
Having someone on call during data collection increases the chances that data will be collected following the protocol.
recruiting a parent or other helper to provide in-person support
A “wingman” can be trained to fulfill some of the same functions as an in-person experimenter.
blocking data collection into short segments
Providing frequent opportunities for feedback and breaks is common practice when working with special populations, but it could be particularly important for remote testing because the experimenter cannot monitor progress for signs of fatigue or flagging motivation.
including task training and probes
Training and probes may be even more critical for remote testing than in-person testing due to the reduced supervision and opportunities for the experimenter to notice confusion or flagging attention on the part of the subject.
user friendly response interface
multiple methods for delivering feedback and reinforcement
For special populations who may find prizes important to sustain motivation during the task, the experimenter may want to design various methods to effectively deliver incentives that meet compliance requirements.
interpretation of standardized tests
Administering standardized tests is usually part of the protocol with special populations, such as batteries of IQ, cognitive and language abilities. There are some implementations online (Gorilla Sample Tests). For most standardized batteries, normative data is collected through in-person interactions and may not be valid for remote implementation. The experimenter should be careful of interpretation of individual data if it will be transformed based on normative data.
Issues related to peer review
What should the standard be for publishing remote research?
There aren’t any hard and fast rules or boxes to check, just as there are no universal standards for in-person research. Experimental methods should be considered in the context of the protocol and the research question. Given the new pressures to adopt remote testing, reviewers will need to think critically and avoid rejecting a new methodology simply because it deviates from previous conventions. The focus should be on whether the hardware and test protocol are sufficient to support reliable and valid data that inform the specific question being asked.
As an author, what steps should I take to demonstrate rigor of my remote research methods?
Steps for demonstrating rigor of remote research are the same as those for in-person research, with the caveat that novel methods require additional explanation and explicit justification. Some specific considerations appear in the section describing Best Practices
My remote methodology offers less stimulus control than in-person testing. Is that a fatal flaw?
Not necessarily. If you can make a case that stimulus control is good enough to observe the effects being evaluated, then it may be sufficient to describe the methods and note relevant limitations of the methods.
Identifying best practices
A long-term goal of the task force is to identify best practices or guidelines for remote testing. For the most part, however, the necessary evidence base (in terms of what works well and what does not) does not yet exist. As such evidence becomes available (see Examples), we expect to add to the list of "best practices" which can be identified from investigators’ experiences. Here, we consider the motivations and challenges to identifying best practices in the first place.
Why should we attempt to identify best practices?
Best practices can form the basis of formal or informal guidelines for research practice with remote testing. The many tradeoffs evident in remote testing demonstrate the potential for fundamental weaknesses if the remote-testing approach is not designed appropriately for the study. Identifying best practices can help investigators select the features which will best ensure rigor and reproducibility of the research.
What issues stand in the way of establishing a single set of best approaches?
- Outside of specific examples where remote methodology has been a key focus (see Resources), very few remote-testing studies have yet been completed. There are many research questions which ought to be addressable via remote testing, but the unavailability of results means that critical challenges and confounds remain unidentified. This barrier is likely to be overcome as investigators’ complete studies and gain experience with the relevant approaches.
- More fundamentally, because experimental questions differ widely in the level of control required, no single approach is likely to be optimal for all studies. The specific hardware, software, and procedures used in any remote-testing study will impact the degree of experimental control and the information that can be collected from a test session. For example, accurate calibration is critical when evaluating detection in quiet, and less so when evaluating memory for a melodic tone sequence. Investigators are encouraged to carefully consider the methodological strengths and weaknesses as they pertain to the specific goals of their own research. The best approach depends on the phenomena being evaluated.
Candidate Best Practices that can be identified at this time:
Align strengths to research goals: Prior to conducting a remote-testing study, enumerate the specific tradeoffs associated with each identified approach. Be certain to align the strengths to the goals of the specific research question. Familiarity with the questions raised on this Wiki (seeIssues) and with feature comparisons across remote-testingPlatformscould help.Measure and document calibration: Incorporate the most accurate form of stimulus calibration that is achievable within the selected approach. In some cases (e.g. browser-based testing with participants’ own computer and headphones) this may be very limited, but even a simple psychophysical validation using tone detection or binaural comparison could provide important verification of the stimulus setup, such as whether earphones were worn correctly or if stimulus levels were appropriate for the test setting. More elaborate procedures involving acoustical measurement before, after, or during the tests might alleviate many performance concerns about testing outside of a controlled sound booth.Validation: If possible, include a replication or validation condition which matches, as closely as possible, an approach for which standard in-lab data exist or may easily be obtained. Close replication across in-lab and remote-testing procedures is one of the strongest approaches available to ensure the reliability and validity of new data. SeeData Analysis. Unexpected results could indicate an unacceptable deviation from ideal conditions, and could help to identify previously unanticipated limitations of the selected approach.
Inclusion of independent measures and predetermined criteria for outlier removal: Incorporating additional measures, such as cognitive screens, attention checks, and catch trials into the study procedures can provide important independent data for identifying non-compliant or poorly performing participants who contribute excessively to random error and thus should be removed from data analysis to preserve statistical power (see, e.g., McPherson & McDermott 2020). A set of independent, predetermined criteria for data removal is required to avoid introducing experimental bias that could result from identifying "outliers" based on the study data themselves. Alternatively, screening measures can provide covariate measures that aid the interpretation of study data when all participants are retained in the final analysis. See Data Analysis