This page presents a concatenated version of the "issues" pages hyperlinked throughout from the Issues page.

Issues and Best Practices

Shifting a laboratory-based study to one that employs a remote-testing approach, or designing a new remote study, requires the investigator to consider numerous issues that could impact the accuracy, reliability, validity, or legality of the work. In this section, we aim to explore the many questions that an investigator might consider and–where possible–identify some of the clearest options. Although a long-term goal of the task force is to identify best practices or guidelines for remote testing, the necessary evidence base (in terms of what works well and what does not) does not yet exist. Thus, for the time being you will find few explicit recommendations here, and investigators are encouraged to carefully consider each issue as it pertains to the specific goals of their own research. Please refer to Examples for additional specific information provided by investigators regarding the successes and failures of individual projects. View this content as a single concatenated page.




Reasons and motivations for remote testing

The need for remote testing began prior to 2020, however, it has now been accelerated due to the COVID-19 global pandemic. Lab closures or restrictions on in-lab testing due to quarantine or social-distancing recommendations have forced many investigators to decide between suspending progress of human-participant research and turning to alternatives to in-lab data collection such as meta-analyses, use of computational models, and remote testing of human participants.

Access to rare populations

Previous remote testing needs have included testing individuals in rural populations and recruitment of rare populations. Specifically, researchers and hearing healthcare providers may need to reach clinical patients and potential research subjects located in different counties or even states from specialized health care clinics and research laboratories. Often, financial stipends may be offered in these situations to incentivize the patients or research subjects to travel to specialized clinics and laboratories. However, there are many situations where even stipend travel is not justified. Families with young children and caregivers for older adults with mobility and health issues or individuals with special needs may not be able to visit specialized clinics and laboratories. Additionally, patients and research subjects at a distant location may only be able to make the trip once and would be unable to participate in studies requiring multiple visits or longitudinal components.

Convenience

While there are obvious benefits in testing rare clinical and rural populations, there are also many conveniences in remote testing for families. Families with multiple children have very busy schedules and are often unable to work around parents work schedules, multiple children’s school schedules, and extra-curricular activities. Additionally, many families may be caring for their elderly parents or other dependents with various mobility or other needs. All of these scenarios may cause appointments to be delayed or limit the amount of in-person testing completed per visit, particularly if testing locations include commutes with excess traffic and poor parking. The ability to test remotely would be convenient for families and caretakers and could potentially be more efficient when several transportation related barriers are eliminated in the process.

Audiological telehealth

Remote applications of auditory research and select clinical audiology services may begin to alter how people think about their hearing healthcare in the future. Mobile applications of hearing tests have been developed and refined for decades catered mostly to industrial audiology applications in accordance to the Occupational Safety and Health Administration (OSHA) regulations for hearing conservation programs. However, patient-oriented healthcare for personal use has been expanding in recent years. Unmonitored hearing tests are far from gold standard audiological testing, but it could be argued that access to mobile pure tone testing at home or at locations such as pharmacies could help identify those with hearing loss before they are ready to visit an audiology clinic. In terms of patient care, there may be unexplored auditory analogies to white coat hypertension that researchers and clinicians are unaware of due to the constrained current diagnostic hearing testing protocol. However, monitoring the ambient and other noises in a mobile test environment is potentially much more complex than other patient-centered healthcare applications.

Motivation for remote testing of hearing currently vary from continuing research testing outside of the laboratory to remote clinical applications including hearing device fittings and adjustments as well as aural rehabilitation or auditory training plans. During this period of COVID-19, most researchers and clinicians simply hope to continue their research data collection to improve clinical practice and to minimize the disruption of safely diagnosing and treating their patients. Conducting research experiments and treating clinical audiology patients in the comfort of their homes is in everyone’s best interest during a global pandemic. Further, lessons learned from remote testing now may open opportunities for more efficient data collection and more realistic hearing healthcare in the future.

Other considerations

Remote testing can also be used to support data collection in real-world environments, which may be an important issue for some questions of ecological validity. Longitudinal studies or trials of at-home training paradigms could also benefit from robust approaches to remote testing with online or portable platforms. Future developments aimed at enhancing the rigor and robustness of at-home tests could pay significant dividends for a wide variety of studies which might be freed from lab-based constraints and therefore achieve greater validity in terms of the listening environment, sampled population, etc.




Compliance/legal/ethical issues

IRB approvals

Institutional Review Boards are tasked with protecting the rights and welfare of human subjects who participant in research. The procedures used to achieve that goal vary for different IRBs. When preparing a protocol for remote testing, or modifying an existing protocol, you may want to usetext from approved IRB protocolsas a guide.When adding remote testing to a protocol, your IRB may want to know whether these procedures will be used for a limited period of time (e.g., during a shelter-in-place order), or if they will be used indefinitely. If in-person testing is temporarily suspended, your IRB may ask that you retain in-person testing procedures in anticipation of resuming those activities in the future.In addition to considerations related to in-person testing, a protocol that includes remote testing may also need to consider:

* modified procedures for recruiting and obtaining informed consent remotely

* reduced stimulus control and/or data integrity, and resultant changes in the balance of risk vs. benefit

* additional risks of harm to the subject (e.g., sounds that are louder than intended)

* additional risk with respect to loss of confidentiality associated with transferring data from the remote test site

* procedures for providing hardware or verifying hardware already in the subject’s possession

* liability associated with asking a subject to download software onto a personal computer or remotely accessing a subject’s computer

* procedures for subject payment

HIPAA (see also data handling)

The Health Insurance Portability and Accountability Act (HIPAA) is a federal law that sets out standards for protecting a patient’s health information from disclosure without their consent or knowledge. A HIPAA release form gives a researcher permission to access Personal Health Information (PHI). The researcher is responsible for protecting that PHI. Your IRB will want to see procedures for doing that, including the use of secure communication channels and appropriate data handling procedures

Consent procedures

General guidelines for obtaining consent remotely are the same as for in-person research. If the research presents no more than minimal risk of harm to the participant, your IRB may waive the requirement of obtaining documented (signed) informed consent. In these cases, the researcher is often asked to provide the subject with a study information sheet and/or a verbal description of the study, following an IRB-approved script. In some cases, a waiver of informed consent may be granted; this can occur if the risk is no more than minimal and if the research would not be feasible without a waiver.

If documented informed consent is required, procedures for obtaining remote consent may make use of phone, video-chat, or web-based applications. Many institutions have preferred software and procedures in place for secure, confidential communication. In some cases, procedures related to obtaining consent in the context of telemedicine may be appropriate (e.g., Bobb et al., 2016; Welch et al., 2016).

Data and safety monitoring

Collecting data remotely introduces additional security concerns that are often avoided with in-person testing. Encrypting data, deidentifying data, and using HIPAA compliant communication software are all steps that can mitigate risk. Your IRB will want to see that you have a plan in place to ensure data security and integrity. Many institutions have preferred software and procedures for handling remote data collection.

Compliance with regional and international law

Laws regulating human subjects research can vary widely across countries and even across states within the US. For example, most states require parental consent for children ≤ 17 years of age to participate in research, but this range is ≤ 18 years of age in Nebraska. In this case, the age of consent is determined by state law associated with the location of the instutution carrying out the research, not the location of the subject. In other cases, you may need to consider local regulations as they relate to remote testing. Be sure to obtain approval from your IRB before testing subjects who are outside the US.

Other categories to consider

  • Information technology compliance (e.g. screen sharing) – institutional requirements may vary
  • Local/institutional issues
  • Platforms approved for use by local IT / IRB



Participant Administration

Recruitment

Important factors to consider with recruitment include identifying participants that meet the inclusion criteria and communication. An advantage of remote testing is the ability to create new subject pools. As participants are not isolated to a geographical area, there is greater opportunity for diversity in the sample group. How a researcher recruits participants should be guided by factors such as sample size, geographic region, and any additional required details.Identifying participantsmay make use of the following platforms and approaches:

* Prolific

* Amazon Mechanical Turk

* Children Helping Science

* Lookit the Online Child Lab

* Existing subject pools on lab and institution levels

* Family and friends of lab members

* Via social media, newsletters, and other advertisements

* Patients at a local clinic

Communicating with potential participantsoccurs on a continuum and can include emailing, talking on the phone, talking via a secure video call, text messaging, and automatic messaging to participants via various platforms.Email communication provides participants with a written version of the study’s details and allows potential participants time to consider their answers to eligibility screening questions and whether or not they would like to proceed with the consenting process. Alternatively, emailing back and forth with participants can be time-consuming. Examples of email templates include an initial recruitment email describing the study, a recruitment email for existing participants in the lab or institution’s subject pool, a follow-up email for interested participants that includes eligibility questions, a confirmation email, an instructions email, a reminder email, a sorry-we-missed-you email if the participant forgets about their scheduled participation time, a post-study payment email, and a thank-you email for the individual’s participation.Phone or video call communications allow for speedier delivery of information and a real-time opportunity for participants to ask questions and get an appointment scheduled. If you elect to communicate with participants via telephone, anonymous, encrypted, or platform-based phone numbers are encouraged for use to avoid distributing your personal phone number to participants.Software programs can assist in providing automatic and scheduled communications with potential participants. For example,Gorillacan provide automatic replies to participants. Scheduling platforms and calendars such asTimeTapandOutlook Calendarallow for the researcher to input curated reminder emails to be sent automatically to participants at specified times. If Outlook Calendar is used, keep in mind that the Outlook account needs to be open for that reminder to send.

The use of Institutional Review Board-approved templates for these communications, particularly in the initial recruitment stage, is suggested to ensure that participants are receive the same, fair, straightforward descriptions of the study.

Consenting

Consenting can be performed in person, over the phone, or using a video platform. There may be institutional considerations based on each Institutional Review Board regarding which are preferred and allowed.Video platformscommonly used for consenting include:

* Webex

* Zoom

* Microsoft Teams

Electronic consent (henceforth referred to ase-consent) platforms collect the participant’s signature remotely. This eliminates the need to print consent forms, obtain in person signatures, and email forms back and forth. These platforms can house each form needed, including the consent form, HIPAA form, assent form, intake form, and demographics form.Obtaining electronic consentmay be facilitated by the following platforms:

* REDCap

* DocuSign

* Prolific

* G Suite

* Adobe Sign

* Qualtrics

Additional options for consent may be a checked box stating that when the participant clicks the “next” button, they consent and agree to the terms stated on the page. For this option, an important consideration is how much documentation is required for consent.Signaturescan be collected by:

* Emailing forms and asking for scanned copy with a signature

* Typed signatures

* Trackpad signature through e-consent platforms

Like in person testing, there are special considerations for pediatric testing. Assent forms can be signed using the same platforms as forms for adult patients. Children may not always be visible on the webcam during consenting, which makes it difficult for a researcher to identify body language that is associated with a child removing their consent. Should the intake process be lengthy, the child may need a break in order to refocus. The consenting session should also occur at a time when there are minimal distractions in the room.

Instructions

Instructions for the study can be provided in many ways. These can be typed up in a document and emailed to participants, laminated and included in a remote testing kit, or mailed to participants. Verbal instructions via a phone or video call can also supplement written instructions.Accessibility is critical when providing instructions so that participants fully understand the task in which they are partaking. Consider writing instructions at a 2nd grade reading level so potential participants, regardless of literacy level, can clearly understand what the study involves. Videos may be useful as a supplement for demonstration purposes. Ensure the text on the instruction documents is of a large enough font size for individuals to read easily.

Consider using both text and visuals in instructions. This may involve listing step-by-step instructions along with screenshots of the programs used and/or diagrams of the study’s setup. For children, cartoons may be helpful in conveying the main points. Instructions should also include reminders about how long each component is expected to take, any environmental modifications the participant may need to make (e.g., avoiding the washing machine running if the environment needs to be quiet), and contact information if the participant has questions.

Professional Considerations

There are many professional considerations on the researcher’s end when engaging in remote consenting, including the modality of the consent process, confidentiality, whether or not a witnessing signature is required, ways to reach potential participants who do not have access to Internet or technological resources, and whether or not consent forms should be sent to the potential participant for review prior to the consenting call.Confidentialitymay be protected with the following procedures:

* Consent while alone in a closed room

* Make sure the background is quiet

* Wear headphones

* Assure the potential participant that you are the only one seeing and hearing him/her

Minimize distractionsby:

* Using a blank wall as a background

* Blurring or add a custom background

* Positioning yourself with windows or lighting in front of you

* Asking ahead of time if the potential participant would benefit from captions

Delivery of Materials (If Applicable)

Take-home materials can be distributed in a variety of ways. Materials can be shipped from the lab or an online ordering system such as Amazon, dropped off at doorstep by research staff, or sent electronically (such as in downloading computer software).Shipment can be useful for getting materials to participants who are not in close proximity to your lab.Distribution of testing materials by lab staff can be useful for reaching participants in close proximity. Research staff and the participant can plan a time ahead of time for drop-off of equipment, allowing the participant a period of time (e.g., 24 hours) to complete the experiment before contacting the research staff for pickup of materials.Sanitation and social distancingmay require you to:

* Wear a face mask and gloves when exchanging materials in person

* Exchange materials by arranging with the subject to leave (or retrieve) items from an accessible location (e.g., outside of their door)

* Include sanitation products with the take-home materials

* Following material use, sanitize all surfaces of the materials with sterile alcohol prep pads and set aside to dry before use

Troubleshooting

In remote testing, many times the researcher is removed from the session. If participants have questions or difficulties with consent, intake forms, or accessing the materials for the study, they will need a point of contact within the lab. On-call troubleshooting assistance can be available via email, video call, phone call, or by text dependent on the institution. Video calls allow the participant to share their screen with the researcher. This can be a useful tool to rapidly resolve the problem. While these are viable options during business hours, it may be necessary for a lab member to be available after hours depending on when the participant completes the study.

Challenges unique to remote testing may include internal factors related directly to the participant. Considerations are participant compliance with instructions, familiarity level with the technology, strength of the WiFi signal, and additional considerations for pediatrics. More information on this subject can be found in (link section of Wiki).

Reimbursement

Type of payment and documentation of time should be considered when determining reimbursement of participants. Reimbursement for remote testing can be similar to those used in labs in person, or they can be entirely electronic. Most electronic payment methods simple to use and provide the researcher with a notification when they have been received by the participant.

Payment method

Considerations

Gift card (e.g., Visa gift cards, other electronic gift cards)

This form of payment can restrict the participant to using their earned funds solely at one business. Visa gift cards remove this limitation but cost more than the amount disbursed to participants.

Third party payor (e.g., Cashapp, Venmo, Zelle, Paypal

Not all participants may have access to the payment method. Requiring a participant to create an account with a third-party service may not be the most user-friendly option.

Electronic check

This method may be restrictive due to institutional policy as they require a great amount of personal information prior to sending.

Raffle

Each participant is entered following completion and a name is selected. Could be useful when time spent on the study cannot be verified.




Stimulus presentation

Audio Hardware

Earphones (see also Resources.Earphones )

Commercially available headphones will vary in their frequency response and ability to isolate external sounds. For example,Apple earpodshave somewhat poor ability to reproduce frequencies below 100 Hz and above 10 kHz, but within this range the reproduction accuracy is fairly good. If particularly low or high frequency responses are needed then the experimenter should consider shipping headphones known to have good frequency responses in the desired range to participants. If prefect frequency reproduction is not particularly important for the experiment then typical commercially available headphones may be sufficient. Additionally, earbuds tend to be worse at isolating environmental noise, so unless participants are guaranteed to be in a quiet environment in-ear headphones or closed back headphones may be better to ensure the experimental sounds are clearly audible relative to the participant’s environment. Additionally, wireless bluetooth headphones may receive interference from other bluetooth devices in the area, and as a result may lose segments of the auditory signal. This is undesirable, so participants should use wired headphones whenever possible.

Loudspeakers

Loudspeakers vary more in their ability to faithfully recreate stimuli and also interact with the acoustic characteristics of the listening environment. Generally speaking, small speakers with a single driver will not be able to recreate the full spectrum of sounds. In particular, laptop loudspeakers often suffer from poor quality low frequency reproduction, and as such should be considered with caution for most experiments. Encourage participants to avoid listening to loudspeakers in rooms that have bare walls and floors, as the reverberation from the room may interfere with hearing the stimuli. If the participant is expected to be a certain distance from the loudspeaker to control for level or speaker characteristics one option is to send something like a yoga or floor mat along with the speaker to show exactly where the participant should sit and where the speaker stand should be set up.

Sound Cards

The only way to precisely control audio levels or calibrate the frequency response of the device is to use a known combination of headphones/loudspeakers and a sound card. If the experimenter wants to provide all of the equipment necessary to complete an experiment then they can ship a whole computer/tablet with earphones or a loudspeaker to participants (e.g. PART). Alternatively, if the experimenter wants to control the auditory stimulus but allow participants to run research software on their own computer then an alternative is to ship an external sound card and earphones to the participant. In this case, the sound card and earphones can be calibrated together by the experimenter so that the level and frequency response of the audio hardware is known and can be precisely controlled in the experiment. The experimenter should provide easy to understand instructions for how to connect hardware to the participant’s computer and be available to help troubleshoot any issues the participant has. Relying solely on the participant’s computer is possible, albeit with less precise control of stimuli. One particular issue to watch for is that the standard sound drivers in Windows 10 try to shape the amplitude of sounds to avoid sudden onsets. This means it is possible for Windows to decide to ramp up the audio amplitude of a program during stimulus presentation, which is problematic for short (less than approximately 1 s) sounds, and in some cases may even render short sounds inaudible. In some cases, Windows OS power settings may affect this behavior (see, for example, https://appuals.com/audio-crackling-windows-10/). Consider testing stimulus playback on a variety of computer platforms before running an experiment with participant supplied hardware to ensure that specific platforms will not interfere with stimulus playback.

Audio playback

Sound file formats

Sound files can be saved in many formats. For experimental purposes, most labs use wav files, which save the exact signal that will be sent to the audio device. These files are ideal for reproducing stimuli, but take the most data. If a large amount of audio needs to be sent through an online server to the participant you may want to consider some form of compression. The compression format you are likely most familiar with is mp3, which is a format designed to compress sound files by removing information that is unlikely to be important for what most people can hear in music. This compression is lossy, which means that it does not perfectly preserve all of the details of the audio signal. mp3 has been superseded by m4a/mp4. As an example, Matlab’s audiowrite command can write mp4 files, but not mp3s. These lossy compression formats are based on assumptions about what music usually sounds like and what people are capable of hearing, so they may create weird artifacts when attempting to compress more psychophysical stimuli, such as quiet or bandlimited sounds. An alternative would be to use lossless compression, such as flac. This will save some space, but whether the process of converting audio files to flac from whatever the lab normally uses is one the experimenter will need to decide.

Providing Audio Online

When listening to most media sources online the audio is streamed, which means that portions of the signal are sent to the listener as they listen. This ensures that the listener doesn’t need to download and entire file before they can start listening and saves bandwidth if the listener skips the song or stops listening partway through. The downside of this approach is that playback can proceed faster than the listener receives the signal, so it can lead to waits for the signal to buffer. In most experiments this is undesirable. An alternative is to send the entire sound file in advance, which is what most JavaScript based implementations, such asjsPsych and Matlab Web Serverdo. These programs pause while sound files are sent to the participant, and do not proceed until the participant has the entire file.Some experiments generate audio files on the fly. This is somewhat difficult in an online format, as the website backend has to be able to create sound files (something that is possible but with limited functionality in JavaScript, and by extenstionjsPsych) and then send those sound files to the participant. If the sound files to be created are particularly complex consider usingMatlab web server, because it can handle both of these needs. One thing to note is that HTML tries to be efficient with the files it sends. If an experiment renders a sound file and sends it to the participant, then next time a sound file with the same name is requested the web browser will assume we already have that file and play the one that was originally sent. There are ways around this by making the request for a sound file look unique every time the request is made, often by adding an irrelevant seed to the request such as datetime (Example needed here).If stimuli are short (less than 45 seconds in duration), the audio can be generated on the fly within the browser using theAudioBufferfunction in JavaScript. Code for stimulus generation and playback (via AudioBuffer) can be run (for example) within a jsPsych experiment using jsPsych’sevent-related callback functions. An example of using this functionality to implement a three-alternative forced choice modulation detection task can be foundhere. Thepsychophysical task exampleused this method as well.Some experiments require audio to go to a single ear. This can be readily achieved in the lab using hardware controls and routing signals to a mono output to the left or right ear, but such control is not possible without sending participants hardware. An easy alternative is to create a mono channel audio file, then combine that mono channel audio with a second channel of silence in a stereo recording. This will ensure that the audio goes to one channel of a stereo device, which will cause the sound to only play in one ear in a pair of headphones. (Matlab example needed here)

It may be useful to record experimental sessions. On Windows computers, one default audio input device is called ‘Stereo Mix’, which is essentially a recording of any audio the computer is playing. If you have a video or audio call open with a research participant recording from the Stereo Mix device will record the audio signal from the call. Note that you should have Informed Consent from the participant to make these recordings. These recordings can be used post-hoc to determine if the background noise in the participant’s test environment is acceptable and to re-analyze verbal responses provided during the experiment. It is also possible to record the audio the participant hears. This can be useful to determine if stimuli are distorted or to check that the participant was hearing the correct stimuli. This can be done by having the participant share through the video or audio call the browser the experiment is running in and their computer audio. That way, the audio stream the participant sends to the experimenter will contain what they were hearing, any environmental noise that their microphone picks up, and the participant’s verbal responses in the same audio stream.

Video

Desktop screens

Modern computer monitors are capable of high-quality display, but there is a wide variety of video processing and rendering options at the operating system, video card, and hardware levels. Critical aspects of image rendering (e.g. colors and contrast to different stimuli from one another and from backgrounds, legibility of text, absence of video artifacts) should be checked on the test hardware prior to an experiment. Rendering accuracy can be affected by screen resolution and by anti-aliasing. With low resolutions, stimuli may appear pixilated and will lose some fidelity. Most modern computers default to reasonably good resolutions, but if a participant is using an older computer or their video drivers are misconfigured it is possible their display will have a low resolution. Anti-aliasing is a rendering technique which smooths transitions between adjacent pixels to make a picture look higher fidelity. This is usually a good thing, but variability in anti-aliasing across displays could alter the clarity of images or alter the readability of text. Modern screens often include different processing modes that are optimized for games or movies. These modes vary from manufacturer to manufacturer, but usually alter the throughput delay of the screen and may notably affect the color balance of images. If a known, fixed timing between auditory and visual events is desired, as is often the case in studies of audio-visual integration, it may be necessary to use the same calibrated hardware across participants. Additionally, keep in mind that video screens have a slower updating and refresh rate than audio devices (usually 60 Hz), so there will be some variability in the timing between visual and auditory events across screens.

Placement

Participants should be sitting so they are directly facing the screen. Stimuli should be designed to accommodate some back and forth sway of the participants’ head relative to the screen, as they will not be completely still while performing a task. If attending the visual stimulus is essential to the task it may be helpful to have the experimenter monitor eye and head position through a video call during the experiment. Have the participant look at a fixation cross at the center of the screen and note the angle of the head and eyes on video, then monitor for obvious deviations from this position during stimulus presentation.

Tablet displays

Concerns about screens and placement apply to tablets, with the added concern that participants have more degrees of freedom for orienting their eyes and head relative to the display. Make sure stimuli are clearly visible even on small displays that are held far from the head, and consider what the experimenter should do if a participant accidentally drops the tablet or looks away while adjusting position.

Head mounted displays

Another approach to visual stimulation is the use of commercially available head-mounted displays (HMD) intended for virtual reality (VR), such as Oculus Rift, Quest, HTC Vive, etc. Some advantages of this approach include known and reproducible placement of the display, head (and possibly eye) tracking. Although the field of view is generally much narrower than natural vision, large visual displays can be simulated by tracking the head position and updating the eye-fixed display appropriately. This functionality is built in to these devices, and can be exploited using standard 3D game-programming techniques in development platforms such as Unity 3D and Unreal Engine. Calibration of video, tracking, and audio/video sync with HMD is beyond the scope of this article. However, although the capabilities of a specific device type and model should be assessed, modern manufacturing tends to produce units with very similar performance (as is the case for many tablet devices), so it may be reasonable to assume a standard level of performance, and specific calibration–in the field–of each unit may not be required.Most head-mounted displays also feature some means to deliver audio stimulation, either through earphones attached to the unit or small HMD-mounted loudspeakers. Where possible, earphones with good passive attenuation and a direct electrical (rather than acoustical) path to the ear are preferred. These may interface directly with the HMD or with a host PC, in which case audio concerns are similar to those discussed above. Bear in mind that software-based "3D audio" features may distort the binaural and spectral features of the audio in attempt to compensate for head movements and simulate virtual audio sources. In general, commercial 3D-audio algorithms may not be well suited to research purposes, and investigators should consider whether "3D" audio is important to the goals of the study or should be disabled. Convincing (i.e. true) 3D audio can also be achieved using loudspeakers, although the HMD will interfere to some degree with the spatial acoustics, particularly at high frequencies.

In each case, consider whether device type specified, provided, BYO

Many of the above issues can be handled with good precision on known devices. If participants use their own devices, consider adding quick checks at the beginning of the experiment to ensure essential details are visible. It may also help to ask participants if they use any additional video processing (e.g. accessibility options or night display modes) and to disable that processing if it interferes with the experimental stimuli.

Compatibility with clinical devices

Hearing aids

Earphones are generally not an option for aided listening. If a participant’s audiogram is known in advance stimuli presented through earphones can be amplified to improve audibility when listening without hearing aids. The experimenter should take care to check that amplification does not produce uncomfortably loud stimuli. Participants with hearing aids that fit entirely in the ear canal may be able to use their hearing aids while listening through earphones, although this should be tested to check for undesired physical or acoustic interactions between the earphone and the hearing aid. Loudspeaker presentation is an option for aided listening, although the experimenter should take care to ensure that participants are oriented relative to the loudspeaker to avoid confounding differences in hearing aid directionality. Some hearing aids also have streaming capabilities through Bluetooth, which could enable a direct connection from a computer to the hearing aid. It may be helpful to obtain permission from participants to contact their audiologists for information on how the device is programmed, as various settings (noise reduction algorithms, directional microphones, compression) will alter how acoustic stimuli are processed across individuals.

Cochlear implants

Similar to hearing aids, headphones do not provide good aided listening to participants with cochlear implants. However, there are published studies that used circumaural headphones to present stimuli (Grantham et al., 2008andGoupell et al., 2018). Loudspeaker presentation is an option, and some cochlear implants have direct connection audio jacks and/or Bluetooth streaming capabilities. Implants tend to process a narrower frequency range than acoustic hearing, so at-home audio devices (e.g. laptop speakers) may have a sufficient frequency response in the range that the cochlear implant processes, but this should be experimentally verified.

Other devices

The advice for hearing aids and cochlear implants may generalize to other assistive devices (e.g. bone-anchored hearing aids, auditory brainstem implants), but it is up to the experimenter to determine whether stimuli are being heard as intended. If you have experience with remote testing specific devices please share your advice here.




Issues related to collection of response data

Remote testing, as understood by this Task Force, involves the collection of data ("responses") from research participants interacting with a response device such as a paper form, web survey, tablet app, or VR game. This article attempts to describe some considerations that might apply to the collection and processing of response data.

Comparison to in-lab response collection (see also Task Performance. )

Assuming that appropriate hardware/software resources can be provided to the participant, the types of response data that can be collected during remote testing do not, in principle, differ from those available during in-lab testing. However, the types of response data that are most easily accessed will depend on the remote-testing platform and the types of tasks the platform is intended to implement.

Thus, a superficial but useful distinction can be made between two major types of tasks:

  • survey-based tasks include a series of different question/answer (or stimulus/response) pairs
  • trial-based tasks present a repeating series of similar stimulus/response pairs

For example, a typical survey might include a series of questions and question-specific response choices:

  1. How loud would you consider your workplace? [Scale of 0-100]
  2. How often do use hearing protection at work? [Five-point Likert scale, "Never" to "Always"]
  3. Does your workplace provide earplugs? [Yes / No]
  4. List the main noise sources present in your workplace: [free response]

A typical trial-based task would present the same type of trial repeatedly to assess a distribution of responses:

  1. Select the word "sidewalk"
  2. Select the word "baseball"
  3. Select the word "workbench"


Note that either type of task could be administered in the lab or through remote testing. Many web-based platforms, however, are oriented primarily toward survey-based tasks. Trial-based tasks are more often implemented as standalone programs (e.g. MATLAB, PC, or tablet apps). Luckily this is not likely to be an issue for remote testing; most trial-based tasks can be easily reframed as survey-based tasks (by treating each "trial" as a survey "question"), although platforms vary in their support for common trial-based approaches such as randomized presentation order, adaptive presentation (using performance to select the next trial), etc.

Another difference between survey- and trial-based tasks has to do with whether individual participants complete the task once (as typical for a survey) or many times (as typical for trial-based tasks). Different considerations may apply to data handling (managing one vs. many data files per participant), counter-balancing conditions across repeated trial-based runs, randomizing question order across survey versions assigned to different participants, etc. Platforms may vary in their suitability for administering a survey task once to each of many (possibly anonymous) participants versus tracking a smaller number of participants across multiple sessions of trial-based tasks.

Types of response data that may be collected during remote testing

Most relevant to the purpose of this article are the types of response data collected during survey- vs trial-based tasks. There is no hard distinction between these, and most (all?) response types (multiple choice, rating scale, head pointing, pupil dilation) could, in principle, be used in either survey- and trial-based tasks. However, certain types of responses are commonly encountered in survey-based tasks, and these are the most commonly supported by many remote-testing platforms.

Note that the availability of specific response devices (buttons, sliders, etc) and response data types may be limited by platform. For example, Gorilla (seePlatforms) supports the following response types:

  • Continue Button
  • Response Button (optionally featuring Text, Paragraph, or Image content)
  • Rating Scale/Likert
  • Response Slider
  • Dropdown
  • Keyboard Response (Single or Multi key press)
  • Keyboard Space to Continue
  • Response Text Entry (Single/Multi line / Area)

Some of these options can provide immediate response processing (e.g. response recorded when button clicked), which may support some degree of timing data or even conditional/adaptive processing. Other response types (e.g. text entry) require a second step, such as "click to continue" to record the response.

Other platforms, particularly non-browser platforms such as PC or tablet apps, may offer a wider range of response types, including:

  • Touch Response (one or two dimensions)
  • Multi-touch / Gesture Response (e.g. swipe left or right)
  • Tilt / Acceleration Measures
  • Special Hardware Support
    • Gamepad/Joystick
    • Wiimote
    • Camera or Depth Camera
    • Tracked Controllers (head-mounted display / VR touch controllers)
    • Physiological Sensors (heart rate, GSR, EKG, EEG, pupilometry, eye-tracking etc)


Platforms that provide additional support for accessing on-device sensors or hardware controllers include:

  • Most PC or custom app frameworks (MATLAB, Windows, Android Studio, xCode, Unity, etc.)
  • Some commercial platforms for physiological measurement, audiology, telehealth

Speech / Audio response collection

In-lab testing of speech perception (for example) often combines open-set responding ("Repeat the word BLUE") with in-person scoring by a human observer. Some platforms allow synchronous interaction between experimenter and remote participant which can support a similar approach. However, low quality audio/video streaming, dropouts, or distortion might disrupt accurate scoring. A few approaches may be used to support open-set data collection:

  • Audio or AV recording and storage of responses for later verification
    • Potential challenges: defining the response window, storing/transmitting audio/AV data files, ensuring participant privacy.
  • Redundant response collection or self-scoring after feedback (e.g. "Say the word, then type it", or "Check box if you were correct")
    • Potential challenges: reconciling mismatched responses

Response Calibration (see also Calibration

Although most survey-type responses should interpretable in an absolute sense and thus require no calibration to determine value, some continuously variable response data (for example, touch displays, tilt/force sensors) may require psychophysical or hardware calibration. See Calibration for more details.




Stimulus Calibration

Audio Stimuli Calibration

Within the scope of psychological and physiological acoustics, the goal of many experiments using remote testing is to collect responses to a set of acoustic stimuli from test subjects at remote sites. It is crucial to calibrate the stimuli so that the obtained responses would not be dominated by variations in presentation levels and specific transducers (e.g., earphones, loudspeakers, etc.). The appropriate calibration of the sound pressure level is especially important because (1) the experimenter has to make sure that the stimuli are audible to the subjects; (2) the sound level needs to be kept within a safety limit (seeCompliance); (3) many auditory and speech perception phenomena are level dependent.Remote testing poses many challenges to the calibration of acoustic stimuli, mostly because the experimenter may not have full knowledge of the sound delivery system and environment at the subject’s end. Therefore, when selecting an appropriate platform for a remote experiment, one of the first considerations would be how critical calibration is to the research question under investigation (seePlatformConsiderations). Some platform-specific information with regard to calibration can be found underHardwareAndCalibrationGeneral approachesfor addressing this problem include:Approach 0, No formal calibration: Many experiments involving supra-threshold and/or high-level processes may be conducted with limited influences from inexact audio calibration. In such cases, a browser-based experimentation platform (e.g., Gorilla, jsPsych, PsychoPy, seePlatformDescriptionsfor the descriptions on these platforms) may be preferable, because these platforms, although do not provide precise calibration of audio presentation, could be cost effective and hence enable a relatively large sample size. It should be noted that even for experiments without formal calibration, simple verifications are recommended to ensure the audibility of the stimuli andlistening safetyApproach 1, Electroacoustic calibration: Experiments that involve subjects with hearing impairment or addresses research questions that are expected to be dependent on stimulus characteristics such as level and spectrum may consider an experimentation platform involving sending pre-calibrated systems (e.g., tablets with headphones) to the subjects. Some examples for research platform under this category include hearX, PART, etc. (seePlatformDescriptionsfor the descriptions on these platforms). Although Approach 1 enable precise control of the test stimuli, the logistics (i.e., teaching subjects to use the equipment, shipping and receiving the equipment, troubleshooting remotely, and answering ongoing questions) may be time-consuming and costly.Approach 2, Limited calibration involving reports from the subject: Besides the two scenarios described above, there may be situations in which some degrees of control on the stimulus characteristics are preferred but not critical. In these cases, browser-based platforms may be used and some moderate degrees of stimulus control may be achieved with the involvement of the subject. For example, the subject may be instructed to report the manufacturers and models of the devices used in the experiment. Calibration can then be completed using the known specifications for the devices.

Approach 3, Limited calibration using psychophysical techniques: One specific technique to calibrate the acoustic stimuli with the participation of the subjects is psychophysical calibration or perceptual calibration. This calibration method is especially useful for experiments aimed for healthy, normal-hearing adults as subjects. For this population, normative performance ranges (and psychophysical models) are well-established for many basic auditory tasks. Incorporating some of these basic auditory tasks into the experimental protocol provides a way to probe the fidelity of the stimulus presentation system and the background noise level at the subject’s end. Additionally, there are many binaural phenomena that requires the appropriate placement of headphones (binaural beating, binaural masking level difference, binaural pitch, etc.). These tasks may be implemented to check whether a headphone is used and correctly placed during the experiment.

Approach 1, Electroacoustic Calibration

For this approach, the experimenter will be responsible for calibrating the hardware systems before sending them to the subjects. In this case, the calibration procedure should follow the ISO/ANSI standard (e.g., ANSI S3.7) and/or the best practice of the field. Some commercially available platforms include hardware calibration service in their annual subscriptions (e.g., hearX and SHOEBOX).Calibration setup: A typical setup for level calibration for headphones consists of (1) a coupler (or artificial ear) for the type of transducer used, (2) a sound level meter with its measurement microphone attached to the coupler. The purpose of the coupler is to simulate the impedance of the ear canal, so that the readings from the sound level meter would simulate the sound pressure level that would be measured at the ear drum of a typical subject. Depending on the type of the transducer (e.g., supra-aural earphones, circum-aural earphones, insert earphones, or hearing aids), a corresponding type of coupler should be used. As a rule of thumb, supra- and circum-aural earphones require couplers with a larger volume (~6cc) while the insert earphones and insert-type hearing aid receivers require couplers with a smaller volume (~2cc).Calibration stimuli: Most stimulus presentation systems (e.g., headphones connected to sound cards) are designed to be linear. Pure tones at various frequencies are typically used as the calibration stimuli. With the calibration tone presented above the noise floor, the correspondence between the rms amplitude of the digital signal and the measured sound pressure level can be found. In some special circumstances, the stimuli are presented via a device with nonlinear dynamic processing (e.g., digital hearing aids). Such systems would provide different amount of amplification to the stimuli in a frequency- and intensity-dependent fashion. For such applications, the stimuli used for calibration should resemble the level and spectrotemporal characteristics of the test stimuli. For example, for a speech-recognition experiment,International Speech Test Signal (ISTS), presented at the same level as the test speech in the experiment, may be used as the calibration stimulus.Device verification by the subject: For an experiment that utilizes multiple experimental devices (e.g., tablet-headphones pairs), each device should be properly identified (e.g., by a device ID) and calibrated separately. During the experiment, it is useful to have the subject report the device ID to ensure the accurate pairing of the calibration data and the device. Even when the testing systems are appropriately calibrated at the site of the experimenter, the stimuli may still be off-calibration due to damage to the device during shipment, improper connection between various components of the device (e.g., the headphone is disconnected from the tablet), wrong experimental software being run, or improper placement of the headphones. To prevent such incidences, a simple psychophysical verification procedure may be used, which may include verifying whether stimuli presented at suprathreshold levels are indeed audible to the subject and whether the stimuli delivered through the two sides of the headphones are properly balanced and synchronized for binaural experiments.Beyond level calibration: Besides calibrating the presentation level, other electroacoustic measurements may also be useful and informative to the experimenter. Some common measurements include the dynamic range, frequency responses, and crosstalk between channels.Dynamic range: The dynamic range is the range of level within which the test stimuli can be presented without significant distortion. The lower limit of the dynamic range is typically the noise floor, i.e. the output level when no stimulus is presented. It is worth pointing out that the noise floor is not the readings from the sound level meter when the system is powered off. Rather, it is the noise level expected during stimulus presentations. It is recommended that an all-zero array is used as the "calibration stimulus" during the measurement of the noise floor. The upper limit of the dynamic range is the maximum output level without distortion (e.g., clipping). The experimenter should choose the appropriate devices and transducers so that the test stimuli would not be too close to either the lower or upper limit of the dynamic range. In other words, an experimental system with a larger dynamic range would accommodate a greater variety of experiments.Frequency response: The frequency response is the response from the testing system for unit input as a function of frequency. An ideal sound delivery system would have a flat frequency response, so that it will not introduce additional coloration to the experimental stimuli. The frequency response is typically measured by analyzing the response to a broadband stimulus. Depending on specific methodologies used, the test stimuli may be a sequence of pure tones, a frequency-varying tone sweep, a Gaussian noise, a Maximum-Length Sequence, orGolay codes

* Crosstalk between channels: Crosstalk refers to the signal leakage from one channel to another. Crosstalk between channels could be problematic for auditory experiments when the stimulus meant for one of the test ear is audible from the headphone for the opposite ear. Portable testing platforms may be more subjected to crosstalk because (1) the left and right channels of common portable computers and mobil phones usually have a shared ground return, and (2) low-impedance earphones are typically used with portable devices to ensure sufficiently high output levels. Crosstalk is measured by applying a test signal to one channel, measuring that signal’s level from the other channel, and then expressing the measured level as a ratio (in dB) relative to the source signal. Since the crosstalk is often frequency dependent, the measurement is typically conducted as a function of frequency.

In-situ or self calibration: For send-home testing systems that also carry built-in microphones, the microphone on the device can be calibrated so that it can be used for in-situ or self calibration. In-situ calibration are required for most experiments that involve free-field stimulus presentation (using loudspeakers). In such cases, the microphone should be positioned near where the subject’s head would be during the experiment. A calibration stimulus is then presented via the loudspeaker(s) and the sound level or frequency response can be derived from the recording of the stimulus using the microphone. Another application of built-in microphones is to conduct measurements of the noise level (or the power spectrum of the ambient noise) for the subject’s test environment. It should be noted that the raw audio recordings made during the in-situ measurements should not be stored without an appropriate IRB approval and the subject’s consent (see Compliance).

Approach 2, Limited calibration involving reports from the subject

For this approach, the subject reports the manufacturer and model of the devices used in the experiment, and the experimenter configure the stimuli based on the available specifications for these devices. For examples of databases for head-phone specifications, seeEarphones. It should be noted that some of the available headphone specifications are measured without the use of a coupler.

Level calibration based on headphone sensitivity: The most relevant specification for level calibration is the headphones’ sensitivity. It is the sound pressure level that would be produced by the headphone with unit input voltage (i.e. 1 volt) at 1 kHz. Therefore, if the dynamic range of the sound card (the output voltage in dBu at 0 dBFS, 0 dBu = sqrt(0.6) V) is also available, the maximum output sound pressure level can be derived and the conversion from dBFS to dB SPL can be achieved. Sometimes, the sensitivity of the headphones is given for unit input power (i.e. 1 milliWatt). In this case, the sound pressure level needs to be calculated based on both the voltage that the headphones receive and the headphones’ impedance. For example, if the sound card has a maximum output level of 4 dBu at 0 dBFS, the sensitivity of the headphones is 102 dB SPL/mW, and the impedance of the headphones is 64 Ohm, then the maximum output voltage is sqrt(0.6)*10^(4/20) = 1.23 V, the maximum output power is (4.61)^2/64 = 23.5 mW, the maximum output sound pressure level (corresponding to 0 dB FS) is 102+10*log10(23.5/1) = 115.7 dB SPL.

Approach 3, Limited calibration using psychophysical techniques:

For this approach, a psychophysical procedure is conducted for the purpose of calibration and system verification.Loudness-based level adjustment: One of the simplest way for perceptual calibration is to instruct the subject to adjust the volume control of the sound delivery system so that the test stimuli would be presented at a most comfortable level. Alternatively, the subject may adjust the level of an anchor stimulus to the most comfortable level, and the presentation levels of the test stimuli are set relative to that of the anchor stimulus. This is a relatively quick procedure, typically taking 2-3 minutes. For pure tones presented in quiet, the expected standard deviation in the listeners’ most comfortable loudness (MCL) levels is typically greater than 10 dB (seePunch et al., 2004for detailed discussions on the various measurement considerations).Besides adjusting a stimulus to the most comfortable loudness level, a procedure to measure the loudness growth at 1 kHz may be conducted for calibration purpose. The loudness growth function at 1 kHz, i.e. how loudness rating grows with the stimulus level, is well-established for normal-hearing adults. The procedure for measuring the loudness growth function has been standardized (ISO 16832). Therefore, it is possible to first measure the loudness growth function using uncalibrated levels and then compare the obtained data with the published normative results to estimate the conversion factor for calibration.Threshold-based calibration: sensation level: For an experimental system with unknown calibration, absolute thresholds may be measured first for the test stimuli using an uncalibrated arbitrary unit. The test stimuli can then be presented at a desired sensation level (in dB SL). 10 dB SL indicates a level that is 10 dB above the absolute threshold for the stimulus. The advantage of configuring the stimulus level in dB SL is that the audibility of the stimuli is ensured for each individual subjects and for their specific sound delivery systems. However, there are a few disadvantages associated with this approach that researchers need to consider. First, threshold measurements add additional testing time. Second, the subject’s testing system and environment may be sub-optimal for threshold measurements, and the measured thresholds may be dominated by masking from the electronic or ambient noises. Third, for a sound delivery system with a limited dynamic range, it may be difficult to both conducting threshold measurements and presenting stimuli at high sensation levels. For example, consider an experiment with a pure-tone stimulus at 1 kHz and 50 dB SL. Measuring the absolute threshold for the tone requires adjusting the volume control of the system so that the noise floor would be reasonably lower than the threshold (the hardware noise should not be audible). If the noise floor is 10 dB below the absolute threshold, then presenting the tone at 50 dB SL requires at least 60 dB of dynamic range. Last, for subjects with more than moderate degrees of hearing loss, setting the stimulus level at a fixed sensation level, may lead to very high, unsafe, sound pressure levels (if still within the dynamic range of the system) or severe distortions (if the level exceeds the dynamic range).

Headphone verification using binaural phenomena: In many experiment, it is crucially important to verify that the subject is wearing headphones with the correct orientation during the experiment. This can be achieved by conducting a psychophysical procedure involving an auditory percept (such as binaural pitch) that is strongly dependent on specific interaural phase relationships. Examples of using binaural phenomena to verify headphone connection/placement are Woods et al. (2017) and Milne et al. (2020)

Visual Stimuli Calibration

Dimensions of calibration

* Color

* Luminance

* Contrast

* Spatial variationKollbaum, P. S., Jansen, M. E., Kollbaum, E. J., & Bullimore, M. A. (2014). Validation of an iPad test of letter contrast sensitivity. Optometry and Vision Science, 91(3), 291-296.Dorr, M., Lesmes, L. A., Lu, Z. L., & Bex, P. J. (2013). Rapid and reliable assessment of the contrast sensitivity function on an iPad. Investigative ophthalmology & visual science, 54(12), 7266-7273.de Fez, D., Luque, M. J., García-Domene, M. C., Camps, V., & Piñero, D. (2016). Colorimetric characterization of mobile devices for vision applications. Optometry and Vision Science, 93(1), 85-93.Dorr, M., Lesmes, L. A., Elze, T., Wang, H., Lu, Z. L., & Bex, P. J. (2017). Evaluation of the precision of contrast sensitivity function assessment on a tablet device. Scientific Reports, 7, 46706.

Ozgur, O. K., Emborgo, T. S., Vieyra, M. B., Huselid, R. F., & Banik, R. (2018). Validity and acceptance of color vision testing on smartphones. Journal of Neuro-ophthalmology, 38(1), 13-16.

Audiovisual Stimuli Calibration

In many situations, speech stimuli are presented in both the auditory and visual modalities and the subjects’ abilities in recognizing speech are assessed. The synchronization between auditory and visual speech cues can influence speech understanding, therefore it is crucial ensure the synchronization between the audio and video displays. In most cases, AV speech is stored in a compressed file in order to constrain the file size. A compressed video file consists of both audio and video signals compressed using separate codecs.During an experiment, when presenting a compressed video file, the hardware on the subject’s end will need to decode both the audio and video portions of the file, which may cause unintended asynchronies between the audio and video displays. This often unpredictable amount of AV asyncronies at the subject’s end determines that remote experiments involving AV speech stimuli would need to ship pre-calibrated systems (e.g., tablet + headphones) to the subjects. The calibration procedures would consist of three steps: (1) audio stimuli calibration, (2) visual stimuli calibration, and (3) test of AV syncronization. The first two steps can be conducted following the procedures described in preceding section. Here, an example procedure for measuring AV synchronization is described.Measuring AV synchronization: If possible, the stimulus files should be stored locally on the mobile testing system. The system applications that are running in the background should be kept at a minimum. Then, a calibration stimulus is presented via the same software environment as in the experiment. The calibration stimulus is a AV file with the identical specifications as the experimental stimuli (i.e. the same audio and video codecs, the same screen size, etc.). The calibration stimulus is generated so that the audio and video components share common onsets. For example, the stimulus may be periodic audio clicks and video flashes with the same onset times. During the presentation of the test stimulus, a passive photo sensor is placed on the screen, and the outputs from the photo sensor and the output from the sound card are fed into the two channels of an oscilloscope. The asynchrony between the audio and video outputs in milliseconds can then be measured. The above procedure should be repeated for a couple of times to check the consistency of the AV asynchrony. As an alternative to the clicks and flashes, the amplitudes of the audio and video signals can be defined analytically by the same sine function. The AV synchrony can then be verified by viewing the audio and video outputs for this calibration stimulus using the XY display mode of the oscilloscope.

Synchronizing AV stimuli: Once the average asynchrony (in milliseconds) is measured, the stimulus files can be modified to compensate the asynchrony. This means delaying the audio component if the audio output is leading the video, or delaying the video component if the video output is leading the audio. This can be done in a video editing software (e.g., Final Cut or Adobe Premiere). First, import the original stimulus file into the video editing software. Then, apply a delay according to the measured asynchrony to the appropriate stream. Finally, export a new stimulus file using the original codecs. This compensation procedure can also be applied to the calibration stimulus. When repeat the synchrony measurement using the modified calibration stimulus, the average asynchrony should be near 0 ms. When the sine function is used as the calibration stimuli, then after compensation the audio and video outputs, when viewed using the XY model of the oscilloscope, should form a diagonal line, rather than an ellipse or circle, indicating that the two outputs are in-phase.

Response Calibration (see also Response

An advantage of most survey-type responses used in remote testing (buttons, multiple choice, rating scale) is that each response should be interpretable in an absolute sense, requiring no calibration to determine its value. That is, clicking "Yes" has the same meaning in every session and for every participant. Some response data, however, require additional calibration.

Calibration of participant response scalerefers to the calibration of response scales or range across sessions and/or participants. This can be accomplished by instruction (e.g. using a labeled Likert scale, identifying endpoints such as "Inaudible" to "Painfully Loud", etc.) Similar considerations would seem to apply to both in-lab and remote testing, although the need for clear instruction may be more acute in remote testing.

Calibration of response hardwaremay be desired or necessary for some types of on-device sensors (touch displays, tilt/force sensors) and external hardware (pointing devices, physiological measurements).

  • Psychophysical calibration optional: In some cases, a rough calibration can be assumed with confidence; for example, the X and Y coordinates of a tablet-based touch response should be presented in standard units (pixels, perhaps) with only small offset away from the actual finger location. In such cases, a psychophysical procedure may be included to verify the calibration. For example, at the start of testing, the participant could be asked to touch a series of target locations indicated visually on the display. Offset from expected values can be measured to confirm or adjust calibration, although any measured offset will incorporate contributions of the response hardware and response biases of the participant (e.g. reaching with the right hand introducing a rightward touch bias).
  • Psychophysical calibration required: In other cases, a calibration step will be necessary to interpret the response values at all. For example, a head-mounted display could be used to measure head orientation in a pointing task. The angle reported by the device is relative to its position when initially set up, and may vary considerably across sessions and participants. A calibration task can be used to determine and correct for the values reported for known and repeatable target directions (e.g. straight ahead). As noted above, measured offsets will include both hardware and participant contributions. Access to hardware-based calibration (e.g. a physical target) may help to isolate these contributions but may be difficult to implement in remote-testing scenarios.
  • Hardware calibration required: Note that some devices may require more detailed calibration to a physical standard in order to provide reliable data (generally, absolute measures). Such cases are likely to be highly dependent on the specific device and calibrator, and may be difficult to implement in remote-testing scenarios.



Issues related to participants’ performance of the required tasks

Potential effects of the testing context

When testing outside a sound booth it is important to consider both the cognitive and acoustic effects that the testing environment may have task performance. While it is possible for remote testing to be conducted in environments with limited distractions, it is also possible that participants may not be alone in the testing environment or that the environment has distracting elements. Further, the participant could attempt to multitask during the testing. For headphone based studies, passive noise attenuating headphones may be advantageous as well as using a moderate level masking noise. For remote testing with speakers, background noise, room acoustics and the positioning of the speakers can influence performance. The use of moderate level masking noise to overcome background noise may inconvenience individuals near the participant. Brungart et al (in press) measured speech perception in crowded public spaces while simultaneously measuring the background noise level on every trial. They were then able to compare performance as a function of the background noise level.

Communicating instructions

In conventional testing environments, after explaining the task, participants often have the opportunity to ask questions. Further, it is often possible to observe the data in near real time allowing for the experimenter to correct obviously incorrect behavior. During remote testing, this is often not the case. Subjects may ignore simple instructions like ensuring that headphones are placed on the correct ears and may not fully comprehend more complicated instructions. Multiple versions of the instructions may be required depending on the number of platforms the experiment is compatible with and if all subjects are not required to be fluent in the same language.

Age considerations

Apart from standard considerations related to the relationship between age, hearing impairment, and cognitive decline, remote testing performance may depend upon the comfort and skill level using a computer/tablet. Auditory remote testing presents a unique challenge since there is a complex relationship between computer skills and hearing impairment and age (see Henshaw et al., 2012).

Linguistic considerations (translation, etc)

Remote testing provides the access to participant populations who speak a much wider variety of languages than may be available in a traditional single-site experimental design. While this can provide benefits, it also can affect performance on the task if the testing material is not appropriately translated and modified for the range of languages to be tested.

Technological literacy of participants

Remote testing may involve diverse levels of participants’ familiarity, experience, and facility with the technologies employed. Careful consideration should be given to maximizing accessibility across the targeted population when selecting a platform and designing a study. Consider, for example, a tablet preconfigured to run a single app with settings for that participant versus a laptop that requires signing in to wi-fi, downloading an update, and saving/uploading a data file. A typically developing college-age cohort might reasonably be expected to complete either study with minimal extra intervention (see participant administration), but the first option might be appropriate for a broader cohort. The latter approach also risks confounding results by reducing the likelihood that some subject groups complete the full task and/or by introducing additional cognitive load unrelated to the research question.

Supervision of performance

Remote testing presents challenges for the experimenter to supervise the participant and their performance. Adequate supervision of the participants can be critical for keeping the participant motivated and engaged with the task. Supervision can also be critical for identifying situations where the participant may be confused with the instructions. Finally, while some experimental procedures can be fully automated, in some cases experimenter intervention is critical. For example, open-set speech-in-noise testing either requires the participants to self-score their performance or for the experimenter to observe the responses in near real time. Supervision of remote testing can range from the experimenter being at the remote testing site, either physically or virtually, to automated help systems being built into the testing platforms, to asynchronous supervision and help via phone, etc.

Evaluation of participant experience

Remote testing participant populations potentially have a wider range of experience with auditory and/or behavioral testing. The experience of the participant with standard behavioral paradigms may impact the performance on the task.




Data Handling

Kinds of data

Remote testing is associated with various kinds of data that need to be exchanged between participants and researchers. These typically include

  • Participant information, including personally identifiable information (PII) and protected health information (PHI) that may be subject to regulatory compliance constraints
  • Stimuli (e.g., audio, image and video data) and experiment parameters. Although often fixed across participants, these may be individualized, for instance when participants are randomly assigned to different "conditions", or if the measurements are adaptive.
  • Response data, which will likely be the bulk of the data that is of interest to the experimenters. Access to single-trial response data may be needed during testing for progress monitoring and verification of task compliance, to provide feedback to participants, and/or to calculate summary performance metrics to make decisions about task flow. The full set of final responses will then need to be assembled for detailed analyses. Long-term archival and sharing with collaborators or the broader research community are also considerations that may apply.

In some cases PII may incidentally be linked to the response data, thus requiring special considerations. For instance, when verbal responses are recorded, or if there is live video interaction between the participant and experimenter, raw audio/video data will contain identifying information.

Server-side versus client-side data handling

There is considerable technical flexibility in whether the data handling is done on the server side (i.e., online in the "cloud") versus the client side (i.e., on the participant device). Different platforms used for remote testing fall somewhere in the wide spectrum between fully server-side data handling and fully client-side data handling. More information about data handling in particular remote resting platforms may be found in theplatform descriptionspages. Theexamples sectionfeatures some solutions adopted by different labs for various data handling needs. The balance is typically different depending on whether the platform used for remote testing runs as astandalone applicationon the participant device (e.g., tablet/smartphone apps, or installable desktop/laptop applications), or served to abrowserFor standalone apps, after initial installation and one-time download of material from the server, the stimuli and experiment parameters are typically stored on the participant device. Although apps can communicate with the server and handle data on the server-side if desired, it is more common for most computations (e.g., to monitor progress, provide feedback, control task flow etc.) to be done on the participant device within the installed app without the need to communicate with the server. The upload of the full final response dataset may then be done once at the end of the task/study either within the app automatically or manually by the participant. For instance, data uploads could be scheduled to occur after each session when the app is closed. Transfer of response data files may also be done manually outside of the app (e.g., participant upload via e-mail or web browser). While manual uploads allow the app to be more minimal in its feature set, this comes with some reduction in standardization of data handling and challenges to logging of data uploads.For browser-based platforms, the data handling primarily tends to be done on the server side. It is typical for stimuli and experiment parameters to be stored on the server, and loaded to the participant browsers "just in time" for administering the task. When using JavaScript to control the task, the stimuli my be pre-loaded when the task pages are first requested by the browser to build the so-calleddocument object model. The response data may also be aggregated by the JavaScript code in the browser memory and transferred to the server once at the end, or on a trial-by-trial basis in real-time using asynchronous requests. When only using HTML/CSS without JavaScript to control the task flow, conventional synchronous HTTP requests will be needed to load the stimuli for each trial and write the data to the server; the computations in this case for controlling the flow logic are done on the server. Finally, unlike app-based platforms where the device tends to be associated with a unique participant, browser-based remote testing will need to employ either anonymous sessions using browser cookies, or secure authentication/login information to track an individual participant. Although long-term storage on the participant device is also possible for browser-based testing through the use of cookies with long lifetimes, there is some risk that cookies may be cleared by the participant or other individuals using the browser (without this state-change being known to the server).

One advantage of handling data primarily on the client side (i.e., on the participant device) is that internet access is not necessary except for the initial download of the app and task material, and the upload of data at the end. Furthermore, when computations are done on the client side with pre-loaded stimuli, better timing control can be achieved compared to loading stimuli from the server on a trial-by-trial basis. Another advantage of client-side data handling is that some privacy/security issues may be circumnavigated, as described in the next section. On the flip side, server-side handling of data typically allows for greater standardization, near real-time logging of progress and aggregation of data, and perhaps most importantly, a simpler experience for the participants because their involvement beyond completing the task itself is minimal (e.g., no need for participant involvement in installing the app or uploading the data).

Privacy and Security

Issues of privacy and security are key considerations for data handling. For remote applications,PII and PHI data are subject to government regulationsIn the broader IT world, it is considered good practice to decouple identifying information from all other application data which may be in turn stored in a de-identified form. Accordingly, many cloud computing services (e.g., Amazon Web Services/AWS, Microsoft Azure, Google Cloud Platform/GCP) provide separate platforms for managing individual identities and individual data (e.g., AWS Cognito vs. AWS Aurora, or GCP IdentityPlatform vs. GCP CloudSQL). This separation provides an extra layer of privacy protection because unless both resources are simultaneously compromised and linked, the leaked data cannot be associated with an individual’s identity. One way to achieve this decoupling is to manually manage participant identities as would be done in in-person studies, and securely (e.g., via phone/email) provide the participant an anonymous unique subject ID that they use when participating in all remote testing. This separation between identities and data is also built in whenrecruiting subjects from online participant marketplaces such as Prolific or Mechanical TurkClient-side data handling is particularly advantageous when it comes to de-identifying data. For instance, when using standalone apps, because the client device is typically associated with a unique participant, the device can generate an anonymous subject ID without the experimenter having to securely transmit this ID to the participant. Furthermore, if verbal/audio or video responses are being recorded, client-side processing can perform the data analysis on the participant device (including for browser-based testing using JavaScript) and only transmit the results to the server thus effectively de-identifying otherwise-hard-to-de-identify data (e.g., if measuring environmental noise using a microphone on the participant device, noise-levels rather than raw audio may be transmitted to the server).

Another layer of security may be achieved by encrypting all data stored (i.e., encryption at rest). All major databases (e.g., PostgreSQL, MySQL, SQLite) and cloud computing service providers (e.g., AWS, GCP, Azure) provide multiple options for encryption at rest. However, it may be desirable to have public "clear-text" copies of the de-identified research data in the interest of open science. When sharing data to public repositories, it is good practice to use different anonymous participant IDs than used during data collection. Finally, it is important that all communication between participant devices and servers are encrypted. This is especially the case for browser-based communications with form fields where the participant can type in information. This can be achieved using SSL/TLS. Keys for TLS/SSL may be obtained from certificate authorities. A popular free option for TLS/SSL certification is Let’s Encrypt

Backup

Remote testing platforms vary in their support for automatic data backup. If setting up a custom platform, major databases (e.g., PostgreSQL, MySQL, SQLite) also come with support for manual backup snapshots that may be executed by scripts that are scheduled to run at specific times (e.g., cron jobs on Linux). Multiple clones of the database may be used to reduce server downtime in the event of a database crash. Otherwise, all data backup considerations as with in-person studies apply.




Issues in Data Analysis

Elevated random error

In the lab, we typically have a lot of control over what the testing environments are like. Subjects are usually seated comfortably in a sound booth with minimal auditory and visual distractions. In the same space, the hardware is usually limited to only what is necessary for the subjects to respond to tasks and positioned in a way that enhances the overall experience during testing. The experimenter is typically present with lots of opportunities to provide task-related instructions and checks in frequently to make sure that subjects are engaged in the task. This is sometimes critical for testing with special populations such as young children and older adults (seeSpecial Population).Any remote testing platform outside of the traditional lab setting will unlikely provide such control, inevitably introducing greater random error or “noise” in the behavioral data. Below is a list of factors to consider that may lead to more variability in behavioral data.

* Environmental factors: Subjects are likely in an environment with increased auditory/visual distractions and elevated ambient noise than in the lab sound booth. Ambient noise may also fluctuate over the course of the experiment.

* Device factors: Consumer electronics (e.g., computer, headphones) that subjects have access to may introduce variability in stimulus quality, such as emphasis/de-emphasis of spectral regions of the output frequency response. Noise-canceling headphones may in fact introduce elevated noise floor with improper wear.

* Subject factors: For remote testing, subjects will more likely run the experiment during a time of their choice – This may mean in late evenings or other times outside of typical business hours. Their attention state during these times may also play a role in how they respond to psychoacoustic tasks in ways unlike when testing is typically done in the lab. In the absence of a task proctor, participants may have limited access to raise questions regarding the task instructions. For child and elderly participants, this may be important as part of behavioral testing.

Note that even though these factors may seem to increase individual variability between subjects, it is highly likely that they may also affect within-subject behavioral performances because of the highly dynamic environments where remote testing occurs.What we do not know is whether and to what degree these factors contribute to the difference (random error) in the data collected in remote testing environments versus in the lab. The elevated error may influence behavioral outcomes in the following categories:

* Test-retest reliability, particularly important for experimental protocols aimed at indicating diagnostic and training outcomes

* Baseline performance (e.g., threshold for speech in noise recognition)

* Effect size of experimental manipulation (e.g., amount of speech-in-noise masking due to different types of noise masker)

The following illustration provides a simple example of how remote data collection may influence behavioral outcomes as compared to data collected in the lab environment.

The table below shows the general types of studies that are progressively more susceptible to variabilities introduced by remote platforms, including the factors that will likely influence behavioral outcomes and potential solutions.

Factors influencing outcomes

Potential solutions

Non-threshold studies (e.g., talker discrimination)

  • More distractions (e.g., visual, auditory)
  • Potential disconnection of audio device during task
  • Provide examples of a good home environment (quiet, few distractions) to participants in the instructions
  • Recommend participants use audio device with consistent connectivity (wired headphone/earphones and loudspeakers)
  • Implement a perceptual screening (e.g., proper headphone/earphone wear, stimulus presentation at comfortable level)

Studies measuring pitch-based absolute thresholds (e.g., pitch discrimination)

  • All of the above +
  • Audio device characteristics (e.g., narrower output frequency range, fluctuating frequency responses)
  • For calibrated devices: consider compensating for audio device frequency responses (incl. all hardware in the output chain)
  • For browser-based implementation: consider the upper bound of frequency responses for most consumer electronics

Studies measuring loudness-based absolute thresholds (e.g., loudness threshold)

  • All of the above +
  • Elevated ambient noise (very likely to lead to generally elevated threshold) +
  • Lack of audio output level measures (e.g., SPL) unless device is calibrated
  • All of the above +
  • When the absolute SPL cannot be obtained, consider whether sensation level is acceptable for the study design.

Studies measuring relative thresholds (e.g., speech in noise)

  • All of the above +
  • Absolute threshold may have a smaller but substantial impact of relative thresholds
  • Smaller dynamic range from audio device that may lead to clipping more easily
  • All of the above +
  • On the software side, consider implementing additional safety measures to ensure sounds do not exceed a maximum output level
  • Consider adding steps in a perceptual screening to check for “most comfortable level” as another upper bound to guide the definition of maximum output level

Studies involving binaural hearing (e.g., interaural level difference discrimination)

  • All of the above +
  • Additional audio device characteristics
    • Accurate delivery of very short delays may require specialized synthesis which may not be verifiable in all platforms. Inserting blank samples, for example, can only generate delays corresponding to multiples of the audio sampling period (e.g. delays of 23, 45, 68, … µs for 44.1 kHz) and should be avoided.
    • Left/right channel imbalance (e.g., level, spectral, or even user error)
  • All of the above +
  • Additional perceptual screen to ensure
    • Proper headphone/earphone wear for left/right channels
    • Left/right balance check not to exceed a maximum output discrepancy. Use this information for either rejecting participation, or compensating for left/right imbalance.

Validation studies

Ideally, the effects of remote testing on baseline performance, effect size, and random error would be known (or at least measurable) for every study question. In general, however, that information is not available. Alternative approaches can be used to estimate these effects, and inclusion of one or more validation conditions is one of the few easily identifiablebest practicesfor remote-testing:Replication using in-lab methods.One option could be inclusion of in-lab testing in a subset of conditions or participants. Where appropriate, for example, lab personnel might collect data on themselves in both remote and in-lab settings, or in-lab data might already exist from pilot stages of the project. Directly comparing performance across settings can provide some reassurance of the validity of remotely collected data.

Remote replication of in-lab tests. If an existing in-lab dataset closely related to the study procedures can be obtained, replication via remote testing can provide additional assurance that remote methods are capable and comparable to in-lab approaches. For example, a condition from a prior in-lab study can be included to estimate baseline performance across test settings. While this approach might not be capable of detecting changes in effect size, it can be used to verify baseline performance level and variation.

Analytical approaches

The random error is expected to be higher in datasets collected on remote testing platforms due to the highly dynamic testing environments (see the contributing factors listed in Section 1). While validation studies may provide insights on the types of studies that are generally more resistant to the remote testing environments, for individual datasets collected remotely, below are some considerations to ensure robustness through data analysis.Removal of outlier participants form data analysis.:The potentially large degree of random error in remote testing suggests high variance across participants which will have negative impacts on study power. Some variation might be attributed to participant factorsrelated to the task. Eliminating non-compliant participants from data analysis is often necessary to preserve study power, but often such participants cannot be easily or cleanly identified from the task data alone. Investigators are encouraged to carefully consider potential sources of inter-participant variation and include additional tests, such as perceptual/cognitive screening, attention checks, and catch trials within the remote testing protocol. Predefined levels of performance on these additional measures can be used to censor participant data independently of the primary data, improving the power and sensitivity of the study in a balanced and rigorous way.Bootstrapping and reporting**The (unknown) underlying effect of interest is likely more susceptible to sampling error because of the elevated random error in datasets collected in highly dynamic environments. When modeling psychometric functions,Wichmann & Hill (2001)provides insights on estimating variability in fitted parameters through bootstrapping in thepsignifit toolbox. In essence,bootstrappingprovides a range of values in the parameter estimate(s) in any statistical model that is fitted to a dataset, both for descriptive statistics (e.g., mean, standard deviation) and inferential statistics (e.g., regression estimates, effect size). It uses a Monte Carlo approach to simulate resampling from the datasetwith replacementunder the assumption that the original sample represents the population (random sample).

**Bootstrapping can also provide a sanity check for sample size (Chihara & Hesterberg, 2011). Under the Central Limit Theoreom, the sample mean from a random sample that is sufficiently large will have a normal distribution regardless of the distribution of the population. Distribution from the bootstrapped (re)samples has the same spread and skew as the original random sample. So if the distribution of the bootstrapped parameter estimate(s) is not normal, it is highly likely that the original dataset is under-sampled.




Platform considerations.

The primary dimension along which approaches to remote testing vary is the trade-off between experimental control and convenience/accessibility. This trade-off impacts all aspects of the study design, although different balances may be more appropriate for different aspects (e.g. combining careful stimulus control with convenient sampling of participants). There are many different research Platforms available to support remote testing; please refer to Platform Descriptions for detailed information about specific platforms.

What approach should I use for remote testing?

There are really three big questions you need to answer when deciding how to set up an auditory experiment for remote data collection: what hardware will I use, what software will I use, and who are the subjects I want to test. In all three cases, the alternatives range from convenient and less controlled to more time consuming and well specified.

hardware: calibration & interfaces

  • loose control of auditory stimuli with respect to calibration & frequency response
  • flexible graphic and temporal specs
  • no “special” data collected (e.g., ambient noise levels)

Any user hardware

e.g., PC & headphones

or more controlled solutions (below)

  • stimulus level & frequency response defined within ~5 dB
  • controlled graphic and temporal specs
  • some specialized data collection capabilities (e.g., touch screen input)

specified hardware

e.g., iPad

or more controlled solution (below)

  • strict control of level and freq. response
  • other specialized measurement or interface required (e.g., calibrated ambient noise recordings, software permitting)

lab hardware

e.g., deliver tablet & sound level monitor

software: data handling and experimental control

  • standard procedures w/out modifications required

Preconfigured hearing research packages

e.g., PART

or more controlled solutions (below)

  • custom stimuli
  • standard interface and response sufficient

build-your-own systems

e.g., Gorilla

or more controlled solution (below)

  • custom procedures or real-time processing
  • non-standard interface or data collection (e.g., voice recordings, hardware permitting)

fully custom scripts

e.g., MATLAB or Python

subjects: demographics & instruction

  • anyone can participate
  • no one-on-one instruction required

anonymous & unsupervised

e.g., Mturk

or more controlled solutions (below)

  • targeted population
  • specialized instruction (e.g,. via zoom)

“by invitation” access

most hardware/software will work

  • populations requiring real-time supervision (e.g., children)
  • protocols requiring rigorous training

supervision by proxy (e.g., parent)

experimenter-administered protocol hybrid model (e.g., training in person, data collection at home)

most hardware/software will work

Some other dimensions along which platforms vary:

Settings (In-lab, kiosk, at-home, in-the-wild)
At this time, most of the platforms identified with remote-testing appear optimized for testing in remote but isolated settings, such as in a participant’s home. Most are equally deployable to in-lab settings, possibly with greater control over computing and stimulus hardware. Depending on the study, it might be feasible to utilize a single design for both remote and in-lab validation studies. Keep in mind, however, that for commercial platforms the pricing structure may not be ideal for in-lab deployment.

Portable systems configured for standalone use with minimal experiment intervention could also be deployed in a kiosk setting, i.e. semi-permanently installed for unsupervised walk-up use. Depending on the motivation for remote-testing, kiosk deployment could provide numerous advantages such as sampling of geographically targeted populations in health-care offices, at music events, etc. Study design for kiosk-based testing is likely to share many elements of design for take-home / tablet-based studies where simplicity and clear instruction are prioritized over controlled sampling and experimenter supervision.

Finally, some of the platforms identified for remote testing may be suitable for use in everyday / "real-world" settings such as bars, cafes, classrooms, and outdoor settings. Again, depending on the motivation for remote testing, this use could enhance the ecological validity of a study by measuring performance in behaviorally relevant backgrounds rather than in controlled lab settings.

Supervision (experimenter present in person or remote, vs standalone task)
Instruction about the task, clarification when questions or malfunctions occur, and debriefing are all common interactions between experimenters and participants in lab-based testing. Shifting to remote testing requires careful consideration of how (or if) such communication must be facilitated, and identification of a research platform capable of supporting it. At one extreme of this dimension lies in-lab testing, with constant in-person interaction available as needed. At the other lie completely standalone tasks. Detailed and effective instruction can take the place of many interactions, but is less helpful for unexpected errors in the task or in the research hardware/software itself, or for special populations which experience specific challenges. An intermediate solution is for an experimenter to provide direct real-time supervision remotely. Some platforms support this feature directly. Others may require use of a secondary service (telephone / video calling, screen sharing, etc.) running alongside and independently of the research platform itself.
Whose device? (experimenter provided / take-home vs participant BYO)
Another important dimension is that of the hardware selection and control. Laboratory-owned equipment (e.g. a lab-configured PC or tablet) can be used for in-lab testing, and for remote testing, by delivering or shipping the equipment to the participant’s location. Greater control is obviously achievable with experimenter-provided equipment, which could be delivered with earphones, displays, and response devices. In that case, device calibration can be done prior to delivery, and verified after use and return.

Participant-controlled equipment offers less control, but may allow greater flexibility in accessing the research paradigm online or by direct download. In that case, other procedures will be necessary to verify the correct operation and stimulus delivery (e.g. psychophysical calibration), and preparations should be made to provide technical support to participants attempting to download and install research software for participation.

Platform device type (browser, tablet, headset, PC, custom hardware)
Similarly, device types vary significantly across research platforms, from entirely software-based platforms accessed online (via a web browser) or by download, to standalone devices such as tablets and VR headsets, to general purpose or custom computing hardware. Some approaches (e.g. physiological data collection, custom earphones) may require additional custom hardware for control or calibration.
Special hardware support (headphones, soundcard, UI response, etc)
Many online platforms are capable of presenting auditory stimuli via the interface or sound card built in to the participants hardware, and using whatever earphones the participant has on hand. Few of these offer the level of stimulus control and calibration that experimenters may be used to working with in the lab. For this reason, it may be worthwhile to consider platforms capable of working with off-board audio interfaces (e.g. USB sound cards) which can be delivered, along with a standard model earphone, to the participant’s location even if the research platform itself is fully online. Other types of specialized hardware may be required for certain types of response data, such as physiological data (heart rate, GSR, EKG / EEG), head and hand tracking (possible using VR headsets), etc.
Platform OS environment (e.g. Windows, iOS, Android)
Platforms also vary according to the operating system environment in which they run. Tablet-based systems may be compatible with Apple’s iOS (iPhone / iPad) or Google’s Android, which also supports a range of other devices including VR headsets; PC-based systems may run on Microsoft Windows or Apple MacOS. Few platforms run on multiple OSes, aside from fully online platforms, which may be compatible with any OS and a wide range of web browsers.
Software environment type (bespoke app, customizable app, MATLAB, js, Python, etc)
An important consideration for study design and implementation is the software development environment of the selected research platform. Some online platforms off simple web-based tools for defining sequences of trials (instructions, stimuli, responses, feedback,…) with no direct coding required. Other approaches support standard scripting languages such as javascript and python. PC- and tablet-based approaches can make use of a wider array of software development environments including MATLAB, Unity, Xcode, etc. The time, effort, and expertise required to develop for each platform could thus vary significantly. In some cases, there may be a tradeoff between platform complexity as experienced by developers and participants (e.g. a standalone and easy-to-use iOS app versus a MATLAB or Python script running in a separate interpreter).
Costs involved (software costs, subscriptions, required services)
Finally, the costs associated with various remote-testing platforms vary significantly and follow a number of different models. On the one hand are in-house and open-source programs with minimal or no acquisition costs. On the other are online platforms with subscription models that charge by the year, study, or participant. Some research platforms also require specialized hardware, which may be available from the platform vendor or third parties.



Special Populations

Special considerations may apply to specific populations of research participants, including children, elderly, people with sensory-impairment (e.g., hearing loss or low vision), users of hearing aid and cochlear implants, and patients experiencing cognitive and/or neurological challenges. Limited lteracy or limited proficiency in the primary languate of the test can also pose a challange. Although each individual or group of participants will experience their own pattern of challenges for in-lab and remote testing, some consideration can be given to common features of remote testing the might particularly affect special populations.

Remote testing imposes additional challenges compared to in-person testing. These additional demands include:

  • the ability to communicate using remote technologies, such as written instructions or information conveyed via video conferencing
  • practical knowledge required to set up the test environment, such turning off devices that generate background noise or asking family members not to interrupt testing
  • technical knowledge required to set up hardware or software
  • the ability to maintain attention without direct supervision

Special populations may require additional accommodations to ensure consistency and quality of data collected remotely. These accommodations might include:

video-chat for obtaining consent or assent

Video may be particularly beneficial for use with special populations, because it provides a rich set of interpersonal cues to ensure understanding and guard against coercion. Closed caption may also be appropriate for some subjects.

a social story or procedural video

While verbal instructions may be sufficient, some participants benefit from additional materials showing concrete examples of the task and what they will be asked to do.

a progress chart or visual schedule

Like a progress bar, these tools help the subject track their progress through a task or set of tasks.

an experimenter available when testing occurs

Having someone on call during data collection increases the chances that data will be collected following the protocol.

recruiting a parent or other helper to provide in-person support

A “wingman” can be trained to fulfill some of the same functions as an in-person experimenter.

blocking data collection into short segments

Providing frequent opportunities for feedback and breaks is common practice when working with special populations, but it could be particularly important for remote testing because the experimenter cannot monitor progress for signs of fatigue or flagging motivation.

including task training and probes

Training and probes may be even more critical for remote testing than in-person testing due to the reduced supervision and opportunities for the experimenter to notice confusion or flagging attention on the part of the subject.

user friendly response interface

Subjects with limited motor dexterity may benefit from the use of a touch screen or custom interface (e.g., oversized buttons likethisorthis).

multiple methods for delivering feedback and reinforcement

For special populations who may find prizes important to sustain motivation during the task, the experimenter may want to design various methods to effectively deliver incentives that meet compliance requirements.

interpretation of standardized tests

Administering standardized tests is usually part of the protocol with special populations, such as batteries of IQ, cognitive and language abilities. There are some implementations online (Gorilla Sample Tests). For most standardized batteries, normative data is collected through in-person interactions and may not be valid for remote implementation. The experimenter should be careful of interpretation of individual data if it will be transformed based on normative data.




Issues related to peer review

What should the standard be for publishing remote research?

There aren’t any hard and fast rules or boxes to check, just as there are no universal standards for in-person research. Experimental methods should be considered in the context of the protocol and the research question. Given the new pressures to adopt remote testing, reviewers will need to think critically and avoid rejecting a new methodology simply because it deviates from previous conventions. The focus should be on whether the hardware and test protocol are sufficient to support reliable and valid data that inform the specific question being asked.

As an author, what steps should I take to demonstrate rigor of my remote research methods?

Steps for demonstrating rigor of remote research are the same as those for in-person research, with the caveat that novel methods require additional explanation and explicit justification. Some specific considerations appear in the section describing Best Practices

My remote methodology offers less stimulus control than in-person testing. Is that a fatal flaw?

Not necessarily. If you can make a case that stimulus control is good enough to observe the effects being evaluated, then it may be sufficient to describe the methods and note relevant limitations of the methods.




Identifying best practices

A long-term goal of the task force is to identify best practices or guidelines for remote testing. For the most part, however, the necessary evidence base (in terms of what works well and what does not) does not yet exist. As such evidence becomes available (see Examples), we expect to add to the list of "best practices" which can be identified from investigators’ experiences. Here, we consider the motivations and challenges to identifying best practices in the first place.

Why should we attempt to identify best practices?

Best practices can form the basis of formal or informal guidelines for research practice with remote testing. The many tradeoffs evident in remote testing demonstrate the potential for fundamental weaknesses if the remote-testing approach is not designed appropriately for the study. Identifying best practices can help investigators select the features which will best ensure rigor and reproducibility of the research.

What issues stand in the way of establishing a single set of best approaches?

  1. Outside of specific examples where remote methodology has been a key focus (see Resources), very few remote-testing studies have yet been completed. There are many research questions which ought to be addressable via remote testing, but the unavailability of results means that critical challenges and confounds remain unidentified. This barrier is likely to be overcome as investigators’ complete studies and gain experience with the relevant approaches.
  2. More fundamentally, because experimental questions differ widely in the level of control required, no single approach is likely to be optimal for all studies. The specific hardware, software, and procedures used in any remote-testing study will impact the degree of experimental control and the information that can be collected from a test session. For example, accurate calibration is critical when evaluating detection in quiet, and less so when evaluating memory for a melodic tone sequence. Investigators are encouraged to carefully consider the methodological strengths and weaknesses as they pertain to the specific goals of their own research. The best approach depends on the phenomena being evaluated.

Candidate Best Practices that can be identified at this time:

Align strengths to research goals: Prior to conducting a remote-testing study, enumerate the specific tradeoffs associated with each identified approach. Be certain to align the strengths to the goals of the specific research question. Familiarity with the questions raised on this Wiki (seeIssues) and with feature comparisons across remote-testingPlatformscould help.Measure and document calibration: Incorporate the most accurate form of stimulus calibration that is achievable within the selected approach. In some cases (e.g. browser-based testing with participants’ own computer and headphones) this may be very limited, but even a simple psychophysical validation using tone detection or binaural comparison could provide important verification of the stimulus setup, such as whether earphones were worn correctly or if stimulus levels were appropriate for the test setting. More elaborate procedures involving acoustical measurement before, after, or during the tests might alleviate many performance concerns about testing outside of a controlled sound booth.Validation: If possible, include a replication or validation condition which matches, as closely as possible, an approach for which standard in-lab data exist or may easily be obtained. Close replication across in-lab and remote-testing procedures is one of the strongest approaches available to ensure the reliability and validity of new data. SeeData Analysis. Unexpected results could indicate an unacceptable deviation from ideal conditions, and could help to identify previously unanticipated limitations of the selected approach.

Inclusion of independent measures and predetermined criteria for outlier removal: Incorporating additional measures, such as cognitive screens, attention checks, and catch trials into the study procedures can provide important independent data for identifying non-compliant or poorly performing participants who contribute excessively to random error and thus should be removed from data analysis to preserve statistical power (see, e.g., McPherson & McDermott 2020). A set of independent, predetermined criteria for data removal is required to avoid introducing experimental bias that could result from identifying "outliers" based on the study data themselves. Alternatively, screening measures can provide covariate measures that aid the interpretation of study data when all participants are retained in the final analysis. See Data Analysis