Other example reports ( tutorials / technical issues / workarounds )

Testing Music Perception Through Microsoft Teams

Question of Interest

What is the feasibility and accuracy of transforming an in-person protocol using music perception materials developed outside our lab to a remote study?

Our study evaluates music perception via four music-based subtests: the three subtests of the University of Washington Clinical Assessment of Music Perception (CAMP; pitch discrimination, isochronous familiar melody recognition, and instrument recognition) and one subtest of the University of Innsbruck Profile of Music Perception Skills (PROMS; melody subtest). This study was originally conducted in person, routing the audio signal through the audiometer and presenting at a calibrated 65 dB SPL.

To transition this study to a remote environment, we use the videoconferencing software Microsoft Teams on a Windows 10 computer connected via ethernet cable to the university’s internet. This software enables the researcher to share the screen and the audio signal from the computer to the participant’s computer directly. Due to our university’s Business Associate Agreement, we consider this software to be secure to HIPAA standards, and thus appropriate for maintaining privacy for human subjects research.

Prior to confirming the appropriateness of the remote testing protocol, we completed two validation studies. To detect whether the Teams software introduced distortions that would impact the integrity of the pitches at the listeners’ computer, we tested pitch equivalence. To detect whether the results of the study in the remote environment significantly differed from the in-person environment, we validated using test-retest principles for five typical hearing (TH) participants.

Pitch Equivalence


Five participants were consented to the procedure, including downloading the free sound analysis tool Audacity. The pitches were shared from the researcher’s computer (on campus) using Microsoft Teams and the “Share System Audio” feature while the participants recorded the sounds. Represented in this analysis are two Macbook Pros, two Macbook Airs, and one HP Envy X360. Because Mac computers do not have internal sound cards capable of recording computers’ internal audio, the add-on Soundflower was used for those participants with a Macbook, as instructed in the Audacity Wiki. Four of five participants recorded the sounds directly into Audacity, while one participant (whose computer was incompatible with Soundflower) recorded the sounds in QuickTime and converted the files into WAV files to be uploaded into Audacity.

Participants were instructed to use the Plot Spectrum tool on each pitch recorded into Audacity to detect the fundamental frequency represented. Participants were guided through this process with the lead researcher on a video call, and the participant shared their screen with the researcher to confirm adherence to the protocol. First, participants highlighted the entire selected pitch, then chose Analyze -> Plot Spectrum from the top toolbar. Second, participants placed their cursor near the first peak in the image until the vertical guide bar “snapped” to the first peak. Then, the participant read the researcher the value in the box labeled “Peak.” The participant completed this process for all 26 pitches presented.


Root mean square errors were calculated for each of the 26 pitches included in the music test software to determine how much deviation was present when transmitting pitches using Microsoft Teams from the intended pitch to that represented on the participants’ computers. Across pitch stimuli range from 184 Hz to 783 Hz the mean RMSE was 2.8 Hz. With the exception of one pitch (659 Hz, which was had an RMSE of 6.6 Hz), all RMSE were less than 5 Hz.


In classic psychoacoustic studies, frequency difference limens (DL) were tested as a function of intensity, and frequency. Consistently, it was found that as frequency increased, DLs were larger (poorer), and as intensity increased, DLs were smaller (better) (Moore, 1973; Sek & Moore, 1995; Wier et al., 1977). In these studies, in the range of the frequencies presented here, participants had DLs typically around 2 Hz or less. While some of the RMS in this study exceed a DL of 2 Hz, the participants in these classic studies were highly trained (in some cases PIs and lab members of the studies with 20-30 hours of pitch discrimination training). These average DLs do not likely represent the whole untrained population, which are likely to be slightly larger in untrained normal hearing listeners. Thus, the above numbers presented here should not greatly affect the validity of the stimuli presented remotely.

Test-Retest Procedure


Five participants with normal hearing who completed testing in person were contacted to see if they were interested in completing the objective music measures of pitch discrimination, familiar melody recognition, instrument identification, and unfamiliar melody discrimination remotely. Five participants represent 25% of our total sample size for this group (i.e., the full study includes 20 participants). These tests were completed using the “Share System Audio” feature of Microsoft Teams, a videoconferencing software that allowed for secure remote data collection. In the remote modality, participants listened to the stimuli in soundfield (mirroring the presentation of their in-person data collection). A threshold of r = .70 was considered the acceptable test-retest, based upon the rest-retest values found in the original development and validation of these measures when the test and retest were both completed using the same protocols in person.


Pearson’s correlations were calculated for each of the five objective data points gathered in this study:

  • mean pitch discrimination, r =-0.532
  • percent correct on familiar melody recognition, r = 0.912
  • percent correct on instrument identification, r = 0.783
  • unfamiliar melody discrimination score, r = 0.858
  • unfamiliar melody discrimination percent correct, r = 0.711


The correlations between the objective measures at time 1 and time 2 suggest that all measures meet a test-retest value of .70 or greater, except that of the pitch discrimination task. Three of five participants had mean pitch discrimination scores change by .02, one participant had mean score change of .04, and one participant had a mean score change of .10. The full range of scores possible is .50 to 12.00 semitones, with .50 being the best and 12.00 being the worst. Most typical hearing listeners can discriminate pitches that differ by 1.00 semitone, or one half-step. Therefore, a change of .10 is likely not a clinically significant difference. Furthermore, because this change only happened with one individual, this more than likely reflects a change in participant-related characteristics, such as attention.

Although the test-retest value of the mean pitch discrimination (-.532) is less than the predetermined .70 value, the aforementioned calibration study suggests that this is likely not due to remote administration of the objective measures. Because this calibration study suggested robust transmission of the pitches via the remote platform, and all other measures were above .70 test-retest, this validation study concludes that remote administration of the objective music tests mirrors that of the in-person data collection.

Resolving “Sharing System Audio Paused” Error in Microsoft Teams

Upon beginning the remote data collection using Microsoft Teams, it was noted that an error message (“Sharing System Audio Paused”) would occur randomly throughout testing. During the first three tests in the remote environment, presentation of the stimuli was interrupted due to this message, and these participants’ results were excluded as invalid. The CAMP test software does not allow for the researcher to repeat stimuli presentation to the participant, although the PROMS test software does (by refreshing the browser). Speed tests conducted by our IT department detected no problems with the internet connection. No pattern could be detected by the researchers for when or why this happened. Therefore, a reliability procedure based on manual tallying of stimuli was developed. We call this “Tallying for Reliability.”

On paper, the researcher tallies each presentation of the test. After five item presentations, the researcher stops and restarts audio sharing. The screen continues to be shared and the audio is paused and restarted.

In the event that the “Sharing System Audio Paused” error occurs even after completing this procedure, the researcher circles the tally that represents the item that was in error and discards it from the calculation of the final percent correct score. In our experience, “Tallying for Reliability” has dramatically reduced the number of “Sharing System Audio Paused” errors during testing, such that only 4 out of approximately 1362 stimuli (~0.29%) presentations total have been invalid. Additionally, it has improved the validity of the participants’ results, as we remove invalid presentations from the calculation of the participants’ scores.

Contact Stephanie L. Fowler, AuD (stephanie.fowler@utdallas.edu) for more information.


Andrea D. Warner-Czyz, PhD (PI)

Peter Assmann, PhD (Pitch Equivalency)

Colleen Le Prell, PhD (Test-Retest)

PREAMP Team (Analysis)