PP Remote Testing Wiki | Issues / DataHandling

Data Handling

Kinds of data

Remote testing is associated with various kinds of data that need to be exchanged between participants and researchers. These typically include

Participant information, including personally identifiable information (PII) and protected health information (PHI) that may be subject to regulatory compliance constraints.
Stimuli (e.g., audio, image and video data) and experiment parameters. Although often fixed across participants, these may be individualized, for instance when participants are randomly assigned to different "conditions", or if the measurements are adaptive.
Response data, which will likely be the bulk of the data that is of interest to the experimenters. Access to single-trial response data may be needed during testing for progress monitoring and verification of task compliance, to provide feedback to participants, and/or to calculate summary performance metrics to make decisions about task flow. The full set of final responses will then need to be assembled for detailed analyses. Long-term archival and sharing with collaborators or the broader research community are also considerations that may apply.

In some cases PII may incidentally be linked to the response data, thus requiring special considerations. For instance, when verbal responses are recorded, or if there is live video interaction between the participant and experimenter, raw audio/video data will contain identifying information.

Server-side versus client-side data handling

There is considerable technical flexibility in whether the data handling is done on the server side (i.e., online in the "cloud") versus the client side (i.e., on the participant device). Different platforms used for remote testing fall somewhere in the wide spectrum between fully server-side data handling and fully client-side data handling. More information about data handling in particular remote resting platforms may be found in the platform descriptions pages. The examples section features some solutions adopted by different labs for various data handling needs. The balance is typically different depending on whether the platform used for remote testing runs as a standalone application on the participant device (e.g., tablet/smartphone apps, or installable desktop/laptop applications), or served to a browser.

For standalone apps, after initial installation and one-time download of material from the server, the stimuli and experiment parameters are typically stored on the participant device. Although apps can communicate with the server and handle data on the server-side if desired, it is more common for most computations (e.g., to monitor progress, provide feedback, control task flow etc.) to be done on the participant device within the installed app without the need to communicate with the server. The upload of the full final response dataset may then be done once at the end of the task/study either within the app automatically or manually by the participant. For instance, data uploads could be scheduled to occur after each session when the app is closed. Transfer of response data files may also be done manually outside of the app (e.g., participant upload via e-mail or web browser). While manual uploads allow the app to be more minimal in its feature set, this comes with some reduction in standardization of data handling and challenges to logging of data uploads.

For browser-based platforms, the data handling primarily tends to be done on the server side. It is typical for stimuli and experiment parameters to be stored on the server, and loaded to the participant browsers "just in time" for administering the task. When using JavaScript to control the task, the stimuli my be pre-loaded when the task pages are first requested by the browser to build the so-called document object model. The response data may also be aggregated by the JavaScript code in the browser memory and transferred to the server once at the end, or on a trial-by-trial basis in real-time using asynchronous requests. When only using HTML/CSS without JavaScript to control the task flow, conventional synchronous HTTP requests will be needed to load the stimuli for each trial and write the data to the server; the computations in this case for controlling the flow logic are done on the server. Finally, unlike app-based platforms where the device tends to be associated with a unique participant, browser-based remote testing will need to employ either anonymous sessions using browser cookies, or secure authentication/login information to track an individual participant. Although long-term storage on the participant device is also possible for browser-based testing through the use of cookies with long lifetimes, there is some risk that cookies may be cleared by the participant or other individuals using the browser (without this state-change being known to the server).

One advantage of handling data primarily on the client side (i.e., on the participant device) is that internet access is not necessary except for the initial download of the app and task material, and the upload of data at the end. Furthermore, when computations are done on the client side with pre-loaded stimuli, better timing control can be achieved compared to loading stimuli from the server on a trial-by-trial basis. Another advantage of client-side data handling is that some privacy/security issues may be circumnavigated, as described in the next section. On the flip side, server-side handling of data typically allows for greater standardization, near real-time logging of progress and aggregation of data, and perhaps most importantly, a simpler experience for the participants because their involvement beyond completing the task itself is minimal (e.g., no need for participant involvement in installing the app or uploading the data).

Privacy and Security

Issues of privacy and security are key considerations for data handling. For remote applications, PII and PHI data are subject to government regulations.

In the broader IT world, it is considered good practice to decouple identifying information from all other application data which may be in turn stored in a de-identified form. Accordingly, many cloud computing services (e.g., Amazon Web Services/AWS, Microsoft Azure, Google Cloud Platform/GCP) provide separate platforms for managing individual identities and individual data (e.g., AWS Cognito vs. AWS Aurora, or GCP IdentityPlatform vs. GCP CloudSQL). This separation provides an extra layer of privacy protection because unless both resources are simultaneously compromised and linked, the leaked data cannot be associated with an individual's identity. One way to achieve this decoupling is to manually manage participant identities as would be done in in-person studies, and securely (e.g., via phone/email) provide the participant an anonymous unique subject ID that they use when participating in all remote testing. This separation between identities and data is also built in when recruiting subjects from online participant marketplaces such as Prolific or Mechanical Turk.

Client-side data handling is particularly advantageous when it comes to de-identifying data. For instance, when using standalone apps, because the client device is typically associated with a unique participant, the device can generate an anonymous subject ID without the experimenter having to securely transmit this ID to the participant. Furthermore, if verbal/audio or video responses are being recorded, client-side processing can perform the data analysis on the participant device (including for browser-based testing using JavaScript) and only transmit the results to the server thus effectively de-identifying otherwise-hard-to-de-identify data (e.g., if measuring environmental noise using a microphone on the participant device, noise-levels rather than raw audio may be transmitted to the server).

Another layer of security may be achieved by encrypting all data stored (i.e., encryption at rest). All major databases (e.g., PostgreSQL, MySQL, SQLite) and cloud computing service providers (e.g., AWS, GCP, Azure) provide multiple options for encryption at rest. However, it may be desirable to have public "clear-text" copies of the de-identified research data in the interest of open science. When sharing data to public repositories, it is good practice to use different anonymous participant IDs than used during data collection. Finally, it is important that all communication between participant devices and servers are encrypted. This is especially the case for browser-based communications with form fields where the participant can type in information. This can be achieved using SSL/TLS. Keys for TLS/SSL may be obtained from certificate authorities. A popular free option for TLS/SSL certification is Let's Encrypt.

Backup

Remote testing platforms vary in their support for automatic data backup. If setting up a custom platform, major databases (e.g., PostgreSQL, MySQL, SQLite) also come with support for manual backup snapshots that may be executed by scripts that are scheduled to run at specific times (e.g., cron jobs on Linux). Multiple clones of the database may be used to reduce server downtime in the event of a database crash. Otherwise, all data backup considerations as with in-person studies apply.