Capabilities for Large-Scale Data Collection and Management

Dear Epicollect5 Team,

I hope this message finds you well. I work as an M&E Manager with Interactive Research and Development (IRD), a global public health research and service delivery organization running a decentralized, community-based program focused on TB, HCV, and Mental Health in rural Sindh, Pakistan. Over the past year, we have screened more than 30,000 individuals through this initiative, supported by TB Reach under the Stop TB Partnership.

I recently explored Epicollect5 while reviewing workshops for the 2024 UNION Annual Lung Health Conference. The tool, highlighted by key institutions and policymakers in TB and lung health, caught our attention for its relevance to our data collection needs. As an organization collaborating with several global health initiatives, we are keen to adopt platforms recognized and supported by leaders in the field.

As we plan to further expand our outreach in the coming year, we are carefully evaluating options for a reliable, scalable, and secure data collection solution. While Epicollect5 appears to align with many of our requirements, we have a few concerns we’d like to address:

Form Constraints: We noted that Epicollect5 supports single and multi-form surveys (up to five hierarchical forms). Since our data requirements may involve more complex structures, are there options for expanding form capacity or introducing custom solutions?

User Management: A critical requirement is controlling user access and data modification. Is there a way to restrict users from modifying or deleting data entries after they have been uploaded to ensure data integrity?

Scalability: Our upcoming initiatives will involve screening and managing a significantly larger number of individuals, and we need to retain all data securely over a longer period. Could you provide details on how Epicollect5 can support large-scale operations in terms of data handling and long-term scalability?

Data Storage & Retention: Given the scale of our project and the comprehensive tracking required for patient journeys (from screening to treatment completion), we are concerned about potential limitations on data storage and retention. While we understand that Epicollect5 does not impose strict data limits, there is a possibility of requests to delete large datasets. Could you clarify how this policy applies to long-term, large-scale projects like ours?

We see great potential in using Epicollect5 for our upcoming initiatives and would appreciate any recommendations you have on addressing these concerns. Your feedback will be crucial in helping us determine the suitability of the platform for our needs.

Thank you for your time and consideration.

Thank you for reaching out and for your interest in Epicollect5. To answer your questions:

1. Form Constraints: At present, there are no plans to alter Epicollect5 form design logic, as it sufficiently covers the needs of most users. However, since Epicollect5 is an open-source platform, you have the flexibility to fork the source code and develop custom solutions to better fit your specific requirements. Please note, though, that our team is unable to provide support for custom installations due to resource constraints.

2. User Management: Regarding user access and data modification, Epicollect5 does not currently offer features to restrict users from modifying or deleting data entries post-upload.

3. Scalability: Epicollect5 has demonstrated its capacity to handle large-scale operations effectively. We currently manage over 350,000 users across 137,000 projects, with more than 52 million entries collected. To better assess how Epicollect5 can support your specific needs, could you provide more details about the scale of your upcoming initiatives? This information will help us better understand your requirements and provide more tailored guidance.

4. Data Storage & Retention: Our fair usage policy for data storage and retention is assessed on a case-by-case basis. To offer a more precise response, we would need to understand the scale of your project in terms of the number of entries and media files you anticipate uploading. Once we have these details, we can provide a more customized assessment of how Epicollect5 can accommodate your long-term data storage and retention needs.

Epicollect5 source code →

We hope this helps.

Thank you for your prompt and detailed response. I appreciate the clarity provided on several aspects of Epicollect5’s functionalities. Given that Epicollect5 is open-source, we also appreciate your suggestion to download and run our own instance. We believe this could help address some of our specific needs, but we would like to clarify the implications this would have on support and functionality. Based on our operational needs and the context of our project, I would like to further elaborate and seek specific clarifications:

Form Constraints and Workflow: Our workflow involves multiple sequential forms, including initial screening (patient info, baseline screening, X-Ray results, RDT test results, sample collections, GeneXpert results, PCR results) and post-diagnosis (treatment referral, initiation, adherence follow-ups, outcome tracking).

Given the complexity of this workflow, the current limit of five hierarchical forms could restrict our ability to effectively capture patient journeys and manage data without gaps. We previously considered using sub-forms and branching features, but we are concerned that these limitations may impede operational efficiency. Could you provide guidance on how we can effectively structure our forms given the current limitations, if at all?

Skip Logic Limitations: We have encountered challenges with implementing complex skip logic in our forms. For instance, consider a scenario where a respondent answers a question about their last visit to a health facility. If they select “Other,” they should be directed to a follow-up question asking for specifics. If they choose “Never,” the form should skip directly to the next unrelated question. However, if they select “Last year,” “Last month,” or “Last week,” it should proceed to another question about the reason for their visit. Managing these multiple conditions can lead to confusion and improper routing, affecting the data collection process. How would you recommend structuring the forms to effectively handle such complex skip logic scenarios within the current system limitations?

Patient Record Management: We currently use a patient-centric system where all forms are linked to a patient’s assigned unique ID/QR Code. In Epicollect5, can different health workers access and update the same patient’s records by searching for unique IDs/QR codes to complete additional forms? We tested the barcode/QR code reader functionality to read patient IDs and ensured the “make IDs unique” checkbox was selected to disallow duplicates. Are these IDs/QR codes searchable in the system, allowing other users to look them up and collect more data against these records?

User Management: Maintaining data integrity is crucial for us. While restrictions on modifying or deleting data entries post-upload aren’t currently available, would running our own instance of Epicollect5 allow us to implement such restrictions through customizations?

Offline Use and Workflow Continuity: We understand that Epicollect5 supports offline data entry. However, without server synchronization, newly created patient records are inaccessible to other users during offline use. To work around this in our current system, we ensure the same device that created the patient record follows the patient through all stations, capturing all necessary forms before uploading to the server once connectivity is restored. While effective, this method is inefficient. Does Epicollect5 offer any alternatives to streamline the process in offline environments?

Scalability and Data Storage: We anticipate significant data volume from screening approximately 90,000 individuals over the course of a year. While the data will grow, it will also naturally filter down through each stage in a cascade—only 10% will need further testing, and about <1% will proceed to treatment. Nonetheless, given the nature of our work, we will require all patient data to be stored and accessible throughout the project.

Additionally, we are aware that Epicollect5 utilizes Blue Ocean for data hosting, which is UK-based. Recent legislation in Pakistan, specifically the Data Protection Bill, prohibits the storage of medical data outside the country. This is a major concern for us, as it affects our compliance with local laws governing medical data. Running our own instance would enable us to keep data local, but as you noted, this would mean we would have to forfeit the support that comes with using the hosted service.

Your insights on these matters will be immensely helpful as we assess the suitability of Epicollect5 for our needs.

Thank you for your time and consideration.

Given the complexity of your requirements and the lack of crucial features in Epicollect5, you might want to consider alternatives like:

Kobo Toolbox →

ODK →

While Epicollect5 is a powerful tool, the specific needs and challenges of your project might be better addressed by platforms like KoBoToolbox or ODK. These alternatives offer greater flexibility, robust support for complex workflows, and better compliance with local data storage regulations.