In our previous blog post, we highlighted the clinically-rich nature of the PHARMO Data Network, which can be leveraged to generate real world evidence (RWE) in rare diseases. We know this can be challenging due to multiple factors, such as the need for multiple diagnostic criteria for diagnosis, the requirement for longitudinal data across multiple settings of care and, most significantly, the limited sample sizes of patients within a single dataset. To overcome challenges with sample sizes, pooling datasets is one possible solution. However, pooling datasets can be undertaken in multiple forms and varying complexities. The three common approaches we explore in this article are 1. Harmonization of scientific methods, 2. Pooling of aggregate results, and 3. Pooling of individual patient-level data.

Tiered funnel diagram with 3 segments, largest to smallest. They are labelled: 
1. Low complexity: Harmonization of Scientific Methods
2. Medium complexity: Pooling of Aggregate Data
3. High complexity: Pooling of Patient Level Data

A low-complexity approach to pooling data would be the Harmonization of scientific methods. This method involves local partners adapting a common protocol/statistical analysis plan (SAP) to fit the local data context, such as underlying data structure/content and health system variations. Local partners then develop their own programming code to run the analysis on-site, and provide aggregate tables and figures with the coordinating study center for central reporting. If required, a meta-analysis can be conducted across the data source. This approach is useful when local data cannot readily be sent off-site, when measuring the differences in health system care management are of interest, when underlying data structure/content varies by source, when speed/time-to-output is critical, and when the budget is limited.

A medium-complexity approach to pooling data would be the Pooling of aggregate results. This method is more complex, as it involves designing a bespoke common data model (CDM). Data is harmonized locally by data source holders to the CDM, and the coordinating study center writes common programming code, distributing it to local data holders who run the analysis locally. Like the low-complexity approach, local data holders then provide aggregate tables and figures with the coordinating study center for central reporting. This approach is useful when the underlying data being pooled is of similar type, granularity, and depth, and it is important to remove ‘noise’ related to health system differences to reflect the true epidemiology and outcomes of disease and treatment. An example of a medium-complexity approach would a federated model, under these frameworks data remains with local partners, but analytics are performed in a coordinated fashion across multiple sites. This ensures data privacy and security while enabling robust, multi-center analyses. It also allows for a stronger external validity of findings across countries.

A high-complexity approach to pooling data would be the Pooling of individual patient-level data. This method follows the same steps as the pooling of aggregate results in the sense a CDM is used to harmonize data across sources. However, in this scenario, local data partners share the CDM/patient-level data with the coordinating study center to create a combined observational research file. After a series of quality control checks, the coordinating study center then takes responsibility for analyzing the observational research file. This approach is useful when certain statistical methods require patient-level comparisons, such as propensity score methods, and when the underlying data being pooled is of similar type, granularity, and depth. However, this method demands significant time to secure agreements from data source holders and ethics committees for transferring individual patient-level data. Although patient-level data is often preferred for its detailed insights, a major challenge is navigating the EU General Data Protection Regulation (GDPR), IT infrastructure, and compliance hurdles to enable the secure transfer of data off-site.

In summary, there are multiple approaches available to improve sample size estimates in rare diseases. The most appropriate approach will depend on local-level data access models, similarity of health systems, similarity of data structure across datasets, timelines, and scope. By carefully selecting the pooling method that best fits the study requirements, you can overcome the challenges associated with limited sample sizes and generate robust and meaningful RWE in rare diseases.

To navigate these study design choices and trade-offs effectively, reach out to the Lumanity RWE Consulting team to guide you through the complexities of data pooling and ensure the selection of optimal approach for your research needs.