Spatio-temporal methods to handle missing data in syndromic surveillance with applications to health management information system data

Syndromic surveillance, an approach monitoring disease symptoms as proxies for cases, serves as a valuable tool for early outbreak detection and the characterization of existing outbreaks (Katz et al., 2011). This approach requires data to be collected on disease symptoms at regular frequency with high accuracy. In low- and middle-income countries, health management information systems (HMISs), which collect data on health-related indicators aggregated by month and health facility, are often utilized for syndromic surveillance (Abat et al., 2016, Fulcher et al., 2021, Ahmad et al., 2015). Importantly, the use of existing HMISs for syndromic surveillance activities requires limited additional resources, can cover large target populations, and can be implemented and scaled quickly.

Existing methods for syndromic surveillance often model regularly reported counts to predict future behavior in a specific geographic region (Chen et al., 2011, Mathes et al., 2017, Chan et al., 2010). However, methods applied to HMIS data often do not appropriately account for incomplete data patterns (Feng et al., 2021) or capture important spatio-temporal features present in HMIS — potentially inhibiting the precision and thus utility of surveillance. First, incomplete data can introduce bias into model fitting, particularly if the missing data is not missing completely at random (MCAR). Second, incomplete data presents difficulties in aggregating data from smaller areal units to higher level geographical units.

There has been significant research to develop methods to impute missing data (Lin and Tsai, 2020). Some of the most commonly used methods are MICE (Van Buuren and Groothuis-Oudshoorn, 2011), MissForest (Stekhoven and Bühlmann, 2012), and matrix-completion methods (Fuentes et al., 2006). However, these methods are not specifically designed to account for the spatial and temporal relationships that are present in our data. There has been some work developing spatio-temporal imputation methods, usually for geological data, such as CUTOFF (Feng et al., 2014). Other researchers have used spatio-temporal models designed for other purposes to impute missing data (Kim and Pachepsky, 2010, Rouamba et al., 2020, Yang et al., 2018). In our study, we are not primarily concerned with imputing missing data points in a baseline period, but rather in the out-of-sample predictive ability (and uncertainty estimation) of models when there is missing data in the baseline. Hence, we do not directly study the popular imputation methods in our simulations, even though many of these methods could be used to impute missing HMIS data. We have identified one study that has analyzed the impact of missing data in HMIS in terms of parameter estimation and imputation prediction accuracy (Feng et al., 2021); however, the focus of our study is on prediction sensitivity and specificity, which are more relevant to outbreak detection in the real context.

Syndromic surveillance data often includes spatial and temporal location information, and models that use this information could be advantageous. Spatio-temporal models account for correlation of data over space, such as neighboring healthcare facilities having similar rates of a disease, and time, such as facilities having similar month-to-month rates of a disease. Many spatio-temporal models have been developed – kriging models, regional-effects models, vector autoregressive models, models with separable spatio-temporal components, and hierarchical dynamical spatio-temporal models – that include spatial and temporal correlation in different ways (Cressie and Wikle, 2015). One important distinction in spatio-temporal regression models is whether the spatial and temporal effects are captured from residuals or outcomes (deemed “error models” and “lag models” (Tabb et al., 2018)). In this study we examine a regression model containing spatial and temporal lag terms that has been used for infectious disease modeling (Paul et al., 2008, Paul and Meyer, 2016) as well as a model using a conditional auto-regressive (CAR) procedure, a Bayesian spatio-temporal error model, that has shown promise in imputing and predicting outcomes in HMIS data (Rouamba et al., 2020).

Although such spatio-temporal methods appropriately capture spatial and temporal relationships, it is still unknown how they will perform when utilized for syndromic surveillance purposes in the context of missing indicator counts. Chen et al. compared spatio-temporal models used for syndromic surveillance and found that there has been more of a focus on temporal models rather than spatial or spatio-temporal models in syndromic surveillance (Chen et al., 2011). While many of the models in that paper could be useful in certain situations, they relied on individual-level geo-located data and focused on detecting clusters of outbreaks. Cluster detection is not relevant in the context of our study, where data is grouped into pre-defined jurisdictional units. Mathes et al. compared spatio-temporal models in detecting outbreaks in daily emergency department data (Mathes et al., 2017). They found that a spatial scan statistic performed the best at detecting small outbreaks, however this method relies on geo-located data that is not grouped into areal units. In our study, we focus on areal data, the typical form of HMIS data, which is an important addition of our study to existing research.

The goal of this study is to explore the effect of missing data on syndromic surveillance as applied to HMIS-like data by comparing several statistical modeling approaches. Specifically, this study builds on our team’s prior syndromic surveillance work with HMIS data in Liberia during the COVID-19 pandemic (Fulcher et al., 2021). Notably, our prior work assumed data was missing completely at random and did not leverage the spatio-temporal characteristics of the data. To assess the performance of this approach, we incorporate two common spatio-temporal approaches – freqGLM (Paul et al., 2008, Paul and Meyer, 2016) and CAR (Rushworth et al., 2014) – in extensive simulation studies, with different data generation models, parameters, missing mechanisms, and proportion of missing data.

This study offers insights into the performance of existing methods in realistic situations that can be valuable for those in the field, particularly those in low- and middle-income countries relying on HMIS data. This is particularly important in the field of syndromic surveillance which already has many existing methods but where it is unclear which of these methods are “best” for this context. Here, our scientific contribution is to investigate existing methods and when the use of each is appropriate. Based on our simulation results, we provide recommendations for practical use of syndromic surveillance with HMIS or similarly structured data.

Comments (0)

No login
gif