MixingInsights: A Framework for Causal Inference with Confounder Representation Learning from Mixed Structured and Textual Data

This section evaluates the performance of MixingInsights using the validation framework from Section 2.4. Since ground-truth causal effects are unavailable in observational studies, we first construct semi-synthetic datasets from real features and text to benchmark the estimators. We then compare the proposed confounder-adjusted estimates against biased baselines that ignore confounding or textual data. Finally, we assess the plausibility of our real-world causal conclusions by consulting existing domain expertise. This validation approach is generalizable to other observational datasets. The following subsections detail the real-world data, semi-synthetic generation, experimental setup, baselines, and results for both synthetic and real data.

Airline Review

 

Table 1 Data description of the airline review data

The real-world airline review data is organized in a table with rows and columns. Each column describes a feature of the data, and each row represents an observation. The airline review data includes multiple structured numerical/categorical columns and a column of free-text comments. For this analysis, the data for the economy cabin class is selected. The columns used here include overall score Y, whether users agree the ticket is cost-effective or not T, seat comfort score \(V_1\), food & beverage score \(V_2\), cabin staff service score \(V_3\), inflight entertainment score \(V_4\), ground service score \(V_5\), type of traveler \(V_6\), whether the flight has a layover or not \(V_7\), and free text comments \(\textbf\). A brief data description of the airline review data is put in Table 1. The data of an economy cabin class is selected for causal analysis. The final dataset consists of 12,326 observations. The causal question is “how much does the perception of the ticket as cost-effective affect customer satisfaction with the flight?”.

Specifically, the treatment T is whether users agree the ticket price is valuable of money, the outcome Y is the overall score, the structured numerical/categorical confounders V include seat comfort score, food & beverage score, cabin staff service score, inflight entertainment score, ground service score, type of traveler, whether the flight has a layover or not, and the text field \(\textbf\) is the free text comments.

Semi-synthetic Data

The ground truth of the causal effect is unknown in the real-world airline review dataset. To address this, we construct semi-synthetic data to evaluate whether causal estimators that adjust for the constructed confounders can recover the true causal effect. Real numerical/categorical variables are used as metadata to simulate outcomes. Specific variables excluded from the semi-synthetic data are selected to identify related sentences in raw text reviews, which form the text field of the constructed data.

We generated four semi-synthetic datasets using the real-world airline review dataset. They are as follows:

Seat_bin: In this dataset, seat comfort (\(V_1\)) serves as the unobserved confounder, and the simulated outcome Y is binary.

Seat_cont: Here, seat comfort (\(V_1\)) is the unobserved confounder, with the simulated outcome Y being continuous.

Seat_Food_bin: This dataset includes seat comfort (\(V_1\)) and food and beverage quality (\(V_2\)) as unobserved confounders, with the simulated outcome Y being binary.

Seat_Food_cont: In this dataset, both seat comfort (\(V_1\)) and food and beverage quality (\(V_2\)) are unobserved confounders, and the simulated outcome Y is continuous.

For each of these semi-synthetic datasets, ten simulations were generated. Specifically, we first use the numeric/categorical variables V to simulate the outcome Y. The continuous outcome of Seat_cont is generated by the equation (7)

$$\begin Y = & 2.0 * T + 0.5 * V_1 + \sum _^ 0.2 * V_i + 0.5 * \mathbb (V_6 = \text )\\ & + 0.5 * \mathbb (V_7 = 1)+ 0.5 + \varepsilon , \end} \end$$

(7)

where the noise is \(\varepsilon \sim \mathcal (0,1)\), which indicates that \(\varepsilon \) has distribution \(\mathcal (0,1)\). The continuous outcome of Seat_Food_cont is generated by the  (8)

$$\begin Y = & 2.0 * T + \sum _^0.5 * V_i + \sum _^ 0.2 * V_i + 0.5 * \mathbb (V_6 = \text )\\ & + 0.5 * \mathbb (V_7 = 1)+ 0.5 + \varepsilon . \end} \end$$

(8)

The outcome of Seat_bin is simulated by setting the simulated continuous outcome in  (7) to 1 if the continuous outcome is greater than 5, otherwise to 0. A similar process is applied to generate the binary outcome of Seat_Food_bin with the  (8).

The datasets Seat_bin and Seat_cont each have one unobserved confounder \(V_1\). The datasets Seat_Food_bin and Seat_Food_cont have two unobserved confounders \(V_1\) and \(V_2\). The text field \(\textbf\) will only include the sentences related to the aspect of the unobserved confounders. Ideally, we should be able to measure the unobserved confounders from the constructed text field. The datasets Seat_bin and Seat_cont consist of approximately 3,780 observations. The true causal effect of Seat_bin is 0.3523, and the true causal effect of Seat_cont is 2.0. There are 1,904 observations in Seat_Food_bin and Seat_Food_cont. The true causal effect of Seat_Food_bin is 0.3030, and the true causal effect of Seat_Food_cont is 2.0.

Basic Experimental Setting

For multi-modal confounder representation, we use an MLP neural network to represent numerical/categorical features. The network consists of two hidden layers with 32 units each. We append a linear layer with 32 units to the language embedding for the text data. We employ a linear layer neural network to predict the continuous outcome, and we use a linear layer followed by a softmax activation function to predict the binary outcome. The neural networks utilize the ReLU activation function and dropout regularization. We optimize the models using AdamW [38] with a learning rate of \(\gamma = 2e^\), \(\beta _ = 0.9\), \(\beta _ = 0.999\) and \(\epsilon = 1e-8\). We train our model for a dropout rate of 0.5. We used the distilbert-base-uncased model from the Transformers library [39] and truncated review texts to 512 tokens. We continue training on the objective function until convergence.

The anchor strength of Anchored CorEx is set as 6 for all experiments. The number of topics is set as the number of topics anchored by the anchor words list, plus two additional topics. The binary latent topic variables for anchored topics are chosen as measured confounders. The software package used for Anchored CorEx is the corextopic [35]. To identify keywords for the anchored topics, we utilize the unsupervised methodology YAKE [40] to extract keyterms from each review text. We consider unigrams as keyterm candidates and calculate the frequency of each keyterm. We then select anchor words from the top two hundred most frequent keyterms for each anchored topic.

Except for causal estimators with the multi-modal confounder representation, the propensity score e(z, v) is fit with a logistic regression using T, V, and \(\hat\). For continuous outcome, Q(t, z, v) is fit with a linear regression. For binary outcome, Q(t, z, v) is fit with a logistic regression. A biased estimate computed by a “naive” estimator \(\mathbb [Y|T=1] - \mathbb [Y|T=0]\), which is the expected difference in outcomes condition on T, is reported. The “naive” estimator does not adjust for confounders. Typically, the estimate from the “naive" estimator deviates significantly from the ground truth.

Baselines

To establish baselines for confounder representation, we implemented a suite of methods that derive representations from text and combine them with the structured covariates. These included a Bag-of-Words (BoW) model, a Latent Dirichlet Allocation (LDA) topic model [41] (with the number of topics matched to the KATM configuration), and two variants in which LDA was augmented with the two different sentiment representations from Section 2.2 (referred to as LDA+LIWC and LDA+LM). Additionally, KATM, LIWC, and LM were applied to consider topic and sentiment representations separately. The resulting text-based representations were then concatenated with the structured variables V to form the final adjustment set. The performance of causal estimators employing these composite confounder sets is reported in the following sections.

To demonstrate that MixingInsights provides a general framework robust to the choice of causal estimator, we also implement two alternative estimation approaches, Augmented Inverse Probability of Treatment Weighting (AIPTW) [42] and the Q-only estimator. The AIPTW estimator for the ACE is given by

$$\begin \begin \hat}_}&= \frac \sum _^ [ \frac_i} + (1 - \frac_i})\hat(1,z_i,v_i) \\ &- (\frac_i} + (1 - \frac_i})\hat(0,z_i,v_i) ) ], \end \end$$

(9)

where n is the sample size, \(t_i\) is the treatment indicator, \(y_i\) is the observed outcome, \(\hat(t,z_i,v_i)\) is the estimated conditional outcome expectation under treatment t, and \(\hat_i\) is the estimated propensity score. The Q-only estimator, which relies solely on the outcome model, is defined as

$$\begin \hat}_} = \frac \sum _^ [\hat(1,z_i,v_i) - \hat(0,z_i,v_i)]. \end$$

(10)

Here, \(\hat(t,z_i,v_i)\) is again the estimated potential outcome for individual i under treatment t. Comparing results across these distinct estimators allows us to assess the robustness of the causal conclusions derived from our framework.

Table 2 Performance of confounder representation methods on semi-synthetic datasetsResults on Synthetic Data

To quantitatively evaluate the performance of various confounder representation methods, we conducted experiments on four semi-synthetic datasets derived from the real-world airline reviews. Table 2 displays the Mean Squared Error (MSE) and standard deviation (STD) for three causal estimators: Q-only, IPTW, and AIPTW, across different methods. Each cell in the table is formatted as MSE / STD, where a lower MSE means greater accuracy in estimating the true Average Causal Effect (ACE), and a lower standard deviation indicates more robustness across various estimators.

A consistent trend across all datasets is that incorporating text information, regardless of the confounder representation method used, generally leads to better performance compared to using structured variables (V) alone. Among baseline text representation methods, V+LDA+LIWC and V+LIWC deliver competitive results, notably in the Seat_bin dataset, where they achieve the lowest MSE (0.181). This indicates that sentiment information (LIWC) provides a valuable contribution. However, neither method outperforms others consistently across all settings. Our proposed method, V+KATM+LIWC, exhibits the most balanced and robust performance. It achieves the best MSE in all four datasets and outperforms all the other representation methods in the most challenging setting, Seat_Food_cont, which includes two unobserved confounders and a continuous outcome. Moreover, its STD across estimators remains consistently moderate, indicating reliable estimates irrespective of the causal estimator employed. This combination of high accuracy and robustness underscores the effectiveness of its domain-guided and interpretable proxy construction. In contrast, the multi-modal method displays considerable performance fluctuation. Although it reaches a moderate STD (0.150) in Seat_Food_cont, its MSE values are substantially higher than those of other methods. However, when examining the IPTW estimator alone (Fig. 5), the multi-modal method yields the best MSE in three of the four datasets. This instability, particularly under continuous-outcome conditions, aligns with our earlier discussion on the sensitivity of deep representations to model specification and sample size. It further supports the view that purely performance-oriented approaches may compromise reliability when data conditions are suboptimal.

Fig. 5Fig. 5The alternative text for this image may have been generated using AI.

Experimental results for synthetic datasets. It illustrates the estimated causal effects of the treatment on the outcome on the (a) Seat_bin (b) Seat_cont, (c) Seat_Food_bin, and (d) Seat_Food_cont dataset, using various confounder representation methods, including V (structured confounders only), V+BoW, V+LDA, V+LDA+LIWC, V+LDA+LM, V+KATM, V+LIWC, V+LM, V+KATM+LIWC, V+KATM+LM, and the multi-modal approach. The ground truth causal effect is shown in a green line for reference, and the estimated effect with a “naive" estimator without adjusting for confounders is shown in a red line

Fig. 6Fig. 6The alternative text for this image may have been generated using AI.

Convergence plots of causal effect estimators on semi-synthetic datasets. The 2\(\times \)2 panel presents results across four datasets. Each subplot shows estimated causal effects (y-axis) for different confounder representation methods (x-axis). Three lines represent the Q-only, IPTW, and AIPTW estimators, with a horizontal dashed line indicating the ground-truth ACE. The plots visually demonstrate how representations from MixingInsights, particularly V+KATM+LIWC and multi-modal variants, cause the three lines to converge closely around the true effect, thereby mitigating estimator selection sensitivity and providing strong evidence for the framework’s effectiveness

Experimental results shown in Fig. 5 demonstrate that confounder-adjusted estimators not only significantly outperform the unadjusted naive estimator but also surpass methods using only structured confounders V when incorporating textual data. Across all experimental datasets, the multi-modal confounder representation achieved the most accurate and consistent performance, obtaining the closest estimates to the true causal effect on both Seat_bin, Seat_cont, and Seat_Food_bin. The V+KATM+LIWC estimator demonstrated remarkable robustness and a superior balance between precision and interpretability, securing the best results on Seat_Food_cont and the second-best on Seat_bin, Seat_cont, and Seat_Food_bin. Notably, the multi-modal method performed worst on Seat_Food_cont, a fluctuation we attribute to two factors: (i) the dataset involves two unobserved confounders, increasing the complexity of representation learning, and (ii) it contains only half the number of observations compared to the Seat_cont dataset, which has a single unobserved confounder. These conditions make it particularly difficult to estimate the outcome model Q(t, z, v) accurately.

Figure 6 presents convergence plots of causal effect estimators on semi-synthetic datasets, offering direct visual evidence that the representations learned by MixingInsights enable diverse estimators to converge toward the ground-truth treatment effect. The 2\(\times \)2 panel displays results across four semi-synthetic datasets, each constructed with a known true ACE. Within each subplot, the x-axis lists different confounder representation methods (ordered consistently with Table 2), while the y-axis shows the estimated causal effect. Three lines trace the point estimates from the Q-only, IPTW, and AIPTW estimators; a horizontal dashed line marks the dataset’s ground-truth ACE. Light shaded areas optionally indicate 95% confidence intervals. The plots illustrate that baseline representations often produce dispersed estimates across the three estimators, reflecting estimator-selection sensitivity. In contrast, at the points corresponding to MixingInsights representations, specifically V+KATM+LIWC and the multi-modal variant, the three estimator lines converge to the true-effect line. This pattern holds consistently across all four datasets, demonstrating that the framework’s learned representations robustly align different estimation strategies with the true causal effect. Notably, compared to using only the structured variables V, the estimates under MixingInsights show a clear reduction in bias and variance, regardless of the causal estimator employed. Thus, Fig. 6 provides strong support that MixingInsights mitigates estimator sensitivity and enhances the reliability of causal inference in the presence of multi-modal confounding.

Table 3 The anchored topic and topic aspects discovered by Anchored CorEx for Seat_bin and Seat_cont datasets are presented.Table 4 The anchored topics and topic aspects discovered by Anchored CorEx for Seat_Food_bin and Seat_Food_cont datasets are presented

Finally, we detail the anchored topics used by KATM in our experiments. For the Seat_* datasets, the anchor word list was [“seat”, “comfortable”, “uncomfortable”], guiding a topic that captures descriptions of seat comfort during flights. For the Seat_Food_* datasets, two anchor lists were used: [“seat”, “comfortable”, “uncomfortable”] and [“food”, “meal”, “drink”], steering topics toward seat comfort and food/beverage service, respectively. The resulting topic aspects in Tables 3 and 4 are interpretable and domain-relevant, ensuring semantic grounding of the proxies.

Results on Real-world Data

We apply causal estimators adjusting for confounding bias with our proposed methods on the real-world airline review dataset. The treatment is whether the ticket price that customers agree is cost-effectiveness. The outcome is the customer satisfaction rate. For KATM, the anchor words are [“punctual”, “delay”] and [“bag”, “luggage”, “baggage”]. These two topics show the aspects that whether the flights are delayed and the service of customers’ luggage during flights. The rationale for selecting these anchor words followed a two-step process. First, we extracted keywords from the text reviews (e.g., “tv”, “tv show”, “luggage”, “bag”, “delay”, “departure time”) and clustered them into semantic topics. Second, we selected anchor words based on domain knowledge, prioritizing aspects not already captured in structured ratings. For example, while entertainment-related keywords (e.g., “tv”) were already covered by the existing “entertainment” rating in Table 1, operational factors like flight delays and baggage service were absent from structured variables. The resulting topics and their associated aspects for the airline review dataset are summarized in Table 5.

Table 5 The anchored topics and topic aspects discovered by Anchored CorEx for the airline review data are presentedFig. 7Fig. 7The alternative text for this image may have been generated using AI.

Comparison of causal effect estimates for the impact of perceived price value on overall satisfaction using selected confounder representation methods. Point markers represent the mean estimate from each causal estimator (Q-only, IPTW, AIPTW), and vertical error bars denote the standard deviation

The direct comparison of the difference in customer satisfaction rates using the “naive” estimator yields a value of 5.4303. However, this estimate does not account for the adjustment of confounders, which often introduces significant bias (see previous research [43] for a further discussion on bias). After adjusting for both numerical and categorical confounders, the estimated effect drops to approximately 2.0. Furthermore, by incorporating textual information through our proposed framework, as illustrated in Fig. 7, we find that perceiving the price as cost-effective leads to an increase in customer satisfaction of less than two stars on the rating scale. It is noteworthy that the convergence of results across three distinct causal estimators and various methods of representing confounders highlights the robustness and consistency of the findings, demonstrating the effectiveness of the MixingInsights framework.

Since the ground truth of the causal effect is unknown, we check our causal conclusion with the previous investigation [5] conducted by domain experts. The previous investigation assessed the attributes that lead to customer satisfaction and dissatisfaction with economy cabins. Their findings show that friendliness and helpfulness of staff, seat comfort and legroom, luggage, and flight disruptions are key factors for customer satisfaction of passengers. Their findings do not investigate the effect of the price, so they do not conclude that price is a key factor in customer satisfaction. Our causal conclusion presents a new finding that the price has a significant influence on customer satisfaction.

Comments (0)

No login
gif