Integrated artificial intelligence in healthcare and the patient’s experience of care

To develop a computational model to reason over general preferences and attitudes around the integration of AI into healthcare, we introduce the AI affinity coefficient$(\alpha ) \in R \xrightarrow (-1, 1)$ as a measure of the deviation of an answer to a survey question, $Q_{i} \in Q$, from neutrality, which is realized as $\alpha = 0$. When a response is in favor of AI, $\alpha \rightarrow 1$. The actual realized value of $\alpha$ depends on the strength of the sentiment expressed. The reverse is true for a response not in favor on AI; $\alpha \rightarrow -1$. For each study participant, we calculate an AI affinity score such that for the kth respondent the following holds:

$$\begin{aligned} \textit{AI affinity score}(A_{k}) = \prod _{i=1}^{n}\alpha _{i}^{k}W(Q_{i}) \end{aligned}$$

(1)

We choose $W(Q_{i}) \xrightarrow (0, 1)$ such that

$$\begin{aligned} \sum _{i=1}^{n} W(Q_{i}) = 1 \end{aligned}$$

(2)

$\alpha _{i}^{k} \text { is the AI affinity coefficient of the \textit{kth} respondent response to the \textit{i}th question}$
$W(Q_{i}) \text { is the weight assigned to the \textit{ith} question}$
$n \text { is the total number of related questions in the survey}$

In this study, the weights of the questions in the survey were selected based on expert opinion on their perceived importance or influence (implicit or otherwise) on AI affinity. Subsequently, we present a deep learning model that predicts AI Affinity Scores towards the determination of the degree of AI integration into care that will impact a patient’s experience of care.

Table of Contents

Supervised learning for predicting AI affinity scores

The dataset we used for the model prediction included 24 predictors derived from 320 patient survey responses. The age of these patients cohort ranged from 18 to over 46 years, with categories across 18–25, 26–35, 36–45, and 46+. These predictors covered demographics such as Gender, Education, Region, Occupation, familiarity, and attitudes toward AI and robotics in healthcare. First, we perform Principal Component analysis (PCA) to determine the top five relevant features in the prediction to handle high-dimensional data and minimize redundancy. Specifically, PCA revealed that the most influential predictors included patients’ attitudes toward AI integration, their concerns about the use of AI and robot assistants in healthcare and the service industry, digital health usage behaviors and familiarity, and their level of trust in AI tools. Subsequently, we partitioned the dataset using a 60/20/20 train-test-validation, i.e., 60% of the data is used to fit the model, 20% is used to evaluate the trained model and 20% is used to validate the trained model.

Model training

The models were trained to support continuous and categorical prediction of AI Affinity Scores. These include a deep learning regression model, a classification model using the same deep learning architecture, a baseline linear regression model, and a Random Forest classifier for interpretability. We implemented a feedforward neural network (FNN) for the regression model with three layers. First, an input layer that accepts the top 10 PCA-transformed components as inputs. Secondly, there are two hidden layers with 64 and 32 neurons, each using ReLU activation. The output layer contains a single neuron with a linear activation function for affinity score prediction. We use dropout layers with a 20 per cent rate after each hidden layer and an Adam optimizer with a learning rate of 0.001. We minimize the average squared difference between observed and predicted affinity scores using the Mean Squared Error (MSE) loss function. The training procedure with the 60/20/20 train-test-val split was performed over 2000 epochs with a batch size of 32. Since we have a small dataset size ( 300 samples), we have included an EarlyStopping callback with a patience of 50 epochs to prevent overfitting. With this modification, training stopped automatically once the model’s performance on the validation set plateaued to avoid unnecessary training up to the preset 2000 epochs. The ML model’s outcome after training is the Affinity Score, a metric that captures participants’ degree of receptiveness to the use of AI and robots as assistants in healthcare. The classification model shares the same architecture but uses a softmax-activated output layer with three units for the affinity categories(Low = 0, Medium = 1, High = 2). Since the distribution of affinity labels is imbalanced, with the “Medium” class dominating, we incorporated class weights during training. Affinity scores were binned into three ordinal classes using cutoffs based on quantiles. Class weights were computed from the inverse class frequencies and used to balance the loss contribution across categories. The third model we trained was a linear regression model using the same PCA features to compare performance. This basic model provides a baseline for continuous prediction that is interpretable and easy to implement. Additionally, we trained a Random Forest classifier using the binned affinity classes. This model is useful in evaluating robustness across categorical prediction tasks and enables further comparison to the neural classification model. All our deep learning models were built and trained using the TensorFlow library, with the Keras API used to define the neural network architectures. Classical models (Random Forest and Linear Regression) were implemented using scikit-learn.

Model evaluation

For the deep learning model with regression, the evaluation on the test set yielded a low MSE of 0.0020. The $\text {R}^{2}$ score was 0.9339, indicating strong agreement between predicted and observed values. Figure 5 shows that the sorted squared errors (blue line) have an increasing trend, which may make it harder for the model to obtain correct predictions for a subset of training samples. The red line, which represents the Mean Squared Error (MSE), serves as a reference to evaluate the model’s overall performance. Samples above this line indicate areas where the model may struggle more with prediction. To further evaluate model performance, we applied the paired t-test to determine whether the differences between the actual and predicted affinity scores are statistically significant. The test produced a t-statistic of 0.5043 and a p-value of 0.6158. Since the p-value is well above the commonly used threshold of $\alpha = 0.05$, we fail to reject the null hypothesis. This indicates no significant difference between the predicted and actual scores, suggesting that the model’s predictions are strongly aligned with the ground truth. The Mean Absolute Error (MAE), which quantifies the average magnitude of prediction error, was 0.0356. This means that on average, the predicted scores deviate from the true values by just 0.0356 units, which is small. We also observed that the deep learning model tends to regress toward the mean, producing predictions close to the average and with reduced variance, which is common among models trained on small datasets. For comparison, we trained a linear regression model. Surprisingly, the linear model outperformed the deep learning model, achieving an $\text {R}^{2}$ of 0.91, a lower MSE of 0.0016, and an MAE of 0.0327. The second model which is the DL with the classifier achieved a test accuracy of 90% , but the confusion matrix revealed that most predictions fall into the “Medium” category. This is likely due to class imbalance in the data. To reduce this bias, we applied class weighting during training, which improved sensitivity for the “Low” and “High” classes. However, class imbalance remains a limitation in the overall classification performance. The third model, the linear regression baseline performed well, achieving a low Mean Squared Error (MSE) of 0.0017, a Mean Absolute Error (MAE) of 0.0337, and a high $\text {R}^{2}$ score of 0.9388. This suggests that the model was able to closely approximate the continuous Affinity Scores. These results demonstrate that for small datasets, linear models can provide fair performance with minimal overfitting. The last model (random forest classifier) achieved a test accuracy of 82% with F1 scores for the Medium and High categories. It attained a precision of 0.89 and recall of 0.80 for the High category as well as a precision of 0.79 and recall of 0.98 for the Medium category. However, it struggled with the Low category, yielding a much lower recall of 0.31, which indicates frequent misclassifications. The confusion matrix confirms that while the model correctly identified most Medium scores, many Low scores were misclassified as a result of the underlying distribution of the data.

The confusion matrix in Fig. 6 helps evaluate the performance of the deep learning model (with regression) in classifying AI Affinity Scores. For visualization, the model’s continuous predictions and corresponding ground truth values were discretized post-hoc into three ordinal categories: “0” for Low, “1” for Medium, and “2” for High. This binning was applied only after model training and did not affect the regression model itself. The diagonal elements represent correct predictions and show strong performance for the Medium category, with 43 correct predictions. Additionally, 8 correct predictions were made for the Low category. The High category performed the worst, with only 3 correct predictions, indicating that the model struggles to accurately classify high-affinity scores. Most misclassifications appear in the off-diagonal elements. Specifically, Medium scores were often misclassified as both Low and High, and High scores were misclassified as Medium in five instances. This pattern suggests that the model has difficulty distinguishing between Medium and High scores, possibly due to overlapping feature distributions.

Comparison between the predicted and observed affinity scores

The plot in Fig. 8 shows the relationship between the observed (true) AI Affinity Scores and the model’s predicted AI Affinity Scores. The red dashed line represents the perfect scenario where the predicted values match the observed values entirely. A key observation from Fig. 8, is that most points cluster around the red dashed line indicating that the model’s predictions are reasonable when compared to the actual values.

Paired T-test for synthetic data evaluation

The total sample size is 320, we generated 80 additional samples synthetically by resampling the original dataset with replacement through bootstrapping. Each sample in this synthetic dataset retained the same feature distribution as the original data. The distribution of 80 synthetic samples of AI Affinity Scores were treated as the observed scores for evaluation. We added random Gaussian noise $\sim N(\mu =0, \sigma =0.05)$ to the observed distribution of AI Affinity Scores to simulate the deep learning model predictions. These noisy scores represent the predicted scores from the synthetic dataset. We performed a paired t-test to compare the observed and predicted affinity scores specifically for the 80 synthetic rows and obtained the T-statistic of 0.496 and a P-value of 0.621. This p-value indicates no significant difference between the observed and predicted scores showing stable model predictions. These results from the synthetic data evaluation show a strong alignment between the model’s predictions and the observed values and are a testament to the robustness of the trained model as seen in Fig. 7. Overall, the model’s predictions align with the observed values with high accuracy (low MAE and strong alignment in the scatter plot). There is a strong statistical agreement between the high p-value and t-test results. This shows that the model effectively captures the relationship between features and the target variable.