Intelligent IoT-Based Mental Health Prediction Framework for Smart Cities Using Deep Learning: Integrating Facial Emotions and Questionnaire

For various reasons, individuals experience worries or distress without fully recognizing the state of their mental health. This research proposes a novel approach to predict an individual's potential mental health condition and offer necessary support, utilizing both IoT and computational intelligence, which exposed in Fig. 2. Table 2 presents pseudocode of the proposed strategy. Table 3 depicts workflow table. The proposed system uses a hybrid approach of feature-level and decision-level fusion strategy in order to incorporate textual and facial modalities. First, textual characteristics obtained out of RoBERTa-stress assessment and the visual characteristics of the facial emotion recognition module are internally handled and normalized to obtain the similar dimensional representations. The representation is at the feature-level merged point which then undergoes fully connected layers to acquire inter-modal correlations and complementary schemes. This facilitates the model to collectively explain the indicators of linguistic stress and facial emotional expressions. A decision-level refinement step used to aggregate modality-specific confidence scores with adaptive weighting is added to further increase robustness. The weighting system gives more significance to the modality of the predictive confidence and therefore classification is guaranteed even in situations where one of the modalities is noisy or even partially unavailable. The multimodal fusion approach enhances precision in prediction, stability and generalization because it employs the strengths of text and images to provide an overall evaluation of mental health.

Fig. 2 Fig. 2

The alternative text for this image may have been generated using AI.

(a) Schematic representation of IFG-CNN for mental health prediction (b) unified algorithmic workflow diagram

Table 2 Pseudocode of the proposed strategy

Mental health prediction framework based on IoT (as proposed) has a high focus on privacy, data security, and fairness due to its sensitive nature of questionnaire responses and face images. Encrypted communication between the devices and cloud servers is also secured through the use of secure key exchange tools in order to safeguard user data. Textual input (RoBERTa) and facial image values are encrypted prior to storage and cannot be accessed or altered by an unauthorized user. Role-based authentication is used to provide restrictive access to the system, whereas anonymization methods are employed to strip the system of personally identifiable information prior to processing the data in the IFG-CNN architecture. In a methodological perspective, the Interval Type-2 Fuzzy element is used to handle uncertainty when interpreting stress without having to perform massive amounts of personal profiling, which exposes sensitive metadata.

The multimodal fusion approach combines normalized feature representations derived with RoBERTa and SIFT and makes certain that raw information that is identity specific are not utilized directly in the classification. Also, the White-Faced Capuchin Optimizer optimizes RISAR-Net parameters without raw input sample storage, which assists in safe, privacy-conscious learning. Mitigation of biases is achieved by balanced training datasets and subgroup-based validation based on age and gender. Unintended bias is minimized because attention mechanism of RISAR-Net focuses on emotion-relevant face regions and not demographic features. Monitoring performance on a continual basis provides fair classification of stress and responsible system deployment.

Rationale for Technique Selection

All the parts of the suggested framework are chosen on the premise of suitability to working with multimodal, uncertain, and nonlinear mental health data. The proposed system is integrated to deal with the uncertainty and ambiguity present in the inputs of the psychological and behavioral inputs where stress indicators are highly inaccurate. It is more human responsive, unlike the traditional fuzzy system. The gradient recurrent structure is also employed to capture the temporal correlations of stress-related patterns since mental health indicators in many cases are building up and do not manifest in the form of one event. The CNN element is powerful enough to extract spatial hierarchical features of facial images, so it can be used in the process of recognizing emotions. To evaluate the stress of the text, the RoBERTa is the robustly optimized version of BERT as it has an augmented ability to represent the context and excels in text classification related to the use of psychological and stress-associated language. RoBERTa is chosen because it has a contextual language understanding ability which provides correct interpretation of the question-answer response in cases that are not correct using the keyword analysis. The SIFT feature extractor has been used due to its ability to resist scale, rotation and illumination changes of the facial images. The RISAR-Net WCO-optimized classifier is also better than the original version in terms of parameter tuning, local minima avoidance, and generalization on a wide range of facial variations. All in all, these techniques provide a complementary and strong architecture to enhance the reliability of predictions, flexibility, and computing efficiency of mental health monitoring systems that are enabled with IoT.

User Stress Detection

The stress detection process begins with an assessment based on a set of ten questions designed to evaluate an individual's stress level. These questions are derived from the Perceived Stress Scale (PSS), a widely used tool for determining the observation of stress in various contexts. The 14 items of the PSS are presented in Table 4. Let the responses to these questions be denoted as a vector $Q=[_,_,.._]$, where each $_$ represents the answer to the $i-th$ question. The user responds to each question, and the responses are then analysed using RoBERTa. This approach processes the textual responses and captures the context and semantic meaning of each answer to accurately assess the individual's stress level [20]. Given a response sequence $R$, RoBERTa transforms it into a context-aware embedding $E=BERT(R)$, which encodes the semantic information needed for further analysis. The stress data in the form of a questionnaire are created in this research as a result of a structured online survey. One hundred and twenty respondents volunteered to take part in the data collection exercise. The respondents are divided into various demographic sections, and their age lie between 18 and 45 years, as well as both male and female respondents. A standardized stress assessment questionnaire is used to collect the data by way of an online form. Judging by the responses, stress levels of the study participants are grouped into three categories, namely, the high stress, moderate stress, and normal stress. Distribution of the sampled samples is done with great care in order to ensure there is a balanced representation of the classes to train and evaluate the model reliably.

Table 4 Guidelines and questions for the perceived stress scale

Subsequent the analysis of the responses, classification algorithm is employed to classify the stress levels based on the data. Let $X$ represents the feature vector derived from the RoBERTa embeddings, where $X=[_,_,..._]$,where each $_$ corresponds to a specific feature extracted from the embedding. The classifier is trained to categorize stress levels into three distinct groups such as highly stressed, moderately stressed and normal. The classification results are presented to display the distribution of stress levels across users, ensuring a clear understanding of stress patterns.

The multimodal fusion of the stress characteristics derived using the BERT-based textual analysis and the facial emotion characteristics derived using RISAR-Net is carried out using multimodal feature integration strategy. The resulting textual stress codec of BERT is originally encoded into a dense semantic feature codex and the emotional visual features of the RISAR-Net are coded into high-level spatial feature codec. Combining these heterogeneous feature spaces, to achieve this both feature vectors are normalized and mapped to a shared latent representation space with fully connected transformation layers. Afterwards, the uniform feature vectors are joined together to create a single multimodal feature representation. This combined representation is then subjected to Interval Type-2 fuzzy logic that adaptively captures uncertainty, and situation-dependent, emotional and stressReligious correlations amongst emotional and stress indicators. The final stress state prediction is finally transferred to the IFG-CNN classifier as the integrated features.

Facial Emotion Detection

An individual's emotional state is accurately assessed through their facial expressions, which serve as a direct indicator of their emotions. The process of analyzing facial emotions is divided into four main stages: (1) Collecting facial emotion data (2) extracting facial features, and (3) classifying the features to detect emotions, as shown in Fig. 3.

Fig. 3 Fig. 3

The alternative text for this image may have been generated using AI.

Facial images-based different emotion detection

Collecting Facial Emotion Data

The facial emotion data, obtained from Kaggle's Face Expression Recognition dataset (https://www.kaggle.com/datasets/jonathanoheix/face-expression-recognition-dataset), consists of more than 35,000 labelled images depicting human faces expressing one of seven fundamental emotions such as happiness, anger, disgust, fear, surprise, sadness, and neutral. These grayscale images are meticulously aligned to ensure accurate recognition of facial expressions. This dataset offers labeled emotions on the faces (i.e. happy, sad, neutral) which are cross-linked to the emotion-to-stress association schema. Particularly, the wording that signified distress or negative affect (e.g., angry, sad, fearful) is classified as higher stress indicators, and the expressions of neutrality and positive value added to moderate or normal levels of stress when added to other modalities. Highly stressed, moderately stressed, and normal conditions ground truth label are obtained by combining questionnaire based stress scores with indicators obtained by emotion. PSS responses, and validated indices of PHQ-9, GAD-7, DASS-21 and SF-36 are normalized and thresholded in order to assign stress levels. Validated literature cutoff ranges are employed clinically (e.g., PSS ≥ 20 that refers to increased perceived stress) and label validity as opposed to heuristically applied labeling. To train, validate, and test, the multimodal dataset is divided as follows, 70% of the training, 15% of validation, and 15% of testing. Caution is observed in such a way that the data of the same subject are not included in both training and testing split. Mixed records of both facial and questionnaire data of a subject are stored together in the same split to retain modality homogeneity. Time-aligning the faces frames that have been collected by an IoMT and the questionnaire responses allows accomplishing multimodal synchronization. All the facial images or brief video clips are associated with a temporally equivalent questionnaire session; therefore, emotion recognition and self-report of stress are both related to the same period of mental state.

Extracting Facial Features

After collecting the data, it enters the feature extraction phase. In that, SIFT is exploited to extract limited attributes from images. The key benefits of SIFT include its invariance to rotation, scale, and changes in illumination, as well as its robustness to noise, affine transformations, and perspective changes. Moreover, it works well in analysis of local illumination differences and is insensitive to fractional variations, which makes it perfect in the depiction of finer details of the face including depth discontinuities around the main features of the face such as nose, cheek, and lips. The Difference of Gaussian (DOG) (Eq. 1) $D(m,n,\sigma )$ assists in finding significant areas of the face and distinguishing edges and significant areas of the image. This helps to make sure that the system addresses the significant changes on the face and not noises in the background.

$$D(m,n,\sigma )=(G(m,n,k\sigma )-G(m,n,\sigma ))*I(m,n)=L(m,n,k\sigma )-L(m,n,\sigma )$$

(1)

where $G(m,n,\sigma )$ represents the Gaussian function,$I(m,n)$ is the original image, and $L(m,n,\sigma )$ denotes the convolution of the image with the Gaussian function. $k$ signifies a scaling factor used to increase the scale ($\sigma$) in the second Gaussian function, typically used to detect features at different scales in the image [21]. The DOG is then used to detect keypoints by identifying extrema in the scale space, where significant features in the image are located. Once keypoints are detected, the next step is to assign an orientation and gradient modulus to each keypoint. The gradient and orientation calculations (Eqs. 2, 3) determine the direction and intensity of facial movements, such as raised eyebrows or tightened lips, which are strong indicators of emotional states.

$$k(m,n)=\sqrt^+(L(m,n+1)-L^}$$

(2)

and the orientation $\theta (m,n)$ is given by Eq. (3).

$$\theta (m,n)=}^(\frac)$$

(3)

The last step is the building of the feature descriptor where gradient orientation histogram is calculated on an 8x8 neighbourhood window about the keypoints. This window is subdivided into 4x4 child windows, the gradient orientation histograms of each of these windows are summed up into a 128 dimensional descriptor. These descriptions are used to describe the local characteristics of the face which are put on the emotion classification.

Classifying the Features

Once the facial features have been extracted they are combined to create a feature vector, and finally the classifier is fed this as input. One of the most important features of mental health prediction is the facial expression recognition. Conventional techniques are weak in the management of pose variations, variations of illumination and occlusions. In order to resolve these problems, run the RISAR-Net, the rotation-invariant surface property with attention mechanism to improve the classification accuracy. This approach is resistant to rotational transformations and the intricate local geometry of facial expressions is maintained.

First, in this module, rotation-invariant surface features are extracted out of facial images in a structured manner. Given a reference point, the $K$-nearest neighbors are identified to form a limited point set [22,23,24]. Each neighbor $_$ contributes to constructing two triangular local surfaces using adjacent points $_$ and, which is defined by Eq. (4). The local surface and geometric modeling (Eqs. 4, 5) capture the shape and structure of facial regions in a way that remains stable even if the face rotates or slightly changes position and this improves robustness.

$$R(_)=(_,_,_,_,_,_)$$

(4)

where $_$ signifies the Euclidean space from $p$ to $_$, and $_$ (for $i=\text,\text,5$) characterize the angular relationships between the formed triangles.

$$\begin _=\angle (p,_),_=\\\angle (_,p), _=(_-_),_=\\ (_-_),_=(_\times p-_\times p) \end$$

(5)

Furthermore, Eq. (5) presents a comprehensive representation of the local geometric structure, ensuring rotation invariance during feature extraction. To further enhance feature learning, integrate an attention mechanism ($A$) that dynamically weighs the importance of different surface properties is formulated. The attention mechanism (Eqs. 6, 7) allows the model to concentrate on emotionally significant facial areas (e.g., eyes, mouth) while ignoring irrelevant details, enhancing expression recognition accuracy.

$$A(q,k,v)=Soft\text(\frac^}_}})v$$

(6)

where $q$,$k$ and $v$ represents the query, key, and value feature matrices, correspondingly and $_$ signifies the dimensionality of key vectors. Multi-head attention is used to identify the complex interdependencies using Eq. (7).

$$MHA(q,k,v)=Concat(hea_,..,hea_)_$$

(7)

where $_$ is a learnable parameter matrix. The mechanism increases the capacity of the model to highlight important expression related features and to comprise unnecessary variations. The finally obtained features are also used in a Radial Basis Function (RBF) to classify. It is an operation that involves the mapping of features on a high-dimensional space by using a kernel function and after which the facial expressions can be separated linearly. A more understandable separation between the various emotional patterns is the radial basis function classifier (Eqs. 8, 9), which is better at distinguishing between levels of stress.

where $c$ represents the center of the RBF and controls the spread of the function. This is the result of classification that is obtained by Eq. (9).

$$y=\sum_^_\varphi (_)+b$$

(9)

where $_$ signifies the learned weights, and $b$ is the bias term. In addition, a Categorical Cross-Entropy (CCE) loss function is used to improve the performance of the classification of the module. Since facial expression recognition has several mutually exclusive classes, CCE is a suitable model to be used to train the models in a multi-class classification environment. The cross-entropy loss (Eq. 10–11) helps the system minimize the error in prediction (in the training phase) to come up with more reliable classifications.

$$_=-\sum_^_\text(}_)$$

(10)

where $C$ signifies the entire amount of facial expression classes, $_$ denotes the ground truth label (one-hot encoded), and $}_$ is the predicted possibility for class $i$. The method scales features in a linearly separable space, which makes feature classification strong to face expression and overcome pose distortions and local geometric distortions that enhance extrapolation across different expression types. Moreover, the model uses WCO to optimize the loss function, which increases the learning process and minimizes the errors of classification to a large extent.

In order to increase the accuracy and strength of the classification, optimization of the loss is done in WCO. Due to the foraging behaviour and intelligent decision-making of white-faced capuchin monkeys, WFCO is effectively balanced in terms of exploration and exploitation, so that it presents the best feature representation of a facial expression classification. The optimization process reduces the loss of function using the Eq. (11).

$$Fitness=Optimize\_\}$$

(11)

where $Fitness$ signifies the objective function of the problem, categorical cross-entropy loss function is denoted as $_$. Additionally, WCO involves dynamic adjustments to avoid local minima and achieve better generalization. Each candidate solution in the population represents a possible parameter set for the model, updated based on position, velocity, and a self-adaptive learning strategy [25]. The exploration phase ensures diverse solutions, while exploitation refines the best candidates, which is formulated. The white-faced capuchin optimizer (Eq. 12) fine-tunes model parameters to avoid poor solutions and improve overall accuracy.

$$_^=_^+\alpha \cdot (^-_^)+\beta \cdot (_^-_^)$$

(12)

where $_^$ is the location of the $^$ capuchin at iteration $t$, $^$ represents the global best location, and $_^$ is the local best solution. The parameters $\alpha$ and $\beta$ control the adaptive movement towards optimal solutions. Table 5 provides the step-by-step process of the WCO algorithm.

Table 5 Pseudeocde of WCO

Therefore, WCO improves classification of facial expression by carefully adjusting the parameters of the network to obtain the best features. It enhances convergence performance, prevents the local minimum, and empowers the discriminative ability of learned features. WCO reduces errors in classification and makes it more robust to differences in pose, illumination, and facial expressions by making training more stable. As a result, the model is more accurate and widely applicable to different faces datasets.

Mental Health Prediction

The prediction of mental health status in individuals is enhanced through the fusion of stress detection and emotion analysis, employing the proposed IFG-CNN. Conventional Type-1 Fuzzy Logic Systems (FLSs) exhibit limitations in handling rule uncertainty, necessitating the adoption of Type-2 FLSs. These systems extend fuzzy logic by incorporating an additional degree of uncertainty through secondary memberships, enabling improved decision-making under ambiguous conditions. In the case of the Interval Type-2 Fuzzy system, Gaussian membership functions are employed whereby the mean value is initialized with the help of normalized feature inputs and the factors of standard deviation are put as 0.15–0.30.15.30. Footprint of Uncertainty (FOU) is outlined in the range of variation of the primary membership function which is its variation of +10 percent to −10 percent which is enough to encompass the ambiguity of the stress level with no extravagance. The Interval Type-2 Fuzzy system (Eq. 13-16) takes care of uncertainty in interpreting the stress, as the emotional states are not rigidly defined.

$$\overline=\_ }(x,n))_\in X,_\in _\subseteq [\text]\}$$

(13)

where $_ }(x,n)$ represents the secondary membership function, interval type-2 fuzzy set is denoted as $\overline$, input variable is denoted as $x$, secondary membership grade is expressed as $n$, universe of discourse (range of input values) is defined as $X$, interval of secondary memberships for input is denoted as $_$, and it is defined by Eq. (14).

$$_ }(x)=__}_(n)/n$$

(14)

For mental stress detection, Gaussian primary membership functions with uncertain standard deviations are used, which is expressed by Eq. (15).

$$_(_)=\text\left(-\frac_-_^}\right)}^\right),\begin\sigma \in [_,_]\end$$

(15)

where $_,_$ denotes the lower and upper bounds of uncertainty range, standard deviation is denoted as $\sigma$, input variable and mean (center) of Gaussian function is denoted as $_,_^$. Furthermore, this allows the system to dynamically adjust stress condition assessment over time. The upper and lower membership functions of the interval Type-2 fuzzy space are characterized by Eq. (16).

$$_ }(x)=N(\alpha ,^;x)_ }(x)=N(\alpha ,_;x)$$

(16)

where $\alpha$ represents uncertain variance. Once the fuzzy rules are defined, the fuzzy inference mechanism evaluates the mental health condition by processing input features such as stress indicators, emotional state, and behavioral patterns. The output classification groups individuals into one of three categories such as highly stressed, moderately stressed and normal. Additionally, Recurrent Neural Networks (RNN) introduce temporal dependencies through recurrent connections, ensuring a robust analysis of evolving mental health conditions [26, 27]. The RNN structure has 2 hidden layers that comprise 64 and 32 neurons respectively with tanh activation function. In order to avoid overfitting, a dropout rate of 0.3 is used. Learning rate is 0.001, batch size is 32, and by the end of 50 epochs, it is noticed that the convergence of the learning effect occurs without reaching a performance plateau. RNN (Eq. 17, 18) traces the dynamics of stress dynamics over time, thus allowing the ongoing and dynamic mental health evaluation.

$$_=g(_,_)_=g(_\cdot _+_\cdot _+_)}_=g(_\cdot _+_)$$

(17)

where $_$ represents the hidden state at interval $p$, hidden state at next interval $p-1$ is denoted as $_$, element-wise nonlinear activation function is denoted as $g(.)$, learnable weight matrix for the input-to-hidden transformation is denoted as $_$, learnable recurrent weight matrix for the hidden-to-hidden transformation is denoted as $_$, bias vector added to the hidden-state computation is expressed as $_$, learnable weight matrix mapping the hidden state is denoted as $_$, $_$ specifies bias vector for the output layer and $}_$ is the predicted mental health classification. The model is trained using forward–backward propagation to minimize the loss function through Eq. (18).

$$L(}_,_)=\frac^(}_-_)}^$$

(18)

where loss value for the current prediction is expressed as $L(}_,_)$, number of samples is denoted as $N$. In this way, this enables the proposed framework to keep on improving predictions through incorporation of stress related variables and emotional reactions. This hybrid model takes advantage of the strengths of Type-2 FLSs and RNNs by providing an interpretable and adaptive mental health prediction model, which adds to more accurate diagnosis and intervention plans.

Preventive Action Recommendations Via a Web Portal Design

Following mental health status prediction, it developed a web based application, Mind harmonizer, to determine and forecasts the mental health status of an individual as highly stressed, moderately stressed and normal. The platform also collects and processes user feedback on structured tests using sophisticated calculative tools. According to the assessment, Mind Harmonizer provide individual recommendations and even professional consultation in case of need, which guarantee a mental health intervention in a timely manner. The essential aspects of Mind harmonizer are that it offers a confidential and convenient platform to minimize embarrassment in the talks about mental health [28]. The platform supports mental health practitioners by providing predictive information depending on the response of users to enhance diagnostic accuracy. Moreover, the users have access to interactive capabilities, including mood tracking, informative blogs, relaxation and access to educational information on mental well-being. In case of patients with moderate to severe mental distress, Mind Harmonizer allows them to communicate directly with the providers. The system safely sends the user information, personal and mental health status and GPS position to the closest health center to support them. Moreover, the site enable consumers to invite friends or family members who require such services, which create a proactive attitude toward mental health treatment and early intervention.

View original article

COGNITIVE COMPUTATION

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Intelligent IoT-Based Mental Health Prediction Framework for Smart Cities Using Deep Learning: Integrating Facial Emotions and Questionnaire

Comments (0)