مطالعه ای بر ارزیابی تأثیر رویکرد تشخیص فعالیت صدا (VAD) بر تشخیص عواطف گفتاری کودکان اوتیستیک

حسینی, سید مهدی

doi:10.22034/naes.2024.451265.1386

مطالعه ای بر ارزیابی تأثیر رویکرد تشخیص فعالیت صدا (VAD) بر تشخیص عواطف گفتاری کودکان اوتیستیک

نوع مقاله : مقاله پژوهشی

نویسنده

سید مهدی حسینی

کارشناس ارشد علوم کامپیوتر، گرایش محاسبات علمی، گروه علوم کامپیوتر، دانشگاه مازندران، بابلسر، مازندران، ایران

10.22034/naes.2024.451265.1386

چکیده

زمینه و هدف: طیف اوتیسم، اختلالی نورولوژیک است که خودش را در سال‌های اولیه رشد کودک نمایان می‌کند. افراد مبتلا به اوتیسم با چالش هایی در زمینه تنظیم احساسات مواجه هستند و حالات عاطفی خود را به روش های مختلف بیان می کنند. پژوهش فعلی یک سیستم تشخیص فعالیت صوتی (VAD) را ارائه می‌کند که با صداهای کودکان اوتیستیک سازگار شده است.

روش پژوهش: سیستم VAD ارائه شده یک شبکه عصبی بازگشتی (RNN) با سلول های حافظه کوتاه مدت (LSTM) است. داده ها شامل 25 کودک اوتیستیک انگلیسی زبان است که یک فعالیت آموزشی ساختار یافته را انجام می دهند.

یافته ها: آزمایش‌های ما نشان می‌دهد که سیستم VAD کودک عملکرد کمتری نسبت به سیستم VAD عمومی ما دارد که تحت شرایط یکسان آموزش داده شده است، زیرا منحنی مشخصه عملکرد سیستم (ROC) را تحت معیارهای منحنی (ROC-AUC) به ترتیب 0.662 و 0.850 به دست می‌آوریم. نتایج SER عملکردهای متفاوتی را در بین ظرفیت و برانگیختگی، بسته به سیستم VAD مورد استفاده با حداکثر ضریب همبستگی تطابق (CCC) 0.263 و حداقل ریشه میانگین مربعات خطا 0.107 نشان می‌دهد.

نتیجه‌گیری: اگرچه عملکرد مدل‌های SER به طور کلی پایین است، سیستم VAD کودک می‌تواند به نتایج کمی بهبود یافته در مقایسه با سایر سیستم‌های VAD و به ویژه تراز مبنا بدون VAD (VAD-less baseline) منجر شود، که از اهمیت فرضی سیستم‌های VAD کودک در زمینه مورد بحث حمایت می‌کند.

کلیدواژه‌ها

تشخیص فعالیت صدا

تشخیص عواطف گفتاری

شبکه عصبی بازگشتی

سلول‌های حافظه کوتاه مدت

اوتیسم

موضوعات

آموزش و پرورش کودکان عقب مانده ذهنی

عنوان مقاله English

A study on evaluating the effect of voice activity detection (VAD) approach on speech emotion recognition of autistic children

نویسنده English

Seyed Mehdi Hoseini

Master of Computer Science, Scientific Computing, Department of Computer Science, Mazandaran University, Babolsar, Mazandaran, Iran,

چکیده English

Background and Aim: Autism spectrum is a neurological disorder that manifests itself in the early years of a child's development. People with autism face challenges in regulating emotions and express their emotional states in different ways. The current research presents a vocal activity detection (VAD) system adapted to the voices of autistic children.

Methods: The proposed VAD system is a Recurrent Neural Network (RNN) with short-term memory (LSTM) cells. The data includes 25 English-speaking autistic children performing a structured learning activity and was collected as part of the DE-ENIGMA project.

Results: Our experiments show that the pediatric VAD system performs less well than our generic VAD system trained under the same conditions, as we obtain system performance characteristic curve under the curve (ROC-AUC) criteria of 0.662 and 0.850, respectively. The SER results show different performances between capacity and excitation, depending on the VAD system used, with a maximum match correlation coefficient (CCC) of 0.263 and a minimum root mean square error (RMSE) of 0.107.

Conclusion: Although the performance of SER models is generally low, the pediatric VAD system can lead to slightly improved results compared to other VAD systems and especially the VAD-less baseline, which supports the hypothesized importance of pediatric VAD systems in the context under discussion.

کلیدواژه‌ها English

Voice Activity Detection

Speech Emotion Detection

Recurrent Neural Network

Short-Term Memory Cells

Autism

[1] Harár P, Burget R, Dutta MK. “Speech emotion recognition with deep learning,” in 2017 4^th International Conference on Signal Processing and Integrated Networks (SPIN) (Noida: IEEE), 2017; 137-140.

[2] Alghifari MF, Gunawan TS, Qadri SA, Kartiwi M, Janin Z. On the use of voice activity detection in speech emotion recognition. Bull. Elect. Eng. Inf. 2019;8: 1324-1332. doi: 10.11591/eei.v8i4.1646

[3] Akçay MB, Oğuz K. Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020; 116: 56-76. doi: 10.1016/j.specom.2019.12.001

[4] Baird A, Amiriparian S, Cummins N, Alcorn AM, Batliner A, Pugachevskiy S, et al. “Automatic Classification of autistic child vocalisations: a novel database and results,” in Proceedings of the Interspeech 2017 (Stockholm), 2017; 849-853.

[5] American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders (DSM-5), Arlington, VA: APA. 2013.

[6] Kopp S, Beckung E, Gillberg C. Developmental coordination disorder and other motor control problems in girls with autism spectrum disorder and/or attention-deficit/hyperactivity disorder. Res. Develop. Disabil. 2010; 31: 350-361. doi: 10.1016/j.ridd.2009.09.017

[7] Lord C, Elsabbagh M, Baird G, Veenstra-Vanderweele J. Autism spectrum disorder. Lancet 2018; 392: 508-520. doi: 10.1016/S0140-6736(18)31129-2

[8] Hudson CC, Hall L, Harkness KL. Prevalence of depressive disorders in individuals with autism spectrum disorder: A meta-analysis. J. Abnormal Child Psychol. 2019; 47: 165-175. doi: 10.1007/s10802-018-0402-1

[9] Zaboski BA, Storch EA. Comorbid autism spectrum disorder and anxiety disorders: a brief review. Future Neurol. 2018; 13: 31-37. doi: 10.2217/fnl-2017-0030

[10] Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, et al. “The interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” in Proceedings Interspeech 2013, 14^th Annual Conference of the International Speech Communication Association (Lyon). 2013.

[11] Nahar R, Kai A. “Effect of data augmentation on dnn-based vad for automatic speech recognition in noisy environment,” in 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE) Kobe, 2020; 368-372.

[12] Amiriparian S, Baird A, Julka S, Alcorn A, Ottl S, Petrović S, et al. “Recognition of echolalic autistic child vocalisations utilising convolutional recurrent neural networks,” in Proceedings of the Interspeech 2018 (Hyderabad), 2018; 2334-2338

[13] Schadenberg BR, Reidsma D, Evers V, Davison DP, Li JJ, Heylen DK, et al. Predictable robots for autistic children-variance in robot behaviour, idiosyncrasies in autistic children's characteristics, and child–robot engagement. ACM Trans. Comput. Human Interact. 2021; 28: 1-42. doi: 10.1145/3468849

[14] Schuller BW. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 61, 2018; 90-99. doi: 10.1145/3129340

[15] Shen J, Ainger E, Alcorn A, Dimitrijevic SB, Baird A, Chevalier P, et al. Autism data goes big: A publicly-accessible multi-modal database of child interactions for behavioural and machine learning research. In International Society for Autism Research Annual Meeting (Kansas City, MO). 2018.

[16] Howlin P, Baron-Cohen S, Hadwin J. Teaching Children With Autism to Mind-Read: A Practical Guide for Teachers and Parents. Chichester: J. Wiley & Sons Chichester. 1999.

[17] Ringeval F, Schuller B, Valstar M, Cummins N, Cowie R, Tavabi L, et al. “Avec 2019 workshop and challenge: state-of-mind, detecting depression with ai, and cross-cultural affect recognition,” in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop AVEC '19 (New York, NY: Association for Computing Machinery), 2019; 3-12.

[18] Ringeval F, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, et al. “Avec 2017: real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, AVEC '17 (New York, NY: Association for Computing Machinery), 2017; 3-9.

[19] Salishev S, Barabanov A, Kocharov D, Skrelin P, Moiseev M. “Voice activity detector (vad) based on long-term mel frequency band features,” in Text, Speech, and Dialogue, eds P. Sojka, A. Horák, I. Kopeček, and Pala, K. (Cham: Springer International Publishing), 2016; 352-358.

[20] Eyben F, Scherer KR, Schuller BW, Sundberg J, André E, Busso C, et al. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015; 7: 190-202. doi: 10.1109/TAFFC.2015.2457417

[21] Hagerer G, Pandit V, Eyben F, Schuller B. “Enhancing lstm rnn-based speech overlap detection by artificially mixed data,” in Audio Engineering Society Conference: 2017 AES International Conference on Semantic Audio, Erlangen. 2017.

[22] Eyben F, Wöllmer M, Schuller B. “Opensmile: the munich versatile and fast open-source audio feature extractor,” in MM '10 (New York, NY: Association for Computing Machinery), 2010; 1459-1462.

[23] Stappen L, Baird A, Rizos G, Tzirakis P, Du X, Hafner F, et al. “Muse 2020 challenge and workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: Emotional car reviews in-the-wild,” in Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop MuSe'20, 2020; 35-44.

[24] Van Rossum G, Drake FL. Python 3 Reference Manual. (Scotts Valley, CA: CreateSpace). 2009.

[25] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: Large-scale Machine Learning on Heterogeneous Systems. Available online at: https://www.tensorflow.org/ (accessed December, 2015; 13: 2021).

[26] Lin L. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255-268. doi: 10.2307/2532051