Stuttering is a speech impediment that affects around 1% of the world’s population. The funda- mental cause of the stuttering is still unknown, although multiple hypotheses exist, such as genetic problems, congenital factors, and structural and functional brain differences.
The impact of the stuttering on people who have such a problem could be on both social functioning and mental aspects. Stuttering hinders daily oral communication with speech repetition, prolongation, blocks, and secondary behaviors, such as body movements and facial grimaces. People who stutter (PWS) quite often feel stress, shame, and low esteem when their stuttering behaviors are exposed, which can cause them to fear speaking. To avoid those negative feelings, many PWS hold back from communication and even choose to self-isolate.
Stuttering is also a developmental speech disorder. Most PWS begin their stuttering in childhood, with an average onset of 30 months. Around 20% of them persist stuttering to adulthood. Therapeutic intervention for children who stutter is statistically more effective than for adults, which can increase the chance of full recovery. The importance of early diagnosis is self-evident.
However, compared with North American or Western European countries where the practice of speech-language therapists (SLP) has been evolving for quite some time, the related practices in Mainland China are still in their infancy, as there is no separate certification for speech therapists. Due to the lack of professionals, many families of children who stutter cannot receive timely diagnosis, thus delaying the recovery of their children from stuttering. While adding more professionals, automatic stuttering diagnosis can also meet the needs of some families.
When a person’s stuttering continues into adulthood, the chance of full recovery becomes very low. There is a high probability that stuttering will accompany them throughout their lives. To alleviate their communication barriers or, in a broad sense, eliminate social discrimination, relevant products need to be designed to be inclusive to meet their needs.
With the widespread popularity of smart home devices, and the rapid development of chatbot technology, e.g., Alexa and Chatgpt, voice-user interfaces have become indispensable tools in many of our lives. Although current Automatic Speech Recognition (ASR) systems can handle fluent speech well, they still pose a challenge for stuttering speech recognition. The reasons could be multifaceted, such as a lack of data or awareness of the need to develop systems for these people.
Motivated by the above points, we propose to organize the challenge – “Mandarin stuttering event detection and automatic speech recognition” (StutteringSpeech Challenge for short) in SLT2024. This will be the first challenge on Mandarin stuttering event detection and ASR.
The challenge comprises two tasks and open submission of papers on related topics:
For Task I, the submitted systems will be measured by stuttering event detection accuracy-recall, precision, and F1 on audio snippets in the evaluation set.
For Task II, the accuracy of the ASR system is measured by Character Error Rate (CER). The CER indicates the percentage of characters that are incorrectly predicted. Given a hypothesis output, it calculates the minimum number of insertions (Ins), substitutions (Subs), and deletions (Del) of characters that are required to obtain the reference transcript. Specifically, CER is calculated by
Where NIns , NSubs , NDel are the character number of the three errors, and NTotal is the total number of characters. As standard, insertion, deletion, and substitution all account for the errors.
For Task III, there are no specific evaluation metrics, and we welcome paper submissions for research using our stutter data.
The data used in this challenge is meticulously gathered during 70 online voice chat sessions. The dataset captures two scenarios-dialogue and command dictation. Each recording session begins with around half an hour of interview by two native Mandarin speakers who are both PWS. Then, the interviewee was asked to read a prepared list of commands. The effective duration of the dataset is around 50 hours, encompassing a comprehensive collection of recordings from 72 distinct PWS-2 interviewers and 70 interviewees. The speaker numbers of train/development/test partitions can be found in Table 1.
Mild | Moderate | Severe | Sum | |
---|---|---|---|---|
Train | 25 | 12 | 6 | 43 |
Dev | 4 | 2 | 1 | 7 |
Test | 14 | 3 | 3 | 20 |
Sum | 43 | 17 | 10 | 70 |
All participants should adhere to the following rules to be eligible for the challenge.
Participants are encouraged to prioritize technological innovation, particularly the exploration of novel model architectures, rather than relying solely on increased data usage.
Participants must sign up for an evaluation account where they can perform various activities such as registering for the evaluation, as well as uploading the submission and system description.
The registration email must be sent from an official institutional or company email address e.g., edu.cn; a public email address (e.g., 163.com, qq.com, or gmail.com) is not accepted.
Once the account has been created, the registration can be performed online. The registration is free to all individuals and institutes. The regular case is that the registration takes effect immediately, but the organizers may check the registration information and ask the participants to provide additional information to validate the registration.
To sign up for an evaluation account, please click Quick Registration.
Participants should submit their results via the submission system. Once the submission is completed, it will be shown in the Leaderboard, and all participants can check their positions.
For each task, participants can submit their results no more than 3 times a day.
Please contact e-mail stutteringspeech@aishelldata.com if you have any queries.