Arabic Speech-to-Text: Comparing Results of Top STT Providers

The results are in from our latest benchmarking of Arabic speech-to-text (STT) services. NeuralSpace has emerged as the leader in accuracy, achieving an impressive 91% average accuracy across various dialects. Our performance surpassed Google, AWS, Azure, Intella, OpenAI and Symbl AI, with a 59% absolute increase in accuracy over IBM.

The Value of Accurate STT

STT has a diverse range of applications from transcribing calls and meetings, to automated customer service systems, subtitle creation, speech analytics and more. However, for these use cases to be truly effective, STT accuracy is paramount.

Inaccurate transcriptions can result in misunderstandings and misinterpretations, leading to serious consequences, particularly in fields such as healthcare and legal proceedings. Additionally, inaccurate STT transcriptions can negatively impact user satisfaction, eroding trust and confidence in the product or service it powers, which ultimately hinders user adoption. As such, it’s critical to assess the performance of STT services in the language and dialect of the end users.

The Challenge of Dialects for STT

Developing an accurate STT system requires advanced algorithms and models. The process involves converting complex audio data into text, which requires the system to build a deep understanding of the nuances of language, accents, and dialects it supports. One of the major challenges for STT systems is dealing with regional dialects. STT models trained on standardized language data may struggle to accurately transcribe spoken language that deviates from the standard.

Although Modern Standard Arabic (MSA) is the formal written language used in most official contexts, it is not the language spoken in daily life by the majority of people living in the Arabic-speaking countries of the Middle East and Northern Africa (MENA). Instead, people speak various regional dialects that can differ significantly in pronunciation, grammar, and vocabulary. To address this challenge, STT models need to be trained on a vast amount of diverse language data that includes regional dialects to improve their accuracy and performance.

Additionally, the accuracy of an STT service is heavily dependent on the quality and clarity of the audio input, with performance decreasing in noisy environments or with low-quality recordings. Integrating linguistic knowledge of regional dialects and adapting acoustic models to specific dialects can help improve the accuracy of STT systems for non-standardized languages. Ensuring that they can perform reliably in real-world situations.

To provide a comprehensive evaluation of our Arabic STT systems, we conducted accuracy tests that compared NeuralSpace’s transcriptions to those of eight other service providers, namely Intella, Speechmatics, OpenAI’s Whisper, Google, Azure, AWS, IBM, and Symbl AI. The testing was conducted on five publicly available datasets that included diverse voices of native Arabic speakers speaking a variety of dialects and regional accents.

We used the most common method of testing the accuracy of speech-to-text (STT) systems, which is the Word Error Rate (WER). This metric determines the percentage of words in the STT output that differ from the actual, 100% accurate, so-called “ground truth” transcription. The WER is calculated by dividing the total number of errors, which includes substitutions, deletions, and insertions, by the total number of words in the ground truth transcription.

💡 Ground truth is typically generated by human transcribers who listen to the audio and manually transcribe it into text.

A lower WER indicates higher accuracy of the STT system.

However, WER is sensitive to variations in spelling, punctuation, and capitalization, which can lead to higher error rates even for correct transcriptions. To address this issue, we use a language-specific normalizer to standardize the text and make it less sensitive to such variations, resulting in a more accurate assessment of the STT system’s performance. This benchmark showcases a comparison of the accuracies of various STT service providers. The accuracies are simply calculated by subtracting the WER from 100.

Test Datasets

The benchmark was conducted using the following datasets:

Results

NeuralSpace has achieved the highest accuracy among the Arabic speech-to-text (STT) providers on all evaluated datasets, with an average accuracy of 90.75%% and a peak accuracy of 95%. Notably, on the MASC dataset, NeuralSpace achieved an accuracy rate 59% higher than the lowest-performing system (IBM), illustrating a significant disparity in the performance of STT systems across providers.

‍

The table below shows the accuracies of Intella, Speechmatics, OpenAI Whisper, Google, Azure, AWS, IBM, Symbl and NeuralSpace on all of the datasets we benchmarked against.

NeuralSpace has achieved exceptional performance in Arabic dialects by training our speech-to-text (STT) model with carefully sourced and curated data, utilizing the expertise of our team of linguists who are proficient in all dialects. The model is an encoder-decoder transformer-based system that can accurately transcribe speech recordings of varying lengths, even in the presence of background noise or music, multiple speakers, or strong-quality compression.

Conclusion

NeuralSpace’s emphasis on creating accurate Arabic language models has resulted in exceptional STT transcription performance, surpassing industry-leading STT providers in other languages. With a continued commitment to developing advanced algorithms and models that capture the nuances of regional Arabic dialects, NeuralSpace aims to provide seamless and reliable, AI-powered experiences for customers in the Arabic-speaking world.

Get in touch to learn more about NeuralSpace or visit our website. Head to the NeuralSpace VoiceAI Platform to try out our STT service, for free!

Featured

ABS-CBN Doubles Localization Speed with LocAI

Together, we've created LocAI, a content localization platform that will broaden the reach of its programming through digital distribution.

March 24, 2025

Introducing LocAI. Media Localization For The AI Era

Meet LocAI: a unified, intuitive platform that enables teams to script, translate, and subtitle content twice as fast as traditional manual methods.

March 24, 2025

Introducing dialectal Speech-to-Text models for Arabic

We've launching four new Arabic dialectal speech-to-text (STT) models on VoiceAI.

March 24, 2025

Maximizing Content Reach: How Broadcasters Are Leveraging AI To Unlock Global Growth

Explore key trends and challenges shaping the media industry in 2024, and three innovative ways in which AI is unlocking global growth for streaming services.

October 24, 2024

Arabic Speech-to-Text: Comparing Results of Top STT Providers

The Value of Accurate STT

The Challenge of Dialects for STT

💡 Ground truth is typically generated by human transcribers who listen to the audio and manually transcribe it into text.

A lower WER indicates higher accuracy of the STT system.

Test Datasets

Results

Conclusion

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

Featured

Arabic Speech-to-Text: Comparing Results of Top STT Providers

The Value of Accurate STT

The Challenge of Dialects for STT

💡 Ground truth is typically generated by human transcribers who listen to the audio and manually transcribe it into text.

A lower WER indicates higher accuracy of the STT system.

Test Datasets

Results

Conclusion

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

Subscribe to our newsletter

Featured