VoiceAI: Record Accuracy in Hindi Speech-to-Text

We're excited to announce that VoiceAI has not only set new standards in Arabic speech-to-text (STT) accuracy but also in Hindi. In these tests, NeuralSpace's STT model surpassed seven other providers, with 138% better performance over OpenAI and 77% over Google. This means that NeuralSpace transcriptions have 1.8 times less errors than Google and 2.4 times less errors than OpenAI on average.

Setting a new standard in Hindi speech recognition

The recent benchmarking of our Hindi model illustrates not just the establishment of a new industry benchmark in language AI, but also our relentless pursuit of advancing our own innovations. What truly differentiates our model is its unparalleled accuracy. Achieving a remarkable 15% relative improvement, compared to our preceding model (as detailed in Table 1).

With over 25,000 hours of audio data, our STT model has been meticulously trained on diverse voices, from people of different ages, genders, accents, and dialects, with varying sound quality. This robust training, complemented by human validation of AI-generated transcriptions, has resulted in a model that outperformed all other vendors.

In our analysis, we found that our STT technology excels even in the most challenging scenarios, such as speech muffled by intense background noise or compromised by microphone disturbances - a common hurdle in call center environments.

With this milestone achievement in Hindi STT accuracy, we’re excited to announce the launch of new speech analytics features in VoiceAI, built on the foundation of our trailblazing speech recognition technology. Meaningful insights begin with precise transcription. Together, these technologies are helping us make significant progress towards our mission of dismantling language barriers in technology.

Table 1: NeuralSpace Hindi speech-to-text average word error rate (WER). Calculation: Change in WER / Previous WER

Benchmarking Methodology

We used the most common method to test the accuracy of speech-to-text (STT) systems, which is Word Error Rate (WER)*. This metric determines the percentage of words in the STT output that differ from the actual, 100% accurate “ground truth” transcription. The WER is calculated by dividing the total number of errors, which includes substitutions, deletions, and insertions, by the total number of words in the ground truth transcription.

A lower WER indicates a higher accuracy of the STT system. See how even a marginal difference in WER can impact the quality of your transcription.

Table 2: The comparison text for STT providers shows how the recognized output compares to the reference. Words in red indicate the errors with substitutions in italics, deletions being crossed out, and insertions indicated by an underscore. (Transcription 1 Audio; Transcription 2 Audio)

Test dataset

To ensure a comprehensive evaluation, we calculated the word error rate (WER) across 5 diverse datasets using 2,000 randomly selected audio samples.

Results

For the following Hindi STT benchmark we used WER as a metric across the selected datasets. NeuralSpace’s model achieves the lowest WER (highest accuracy) outperforming seven vendors, with 138% better performance over OpenAI and 77% over Google.

‍

Dataset Benchmarking Results

Across the datasets, NeuralSpace consistently ranks among the top performers. Our Hindi STT model achieves an average WER of 18.05, showing its general effectiveness across multiple audio contexts.

Table 4: WER for different providers and datasets
Lower WER is better

Dataset Diversity: From open-source Common Voice (CV11) entries to technical MUCS lectures and Shrutilipi news bulletins, our datasets span a wide audio spectrum. This diversity ensures robust testing of speech-to-text models across various audio qualities and dialects.

Superiority on Shrutilipi: NeuralSpace's performance on the Shrutilipi dataset stands out with a WER of 10.47. This dataset, derived from All India Radio news bulletins, emphasizes the model's capability in understanding and transcribing formal Hindi speech, a crucial feature for professional applications.

Competitive Edge on MUCS: On the MUCS dataset, which comprises technical lectures, NeuralSpace scores a WER of 24.07. This is noteworthy since technical lectures often contain domain-specific terminologies, which can be challenging to transcribe.

Robustness on Gram Vaani: The Gram Vaani dataset, containing telephone-quality speech, presents unique challenges due to its audio quality and diverse regional accents. NeuralSpace's WER of 26.77 is commendable given the inherent difficulties of this dataset. Google, one of its major competitors, records a WER of 56.77 on the same dataset, showing a difference of 29.99 in favor of NeuralSpace.

Universal Language Contribution (ULCA): On the ULCA dataset, which contains audios from varied sources like government TV and radio channels, NeuralSpace achieves a WER of 12.48. Azure, a close competitor, has a score of 27.75, establishing a difference of 15.27.

Leverage accuracy for advanced audio insights

High-quality data insights start with accurate transcriptions. With the VoiceAI platform, you can generate audio insights to easily analyze post-call data to spot trends in your business, and track agent performance. With accurate STT combined with advanced analytics and translation capabilities, your business and customers get the best solution in the market.

Contact our sales team with any questions about our enterprise pricing and bespoke solutions. We’re here to help.

Footnotes

*Even though WER is the most common metric to evaluate STT vendors,it can vary a lot depending on testing methodologies and test sets. Hence, WER should be looked at in a relative way. Read more.

**Tests conducted in October 2023 against Google, Azure, AWS, OpenAI, Deepgram, Speechmatics and SymblAI against NeuralSpace VoiceAI’s upgraded STT model.

Featured

ABS-CBN Doubles Localization Speed with LocAI

Together, we've created LocAI, a content localization platform that will broaden the reach of its programming through digital distribution.

March 24, 2025

Introducing LocAI. Media Localization For The AI Era

Meet LocAI: a unified, intuitive platform that enables teams to script, translate, and subtitle content twice as fast as traditional manual methods.

March 24, 2025

Introducing dialectal Speech-to-Text models for Arabic

We've launching four new Arabic dialectal speech-to-text (STT) models on VoiceAI.

March 24, 2025

Maximizing Content Reach: How Broadcasters Are Leveraging AI To Unlock Global Growth

Explore key trends and challenges shaping the media industry in 2024, and three innovative ways in which AI is unlocking global growth for streaming services.

October 24, 2024

VoiceAI: Record Accuracy in Hindi Speech-to-Text

Setting a new standard in Hindi speech recognition

Benchmarking Methodology

Test dataset

Results

Dataset Benchmarking Results

Leverage accuracy for advanced audio insights

Footnotes

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

Featured

VoiceAI: Record Accuracy in Hindi Speech-to-Text

Setting a new standard in Hindi speech recognition

Benchmarking Methodology

Test dataset

Results

Dataset Benchmarking Results

Leverage accuracy for advanced audio insights

Footnotes

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

Subscribe to our newsletter

Featured