Speech-to-Text Conversion Using OpenAI Whisper

The service can be used, for example, for converting podcasts to text or audio recordings of lectures, including interpreted recordings. Interpreted recordings can then be translated from the original language.

In a single batch, recordings of up to 90 minutes can be processed, or they must be compressible to 25 MB in good quality.

The service includes:

Audio enhancement of voice recordings
Audio compression to 25 MB
GPT-4 transcription with language selection
Optional: GPT-4 translation during transcription (limited context)
Optional: Command tuning for transcription for more accurate processing
Saving the transcription to a TXT file (without paragraphs)

Price: 500 CZK per file, with translation during transcription 600 CZK per file.

Optionally, command tuning can be performed to correct transcription of names or unusual terms. This part is billed hourly and typically takes 1 hour.

In the case of transcribing interpreted recordings, the system will attempt to recognize only the desired language. Other languages will not be transcribed. It may happen that the transcription captures secondary communication between the interpreter and the speaker, or incorrect fragments may be recorded due to incorrect language detection. The transcription can, of course, be repeated for the second language to obtain a transcription of both the original and the interpreted versions.

Additional Processing

Optionally, it is possible to format the transcriptions into logical paragraphs and remove any repeated sentences using GPT-4. The processing fee is 200 CZK per file.

To achieve higher translation accuracy and enable tuning for specialized terminology, I recommend an additional translation of the transcription.

Whisper is an advanced automatic speech recognition (ASR) system developed based on 680,000 hours of multilingual and multitask data gathered from the internet. This extensive and diverse dataset contributes to improved resilience against various accents, backgrounds, and technical language. Whisper enables transcription in multiple languages and also offers translation.

Compared to similar services offered by Office 365 and Google Workspace, Whisper stands out primarily for its ability to process large volumes of data and deliver more accurate results. For example, the services integrated into Office 365 and Google Workspace offer efficient tools for speech transcription, but they may struggle with translations or more challenging audio recordings containing technical jargon or strong accents. Whisper, on the other hand, has proven to be more robust when dealing with various languages and dialects.

Services