Whisper-large-v3

Speech To Text

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.

Try another model

About Whisper-large-v3 model

Published on huggingface

01/11/2023

Licence: Apache 2.0

Audio price

0.00004083 € /second

Output Formats

jsonverbose_jsontext

Context Sizes

Parameters

1.54B

Try out the model by playing with it.

Audio (Transcription) API - whisper-large-v3

The OpenAI-compatible Audio transcription API endpoint allows you to recognize and transcribe audio, especially human speech, into text.

Introduction

Audio transcription is a technology that converts spoken language into written text. It is a complex process that involves several stages, including speech signal preprocessing, feature extraction, acoustic modeling, language modeling, and speech recognition engine.

AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them:

Supported Audio Formats

The API accepts multiple audio formats, including:

mp3
mp4
aac
m4a
wav
flac
ogg
opus
webm
mpeg
mpga

Make sure your file is in one of these formats before sending it for transcription.

How to?

The Audio transcription endpoint offers you a wide range of transcription options for audios. Learn how to use them with the following examples:

Enable authorization

First, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

If you do not have an access token key yet, follow the instructions in the AI Endpoints – Getting Started.

With the Python OpenAI library

The whisper-large-v3 API is compatible with the OpenAI specification.

First install the openai library:

pip install openai

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
from openai import OpenAI

url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/"

audio_file_path = "my_audio.mp3"

client = OpenAI(
    base_url=url,
    api_key=os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')
)

audio_file = open(audio_file_path, "rb")
transcript_answer = client.audio.transcriptions.create(
  model="whisper-large-v3",
  file=audio_file,
  temperature=0.0
)

print(transcript_answer)

Which returns the following:

Transcription(
    text=" It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    logprobs=None, 
    task='transcribe', 
    success=True, 
    language='en', 
    duration=60.936, 
    words=[], 
    segments=[
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 
            'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    diarization=[], 
    usage={
        'type': 'duration', 
        'duration': 61.0
    }
)

With curl

To process an audio file with curl, run the following command in your terminal:

curl -X POST "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions" \
  -H "accept: application/json" \
  -H "Authorization: Bearer $OVH_AI_ENDPOINTS_ACCESS_TOKEN" \
  -F "model=whisper-large-v3" \
  -F "file=@my_audio.mp3" \
  -F "temperature=0.0" \
  -H 'Content-Type: multipart/form-data'

Which returns the following:

{
  "task": "transcribe",
  "success": true,
  "language": "en",
  "duration": 60.936,
  "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.",
  "words": [],
  "segments": [
    {
      "id": 1,
      "seek": 0,
      "start": 0,
      "end": 8.48,
      "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France.",
      "tokens": [50365,467,307,588,4670,300,264,19169,12761,486,312,5167,294,8380,294,45237,337,6190,13,50865],
      "temperature": 0,
      "avg_logprob": -0.31054688,
      "compression_ratio": 0.9767442,
      "no_speech_prob": 0
    },
    ...,
    {
      "id": 5,
      "seek": 3484,
      "start": 46.12,
      "end": 60.24,
      "text": " It is a chance for everyone to get involved and become interested about different types of sports and to cheer on their home country.",
      "tokens": [50885,467,307,257,2931,337,1518,281,483,3288,293,1813,3102,466,819,3467,295,6573,293,281,12581,322,641,1280,1941,13,51640],
      "temperature": 0,
      "avg_logprob": -0.11174939,
      "compression_ratio": 1.4268292,
      "no_speech_prob": 0
    }
  ],
  "diarization": [],
  "usage": {
    "type": "duration",
    "duration": 61
  }
}

With a simple Python HTTP client (requests)

First install the requests library:

pip install requests

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
import requests
 
url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions"

audio_file_path = "my_audio.mp3"

headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

files = {"file": open(audio_file_path, "rb")}

data = {
    "model": "whisper-large-v3",
    "temperature": 0.0,
    "timestamp_granularities": "segment, word",
    "response_format": "verbose_json"
}

response = requests.post(url, headers=headers, files=files, data=data)

# Handle response
if response.status_code == 200:
    print(response.json())
else:
    print("Error:", response.status_code, response.text)

Returning the following result, which contains the audio transcription API response:

{
    'task': 'transcribe', 
    'success': True, 
    'language': 'en', 
    'duration': 60.936, 
    'text': " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    'words': [
        {'word': 'It', 'start': 0.0, 'end': 0.7}, 
        {'word': 'is', 'start': 0.7, 'end': 0.92}, 
        {'word': 'very', 'start': 0.92, 'end': 1.18}, 
        ...,
        {'word': 'their', 'start': 58.66, 'end': 59.24}, 
        {'word': 'home', 'start': 59.24, 'end': 59.62}, 
        {'word': 'country.', 'start': 59.62, 'end': 60.16}
    ], 
    'segments': [
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    'diarization': [], 
    'usage': {
        'type': 'duration', 
        'duration': 61.0
    }
}

Model rate limit

When using AI Endpoints, the following rate limits apply:

Anonymous: 2 requests per minute, per IP and per model.
Authenticated with an API access key: 400 requests per minute, per Public Cloud project and per model.

On top of the request rate limitations, AI Endpoints also applies the following limitations to the audio data:

Anonymous: 10MB file size or 60s audio duration per request.
Authenticated with an API access key: 2048MB file size or 10800s audio duration per request.

If you exceed these limits, different HTTP error codes will be returned depending on the type of limitation exceeded:

429 (Too Many Requests): When rate limits are exceeded
413 (Payload Too Large): When audio file size or duration exceeds the allowed limits

Note that exceeding either the rate limit or audio size/duration limits will result in an appropriate error response to help you understand and address the issue.

If you require higher usage, please get in touch with us to discuss increasing your rate limits.

Going Further

Want to explore the full capabilities of the ASR API? Dive into our dedicated Speech to Text guide to discover advanced parameters (such as diarization, timestamp granularities, chunking strategy, and prompts) and ready to run code examples in Python, cURL, and JavaScript.

For a broader overview of AI Endpoints, explore the full AI Endpoints Documentation.

Reach out to our support team or join the OVHcloud Discord #ai-endpoints channel to share your questions, feedback, and suggestions for improving the service, to the team and the community.

For more information about the Whisper model features, please refer to Hugging Face documentation:

Whisper-large-v3

Whisper-large-v3

About Whisper-large-v3 model

Audio price

Output Formats

Context Sizes

Parameters

Try out the model by playing with it.

Whisper-large-v3 API

Audio (Transcription) API - whisper-large-v3

Introduction

Supported Audio Formats

How to?

Enable authorization

With the Python OpenAI library

With curl

With a simple Python HTTP client (requests)

Model rate limit

Going Further

Get started with Whisper-large-v3

AI Endpoints - Getting Started

AI Endpoints - Features, Capabilities and Limitations

AI Endpoints - Troubleshooting

AI Endpoints - Using Virtual Models

AI Endpoints - Speech to Text

Create your own Audio Summarizer Assistant

Build a Powerful Voice Assistant

Speaker Diarization with ASR models