Whisper-large-v3-turbo

Speech To Text

Whisper is een state-of-the-art model voor automatische spraakherkenning (ASR) en spraakvertaling, voorgesteld in het artikel "Robust Speech Recognition via Large-Scale Weak Supervision" door Alec Radford et al. van OpenAI. Whisper is getraind op meer dan 5 miljoen uur gelabelde data en toont een sterke capaciteit om te generaliseren naar veel datasets en domeinen in een zero-shot omgeving. Whisper large-v3-turbo is een fijn afgestelde versie van een uitgedunde Whisper large-v3. Met andere woorden, het is exact hetzelfde model, behalve dat het aantal decoderinglagen is verminderd van 32 naar 4. Als gevolg hiervan is het model veel sneller, ten koste van een kleine kwaliteitsdegradatie.

Probeer een ander model

Over Whisper-large-v3-turbo model

Gepubliceerd op huggingface

01/09/2024

Licentie: Apache 2.0

Audioprijs

0.00001278 € /seconde

Uitvoerformaten

jsonverbose_jsontext

Contextgroottes

Parameters

0.81B

Probeer het model uit door ermee te spelen.

Audio (Transcription) API - whisper-large-v3-turbo

The OpenAI-compatible Audio transcription API endpoint allows you to recognize and transcribe audio, especially human speech, into text.

Introduction

Audio transcription is a technology that converts spoken language into written text. It is a complex process that involves several stages, including speech signal preprocessing, feature extraction, acoustic modeling, language modeling, and speech recognition engine.

AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them:

Supported Audio Formats

The API accepts multiple audio formats, including:

mp3
mp4
aac
m4a
wav
flac
ogg
opus
webm
mpeg
mpga

Make sure your file is in one of these formats before sending it for transcription.

How to?

The Audio transcription endpoint offers you a wide range of transcription options for audios. Learn how to use them with the following examples:

Enable authorization

First, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

If you do not have an access token key yet, follow the instructions in the AI Endpoints – Getting Started.

With the Python OpenAI library

The whisper-large-v3-turbo API is compatible with the OpenAI specification.

First install the openai library:

pip install openai

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
from openai import OpenAI

url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/"

audio_file_path = "my_audio.mp3"

client = OpenAI(
    base_url=url,
    api_key=os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')
)

audio_file = open(audio_file_path, "rb")
transcript_answer = client.audio.transcriptions.create(
  model="whisper-large-v3-turbo",
  file=audio_file,
  temperature=0.0
)

print(transcript_answer)

Which returns the following:

Transcription(
    text=" It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    logprobs=None, 
    task='transcribe', 
    success=True, 
    language='en', 
    duration=60.936, 
    words=[], 
    segments=[
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 
            'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    diarization=[], 
    usage={
        'type': 'duration', 
        'duration': 61.0
    }
)

With curl

To process an audio file with curl, run the following command in your terminal:

curl -X POST "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions" \
  -H "accept: application/json" \
  -H "Authorization: Bearer $OVH_AI_ENDPOINTS_ACCESS_TOKEN" \
  -F "model=whisper-large-v3-turbo" \
  -F "file=@my_audio.mp3" \
  -F "temperature=0.0" \
  -H 'Content-Type: multipart/form-data'

Which returns the following:

{
  "task": "transcribe",
  "success": true,
  "language": "en",
  "duration": 60.936,
  "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.",
  "words": [],
  "segments": [
    {
      "id": 1,
      "seek": 0,
      "start": 0,
      "end": 8.48,
      "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France.",
      "tokens": [50365,467,307,588,4670,300,264,19169,12761,486,312,5167,294,8380,294,45237,337,6190,13,50865],
      "temperature": 0,
      "avg_logprob": -0.31054688,
      "compression_ratio": 0.9767442,
      "no_speech_prob": 0
    },
    ...,
    {
      "id": 5,
      "seek": 3484,
      "start": 46.12,
      "end": 60.24,
      "text": " It is a chance for everyone to get involved and become interested about different types of sports and to cheer on their home country.",
      "tokens": [50885,467,307,257,2931,337,1518,281,483,3288,293,1813,3102,466,819,3467,295,6573,293,281,12581,322,641,1280,1941,13,51640],
      "temperature": 0,
      "avg_logprob": -0.11174939,
      "compression_ratio": 1.4268292,
      "no_speech_prob": 0
    }
  ],
  "diarization": [],
  "usage": {
    "type": "duration",
    "duration": 61
  }
}

With a simple Python HTTP client (requests)

First install the requests library:

pip install requests

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
import requests
 
url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions"

audio_file_path = "my_audio.mp3"

headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

files = {"file": open(audio_file_path, "rb")}

data = {
    "model": "whisper-large-v3-turbo",
    "temperature": 0.0,
    "timestamp_granularities": "segment, word",
    "response_format": "verbose_json"
}

response = requests.post(url, headers=headers, files=files, data=data)

# Handle response
if response.status_code == 200:
    print(response.json())
else:
    print("Error:", response.status_code, response.text)

Returning the following result, which contains the audio transcription API response:

{
    'task': 'transcribe', 
    'success': True, 
    'language': 'en', 
    'duration': 60.936, 
    'text': " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    'words': [
        {'word': 'It', 'start': 0.0, 'end': 0.7}, 
        {'word': 'is', 'start': 0.7, 'end': 0.92}, 
        {'word': 'very', 'start': 0.92, 'end': 1.18}, 
        ...,
        {'word': 'their', 'start': 58.66, 'end': 59.24}, 
        {'word': 'home', 'start': 59.24, 'end': 59.62}, 
        {'word': 'country.', 'start': 59.62, 'end': 60.16}
    ], 
    'segments': [
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    'diarization': [], 
    'usage': {
        'type': 'duration', 
        'duration': 61.0
    }
}

Model rate limit

When using AI Endpoints, the following rate limits apply:

Anonymous: 2 requests per minute, per IP and per model.
Authenticated with an API access key: 400 requests per minute, per Public Cloud project and per model.

On top of the request rate limitations, AI Endpoints also applies the following limitations to the audio data:

Anonymous: 10MB file size or 60s audio duration per request.
Authenticated with an API access key: 2048MB file size or 10800s audio duration per request.

If you exceed these limits, different HTTP error codes will be returned depending on the type of limitation exceeded:

429 (Too Many Requests): When rate limits are exceeded
413 (Payload Too Large): When audio file size or duration exceeds the allowed limits

Note that exceeding either the rate limit or audio size/duration limits will result in an appropriate error response to help you understand and address the issue.

If you require higher usage, please get in touch with us to discuss increasing your rate limits.

Going Further

Want to explore the full capabilities of the ASR API? Dive into our dedicated Speech to Text guide to discover advanced parameters (such as diarization, timestamp granularities, chunking strategy, and prompts) and ready to run code examples in Python, cURL, and JavaScript.

For a broader overview of AI Endpoints, explore the full AI Endpoints Documentation.

Reach out to our support team or join the OVHcloud Discord #ai-endpoints channel to share your questions, feedback, and suggestions for improving the service, to the team and the community.

For more information about the Whisper model features, please refer to Hugging Face documentation:

Whisper-large-v3-turbo