Whisper-large-v3-turbo

Speech To Text

Whisper es un modelo de vanguardia para el reconocimiento automático de voz (ASR) y la traducción de voz, propuesto en el artículo «Reconocimiento de voz robusto mediante supervisión débil a gran escala» de Alec Radford et al. de OpenAI. Entrenado con más de 5 millones de horas de datos etiquetados, Whisper muestra una fuerte capacidad de generalización a diversos conjuntos de datos y dominios en un contexto de cero disparos. Whisper large-v3-turbo es una versión ajustada de un Whisper large-v3 simplificado. En otras palabras, es el mismo modelo exacto, excepto que el número de capas de decodificación se ha reducido de 32 a 4. Como resultado, el modelo es mucho más rápido, a expensas de una ligera reducción de la calidad.

Prueba otro modelo

Acerca del modelo Whisper-large-v3-turbo

Publicado el huggingface

01/09/2024

Licencia: Apache 2.0

Precio de audio

0.00001278 € /segundo

Formatos de salida

jsonverbose_jsontext

Tamaños de contexto

Parámetros

0.81B

Prueba el modelo jugando con él.

Audio (Transcription) API - whisper-large-v3-turbo

The OpenAI-compatible Audio transcription API endpoint allows you to recognize and transcribe audio, especially human speech, into text.

Introduction

Audio transcription is a technology that converts spoken language into written text. It is a complex process that involves several stages, including speech signal preprocessing, feature extraction, acoustic modeling, language modeling, and speech recognition engine.

AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them:

Supported Audio Formats

The API accepts multiple audio formats, including:

mp3
mp4
aac
m4a
wav
flac
ogg
opus
webm
mpeg
mpga

Make sure your file is in one of these formats before sending it for transcription.

How to?

The Audio transcription endpoint offers you a wide range of transcription options for audios. Learn how to use them with the following examples:

Enable authorization

First, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

If you do not have an access token key yet, follow the instructions in the AI Endpoints – Getting Started.

With the Python OpenAI library

The whisper-large-v3-turbo API is compatible with the OpenAI specification.

First install the openai library:

pip install openai

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
from openai import OpenAI

url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/"

audio_file_path = "my_audio.mp3"

client = OpenAI(
    base_url=url,
    api_key=os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')
)

audio_file = open(audio_file_path, "rb")
transcript_answer = client.audio.transcriptions.create(
  model="whisper-large-v3-turbo",
  file=audio_file,
  temperature=0.0
)

print(transcript_answer)

Which returns the following:

Transcription(
    text=" It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    logprobs=None, 
    task='transcribe', 
    success=True, 
    language='en', 
    duration=60.936, 
    words=[], 
    segments=[
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 
            'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    diarization=[], 
    usage={
        'type': 'duration', 
        'duration': 61.0
    }
)

With curl

To process an audio file with curl, run the following command in your terminal:

curl -X POST "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions" \
  -H "accept: application/json" \
  -H "Authorization: Bearer $OVH_AI_ENDPOINTS_ACCESS_TOKEN" \
  -F "model=whisper-large-v3-turbo" \
  -F "file=@my_audio.mp3" \
  -F "temperature=0.0" \
  -H 'Content-Type: multipart/form-data'

Which returns the following:

{
  "task": "transcribe",
  "success": true,
  "language": "en",
  "duration": 60.936,
  "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.",
  "words": [],
  "segments": [
    {
      "id": 1,
      "seek": 0,
      "start": 0,
      "end": 8.48,
      "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France.",
      "tokens": [50365,467,307,588,4670,300,264,19169,12761,486,312,5167,294,8380,294,45237,337,6190,13,50865],
      "temperature": 0,
      "avg_logprob": -0.31054688,
      "compression_ratio": 0.9767442,
      "no_speech_prob": 0
    },
    ...,
    {
      "id": 5,
      "seek": 3484,
      "start": 46.12,
      "end": 60.24,
      "text": " It is a chance for everyone to get involved and become interested about different types of sports and to cheer on their home country.",
      "tokens": [50885,467,307,257,2931,337,1518,281,483,3288,293,1813,3102,466,819,3467,295,6573,293,281,12581,322,641,1280,1941,13,51640],
      "temperature": 0,
      "avg_logprob": -0.11174939,
      "compression_ratio": 1.4268292,
      "no_speech_prob": 0
    }
  ],
  "diarization": [],
  "usage": {
    "type": "duration",
    "duration": 61
  }
}

With a simple Python HTTP client (requests)

First install the requests library:

pip install requests

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
import requests
 
url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions"

audio_file_path = "my_audio.mp3"

headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

files = {"file": open(audio_file_path, "rb")}

data = {
    "model": "whisper-large-v3-turbo",
    "temperature": 0.0,
    "timestamp_granularities": "segment, word",
    "response_format": "verbose_json"
}

response = requests.post(url, headers=headers, files=files, data=data)

# Handle response
if response.status_code == 200:
    print(response.json())
else:
    print("Error:", response.status_code, response.text)

Returning the following result, which contains the audio transcription API response:

{
    'task': 'transcribe', 
    'success': True, 
    'language': 'en', 
    'duration': 60.936, 
    'text': " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    'words': [
        {'word': 'It', 'start': 0.0, 'end': 0.7}, 
        {'word': 'is', 'start': 0.7, 'end': 0.92}, 
        {'word': 'very', 'start': 0.92, 'end': 1.18}, 
        ...,
        {'word': 'their', 'start': 58.66, 'end': 59.24}, 
        {'word': 'home', 'start': 59.24, 'end': 59.62}, 
        {'word': 'country.', 'start': 59.62, 'end': 60.16}
    ], 
    'segments': [
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    'diarization': [], 
    'usage': {
        'type': 'duration', 
        'duration': 61.0
    }
}

Model rate limit

When using AI Endpoints, the following rate limits apply:

Anonymous: 2 requests per minute, per IP and per model.
Authenticated with an API access key: 400 requests per minute, per Public Cloud project and per model.

On top of the request rate limitations, AI Endpoints also applies the following limitations to the audio data:

Anonymous: 10MB file size or 60s audio duration per request.
Authenticated with an API access key: 2048MB file size or 10800s audio duration per request.

If you exceed these limits, different HTTP error codes will be returned depending on the type of limitation exceeded:

429 (Too Many Requests): When rate limits are exceeded
413 (Payload Too Large): When audio file size or duration exceeds the allowed limits

Note that exceeding either the rate limit or audio size/duration limits will result in an appropriate error response to help you understand and address the issue.

If you require higher usage, please get in touch with us to discuss increasing your rate limits.

Going Further

Want to explore the full capabilities of the ASR API? Dive into our dedicated Speech to Text guide to discover advanced parameters (such as diarization, timestamp granularities, chunking strategy, and prompts) and ready to run code examples in Python, cURL, and JavaScript.

For a broader overview of AI Endpoints, explore the full AI Endpoints Documentation.

Reach out to our support team or join the OVHcloud Discord #ai-endpoints channel to share your questions, feedback, and suggestions for improving the service, to the team and the community.

For more information about the Whisper model features, please refer to Hugging Face documentation:

Whisper-large-v3-turbo