Whisper-large-v3

Speech To Text

Whisper es un modelo de vanguardia para el reconocimiento automático de voz (ASR) y la traducción de voz, propuesto en el artículo «Reconocimiento de voz robusto mediante supervisión débil a gran escala» de Alec Radford et al. de OpenAI. Entrenado con más de 5 millones de horas de datos etiquetados, Whisper muestra una fuerte capacidad de generalización a diversos conjuntos de datos y dominios en un contexto de cero disparos.

Prueba otro modelo

Acerca del modelo Whisper-large-v3

Publicado el huggingface

01/11/2023

Licencia: Apache 2.0

Precio de audio

0.00004083 € /segundo

Formatos de salida

jsonverbose_jsontext

Tamaños de contexto

Parámetros

1.54B

Prueba el modelo jugando con él.

Audio (Transcription) API - whisper-large-v3

The OpenAI-compatible Audio transcription API endpoint allows you to recognize and transcribe audio, especially human speech, into text.

Introduction

Audio transcription is a technology that converts spoken language into written text. It is a complex process that involves several stages, including speech signal preprocessing, feature extraction, acoustic modeling, language modeling, and speech recognition engine.

AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them:

Supported Audio Formats

The API accepts multiple audio formats, including:

mp3
mp4
aac
m4a
wav
flac
ogg
opus
webm
mpeg
mpga

Make sure your file is in one of these formats before sending it for transcription.

How to?

The Audio transcription endpoint offers you a wide range of transcription options for audios. Learn how to use them with the following examples:

Enable authorization

First, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

If you do not have an access token key yet, follow the instructions in the AI Endpoints – Getting Started.

With the Python OpenAI library

The whisper-large-v3 API is compatible with the OpenAI specification.

First install the openai library:

pip install openai

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
from openai import OpenAI

url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/"

audio_file_path = "my_audio.mp3"

client = OpenAI(
    base_url=url,
    api_key=os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')
)

audio_file = open(audio_file_path, "rb")
transcript_answer = client.audio.transcriptions.create(
  model="whisper-large-v3",
  file=audio_file,
  temperature=0.0
)

print(transcript_answer)

Which returns the following:

Transcription(
    text=" It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    logprobs=None, 
    task='transcribe', 
    success=True, 
    language='en', 
    duration=60.936, 
    words=[], 
    segments=[
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 
            'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    diarization=[], 
    usage={
        'type': 'duration', 
        'duration': 61.0
    }
)

With curl

To process an audio file with curl, run the following command in your terminal:

curl -X POST "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions" \
  -H "accept: application/json" \
  -H "Authorization: Bearer $OVH_AI_ENDPOINTS_ACCESS_TOKEN" \
  -F "model=whisper-large-v3" \
  -F "file=@my_audio.mp3" \
  -F "temperature=0.0" \
  -H 'Content-Type: multipart/form-data'

Which returns the following:

{
  "task": "transcribe",
  "success": true,
  "language": "en",
  "duration": 60.936,
  "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.",
  "words": [],
  "segments": [
    {
      "id": 1,
      "seek": 0,
      "start": 0,
      "end": 8.48,
      "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France.",
      "tokens": [50365,467,307,588,4670,300,264,19169,12761,486,312,5167,294,8380,294,45237,337,6190,13,50865],
      "temperature": 0,
      "avg_logprob": -0.31054688,
      "compression_ratio": 0.9767442,
      "no_speech_prob": 0
    },
    ...,
    {
      "id": 5,
      "seek": 3484,
      "start": 46.12,
      "end": 60.24,
      "text": " It is a chance for everyone to get involved and become interested about different types of sports and to cheer on their home country.",
      "tokens": [50885,467,307,257,2931,337,1518,281,483,3288,293,1813,3102,466,819,3467,295,6573,293,281,12581,322,641,1280,1941,13,51640],
      "temperature": 0,
      "avg_logprob": -0.11174939,
      "compression_ratio": 1.4268292,
      "no_speech_prob": 0
    }
  ],
  "diarization": [],
  "usage": {
    "type": "duration",
    "duration": 61
  }
}

With a simple Python HTTP client (requests)

First install the requests library:

pip install requests

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
import requests
 
url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions"

audio_file_path = "my_audio.mp3"

headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

files = {"file": open(audio_file_path, "rb")}

data = {
    "model": "whisper-large-v3",
    "temperature": 0.0,
    "timestamp_granularities": "segment, word",
    "response_format": "verbose_json"
}

response = requests.post(url, headers=headers, files=files, data=data)

# Handle response
if response.status_code == 200:
    print(response.json())
else:
    print("Error:", response.status_code, response.text)

Returning the following result, which contains the audio transcription API response:

{
    'task': 'transcribe', 
    'success': True, 
    'language': 'en', 
    'duration': 60.936, 
    'text': " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    'words': [
        {'word': 'It', 'start': 0.0, 'end': 0.7}, 
        {'word': 'is', 'start': 0.7, 'end': 0.92}, 
        {'word': 'very', 'start': 0.92, 'end': 1.18}, 
        ...,
        {'word': 'their', 'start': 58.66, 'end': 59.24}, 
        {'word': 'home', 'start': 59.24, 'end': 59.62}, 
        {'word': 'country.', 'start': 59.62, 'end': 60.16}
    ], 
    'segments': [
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    'diarization': [], 
    'usage': {
        'type': 'duration', 
        'duration': 61.0
    }
}

Model rate limit

When using AI Endpoints, the following rate limits apply:

Anonymous: 2 requests per minute, per IP and per model.
Authenticated with an API access key: 400 requests per minute, per Public Cloud project and per model.

On top of the request rate limitations, AI Endpoints also applies the following limitations to the audio data:

Anonymous: 10MB file size or 60s audio duration per request.
Authenticated with an API access key: 2048MB file size or 10800s audio duration per request.

If you exceed these limits, different HTTP error codes will be returned depending on the type of limitation exceeded:

429 (Too Many Requests): When rate limits are exceeded
413 (Payload Too Large): When audio file size or duration exceeds the allowed limits

Note that exceeding either the rate limit or audio size/duration limits will result in an appropriate error response to help you understand and address the issue.

If you require higher usage, please get in touch with us to discuss increasing your rate limits.

Going Further

Want to explore the full capabilities of the ASR API? Dive into our dedicated Speech to Text guide to discover advanced parameters (such as diarization, timestamp granularities, chunking strategy, and prompts) and ready to run code examples in Python, cURL, and JavaScript.

For a broader overview of AI Endpoints, explore the full AI Endpoints Documentation.

Reach out to our support team or join the OVHcloud Discord #ai-endpoints channel to share your questions, feedback, and suggestions for improving the service, to the team and the community.

For more information about the Whisper model features, please refer to Hugging Face documentation:

Whisper-large-v3

Whisper-large-v3

Acerca del modelo Whisper-large-v3

Precio de audio

Formatos de salida

Tamaños de contexto

Parámetros

Prueba el modelo jugando con él.

Whisper-large-v3 API

Audio (Transcription) API - whisper-large-v3

Introduction

Supported Audio Formats

How to?

Enable authorization

With the Python OpenAI library

With curl

With a simple Python HTTP client (requests)

Model rate limit

Going Further

Comienza con Whisper-large-v3

AI Endpoints - Getting Started

AI Endpoints - Features, Capabilities and Limitations

AI Endpoints - Troubleshooting

AI Endpoints - Using Virtual Models

AI Endpoints - Speech to Text

Create your own Audio Summarizer Assistant

Build a Powerful Voice Assistant

Speaker Diarization with ASR models