Whisper-large-v3-turbo

Speech To Text

O Whisper é um modelo de última geração para reconhecimento automático de fala (ASR) e tradução de fala, proposto no artigo «Robust Speech Recognition via Large-Scale Weak Supervision» de Alec Radford et al. da OpenAI. Treinado em mais de 5 milhões de horas de dados rotulados, o Whisper demonstra uma forte capacidade de generalização para diversos conjuntos de dados e domínios num cenário «zero-shot». O Whisper large-v3-turbo é uma versão otimizada de um Whisper large-v3 reduzido. Por outras palavras, é exatamente o mesmo modelo, exceto que o número de camadas de descodificação diminuiu de 32 para 4. Como resultado, o modelo fica muito mais rápido, à custa de uma pequena degradação da qualidade.

Experimente outro modelo

Sobre o modelo Whisper-large-v3-turbo

Publicado em huggingface

01/09/2024

Licença: Apache 2.0

Preço de áudio

0.00001278 € /segundo

Formatos de saída

jsonverbose_jsontext

Tamanhos de contexto

Parâmetros

0.81B

Experimente o modelo brincando com ele.

Audio (Transcription) API - whisper-large-v3-turbo

The OpenAI-compatible Audio transcription API endpoint allows you to recognize and transcribe audio, especially human speech, into text.

Introduction

Audio transcription is a technology that converts spoken language into written text. It is a complex process that involves several stages, including speech signal preprocessing, feature extraction, acoustic modeling, language modeling, and speech recognition engine.

AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them:

Supported Audio Formats

The API accepts multiple audio formats, including:

mp3
mp4
aac
m4a
wav
flac
ogg
opus
webm
mpeg
mpga

Make sure your file is in one of these formats before sending it for transcription.

How to?

The Audio transcription endpoint offers you a wide range of transcription options for audios. Learn how to use them with the following examples:

Enable authorization

First, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

If you do not have an access token key yet, follow the instructions in the AI Endpoints – Getting Started.

With the Python OpenAI library

The whisper-large-v3-turbo API is compatible with the OpenAI specification.

First install the openai library:

pip install openai

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
from openai import OpenAI

url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/"

audio_file_path = "my_audio.mp3"

client = OpenAI(
    base_url=url,
    api_key=os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')
)

audio_file = open(audio_file_path, "rb")
transcript_answer = client.audio.transcriptions.create(
  model="whisper-large-v3-turbo",
  file=audio_file,
  temperature=0.0
)

print(transcript_answer)

Which returns the following:

Transcription(
    text=" It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    logprobs=None, 
    task='transcribe', 
    success=True, 
    language='en', 
    duration=60.936, 
    words=[], 
    segments=[
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 
            'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    diarization=[], 
    usage={
        'type': 'duration', 
        'duration': 61.0
    }
)

With curl

To process an audio file with curl, run the following command in your terminal:

curl -X POST "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions" \
  -H "accept: application/json" \
  -H "Authorization: Bearer $OVH_AI_ENDPOINTS_ACCESS_TOKEN" \
  -F "model=whisper-large-v3-turbo" \
  -F "file=@my_audio.mp3" \
  -F "temperature=0.0" \
  -H 'Content-Type: multipart/form-data'

Which returns the following:

{
  "task": "transcribe",
  "success": true,
  "language": "en",
  "duration": 60.936,
  "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.",
  "words": [],
  "segments": [
    {
      "id": 1,
      "seek": 0,
      "start": 0,
      "end": 8.48,
      "text": " It is very exciting that the Olympic Games will be held in Paris in 2024 for France.",
      "tokens": [50365,467,307,588,4670,300,264,19169,12761,486,312,5167,294,8380,294,45237,337,6190,13,50865],
      "temperature": 0,
      "avg_logprob": -0.31054688,
      "compression_ratio": 0.9767442,
      "no_speech_prob": 0
    },
    ...,
    {
      "id": 5,
      "seek": 3484,
      "start": 46.12,
      "end": 60.24,
      "text": " It is a chance for everyone to get involved and become interested about different types of sports and to cheer on their home country.",
      "tokens": [50885,467,307,257,2931,337,1518,281,483,3288,293,1813,3102,466,819,3467,295,6573,293,281,12581,322,641,1280,1941,13,51640],
      "temperature": 0,
      "avg_logprob": -0.11174939,
      "compression_ratio": 1.4268292,
      "no_speech_prob": 0
    }
  ],
  "diarization": [],
  "usage": {
    "type": "duration",
    "duration": 61
  }
}

With a simple Python HTTP client (requests)

First install the requests library:

pip install requests

Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:

export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>

Finally, run the following Python code:

import os
import requests
 
url = "https://oai.endpoints.kepler.ai.cloud.ovh.net/v1/audio/transcriptions"

audio_file_path = "my_audio.mp3"

headers = {
    "accept": "application/json",
    "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

files = {"file": open(audio_file_path, "rb")}

data = {
    "model": "whisper-large-v3-turbo",
    "temperature": 0.0,
    "timestamp_granularities": "segment, word",
    "response_format": "verbose_json"
}

response = requests.post(url, headers=headers, files=files, data=data)

# Handle response
if response.status_code == 200:
    print(response.json())
else:
    print("Error:", response.status_code, response.text)

Returning the following result, which contains the audio transcription API response:

{
    'task': 'transcribe', 
    'success': True, 
    'language': 'en', 
    'duration': 60.936, 
    'text': " It is very exciting that the Olympic Games will be held in Paris in 2024 for France [...] and to cheer on their home country.", 
    'words': [
        {'word': 'It', 'start': 0.0, 'end': 0.7}, 
        {'word': 'is', 'start': 0.7, 'end': 0.92}, 
        {'word': 'very', 'start': 0.92, 'end': 1.18}, 
        ...,
        {'word': 'their', 'start': 58.66, 'end': 59.24}, 
        {'word': 'home', 'start': 59.24, 'end': 59.62}, 
        {'word': 'country.', 'start': 59.62, 'end': 60.16}
    ], 
    'segments': [
        {
            'id': 1, 
            'seek': 0, 
            'start': 0.0, 
            'end': 8.32, 
            'text': ' It is very exciting that the Olympic Games will be held in Paris in 2024 for France.', 
            'tokens': [50365, 467, 307, 588, 4670, 300, 264, 19169, 12761, 486, 312, 5167, 294, 8380, 294, 45237, 337, 6190, 13, 50793], 
            'temperature': 0.0, 
            'avg_logprob': -0.10994318, 
            'compression_ratio': 1.4497042, 
            'no_speech_prob': 0.0
        }, 
        ...,
        {
            'id': 8, 
            'seek': 5592, 
            'start': 56.48, 
            'end': 60.16, 
            'text': ' cheer, cheer on their home country.', 'tokens': [50365, 12581, 11, 12581, 322, 641, 1280, 1941, 13, 50584], 
            'temperature': 0.0, 
            'avg_logprob': -0.56960225, 
            'compression_ratio': 0.875, 
            'no_speech_prob': 0.00919342
        }
    ], 
    'diarization': [], 
    'usage': {
        'type': 'duration', 
        'duration': 61.0
    }
}

Model rate limit

When using AI Endpoints, the following rate limits apply:

Anonymous: 2 requests per minute, per IP and per model.
Authenticated with an API access key: 400 requests per minute, per Public Cloud project and per model.

On top of the request rate limitations, AI Endpoints also applies the following limitations to the audio data:

Anonymous: 10MB file size or 60s audio duration per request.
Authenticated with an API access key: 2048MB file size or 10800s audio duration per request.

If you exceed these limits, different HTTP error codes will be returned depending on the type of limitation exceeded:

429 (Too Many Requests): When rate limits are exceeded
413 (Payload Too Large): When audio file size or duration exceeds the allowed limits

Note that exceeding either the rate limit or audio size/duration limits will result in an appropriate error response to help you understand and address the issue.

If you require higher usage, please get in touch with us to discuss increasing your rate limits.

Going Further

Want to explore the full capabilities of the ASR API? Dive into our dedicated Speech to Text guide to discover advanced parameters (such as diarization, timestamp granularities, chunking strategy, and prompts) and ready to run code examples in Python, cURL, and JavaScript.

For a broader overview of AI Endpoints, explore the full AI Endpoints Documentation.

Reach out to our support team or join the OVHcloud Discord #ai-endpoints channel to share your questions, feedback, and suggestions for improving the service, to the team and the community.

For more information about the Whisper model features, please refer to Hugging Face documentation:

Whisper-large-v3-turbo