This NVIDIA TTS model generates natural-sounding Italian speech from raw text without requiring additional information.
The Text-to-Speech (TTS) API endpoint allows you to obtain speech synthesis from raw text.
Text-to-Speech (TTS) is a subfield of Artificial Intelligence (AI) that converts written text into spoken words. This TTS API operates as a two-stage pipeline, with a first model generating a mel spectrogram, then a second model using this mel spectrogram to generate speech. This speech synthesis system enables you to synthesize natural speech from raw transcriptions without any additional information.
AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them:
These TTS models were developed by NVIDIA. The TTS AI Endpoint takes text as input and returns audio stream or audio buffer, along with additional optional metadata.
Model configuration:
The TTS endpoint offers you a wide range of transcription options. Learn how to use them with the following example:
First install the requests library:
pip install requests
Next, export your access token to the OVH_AI_ENDPOINTS_ACCESS_TOKEN environment variable:
export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>
If you do not have an access token key yet, follow the instructions in the AI Endpoints – Getting Started.
Finally, run the following Python code:
import requests
url = "https://nvr-tts-it-it.endpoints.kepler.ai.cloud.ovh.net/api/v1/tts/text_to_audio"
headers = {
"accept": "application/octet-stream",
"Content-Type": "application/json",
"Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}
data = {
"encoding": 1,
"language_code": "en-US",
"sample_rate_hz": 16000,
"text": "We provide a set of managed tools designed for building your Machine Learning projects: AI Notebooks, AI Training, AI Deploy and AI Endpoints.",
"voice_name": "English-US.Female-1"
}
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
# Save the audio content to a file
with open("output_audio.wav", "wb") as audio_file:
audio_file.write(response.content)
print("Audio file saved as output_audio.wav")
else:
print("Error:", response.status_code, response.text)
Returning the following result:
Audio file saved as output_audio.wav
You are now able to play and use your generated audio file.
Install RIVA client and audio libraries:
pip install nvidia-riva-client numpy
This use case deals with a basic example that returns the audio speech generated by the model:
import numpy as np
import IPython.display as ipd
import riva.client
# connect with riva tts server
tts_service = riva.client.SpeechSynthesisService(
riva.client.Auth(
uri="nvr-tts-it-it.endpoints-grpc.kepler.ai.cloud.ovh.net:443",
use_ssl=True,
)
)
# set up config
sample_rate_hz = 44100
req = {
"language_code" : "en-US", # choose the corresponding language in the list: en-US / es-ES / de-DE / it-IT
"encoding" : riva.client.AudioEncoding.LINEAR_PCM ,
"sample_rate_hz" : sample_rate_hz, # sample rate: 44.1KHz audio
"voice_name" : "English-US.Female-1" # voices: `English-US.Female-1`, `English-US.Male-1`,
# `English-US.Female-Calm`, `English-US.Female-Neutral`,
# `English-US.Female-Happy`, `English-US.Female-Angry`,
# `English-US.Female-Fearful`, `English-US.Female-Sad`,
# `English-US.Male-Calm`, `English-US.Male-Neutral`,
# `English-US.Male-Happy`, `English-US.Male-Angry`,
# `Spanish-ES-Female-1`, `Spanish-ES-Male-1`,
# `German-DE-Male-1`, `Italian-IT-Female-1`,
# `Italian-IT-Male-1`, `Mandarin-CN.Female-1`,
# `Mandarin-CN.Male-1`, `Mandarin-CN.Female-Calm`,
# `Mandarin-CN.Female-Neutral`, `Mandarin-CN.Male-Happy`,
# `Mandarin-CN.Male-Fearful`, `Mandarin-CN.Male-Sad`,
# `Mandarin-CN.Male-Calm`, `Mandarin-CN.Male-Neutral`,
# `Mandarin-CN.Male-Angry`
}
# input text
req["text"] = "We provide a set of managed tools designed for building your Machine Learning projects: AI Notebooks, AI Training, AI Deploy and AI Endpoints."
# return response
response = tts_service.synthesize(**req)
audio_samples = np.frombuffer(response.audio, dtype=np.int16)
# play output audio
ipd.Audio(audio_samples, rate=sample_rate_hz)
When using AI Endpoints, the following rate limits apply:
If you exceed this limit, a 429 error code will be returned.
If you require higher usage, please get in touch with us to discuss increasing your rate limits.
For more information about the TTS model features, please refer to RIVA TTS documentation.
For a broader overview of AI Endpoints, explore the full AI Endpoints Documentation.
Reach out to our support team or join the OVHcloud Discord #ai-endpoints channel to share your questions, feedback, and suggestions for improving the service, to the team and the community.
New to AI Endpoints? This guide walks you through everything you need to get an access token, call AI models, and integrate AI APIs into your apps with ease.
Start TutorialExplore what AI Endpoints can do. This guide breaks down current features, future roadmap items, and the platform's core capabilities so you know exactly what to expect.
Start TutorialRunning into issues? This guide helps you solve common problems on AI Endpoints, from error codes to unexpected responses. Get quick answers, clear fixes, and helpful tips to keep your projects running smoothly.
Start TutorialLearn how to use OVHcloud AI Endpoints Virtual Models.
Start TutorialExplore the full potential of the ASR API through advanced parameters (diarization, timestamp granularities, chunking strategies, prompts ...) and ready-to-run code examples in Python, cURL, and JavaScript.
Start TutorialTurn hours of audio into sharp, readable summaries! This guide walks you through building an AI-powered audio assistant using ASR and LLMs, perfect for meetings, podcasts, or any voice recordings.
Start TutorialBuild a voice-enabled assistant in under 100 lines of code! Learn how to combine ASR, LLM, and TTS endpoints to create an AI that listens, understands, and responds.
Start TutorialDiscover how to synthesize realistic dialogues and automatically identify who’s speaking. This notebook shows you how to generate speech and label each speaker, perfect for building smart transcripts.
Start TutorialBring your synthetic voice to life! Discover how to use Text-To-Speech with emotion mixing—adjusting pitch, tone, pace, and style to express emotions like joy, sadness, or surprise.
Start Tutorial