llms.txtllms-full.txt
DashboardStatusGet API Key
IntroductionQuickstartModelsPricingArchitecture & SecurityLimits & Quotas
Execution Modes & HTTP QueueWebhooksWebSocketsMCP Servern8n Integrationn8n dryAPI node
API OverviewErrorsText-to-ImagePOSTText-to-Image Price CalculationPOSTText-to-VideoPOSTText-to-Video Price CalculationPOSTImage-to-VideoPOSTImage-to-Video Price CalculationPOSTAudio-to-VideoPOSTAudio-to-Video Price CalculationPOSTText-to-Speech (TTS)POSTText-to-Speech Price CalculationPOSTText-to-MusicPOSTText-to-Music Price CalculationPOSTText-to-EmbeddingPOSTText-to-Embedding Price CalculationPOSTImage-to-ImagePOSTImage-to-Image Price CalculationPOSTImage Background RemovalPOSTImage Background Removal Price CalculationPOSTImage UpscalePOSTImage Upscale Price CalculationPOST
OpenAPI
SDKs & IntegrationsPayment MethodsFAQ — Frequently Asked QuestionsSupport & Contact
dAdryAPI
DashboardStatusGet API Key
API
Technical Reference

Text-to-Speech (TTS)

Technical documentation for dryAPI APIs, integration guides, and operational references.

Endpoint for requesting text2audio inference

Text-to-Speech converts text into natural-sounding audio. The endpoint supports three TTS modes via the mode parameter:

  • custom_voice (default) — Use a preset voice from the model's voice library. Requires the voice parameter.
  • voice_clone — Clone a voice from a short reference audio clip. Requires the ref_audio parameter (3–10 seconds, max 10 MB). Optionally provide ref_text with a transcript of the reference audio for improved accuracy.
  • voice_design — Create a new voice from a natural language description. Requires the instruct parameter (e.g. "A warm female voice with a British accent").

NOTE

Prerequisite: To ensure a successful request, you must first consult the Model Selection endpoint to identify a valid model slug, check specific limits and features, and verify available languages and voices.

WARNING

Mode-specific required fields:

  • custom_voice — voice is required.
  • voice_clone — ref_audio is required. ref_text is optional but recommended.
  • voice_design — instruct is required.

If mode is omitted, the API defaults to custom_voice.

OpenAPI

Request Txt2 Audio

POST
/api/v1/client/txt2audio

Authorization

bearerAuth
AuthorizationBearer <token>

In: header

Header Parameters

Accept*string
Default"application/json"
Value in"application/json"

Request Body

multipart/form-data

Audio generation parameters. Supports three TTS modes: custom_voice (default, preset speakers), voice_clone (clone from reference audio), voice_design (create voice from description).

TypeScript Definitions

Use the request body type in TypeScript.

Response Body

application/json

application/json

application/json

application/json

application/json

curl -X POST "https://api.dryapi.dev/api/v1/client/txt2audio" \  -H "Accept: application/json" \  -F text="A beautiful sunset over mountains" \  -F model="Kokoro" \  -F lang="en-us" \  -F speed="1" \  -F format="flac" \  -F sample_rate="24000"
{
  "data": {
    "request_id": "c08a339c-73e5-4d67-a4d5-231302fbff9a"
  }
}
{
  "data": {},
  "message": "string",
  "errors": [
    null
  ],
  "statusCode": 0
}
{
  "data": {},
  "message": "string",
  "errors": [
    null
  ],
  "statusCode": 0
}
{
  "message": "The selected model does not support Text To Image.",
  "errors": {
    "model": [
      "The selected model does not support Text To Image."
    ]
  }
}
{
  "message": "Too Many Attempts."
}
Last updated on 21 March 2026

POST Audio-to-Video Price Calculation

Previous Page

POST Text-to-Speech Price Calculation

Next Page

text*string

Text to be converted to speech

model*string

The model to use for speech generation. Available models can be retrieved via the GET /api/v1/client/models endpoint.

mode?|

TTS mode: custom_voice (default), voice_clone, or voice_design. Determines which fields are required.

Value in"custom_voice" | "voice_clone" | "voice_design"
voice?|

Name of the voice to be used. Required for custom_voice mode.

lang*string

Language to be used during audio generation

speed*number

Generated audio speech speed

format*string

Audio output format

sample_rate*number

Sample rate of generated audio

ref_audio?|

Reference audio file for voice cloning. Supported formats: mp3, wav, flac, ogg, m4a. Max 10MB. Duration must be between 3-10 seconds (model-specific limits may apply). Required for voice_clone mode.

Formatbinary
ref_text?|

Optional transcript of the reference audio for improved voice cloning accuracy.

instruct?|

Natural language voice description for voice_design mode (e.g. "A warm female voice with a British accent"), or style/emotion control in custom_voice mode.

webhook_url?|

Optional HTTPS URL to receive webhook notifications for job status changes (processing, completed, failed). Must be HTTPS. Max 2048 characters. See Webhook Documentation for payload structure and authentication details.

Formaturi
Lengthlength <= 2048