Skip to content

whisper

Automatic Speech RecognitionOpenAI
@cf/openai/whisper

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

    Usage

    Workers - TypeScript

    export interface Env {
    AI: Ai;
    }
    export default {
    async fetch(request, env): Promise<Response> {
    const res = await fetch(
    "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/samples/cpp/windows/console/samples/enrollment_audio_katie.wav"
    );
    const blob = await res.arrayBuffer();
    const input = {
    audio: [...new Uint8Array(blob)],
    };
    const response = await env.AI.run(
    "@cf/openai/whisper",
    input
    );
    return Response.json({ input: { audio: [] }, response });
    },
    } satisfies ExportedHandler<Env>;

    curl

    Terminal window
    curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/openai/whisper \
    -X POST \
    -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
    --data-binary "@talking-llama.mp3"

    Parameters

    Input

    • 0 string

    • 1 object

      • audio array

        An array of integers that represent the audio data constrained to 8-bit unsigned integer values

        • items number

          A value between 0 and 255

      • source_lang string

        The language of the recorded audio

      • target_lang string

        The language to translate the transcription into. Currently only English is supported.

    Output

    • text string

      The transcription

    • word_count number

    • words array

      • items object

        • word string

        • start number

          The second this word begins in the recording

        • end number

          The ending second when the word completes

    • vtt string

    API Schemas

    The following schemas are based on JSON Schema

    {
    "oneOf": [
    {
    "type": "string",
    "format": "binary"
    },
    {
    "type": "object",
    "properties": {
    "audio": {
    "type": "array",
    "description": "An array of integers that represent the audio data constrained to 8-bit unsigned integer values",
    "items": {
    "type": "number",
    "description": "A value between 0 and 255"
    }
    },
    "source_lang": {
    "type": "string",
    "description": "The language of the recorded audio"
    },
    "target_lang": {
    "type": "string",
    "description": "The language to translate the transcription into. Currently only English is supported."
    }
    },
    "required": [
    "audio"
    ]
    }
    ]
    }