Audio to Text — Transcribe Audio & Video Files in Your Br...

An audio to text transcriber turns speech in audio or video files into written text. This one runs OpenAI’s open-source Whisper model entirely in your browser — no upload, no account, no per-minute pricing. It also bundles a subtitle converter that turns existing SRT/VTT caption files into clean text.

What it does — and deliberately doesn’t

Does: transcribe files you have — voice memos, meeting recordings, podcasts, lectures, interviews, your own videos. Outputs plain text, timestamped text, SRT, or WebVTT.
Does: convert subtitle files (SRT ↔ VTT ↔ clean text) you exported from YouTube Studio, a podcast host, or an editor.
Doesn’t: fetch transcripts from a YouTube / X / Facebook URL. Scraping those platforms breaks their terms of service, and the browser blocks the requests anyway. Services that offer it proxy your request through their servers — the opposite of this site’s no-upload promise.

On-device vs Cloud AI

The tool offers two engines, chosen with the toggle above the controls:

On-device (default) — Whisper runs in your browser. Nothing is uploaded; your audio never leaves the device. The trade-off is a one-time model download and that speed depends on your hardware (the largest option is “Small”). This is the private default and what the rest of this page describes.
Cloud AI (opt-in) — your audio is uploaded to Cloudflare Workers AI (whisper-large-v3-turbo), transcribed on the edge, and the text is returned. No model download, faster, and more accurate than the on-device models — handy on a phone or an old laptop. It’s free and needs no signup, but it does send your file to a server, so it’s clearly labeled with a notice. The audio is used to answer the request and not stored by this tool (Cloudflare, like any host, may log standard request metadata). Capped at 25 MB per file; use On-device for longer recordings or anything confidential.

This is the rare opt-in exception to the site’s private-by-default rule — see the privacy page. Everything below applies to the On-device engine.

Choosing a model

Model	Download	Speed	Quality
Tiny	~55 MB	Fastest — real time or better on CPU	Good for clear speech, stumbles on accents and noise
Base	~85 MB	Fast on GPU, near real time on CPU	Solid default for podcasts and meetings
Small	~250 MB	Needs WebGPU to feel quick	Close to commercial services on clean audio

The download happens once per model and is cached by your browser. Switching models later only downloads the new one.

GPU vs CPU

The tool tries WebGPU first (Chrome and Edge support it broadly; Safari and Firefox are rolling it out). On WebGPU, transcription typically runs several times faster than real time. Without it, the tool falls back to multithreaded WebAssembly on the CPU — slower but works everywhere. The result is identical either way; the meta line above the transcript tells you which path ran.

Output formats

Plain text — paragraphs of clean prose, duplicate caption lines removed. Best for notes, summaries, search.
Text + timestamps — one [mm:ss] line per segment. Best for show notes and quote hunting.
SRT — numbered cues with 00:00:01,000 --> 00:00:04,000 timing. Drop it into any video editor or player.
WebVTT — the web-native caption format for HTML5 <track> elements.

The same transcript renders into all four — switch the selector after transcribing, no re-run needed.

The subtitle converter tab

Already have a caption file? The second tab parses SRT and WebVTT (including auto-generated YouTube captions with their rolling duplicate lines), strips formatting tags, de-duplicates, and outputs any of the four formats above. It’s instant and pure text processing — nothing downloads, nothing uploads.

Privacy

Static page → Preact island → Web Worker running Whisper via transformers.js. Network traffic consists of the page assets, the ONNX runtime, and the one-time model weights download from the Hugging Face CDN. Your audio is decoded with the browser’s built-in decodeAudioData, handed to the worker in memory, and never serialized to any request. Verify it: DevTools → Network → transcribe a file → filter by your filename — nothing.

How it compares

	bytefork.tools	otter.ai	restream / veed
Runs in browser	✓	✗ (uploads)	✗ (uploads)
Free limit	unlimited	300 min/mo	trial watermarks
Signup	✗	required	required
SRT / VTT export	✓	paid tier	paid tier
Works offline after model cache	✓	✗	✗

Video Compressor — convert MKV/AVI to MP4 so the audio track decodes.
Video Splitter & Joiner — split multi-hour recordings before transcribing.
Image to Text (OCR) — the same local-only idea, for text in images.

Frequently asked questions

Is my audio uploaded to a server?

Not in the default On-device mode. There, the Whisper model runs inside your browser tab — on the GPU via WebGPU when available, otherwise on the CPU via WebAssembly — and the only network traffic is the one-time model download (~55–250 MB of neural network weights from the Hugging Face CDN), which contains no user data. Open DevTools → Network while transcribing on-device: zero requests carry your audio. There is also an optional Cloud AI mode you can switch to, which does upload your file (see the next question); it is clearly labeled and off by default.

What is the difference between On-device and Cloud AI modes?

On-device (the default) runs Whisper entirely in your browser — fully private, nothing uploaded, but it downloads the model once and is limited by your device's speed and is capped at the Small model. Cloud AI is opt-in: it uploads your audio to Cloudflare Workers AI (whisper-large-v3-turbo) for a faster, more accurate transcription with no model download — useful on phones or weak laptops. The audio is processed to produce the text and is not stored by this tool; like any web host Cloudflare may log standard request metadata. Cloud AI is free, needs no signup, and is capped at 25 MB per file (use On-device for longer recordings). For anything confidential, stay on On-device.

Can it transcribe a YouTube / X / Facebook video from a URL?

Deliberately not. Fetching media or captions from those platforms is prohibited by their terms of service (YouTube's API policies forbid scraping, and the caption API only covers videos you own), and browsers block the requests anyway (no CORS headers). Tools that do it run server-side proxies — which also means your URL history goes to their server. This tool only accepts files you already have on your device.

How do I get a transcript of my own YouTube video then?

Two legitimate paths: 1) YouTube Studio → Subtitles → ⋮ → Download lets creators export their caption file — paste it into the subtitle converter tab for clean text. 2) Download your own video file (yours to keep via YouTube Studio or Google Takeout), then drop it into the transcribe tab.

How accurate is the transcription?

Whisper Small approaches commercial-service quality on clear speech; Base is a good balance; Tiny is fastest but makes noticeably more mistakes, especially on accented speech and noisy recordings. All models handle punctuation and casing. Expect errors on heavy background noise, crosstalk, and domain jargon — proofread before publishing.

Which audio and video formats work?

Anything your browser can decode: MP3, WAV, M4A/AAC, OGG, FLAC (most browsers), and the audio track of MP4, WebM, and MOV videos. MKV and AVI usually fail — convert them to MP4 with the Video Compressor tool first.

How long does it take?

With WebGPU (Chrome/Edge on a machine with a GPU), Tiny and Base transcribe several times faster than real time — a 10-minute recording typically takes 1–3 minutes. On the CPU/WASM fallback, expect roughly real time for Tiny and slower for Base/Small. The first run adds the one-time model download.

Which languages are supported?

The multilingual Whisper models cover ~100 languages; the language menu surfaces the 15 most requested plus auto-detect. Auto-detect works well when the first 30 seconds contain clear speech in one language. Picking the language explicitly is faster and more accurate.

Is there a file length or size limit?

No hard limit, but practical ones: the decoded audio lives in memory (~230 MB per hour of audio), and CPU transcription of very long files takes a while. Hour-long recordings work on a typical laptop; multi-hour files are better split with the Video Splitter tool first.

What's the subtitle converter tab for?

It converts existing SRT or WebVTT subtitle files into clean plain text (de-duplicated, tags stripped), timestamped text, or between SRT and VTT. Useful for caption files you exported from YouTube Studio, a podcast host, or a video editor.

Why does the model download come from huggingface.co?

The Whisper ONNX weights are hosted on the Hugging Face model hub — the standard distribution channel for open-source models, same idea as the tesseract models our OCR tool pulls from jsdelivr. The download is cached by your browser, so it happens once per model. It is a one-way download of public model weights; nothing about you or your audio is sent.