Generative AI for video recognition and voice narration
Pyry-Samuli Lahti
Quick blog post to share a fun project I’ve been working on for the past few days. I found very cool AI Voice Generator and Text to Speech API by ElevenLabs, and played with them to generate a sportscaster narration for a video.
- First, I used their Instant Voice Cloning feature to generate a Finnish sportscaster voice model from ten short audio samples of Finnish sports broadcasts.
- Then, I used their Text to Speech feature to generate a voice narration for a short video clip from my daughter’s soccer practice.
- Finally, I used Apple iMovie to merge the video and the voice narration, combined with some soccer stadium background sound.
Lord and behold, let there be sportscasting
Real-time video narration
I also wrote a quick proof-of-concept web app, that does the same thing in real-time:
- Capture video from the webcam
- Extract the frames from the video, and feed them to the OpenAI Chat Completions API using the new
gpt-4-vision-preview
model - Use the ElevenLabs Text to Speech Streaming API to generate the voice narration for the video
- Utilize the MediaSource API to stream the narration audio to the browser
Source code
You can view the source code from github.com/Pyppe/live-sportscaster. And even try it out yourself at pyppe.github.io/live-sportscaster if you have OpenAI and ElevenLabs API keys at hand. But don’t expect too much, it’s a quick hack. 😅