Speech-to-Text API Loses the Coding Noise, Delivers Full Function

(Full disclosure: engaged me to prepare a competitive internal review to see how their API stacked up against many other [much larger] vendors.  The results surprised me as a coder - so surprised that I asked, and was granted, their kind permission to publish this for my readers.)

So you need to convert some speech to text and you don't have time to go try to code every API out there.  Where should you start?  Well, lets start with the impressions I got when I surveyed all the major API publishers (AWS, gcloud, IBM, Microsoft, in speech-to-text.

Broad Market Overview
The first sense one gets as one begins to examine text-to-speech APIs on the big clouds, is the feeling that the big vendors want to try to lock you in to their cloud/platform in order to use their speech-to-text API.  AWS is particularly 'bad' in this sense as endpoints and API keys that actually work are very hard to find.  Microsoft's endpoints and API key were easy enough to find, but they only allow 10 seconds of audio via API (you have to use their platform tools for longer sound files).    If your project is already committed to a given cloud platform, by all means you should check that API solution out first (but a few are poorly documented so you may have trouble anyway).


Fast Coding/Fast Deployment
What about solutions that are not, and are unlikely ever to be, hosted on a cloud?  Say, a Wordpress site hosted on Digital Ocean or another host where you are not going to be asking the client to move their hosting?  What's fast to code, and fast to deploy reliably? 

The first thing I usually look for in API documentation is a cURL example.  With that, a quick study will reveal how I'll authenticate a request, whether to make the actual request in a header or in a GET or POST body. has a Docs link on their opening page and a cURL example halfway down the resulting page.   Only IBM among the big vendors rendered up a cURL example quickly.

Next, the API key.  Of course, for that, you have to enroll for the service, but's is trivially easy to find after a fast signup.  Getting the IBM Watson and Microsoft API keys took 'teeny' bit of digging but no real problem.  After reading docs, surfing, reading more docs and testing many API keys they offered, I never did find the 'right' API keys to test the AWS or gcloud APIs.

Following the API key, I want the endpoint web address to bang the API against - again, fairly easy to find with IBM and Microsoft, much harder for AWS and gcloud who seemed to want you to give up on self-documenting API code and resort to in-environment CLI tools (or similar) to employ their API.

Finally, the stuff you don't think about until you've got the API key and endpoints and you begin to code in earnest:  all the big services want you to characterize the sound file you are uploading in one or more ways such as file type (easy enough from the file name extension), and the codec, and the bit rate in the file.

If I'm writing a WP plugin in PHP to convert files uploaded to my site that's on a small host, then I'm in coding hell now.  How do I code in PHP to see if the files just uploaded use the right codec??  How do I code in PHP a function that detects the sound file bit rate so I can put that as a separate element in the API request??  And I have to turn away files after they are uploaded because they were produced with a codec that my cloud API doesn't like??  Can I just copy someone's function or do I need a new PHP library?  Ouch!  Don't hurt me, dude!

With, you don't have to worry about carefully examining a file in your code so you can characterize meta-data separately in your request.  Revai is not curious about that stuff at all.  Just punch the web address to the file into your request and does the rest - they have the file now, so they can run the code that examines the file and characterizes the meta-data!

Video files, too
So one day in working with them, I remembered a past project and asked the guys 'hey, will this do video files too?'  Before I got an answer, I decided to just punch one over to see for myself.  Worked.  With time chops.  (They do have an API request parameter that will turn off time chops) but if you're a Netflix/Hulu/Prime Video producer who needs your show subtitled with time chops as required by your network,'s simple API will get the job done for you quickly, simply and easily.

One other note: unlike many APIs, does not need you to register in advance the web address or domain where you are going to be invoking their API. So theoretically, you could write your code once, then deploy it on many domains (such as furnishing a WP plugin for use on many sites) and change only the API key from instance to instance. Freedom!

I really like the speech-to-text API as opposed to each one of its hard-to-use brethren.  Simple. Well-documented.  Available quickly and easily anywhere on the net, and they don't have a hosting cloud, so they're certainly not going to try to lock you in to  their cloud platform. 

This is just a simple, straightforward API that does all the heavy lifting that most of the other TTS services ask you to do.

Be the first to comment

Please check your e-mail for a link to activate your account.