Chapter 16. Cloud Speech

Chapter 16. Cloud Speech


  • An overview of speech recognition

  • How the Cloud Speech API works

  • How Cloud Speech pricing is calculated

  • An example of generating automated captions from audio content

Speech recognition we mean  audio stream.


Language is a particularly tricky human construct.


What we hear based on what we see.


There is a difference between hearing and listening.


Talk - sounds and turning them into words. 


Listen - taking sounds and combining them with your context and understanding.


Treat the results from a given audio file as helpful suggestions that are usually right

but not guaranteed. 


Wouldn’t want to use a machine-learning algorithm for court transcripts.


May help stenographers improve their efficiency by using the output as a baseline.


There is a difference between hearing and listening.



16.1. Simple speech recognition

Cloud Speech API has textual content as an output but requires a more complex

input—an audio stream.


Send the audio file (for example, a .wav file) to the Cloud Speech API for processing.


Tell the Cloud Speech API the format of the audio.


API needs to know the sample rate of the file. 


Tells the audio processor the clock time covered by each data point.


Know the language spoken in the audio.



Enable the API in the Cloud Console.




API is enabled, you’ll install the client library

npm install @google-cloud/speech@0.8.0


const speech = require('@google-cloud/speech')({

  projectId: 'your-project-id',

  keyFilename: 'key.json'

});


const audioFilePath = 'gs://cloud-samples-tests/

 speech/brooklyn.flac';

const config = {

  encoding: 'FLAC',

  sampleRate: 16000

};

speech.recognize(audioFilePath, config).then((response) => {

  const result = response[0];

  console.log('This audio file says: "' + result + '"');

});



You should see some interesting output:


> This audio file says: "how old is the Brooklyn Bridge"


How long the recognition took. 


The Cloud Speech API needs to “listen” to the entire audio file, so the recognition

process is directly correlated to the length of the audio.


Extraordinarily long audio files (for example, more than a few seconds) shouldn’t

be processed like this.


No concept of confidence in this result.


Use the verbose flag.


const speech = require('@google-cloud/speech')({

  projectId: 'your-project-id',

  keyFilename: 'key.json'

});


const audioFilePath = 'gs://cloud-samples-tests/speech/brooklyn.flac';

const config = {

  encoding: 'FLAC',

  sampleRate: 16000,

  verbose: true

};

speech.recognize(audioFilePath, config).then((response) => {

  const result = response[0][0];

  console.log('This audio file says: "' + result.transcript + '"',

  '(with ' + Math.round(result.confidence) + '% confidence)');

});



The output that looks something like the following:


> This audio file says: "how old is the Brooklyn Bridge"


 (with 98% confidence)


16.2. Continuous speech recognition


You can’t take an entire audio file and send it as one chunk to the API for recognition.


Large audio file, which is too big to treat as one big blob.


Break it up into smaller chunks. 


Trying to recognize streams that are live.


Because these streams keep going until you decide to turn them off.


Speech API allows asynchronous recognition


Accept chunks of data, recognize them along the way, and return a final result after

the audio stream is completed.



const fs = require('fs');

const speech = require('@google-cloud/speech')({

  projectId: 'your-project-id',

  keyFilename: 'key.json'

});


const audioFilePath = 'gs://cloud-samples-tests/speech/brooklyn.flac';

const config = {

  encoding: 'FLAC',

  sampleRate: 16000,

  verbose: true

};


speech.startRecognition(audioFilePath, config).then((result) => {

  const operation = result[0];

  operation.on('complete', (results) => {

   console.log('This audio file says: "' + results[0].transcript + '"',

   '(with ' + Math.round(results[0].confidence) + '% confidence)');

  });

});


You should see the exact same result as before, shown in the next listing.


This audio file says: "how old is the Brooklyn Bridge" (with 98% confidence).


16.3. Hinting with custom words and phrases

Recognize that new words will be invented all the time.


Sometimes the Cloud Speech API might not be “in the know” about all the cool new

words or slang phrases, and may end up guessing wrong.


We invent new, interesting names for companies (for example, Google was a misspelling

of “Googol”).


Able to pass along some suggestions of valid phrases that can be added to the

API’s ranking system for each request. 



const config = {

  encoding: 'FLAC',

  sampleRate: 16000,

  verbose: true,

  speechContext: { phrases: [

    "the Brooklynne Bridge"

  ]}

};



Speech API does indeed use the alternate spelling provided:


> This audio file says: "how old is the brooklynne bridge" (with 90% confidence)


16.4. Understanding pricing

Suggest tags based on what’s being said in the video.


Cloud Natural Language API to recognize any entities being discussed.


Pull out the audio portion of the video, figure out what’s being said, and come back

with suggested tags. 





1. First, the user records and uploads a video.


2. Separate the audio track from the video track.


3. Send the audio content to the Cloud Speech API for recognition.


4.  The Speech API should return a transcript as a response.


5.  You then send all of the text (caption and video transcript) to the Cloud Natural

Language API.


6.  The Cloud NL API will recognize entities and detect sentiment from the text.


7.  Finally, you send the suggested tags back to the user.


Writing a function that will take a video buffer as input and return a JavaScript

promise for the transcript of the video, shown in the next listing. Call this function

getTranscript.


const Q = require('q');

const speech = require('@google-cloud/speech')({

  projectId: 'your-project-id',

  keyFilename: 'key.json'

});


const getTranscript = (videoBuffer) => {

  const deferred = Q.defer();


  extractAudio(videoBuffer).then((audioBuffer,

       audioConfig) => {

    const config = {

      encoding: audioConfig.encoding, // for example, 'FLAC'

      sampleRate: audioConfig.sampleRate, // for example, 16000

      verbose: true

    };

return speech.startRecognition(audioBuffer, config);

  }).then((result) => {

    const operation = result[0];

    operation.on('complete', (results) => {

      const result = results[0];

      const transcript = result.confidence >50 ? result.transcript : null;

      deferred.resolve(transcript);

    });


    operation.on('error', (err) => {

      deferred.reject(err);

    });

  }).catch((err) => {

    deferred.reject(err);

  });


  return deferred.promise;

};


Grab the audio and recognize it as text.


That will take any given content and return a JavaScript promise for the sentiment

and entities of that content.


const Q = require('q');

const language = require('@google-cloud/language')({

  projectId: 'your-project-id',

  keyFilename: 'key.json'

});


const getSentimentAndEntities = (content) => {

  constdocument = language.document(content);

  const config = {entities: true, sentiment:true, verbose: true};

  returndocument.annotate(config).then(

    returnnew Q(data[0]);

     // { sentiment: {...}, entities: [...] }

  });

};



You have all the tools you need to put your code together.


Build the final handler function that accepts a video with properties.


const Q = require('q');

const authConfig = {

  projectId: 'your-project-id',

  keyFilename: 'key.json'

};

const language = require('@google-cloud/language')(authConfig);

const speech = require('@google-cloud/speech')(authConfig);


const handleVideo = (video) => {

  Q.allSettled([

    getTranscript(video.buffer).then((transcript) => {

      return getSentimentAndEntities(video.transcript);

    }),

    getSentimentAndEntities(video.caption)

  ]).then((results) => {

    let suggestedTags = [];

    results.forEach((result) => {

      if (result.state === 'fulfilled') {

        const sentiment = result.value.sentiment;

        const entities = result.value.entities;

        const tags = getSuggestedTags(sentiment, entities);

suggestedTags = suggestedTags.concat(

tags);

      }

    });

    console.log('The suggested tags are', suggestedTags);

    console.log('The suggested caption is',

                '"' + caption + ' ' + suggestedTags.join(' ') + '"');

  });

};




Summary

  • Speech recognition takes a stream of audio and converts it into text, which can

  • be deceivingly complicated due to things like the McGurk effect.

  • Cloud Speech is a hosted API that can perform speech recognition on audio

  • files or streams.



No comments:

Post a Comment

Office hours tomorrow(Tuesday) 5:00pm-6:00pm, 4/26/2021, 5:13 PM, English, 4/26/2021, 5:13 PM

Your assigned language is: English Classroom blog: googleclouduconn.blogspot.com 4/26/2021, 5:13 PM Office hours tomorrow(Tuesday) 5...