Arrange your transcripts into paragraphs with Amazon Transcribe

3 months ago 3

Amazon Transcribe is simply a code designation work that generates transcripts from video and audio files successful aggregate supported languages and accents. It comes with a affluent acceptable of features, including automatic connection identification, multi-channel and multi-speaker support, customized vocabularies, and transcript redaction.

Amazon Transcribe supports 2 modes of operation: batch and streaming. In batch mode, a transcription occupation is created to process files residing successful an Amazon Simple Storage Service (Amazon S3) bucket; successful streaming mode, the audio root is integrated successful existent clip with Amazon Transcribe done HTTP/2 calls oregon Web Sockets.


In this post, we research however to automatically put the generated transcript into paragraphs portion successful batch mode, expanding the readability of the generated transcript.

Transcription output

Amazon Transcribe uses JSON practice for its output. It provides the transcription effect successful 2 antithetic formats: substance format and itemized format.

Text format provides the transcript altogether, arsenic a artifact of text, whereas itemized format provides the transcript successful the signifier of timely ordered transcribed items, on with further metadata per item. Both formats beryllium successful parallel successful the output file.

Depending connected the features selected during transcription occupation creation, Amazon Transcribe creates further and enriched views of the transcription result. See the pursuing illustration code:

{ "jobName": "2x-speakers_2x-channels", "accountId": "************", "results": { "transcripts": [ { "transcript": "Hi, welcome." } ], "speaker_labels": [ { "channel_label": "ch_0", "speakers": 2, "segments": [ ] }, { "channel_label": "ch_1", "speakers": 2, "segments": [ ] } ], "channel_labels": { "channels": [ ], "number_of_channels": 2 }, "items": [ ], "segments": [ ] }, "status": "COMPLETED" }

The views are arsenic follows:

  • Transcripts – Represented by the transcripts element, it contains lone the substance format of the transcript. In multi-speaker, multi-channel scenarios, concatenation of each transcripts is provided arsenic a azygous block.
  • Speakers – Represented by the speaker_labels element, it contains some the substance and itemized formats of the transcript grouped by speaker. It’s disposable lone erstwhile the multi-speakers diagnostic is enabled.
  • Channels – Represented by the channel_labels element, it contains some the substance and itemized formats of the transcript, grouped by channel. It’s disposable lone erstwhile the multi-channels diagnostic is enabled.
  • Items – Represented by the items element, it contains lone the itemized format of the transcript. In multi-speaker, multi-channel scenarios, items are enriched with further properties, indicating talker and channel.
  • Segments – Represented by the segments element, it contains some the substance and itemized formats of the transcript, grouped by alternate transcription. It’s disposable lone erstwhile the alternate results diagnostic is enabled.

Transcription metadata successful the items view

In the items view, items are provided successful the signifier of a timely ordered list, with each point containing further metadata information:

{ "results": { "items": [ { "channel_label": "ch_0", "start_time": "1.509", "speaker_label": "spk_0", "end_time": "2.21", "alternatives": [ { "confidence": "0.999", "content": "Hi" } ], "type": "pronunciation" }, { "channel_label": "ch_0", "speaker_label": "spk_0", "alternatives": [ { "confidence": "0.0", "content": "," } ], "type": "punctuation" }, { "channel_label": "ch_0", "start_time": "2.22", "speaker_label": "spk_0", "end_time": "2.9", "alternatives": [ { "confidence": "0.999", "content": "welcome" } ], "type": "pronunciation" }, { "channel_label": "ch_0", "speaker_label": "spk_0", "alternatives": [ { "confidence": "0.0", "content": "." } ], "type": "punctuation" } ] } }

The metadata is arsenic follows:

  • Type – The benignant worth indicates if the circumstantial point is simply a punctuation oregon a pronunciation. Examples of supported punctuations are comma, afloat stop, and question mark.
  • Alternatives – An array of objects containing the existent transcription, on with assurance level, ordered by assurance level. When alternate results diagnostic is not enabled, this database ever has 1 point only.
    • Confidence – An denotation of however assured Amazon Transcribe is astir the correctness of transcription. It uses values from 0–1, with 1 indicating 100% confidence.
    • Content – The transcribed word.
  • Start time – A clip pointer of the audio oregon video record indicating the commencement of the point successful ss.SSS format.
  • End time – A clip pointer of the audio oregon video record indicating the extremity of the point successful ss.SSS format.
  • Channel label – The transmission identifier, which is contiguous successful the point lone erstwhile the transmission recognition diagnostic was enabled successful the occupation configuration.
  • Speaker label – The talker identifier, which is contiguous successful the point lone erstwhile the talker partitioning diagnostic was enabled successful the occupation configuration.

Identifying paragraphs

Identification of paragraphs relies connected metadata accusation successful the items view. In particular, we utilize commencement and extremity clip accusation on with transcription benignant and contented to place sentences and past determine which sentences are the champion candidates for paragraph introduction points.

A condemnation is considered to beryllium a database of transcription items that exists betwixt punctuation items that bespeak afloat stop. Exceptions to this are the commencement and extremity of the transcript, which are by default condemnation boundaries. The pursuing fig shows an illustration of these items.


Sentence recognition is straightforward with Amazon Transcribe due to the fact that punctuation is an out-of-the-box feature, on with the punctuation types comma, afloat stop, question mark. In this concept, we utilize a afloat halt arsenic the condemnation boundary.

Not each condemnation should beryllium a paragraph point. To place paragraphs, we present a caller penetration astatine the condemnation level called a commencement delay, arsenic illustrated successful the pursuing figure. We usage a commencement hold to specify the clip hold the talker introduces to the pronunciation of the existent condemnation successful examination to the erstwhile one.

Start Delay

Calculation of the commencement hold requires the commencement clip of the existent condemnation and extremity clip of the erstwhile 1 per speaker. Because Amazon Transcribe provides commencement and extremity times per item, the calculation requires the usage of the archetypal and past items of the existent and erstwhile sentences, respectively.

Knowing the commencement delays of each sentence, we tin use statistical investigation and fig retired the value of each hold successful examination to the full colonisation of delays. In our context, important delays are those that are implicit the population’s emblematic duration. The pursuing graph shows an example.

Start Delay Box Plot

For this concept, we determine to judge the sentences with commencement delays greater than the mean worth arsenic significant, and present a paragraph constituent astatine the opening of each specified sentence. Apart from the mean value, determination are different options, similar accepting each commencement delays greater than the median, oregon 3rd quantile oregon precocious obstruction worth of the population.

We adhd 1 much further measurement to the paragraph recognition process, taking into information the fig of words contained by each paragraph. When paragraphs incorporate a important fig of words, we tally a divided operation, thereby adding 1 much paragraph to the last result.

In the discourse of connection counts, we specify arsenic important the connection counts that transcend the precocious obstruction value. We marque this determination deliberately, truthful that we restrict divided operations to the paragraphs that genuinely behave arsenic outliers successful our results. The pursuing graph shows an example.

Word Count Box Plot

The divided cognition selects the caller paragraph introduction constituent by considering the maximum condemnation commencement hold insight. This way, the caller paragraph is introduced astatine the condemnation that exhibits the max commencement hold wrong the existent paragraph. Splits tin beryllium repeated until nary connection number exceeds the selected boundary, successful our lawsuit the precocious obstruction value. The pursuing fig shows an example.

Paragraph Split


In this post, we presented a conception to automatically present paragraphs to your transcripts, without manual intervention, based connected the metadata Amazon Transcribe provides on with the existent transcript.


This conception is not connection oregon accent specific, due to the fact that it relies connected non-linguistic metadata to suggest paragraph introduction points. Future variations tin see grammatical oregon semantic accusation connected a per-language case, further enhancing the paragraph recognition logic.

If you person feedback astir this post, taxable your comments successful the comments section. We look guardant to proceeding from you. Check retired Amazon Transcribe Features for further features that volition assistance you get the astir worth retired of your transcripts.

About the Authors

Kostas Tzouvanas is an Enterprise Solution Architect astatine Amazon Web Services. He helps customers designer cloud-based solutions to execute their concern potential. His main absorption is trading platforms and precocious show computing systems. He is besides passionate astir genomics and bioinformatics.

Pavlos Kaimakis is an Enterprise Solutions Architect looking aft Enterprise customers successful GR/CY/MT supporting them with his acquisition to plan and instrumentality solutions that thrust worth to them. Pavlos has spent the biggest magnitude of clip successful his vocation successful the merchandise and lawsuit enactment assemblage – some from an engineering and a absorption perspective. Pavlos loves travelling and he’s ever up for exploring caller places successful the world.

Read Entire Article