Why human transcription won't go away for a while

Team Reduct Video Nepal
October 2022

Why human transcription won't go away for a while

Computer-generated or AI transcription has come a long way since its recent developments, becoming a better, cheaper, and faster method of transcribing. Depending on the clarity and type of media, computer transcription can now do as good as or sometimes even a better job than humans. However, the audio/video input isn’t always crystal clear. In such circumstances, humans can bring in their understanding of the discussed context and proper research to give you a transcript that is accurate, readable, and has correct speaker labels. In this article, we’ll discuss some specific areas where we found computer transcription falling short and where human transcription can do a better job.

reduct-logo
Try Transcribe by Reduct →

1. Speakers talking over each other

In cases of speakers talking over each other, the AI struggles to differentiate between the dialogues of the speakers. For instance, if you’re transcribing a Zoom call or outdoor audio with people talking over each other and constant background noise, then the AI finds it hard to provide a useful transcript. It also doesn’t provide accurate speaker labeling.

In the example below, the AI incorrectly labeled two separate speakers as one. In the example below, the AI incorrectly labeled two separate speakers as one.

AI:

Speaker 1: For me the best method of yeah but what about

Human:

Speaker 1: For me the best method [crosstalk]-

Speaker 2: Yeah, but what about-

2. Use of punctuations

AI can have a hard time determining the uses of punctuation based on the audio or the dialogue. For instance, in this example, the speaker is speaking with an accent and the AI transcription doesn’t add accurate punctuation. Even though the transcript is correct in a verbatim sense, the sentence loses its readability.

AI: And so we asked me if doing prison was the hard part and I spent over 12 years in prison in 18 years I did 12 total And it wasn’t the hard part wasn’t going to prison.

Human: And so you asked me if doing prison was the hard part, and I spent over 12 years in prison. In 18 years I did 12 total. And it wasn’t- the hard part wasn’t going to prison.

3. Tone of speech

The tone of speech is variable from media to media. In some cases, the same speaker might have a variable tone while saying a single sentence. AI can sometimes fail to pick up words said in a lower tone of voice. Human transcription on the other hand can understand and fill in the details by glancing at the context of the media.

AI:

Speaker 1: You’re what? You’re a fatty? No

Human:

Speaker 1: You’re what?

Speaker 2: Fatty.

Speaker 1: You’re a fatty? No.

4. Inaccurate guessing

In some instances when a word is inaudible, AI transcription tends to fill the gaps with inaccurate words. A transcriber, however, can opt to mark the portion in the transcript as inaudible instead of inaccurately guessing the phrase or word which might change the meaning of the sentence.

AI: But I’m just worried about any back then.

Human: But I’m just worried about any [inaudible].

5. Mid-sentence drifts

In some audios/videos, the speakers often lose their train of thought mid-sentence and drift into talking about something else. In such instances, the computer transcription does not mark the drift with any visible indicators, impacting readability. Humans, on the other hand, are capable of identifying and indicating such drifts using hyphens as in the example below.

AI: Injured passengers, okay, and are you, is your airplane physically on fire?

Human: Injured passengers, okay. And are you- is your airplane physically on fire?

6. Understanding the context

In some cases, abstract knowledge about what the content of the audio/video is about is important to make correct calls about what might have been said. In this case, the human transcriber can correctly transcribe the audio as they have a better understanding of the context.

AI: And then there’s this deal with the information.

Human: And then there’s this new claim for defamation.

7. Fast speakers

The pace of speech is also a variable that comes to play while transcribing any audio or video. AI can have trouble transcribing speakers who have a fast pace of speech. In such instances, AI can miss out or transcribe incorrectly. Human transcribers, however, can pause the audio, and relisten at different audio speeds giving a more accurate transcription. In the example below, the AI transcript wasn’t able to identify the phrase and instead transcribed it as a word.

AI: SOAR process, I thought it could be dealt with very, very, quickly and very, very, smoothly.

Human: So at the start of the process, I thought it could be dealt with very, very, quickly and very, very, smoothly.

8. Speaker labeling

Speaker labeling can be troublesome for AI if the participating speakers have a similar voice, usually when the speakers are of the same gender. Such can also be the case when a single person uses a different tone of voice. Depending on the voice, AI can either label more or fewer speakers than the actual number. In contrast with computer transcription, human transcribers can better differentiate the voices and tones and provide accurate labels. The example below shows how the AI added 6 speaker labels in audio with two speakers.

AI:

Speaker 1: I mean, strictly because it’s a lot of confidential and personal stuff.

Speaker 6: And the effect of my team is also something that’s sort of on my mind.

Human:

Speaker 1: I mean, strictly because it’s a lot of confidential and personal stuff.

Speaker 2: And the effect of my team is also something that’s sort of on my mind.

9. Use of technical words and abbreviations

Computer transcription can struggle when the audio/video contains niche technical content with lots of terminologies and abbreviations. The example below shows how a human transcriber can do proper research based on the context of the media to find out relevant technical terms.

AI: Advise you to maintain at or above 2200 for CMVA

Human: Advise you to maintain at or above 2200 for the MVA (Minimum Vectoring Altitude).

Based on our internal testing and experience, the decision to choose computer or human transcription completely depends on the quality of the audio that you’re working with and the end use of the transcript. If you’re working on studio-quality audio and you can spare a few mistakes, then you can opt to use computer transcription for a faster and cheaper transcript. However, if you’re working with audio that isn’t very clear, needs contextual understanding, and can’t spare any margins for error, then human transcription is the way to go.

At Reduct.Video, we provide you with both computer and human transcription services. Our computer transcription engine uses Whisper, the best-in-class computer transcription available. It is delivered instantly, within minutes of uploading your media. Our human transcription service is fueled by highly skilled transcribers, a multi-staged workflow process, and a competent core team, providing you with a 99% accurate transcript within 6-8 hours.

Get 1 hour of
Transcription
For Free

Related posts: