Why Human Transcription Won't go Away For a While

October 2022

4 min read

Why Human Transcription Won't go Away For a While

Computer-generated or AI transcription has come a long way since its recent developments. It has now become a better, cheaper, and faster method of transcribing.

Computer transcription is as good as or sometimes even than humans depending on the clarity and type of media.

The audio/video input is not always crystal clear. Contextual knowledge and proper research help produce accurate and readable transcription. And this is where humans can perform better than AI.

We have compiled areas where computer transcription may fall short, and humans can do a better job.

Explore the power of text-based video editing.

Try Reduct for free →

1. Speakers talking over each other

In cases of speakers talking over each other, the AI struggles to differentiate between the dialogues of the speakers.

For instance, if you are transcribing a Zoom call with people talking over each other, the AI finds it hard to transcribe accurately. It also fails at differentiating speakers.

In the example below, the AI labeled two separate speakers as one.

AI:

Speaker 1: For me the best method of yeah but what about

Human:

Speaker 1: For me the best method [crosstalk]-

Speaker 2: Yeah, but what about-

2. Use of punctuations

AI can have difficulty determining the uses of punctuation based on the audio or the dialogue.

Looking at the example below, the speaker speaks with an accent, and the AI does not add accurate punctuation. Even though the transcript is correct in a verbatim sense, the sentence loses its readability.

AI:

And so we asked me if doing prison was the hard part and I spent over 12 years in prison in 18 years I did 12 total And it wasn’t the hard part wasn’t going to prison.

Human:

And so you asked me if doing prison was the hard part, and I spent over 12 years in prison. In 18 years I did 12 total. And it wasn’t- the hard part wasn’t going to prison.

3. Tone of speech

The tone of speech is variable from media to media. In a few cases, a single speaker might speak with multiple tones of voice during the same interview. AI can sometimes fail to pick up words with a lower tone of voice. The human can understand and fill in the details by understanding the context of the media.

AI:

Speaker 1: You’re what? You’re a fatty? No

Human:

Speaker 1: You’re what?

Speaker 2: Fatty.

Speaker 1: You’re a fatty? No.

4. Inaccurate guessing

AI sometimes fills the gaps with inaccurate words when a word is not audible. A human transcriber, however, can opt to mark a part of the transcript as inaudible. This ensures the meaning of the sentence does not change because of incorrect guessing.

AI:

But I’m just worried about any back then.

Human:

But I’m just worried about any [inaudible].

5. Mid-sentence drifts

In some audio/videos, the speakers often lose their train of thought mid-sentence and drift into talking about something else.

In such instances, the AI does not mark the drift with a visible indicator, impacting readability. Humans can identify and indicate the drift using hyphens, as in the example below.

AI:

Injured passengers, okay, and are you, is your airplane physically on fire?

Human:

Injured passengers, okay. And are you- is your airplane physically on fire?

6. Understanding the context

In some cases, prior knowledge about the content of the audio/video is crucial to make correct calls about unclear spoken words. A human can transcribe the audio better as they can understand the context better.

AI:

And then there’s this deal with the information.

Human:

And then there’s this new claim for defamation.

7. Fast speakers

The pace of speech also impacts the accuracy of a transcript. AI can have trouble transcribing audio with fast-paced speakers. In such instances, AI can miss out on words or transcribe them incorrectly. Human transcribers can pause the audio, and listen at different speeds to provide more accuracy.

In the example below, the AI transcript did not identify a phrase and instead transcribed it as a word.

AI:

SOAR process, I thought it could be dealt with very, very, quickly and very, very, smoothly.

Human:

So at the start of the process, I thought it could be dealt with very, very, quickly and very, very, smoothly.

8. Speaker labeling

Speaker labeling can be troublesome for AI if the speakers have a similar voice, especially when the speakers are of the same gender. AI also struggles to identify a speaker when a single person uses a different tone of voice. AI can label more or fewer speakers than the actual number depending on the type of audio.

In contrast, human transcribers can differentiate the voices and tones better. Having accurate speaker labeling improves readability quite a lot.

The example below shows how the AI added six speaker labels in audio with two speakers.

AI:

Speaker 1: I mean, strictly because it’s a lot of confidential and personal stuff.

Speaker 6: And the effect of my team is also something that’s sort of on my mind.

Human:

Speaker 1: I mean, strictly because it’s a lot of confidential and personal stuff.

Speaker 2: And the effect of my team is also something that’s sort of on my mind.

9. Use of technical words and abbreviations

AI can struggle when the audio/video contains a lot of terminologies and abbreviations.

The example below shows how simple research can help humans find out relevant technical terms.

AI:

Advise you to maintain at or above 2200 for CMVA

Human:

Advise you to maintain at or above 2200 for the MVA (Minimum Vectoring Altitude).

What should you choose?

Choosing between computer or human transcription depends on your needs. If you’re working on studio-quality audio and can spare a few mistakes, you can opt to use computer transcription. They are faster and a cheaper option.

However, if you can not afford any margins for error, then human transcription is the way to go. Also, humans can work better with unclear audio, have strong accents, or need contextual understanding.

At Reduct.Video, we provide you with both computer and human transcription services. Our computer transcription engine uses Whisper, the best-in-class computer transcription available. We provide instant AI transcription, within a minute of uploading your media.

Our human transcription goes through a multi-staged workflow process and is done and reviewed by multiple skilled transcribers. We provide you with a 99% accurate transcript, often within 6-8 hours.