Determining Best Practice for Filler Words in Captions and Transcripts

January 9, 2020 by birdlab

When I routinely did audio editing early in the journal’s history, I was fairly aggressive in my edits; my goal would be to remove as many filler words as possible, not just “um” and “er” but words and phrases such as “like” or “you know” or a bridging “so . . . . ” or “and . . . .” where a pause to gather a thought followed.

Sometimes a speaker would talk so rapidly and use so many fillers, that it became impossible to remove fillers, as it so “sped up” the speech in the edited audio file

This has come up again in transcripts where there are filler words that have not been (or cannot be) edited from the audio. Contributors (and copyeditors/production editors) are asking if we can eliminate filler words from captions and transcripts. That would have been my sense of our ideal editorial practice, but this is not overtly covered in our style manual and, if ever present elsewhere, seems lost in training materials, so we have wobbled in practice and mean to get this into our style manual as soon as possible, and to make some mending changes to texts in v18n2.

Likewise for questions of transcribing dialect or informalities of speech; we would NOT traditionally write “wonna” or “gonna” but would transcribe them as “want to” and “going to.” We likewise would not attempt to represent regional dialects, though this is actually called for by the Described and Captioned Media Program (DCMP) Captioning Key. I will post separately about this question in future.

I have looked at the following resources as a step toward recommending our best practices. Under each link, I’ve quoted selected text from the destination site to give some sense of a take on the issue, but I recommend that you visit each site to see context and fuller descriptions. These texts are not in universal agreement as the question of whether or not to edit filler words may depend (as in legal uses) on the the circumstance, audience, or preference of a given company or publication.

I’ve opened with GoTranscript’s distinction between “full verbatim” and “clean verbatim” as I think it is helpful and orienting; the other resources, at this writing, are in no particular order, though I do stronly suggest all those interested track down a copy of Sean Zdenek’s Reading Sounds where chapter two takes up a discussion of style guides and addresses verbatim captioning. Zdenek helpfully confirms that “Style guides are light on theory; individual guidelines are typically offered up as truths in no need of justification.” This is important to remember in any search for “authoratative” answers.

(This research is in progress and I may add to our list of resources in the coming weeks.)

GoTranscript Transcription Guidelines
GoTranscript [transcription service]

TEXT FORMAT DESCRIPTIONS

FULL VERBATIM

The text is transcribed exactly as it sounds and includes all the utterances of the speakers.

Those are:

Speech errors: “I went to the bank on Thursday– no, Friday.“
False starts: I, um, wanted– I have dreamed of becoming a musician.
Filler words: um, uh, kind of, sort of, I mean, you know…
Slang words Kinda, gotta, gotcha, betcha, wanna, dunno…
Stutters: I-I went to the bank last Tu-Thursday.
Repetitions: I went- I went to the bank last Friday.

Only use these forms for the affirmative/negative:
Mm-hmm, Mm (affirmative) or Mm-mm (negative)
Uh-huh (affirmative) or Uh-uh (negative)

CLEAN VERBATIM

The transcribed text does not include:

Speech errors
False starts (unless they add information)
Stutters
Repetitions. Note: Keep repetitions of words that express emphasis: No, no, no. I am very, very happy.
Filler words: Words often excessively used by the speaker but when you take them out, you’re left with perfectly understandable sentences. uh, um, *you know, *like, *I think, *I mean, *so, *kind of, well, sort of… Be mindful of the context. Some of these filler words do not always function as filler words.
Expressions should be kept regardless of verbatim type: Oh my God, Oh dear, Oh my, Oh boy, et cetera.
Slang words must be written as “got you” instead of “gotcha”, “going to” instead of “gonna”, “want to” instead of “wanna”, “because” instead of “’cause” et cetera.
“Yeah”, “yep”, “yap”, “yup”, “mm-hmm” must be written as “yes”; “alright” must be written as “all right.”
Never spell “Ok” or “OK.” It must always be spelled as “Okay.”
Avoid starting phrases with conjunctions in clean verbatim. If you really need to add the conjunction, just expand the phrase. For example: I went outside, but forgot to bring my umbrella.

Note: For CV: Omit all the “yeah”, “yes” reactions to retain a fluent text, unless they are answers to given questions.
DO NOT remove filler words if they change the meaning of the phrase.

FV EXAMPLE:
Speaker 1: Hey, Maya, I’d like to ask you something.
Speaker 2: Okay.
Speaker 1: Someone told me, applicants must now present an ID before they can sign up.
Speaker 2: Yeah.
Speaker 1: But I’m not sure if that is true.
Speaker 2: Yeah.
Speaker 1: Okay, uh, is it true?
Speaker 2: Yep.

CV EXAMPLE:
Speaker 1: Hey, Maya, I’d like to ask you something.
Speaker 2: Okay.
Speaker 1: Someone told me, applicants must now present an ID before they can sign up. I’m not sure if that is true. Is it true?
Speaker 2: Yes.

Verbatim or Not Verbatim, That is the Question
CastingWords [transcription service]

CastingWords’ standard transcription style is therefore true to the audio, but non-verbatim. We edit the text lightly for smoother reading.

We don’t correct grammar, paraphrase, summarize, rearrange words, or include words that were not spoken.
We do leave out the stutters, mis-steps, and filler words that tend to pepper spoken communication.

The result is a transcript that conveys the full meaning and tone of the speaker’s words.

Filler Words in Legal Transcription: Why They Should be Included
GMR Transcription [transcription service]

Verbatim transcription is the most detailed type of transcription service. A verbatim transcription includes everything that is said on the recording as well as grunts, sniffles, coughs, and utterances such as “uh huh”. This form of transcription also includes audible noises outside of the people being recorded such as a knock on the door, a honking horn, or the sound of a pencil being dropped. .

Verbatim transcriptions are used in court transcripts as well as for depositions and interviews for the purpose of qualitative analysis. Regardless of the venue, these transcriptions demand comprehensive attention to detail and a high level of experience to determine sounds coming from the people being recorded as well as to decipher ambient noises. Verbatim transcription is, in itself, a daunting task but the requirement for flawless transcription in legal proceedings sets a standard that many transcription services simply cannot meet.

Reporters who quote ums and ahs only make themselves look bad
Rob Beschizza for boingboing

. . . in print, reporters usually remove speech disfluency when they quote subjects. In fact, it is generally considered unethical and unprofessional for editors not to remove the ums and ahs and filler terms, though there’s a usually a hard line against changing words or paraphrasing within quotes.

Here’s Terry Gross, the NPR host, explaining her interview policy:

With the exception of the occasional John Updike, no one speaks readable, perfectly grammatical sentences. So we’ve edited the answers my questions elicited for clarity and concision, while sticking as closely as possible to each interviewee’s actual speaking style.

The 2015 edition of The New York Times Manual of Style and Usage is similarly clear:

The writer should, of course, omit extraneous syllables like “um” and may judiciously delete false starts. If any further omission is necessary, close the quotation, insert new attribution and begin another quotation. (The Times does adjust spelling, punctuation, capitalization and abbreviations within a quotation for consistent style.) In every case, writer and editor must both be satisfied that theintent of the speaker has bee npreserved.

The Associated Press Stylebook is rather vague: it says not to “alter” quotes to correct word usage or grammar, but has nothing to say on filler talk specifically.

If a quotation is flawed because of grammar or lack of clarity, the writer must be able to paraphrase in a way that is completely true to the original quote. If a quote’s meaning is too murky to be paraphrased accurately, it should not be used.

In practice, though, the AP removes it. This is a fact easy to demonstrate by comparing its quotes of Olympic-class filler-talkers Barack Obama and Donald Trump to the transcripts.

Sometimes, Reporters Should Clean Up Ungrammatical Quotes
Katy Wildman for Slate

A few weeks ago, sports writer Brian T. Smith wrote a column for the Houston Chronicle about an outfielder for the Astros, Carlos Gómez, who has gotten off to a slow start this season. Smith interviewed the Dominican-born Gómez and quoted him exactly, relaying his words as follows: “For the last year and this year, I not really do much for this team. The fans be angry. They be disappointed.”

The quote stood out, because sports writers don’t usually transcribe so precisely the words of players for whom English is their second language. Usually, sports writers clean those quotes up. (Even Breitbart has rendered Go-Go’s speech with correct, if informal, grammar.) Critics, including Gómez himself, took Smith to task for seeming to mock the athlete’s incorrect English. Chronicle editor Nancy Barnes apologized, citing “less than adequate” AP guidelines on quoting news sources who did not grow up speaking George Washington’s tongue. On Deadspin, Tom Ley suggested that Gomez “has a right to be annoyed” that a reporter “went off and made him look dumb by not extending him a courtesy that most people quoted by reporters get”: that of subtly tweaked sentences.

Not everyone agrees. Over at ESPN’s brand-new site the Undefeated, J.A. Adande used the incident to inveigh against the cleaning up of quotes. “Since when should journalists apologize for being accurate?” Adande asked. Doesn’t objectivity demand absolute faithfulness to what a person says, not what he means to say?

DCMP Captioning Key
The Described and Captioned Media Program is funded by the U.S. Department of Education and administered by the National Association of the Deaf.

A reoccurring question about captioning is whether captions should be verbatim or edited. Among the advocates for verbatim are organizations of deaf and hard of hearing persons who do not believe that their right for equal access to information and dialogue is served by any deletion or change of words. Supporters of edited captions include parents and teachers who call for the editing of captions on the grounds that the reading rates necessitated by verbatim captions can be so high that captions are almost impossible to follow.

As the debate has continued, researchers have tackled the question. A bibliography of research on reading rates is provided in the Captioning Presentation Rate Research document on the Captioning Key Appendices page. DCMP supports editing based on research results and the DCMP’s half-century of captioning experience. Editing is often essential to ensure that students have time to read the captions, integrate the captions and picture, and internalize and comprehend the message.

When editing occurs, each caption should maintain the meaning, content, and essential vocabulary of the original narration. DCMP media users, who are the families and teachers of students who are deaf and hard of hearing, have enthusiastically praised the quality of the DCMP educational media and the captioning that provides equal access.

Standard Closed Captioning Guidelines
Capital Captions [transcription service]

Content Accuracy and Inclusions

Captions should be as close to original content as possible and written verbatim.
Dialogue must not be censored.
Dialogue should not be simplified.
Occasional truncation or editing of speech is acceptable where there is a significant conflict with reading speed and/or synchronisation.
Where sentence shortening is absolutely necessary, truncations should be prioritised and limited to ‘filler’ words.

[In addition, I am taking note of Captial Caption’s clear tech specs, though this is not directly relevant to our question here. —mak]

Closed Caption Technical Specifications

Font for captions to be Arial, white, with size relative to resolution to fit maximum 40 characters.
Maximum two lines.
Adult’s closed caption reading speed set to maximum 250 words per minute/20 characters per second.
Children’s closed caption reading speed set to maximum 200 words per minute/17 characters per second.
Minimum caption display time 1 second.
Maximum caption display time 8 seconds.

How to Caption Videos
Berkeley Web Access

“It’s okay to clean up words like, “um,” “you know,” and other filler words.

Yagoda’s Rules for Quotes
Ben Yagoda Blog

Accuracy of Quotes
The short answer is that if you’re using quotation marks, it’s not permissible to change anything the speaker said. However, it’s okay not to include meaningless filler words and sounds like “um” and “you know.” Beyond that, different organizations have different rules and policies, so consult with your editor.

Rev Captioning Style Guide 3.3
Rev

[see the guide for examples not quoted in the two sections below —mak]

Accurately Type Out the Words
Rule of thumb: Listen carefully to the dialogue and accurately type out the words with minimal errors and guesses. Never correct the speaker’s grammar or add words that aren’t spoken. Be consistent with punctuation and symbols.

Type what the speaker says. You must ALWAYS caption what is heard and always use American English spelling.

● Never correct (edit) the speaker’s grammar (morphology, syntax, and semantics).
● Never paraphrase.
● Never substitute words.
● Never add words that are not spoken.
● Never rearrange the order of speech.
● Don’t correct phonetics unless it distracts from readability. See the next slide, editing for readability.
● Do remove speech disfluency that distracts from readability. See the next slide, editing for readability.

Exception: Lightly Edit for Readability
Our goal is readability, so it’s preferable for you to remove extraneous text that will likely distract a viewer from the core message:

● Omit speech disfluencies*: unnecessary filler words, false starts, stutters, repetitions, etc.
● Omit quick interjections, such as an interviewer saying “mm-hmm”, unless a direct response to a question.
● Correct egregious phonetic and pronunciation errors that inhibit readability.
However, never change the story being told:
● Don’t correct a speaker’s grammar or pronunciation that is easily understood. E.g., “gonna” must stay as
“gonna”.
● NEVER censor or edit expletives. If the word is censored with a beep sound, use (beep) or (bleep) in-line
where the sound occurs.
● Never omit special words, entire sentences, or expletives.

For more information on speech disfluency, read http://en.wikipedia.org/wiki/Speech_disfluency

Making Media Accessible: Humber College Captioning Style Guide
Humber College Accessible Media Department

[This looks like a well-done guide, including its impressive opening statement on “Fostering Inclusivity.” —mak]

Filler, False Starts, and Discourse Markers
Much of our spoken language – more than most of us would care to admit – is taken up by filler. “Um,” “uh,” “er,” and other sounds that pepper the edges of words are generally meaningless. When we speak and listen to others, our brains rarely process filler, ignoring it in favour of the “real”words that convey meaning.
Discourse markers – words and phrases like “so,” “well,” “I mean,” “you know,” “okay,” and others – are, in an academic sense, used to manage flow and structure in speech. In spoken language, they’re often used so much that they become filler – “like” is a prominent example of this. If discourse markers cause confusion and removing them will not change the meaning of the text, they can be omitted.

“So, you know, if we take a look at this example…” → “If we take a look at this example…”

False starts are common in speech, and present a challenge when transcribing. As a rule of thumb,single-word false starts or stutters can be left out.

“When, when you consider…” → “When you consider…”

“I, I, I think…” → “I think…”

Longer false starts, when a speaker says a few words, stops, then continues speaking, sometimes on a completely different thought, are more challenging. Whether what follows the false start is capitalized or not depends on whether it can be considered the beginning or the continuation of a sentence.

“So when we look at… let’s take a look at this.”

If “So when we look at…” was to be removed, “Let’s take a look at this,” becomes
the beginning of the sentence; “Let’s” is capitalized.

“This is what I mean when I say… when we say this is a theory.”

“when we say this is a theory,” can’t stand on its own as a sentence without,
“This is what I mean”; “when” is not capitalized.

When transcribed, filler, false starts and overused discourse markers can make captions difficult to understand and synchronize. Unless these words seem deliberate or convey aspects of the character speaking, they are removed if their removal doesn’t alter the speaker’s meaning.

Closed Captioning & Subtitling Standards in IP Video Programming
June 16, 2016 by Emily Griffin for 3Play Media
Updated: June 3, 2019

Verbatim
For broadcast media, you should transcribe content as close to verbatim as possible.
For a scripted show, you would include every “um,” every stutter, and every stammer because they are intentionally included in the movie.

There is more leeway for unscripted reality shows, documentaries, and news broadcasts, because the filler words are usually unintentional and irrelevant.
It becomes very hard to digest captions that denote every “um” or stutter; in this case, you should get as close as possible to verbatim without making the captions difficult to read.

Similarly, if someone puts on a fake accent for a couple of lines, you want to transcribe it using proper English and denote in parentheses that they’re speaking with an accent.

[quote] Er, um, like, you know, eh? [end quote] said Kuppajava.
May 28, 2013 12:10 PM

How do I explain to a student journalist that when quoting someone with whom they have recorded an interview they should refrain from keeping in the verbal filler words and tics we all use in casual conversation unless absolutely necessary?

Multimedia, Animations, Motion
A11portal.com

What to Include in Captions and Transcripts

Captions MUST be verbatim for scripted content (except when intentionally creating simplified captioning for a relevant target audience, e.g. people with cognitive disabilities).
Transcripts MUST be verbatim for scripted content.
Captions and transcripts SHOULD be verbatim for unscripted or live content (with the optional exception of stuttering or filler words — like “um” — when captioning the filler words reduces reading comprehension of the captions or transcript).

Recorded Captioning Style Guide
Ai-Media
August 2018

Commas used to separate clauses and after filler words (“So”)
“So, select your data.”

from Reading Sounds: Close-Captioned Media and Popular Culture
Sean Zdenek

Sample page from Chapter 2 of Sean Zdenek’s Reading Sounds: Closed-Captioned Media and Popular Culture

[I have noticed, in at least two instances, that commercial service style guides refer to what should be em dashes but are using single hypens. Is this a limitation of caption display capabilities? Or a throwback to previously limited capabilities in caption display? —mak]

Categories Captioning, Transcription

birdlab: an open process initiative

blackbird founders archive (vols.1–21)