From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Pffffft.gifWikiDienstag.chProdUsing #DataLiteracy | Kollaboborative Transkription @FutureHPodcast

Dienstag Sprints Feedlogs how2manuals Glossar QDw3c Collab

The Zone -- Experimental Area[edit]

First experiments and experiences with Wikimedia-based transcription:

Transcriptions in Progress
Episode Audio ED QA Status Assigned
FutureHistories_S01E04_FelixStalder Audio dk dk done dk
FutureHistories_S01E12_DanielLoick Audio dk - looking for QA dk
FutureHistories S01E38 UlrikeHerrmann Audio cz - Looking for QA NN
FutureHistories_S02E08_ThomasBiebricher Audio dk dk done dk
FutureHistories_S02E09_IsabellaWeber Audio dk dk done dk
FutureHistories_S02E12_FriederikeHabermann Audio az dk done dk
FutureHistories_S02E14_JakobHeyer Audio dk - Looking for QA NN
FutureHistories_S02E17_RobertSeyfert Audio dk dk under QA dk
- - - - - -
- - - - - -
FH_Transkription_Template - - - Stub -
ZettelFuture - - - README_ZettelFuture -


  • In technical terms, you are in a Wiki environment. Please make sure you know what you're doing.
  • In social terms, you ar in a Wikimedia environment. Conduct yourself accordingly and abstain from spamming.
  • In (inter-)personal terms, you are in an environment created by the experimental Future Histories Collaborative Transcription Project. The editors can be contacted via email: (transkription -at - futurehistories.today). Be polite. Do not troll. Do not scam.


Je pense à un monde où chaque mémoire pourrait créer sa propre légende.
(I dream of a world where each memory might create its proper caption.)
Chris Marker, Sans Soleil

Why would an image need a caption, and indeed a proper one? We all know that images can be falsified by improper sub-texts. But what kinds of images that would be incomplete, if not incomprehensible, without their "proper caption"? The late Chris Marker once noted that, while being in the process of taking pictures, filming or observing, "one never knows what one is filming" (or seeing, for that matter) until later. Hence, captions are afterthoughts, benefiting from hindsight. According to Marker, this does not only apply to images, but to memories as well.

Now: what about aurality and, to be more specific, about orality? Consider recordings of the spoken word, say, from interviews or presentations: is there a similar gap/connection between the recounted, the recorded, the transcribed, and the perceived? Does the spoken text apply? Or is it the the written one? What is lost in recording, what in transcription? And" might there also be something that was added in the process of recording and transcription?

This collection of thoughts and information is dedicated to addressing the challenges of sub-titling podcasts properly. Speaking more generally, it is about the (con-)textualization of the spoken word. It is intended to help exploring the utilization of Wikimedia for experimental, collaborative transcriptions of audio resources on the Web. And it is geared at collecting some armory against the next onslaught on our senses and our sense of certainty: deep-fake-audios, which will utilize one of the most personal features of us human beings -- the acoustic characteristics of our voices. An avalanche of acoustic counterfeits will be launched against us, yet another onslaught in the undeclared war against our senses and sensibilities, yet another attempt to undermine one the few roots of interpersonal trust that has so far remained with us: finding our voice, making ourselves heard, granting an open ear.

In a not too distant future, and dispossessed from our personal pitch, timbre, and way of speaking, we will be listening, in disbelief, to statements and refutations, declarations and rejections, expressions of faith and of doubt uttered in what once was our own voice, but never crossed our lips, but were put into our mouths by technical wizardry. We will her oaths, vows and promises we never expressed ourselves, but were produced by machines, possibly even by ourselves, so we do not having to honour them. Will these be anti-promises, then, or their contrary? What would be the opposite of a promise? A false one, or one that was never expressed in the first place? But then, what about silent promises?

So we may have to seriously talk about talking, express ourselves about expressing ourselves, become verbal about our verbalities, maybe stumm about remaining silent. Let us direct our attention to elements of our speech that lie beyond the words, phrases and texts of our language. Let's concern ourselves with those elements of our orality which are rooted in sound, music, and song, which cannot but carry some grain of verity, as they grow out of existential and poetic truth.

And now for something completely differet. Or so it seems.

Transcribing Audio - The Context[edit]

In fall 2021, we started a small initiative with the aim of supplying VTT subtitles for the German language podcast Future Histories. First and foremost, this should be regarded as an attempt to improve the accessibility of its content for people with impaired hearing. As you can see e.g. in this example, podcast players can display the highlighted text sections in parallel to what is rendered acoustically. By "textualization" the the episodes, their full content becomes digestible for search engines, thereby improving lookup and retrieval, which is a desirable side-effect.

A more ambitious, if mid-term goal, is to compile HTML-formatted and possibly annotated transcripts of "core episodes" selected by listeners or the producer. In the long run, we can imagine a discursive, text-based environment that is situated around the podcast and can be interlinked with external sources and initiatives. How far we will actually drive this idea will depend on the level of public interest for such a service. However, as a preliminary step in this direction, we have started an experiment to employ Wikimedia as backend infrastructure for our collective transcription effort. The text you are reading right now was created in the very early stages of this experiment.

There is much more to say about the process of audio transcription -- e.g., concerning the utilization of speech-to-text services and software for early raw transcripts, preparatory steps to be carried out before a machine-generated raw version can be handed to a human transcriber, formatting considerations, the review process, and automated post-processing. We will deal with all these topics later. For now, our priority is to set up a prototypic, Wikimedia-based production environment for collaborative transcription.

Production Environment[edit]

For the time being, our environment is located in a corner of Stefan M. Seydel's collaborative WikiDienstag space (thanks for the invite, -sms- !). You are currently at the Collaboration Top Page. Stubs for the sub-branches Dienstag/Collab/Transkription and Dienstag/Collab/Transkription/FutureHistories have been created for structuring, coordinating and (if so desired) actually hosting the production environment for collaborative transcription of the episodes. Alternatively, part of the coordinative effort could happen via the 'Discussion' section of the corresponding pages.

As Wikimedia lacks a modestly sophisticated Audio Player, the chances of fully streamlining the production of podcast subtitles in the Wikimedia environment are limited. The preliminary list below refers to a handful of Wikimedia pages relevant for audio, video and subtitling:

Audio Subtitling in Practice[edit]

For the time being, we maintain a Wikimedia test page for experimenting with the transcription of a single Future Histories episode, namely S01E12 with Daniel Loick. The corresponding test page is Dienstag:Collab:Transkription Transkription, and this is where you should go to follow a specific practical effort.

The remaining sections on this page deals with the steps required for generating a machine-generated, raw transcript that ready to be handed over to further human processing.

Obtaining Closed Captions for an Episode[edit]

  1. Future Histories is published not only on a variety of podcast hubs, but on YouTube in video format (without any moving images). This plays in our favour, since YouTube automatically generates video subtitles (closed captions) for uploaded material. The usefulness of the service is universally acknowledged and even acknowledged by Wikimedia. The common SubRip (SRT) data format for video subtitles provided by YouTube is not strictly identical, but reasonably close to the WebVTT format required for Audio subtitles, and provide a good starting point for our effort. Better still, the service comes for free.
  2. Now, how do we strip the subtitles from the YouTube video of a Future History Podcast? Conveniently, there is a web service out there that has specialized on this task. Downsub takes the URL of a YouTube video as input and returns files with its subtitles within seconds. You may have to put up with lots of annoying popup-advertisements but hey, that's the common price we all have to pay for "free" services). The subtitles come in two flavours ("SRT" and TXT). We need the SRT one, which we download by clicking on the corresponding icon. The names of the downloaded files are informative, but in an awkward format, so you may want to give the file a more convenient name first. When you open the file (using an ordinary text editor), this is what you should see:
          00:00:00,000 --> 00:00:01,589
          herzlich willkommen bei future history

          00:00:01,589 --> 00:00:03,720
          is dem podcast zur erweiterung unserer

          00:00:03,720 --> 00:00:06,150
          vorstellung von zukunft meines tieren

          00:00:06,150 --> 00:00:07,890
          groß und ich spreche heute mit daniel

Even from these few lines, we can see that (1) the text recognition software doesn't capitalize German nouns, and (b) it makes mistakes, such as transcribing the spoken "future histories" into "future history is". These glitches have to be corrected in the manual editing process that comes later. Prior to this, we have to massage the content a little further.

Pre-Processing the SRT Data[edit]

It turns out that we have to just a couple of minor things in the SRT data to turn it into a"pre-VTT" format that is easily editable by humans:

  1. The decimal commas separating the millisecond part in the timestamps must be changed to dots.
  2. The section numbers 1, 2, 3 ... have to be removed.
  3. All content of a section should be displayed in a single line, without empty lines between subsequent sections, to save the editors a lot of navigating inside the text.
  4. The '>' symbol in the timestamp is temporarily replaced with another symbol, avoiding potential problems with automated processing of HTML-like tags.
  5. To simplify automated search-replace, all words in the text, including those at the end of the line, should be surrounded by at least on whitespace.

Fro reformatting, we can use a text editor or word processor with a reasonably powerful search/replace function (multi-line search, regular expressions). Alternatively, the desired result can be achieved with a few lines of shell script. Assuming "infile.srt" as the SRT file obtained from Downsub, something along the following lines will do the job:


          cat infile.srt |        \
            grep -v               \
              -e "^[0-9][0-9]*$"  \
              -e "^$" |           \
            sed                   \
               -e "s/>/x/"        \
               -e "s/,/./g"       \
               -e "N;s/\n/ /"     \
               -e "s/$/ /"        \
            > preVTT_onliner_file.txt ; 

The output of this script is written to a file namede "preVTT_onliner_file.txt". If you open this file in a text editor, its content looks like this:

          00:00:00.000 --x 00:00:01.589 herzlich willkommen bei future history
          00:00:01.589 --x 00:00:03.720 is dem podcast zur erweiterung unserer
          00:00:03.720 --x 00:00:06.150 vorstellung von zukunft meines tieren
          00:00:06.150 --x 00:00:07.890 groß und ich spreche heute mit daniel

This simple re-formatting exercise already saves thousands of keystrokes, but we can still do one better.

Capitalizing German Nouns[edit]

The major drawback of the intermediate result achieved so far is that all nouns come with small first letters. As an episode may comprise thousands of nouns, this is a tedious affair for the human editor. So, what can we do to fix this?

The method is too involved to be elaborated here, so we just outline the process. Again, scripting comes to the rescue. We automatically create a list of all the words occurring in the episode, pipe it through the UNIX aspell checker to get a shortlist of candidates for nouns, and create a first, "safe" set of words to be capitalized (those that only differ in the first character being lower rather than upper case). Not all nouns will be found this way, due to ambiguities and the incompleteness of the spellchecker dictionary. Hence, two additional, semi-manual steps are to obtain a reasonably complete list capable of correcting >> 90% of all nouns. The result now looks like this:

          00:00:00.000 --x 00:00:01.589 herzlich willkommen bei Future history
          00:00:01.589 --x 00:00:03.720 is dem Podcast zur Erweiterung unserer
          00:00:03.720 --x 00:00:06.150 Vorstellung von Zukunft meines tieren
          00:00:06.150 --x 00:00:07.890 groß und ich spreche heute mit Daniel

Not perfect, but good enough to be handed to a human editor to fix the remaining spelling mistakes, add punctuation, and correct ungrammatical sections (occasionally, speakers lose track of how they started their sentence).

Editing machine-generated Subtitles[edit]

We do not strive at generating verbatim transcriptions that include every uttered'um', 'ehm' or oral glitch. The result should be "easy on the eye" when following the written text in parallel to the audio. That being said: if ever possible, the text should be comprehensible without going back to the audio, while still conveying some impression of the "speaking situation".

E.g., if a speaker pauses to think about a question, or to collect his or her thought, a '...' symbol should be inserted. If there is a jump from one thought to the next, insert a double hyphen -- . If an explanatory thought is inserted in the middle of another thought, enclose it in double hyphens -- like this -- or in brackets (like this). Many speakers tend to start their sentences with a conjunction ("and", "so", but", "because", ... resp. "und", "also", "aber", "weil" ...). Whether or not they are to be removed in the transcription is left to your good judgement. As a rule of thumb, most occurrences of "und" at the start of a sentence can be removed without any semantic impact. For "also" or "weil", this is not so clear cut. Also: if you decide to keep them as first word in a sentence, consider adding a colon.

Ground Rules:

  1. Do not try 'wing it', that is, do not edit the text without listening, in parallel, to the audio! The speech-to-text algorithms are still far from perfect. Occasionally, they come up with truly ludicrous suggestions that can easily be recognised. Other imperfections are more subtle; they simply can't be spotted by looking at the text and making an educated guess (e.g. based on the textual context, which might equally have been mis-transcribed by the algorithms!). Hence: first listen, then write.
  2. Please do not change anything in the time-stamp information on the left had side of the text. Further processing relies on its consistency (continuity, i.e., no undefined intervals, no additional characters, no empty lines)
  3. Please do not insert or remove any lines by pressing <Enter> or <Backspace>. All changes should be carried out within lines that already exist and have proper timestamps.
  4. You can transcribe or edit an episode using the web-based audio player embedded in the episode page. However, it can be advantageous first download the audio file for the episode and use an audio player of your choice that is local to your computer. E.g., I am using VLC and Audacity, since these players allow for finer intervals (1,5,10 sec) of stepping forwards and backwards. In contrast, player embedded in Future Histories web site has fixed stepping intervals of 30 sec in forward and 15 sec in backwards direction. When transcribing, you may also want to lower the playback speed to 50%. When doing the final editing or QA, you may prefer to increase it to 150%.
  5. At particular positions in the text, the transcriber may feel that, in a text-only version, this would be the right place for starting a new paragraph. Whenever this happens, please insert the symbol '-p-'. It will be interpreted accordingly when producing an HTML full-text transcription from the subtiltles. As for when to start a new paragraph: just follow your instinct. In general, a sufficient number of paragraphs is a good thing, as the text becomes more structured and readability is improved.
  6. There may be expressions, sentences or section that appear to be particularly relevant for the podcast, e.g. because they express a specific idea, present a concise problem statement, a good illustration, or simply "the beef" of a longish statement. These sections may be surrounded by the editorial markers '@- ' and ' -@', or a "personalized" variant thereof. Example: @dk- This is a short section enclosed in "personalized" markers of editor 'dk'. -dk@ Annotating the podcast transcript while editing is a welcome help for those wanting to produce a summary or a table of content for the the episode at a later stage.
  7. Note: The timing information tends to be correct at and near the beginning of the episode, but may drift progressively towards the end (some 100ms up to 5 seconds or more, depending on the length of the audio). Only earlier episodes from the first series are affected, in more recent episodes, the time-stamps are correct.

Review and Quality Assurance[edit]

Even with the best will in the world, the result of the first human pass on the machine-generated transcript is unlikely to be perfect. This is where editorial review and QA comes into play. Their main purpose is to catch mistakes that escaped the human transcribers and their spell-checking equipment. They also ensure that the subtitles apply to the audio podcast and what was said therein in the first place. If you are the review editor or QA controller, your OK means that audio and text actually correspond and that the editing rules (spelled out above) have been honoured.

Generating and Deploying the Subtitles[edit]

Review and Quality assurance is performed on the "one-liner", "preVtt" data format. This is translated into proper WebVTT format with the required prefix, expected line discipline and UTF-8 encoding. The result of this process is checked for syntactic correctness in a dedicated testing environment prior to shipping it the Jan, the Future Histories-producer. He is the one who will finally link the subtitles with the corresponding episode. Feel free to examine this example of the end result:

  • Click on the second icon on the left near the lower edge of the player to enable the subtitles
  • see what happens if you toggle the Stop/Follow Text button
  • see what happens if you left-click on any section in the subtitles.