Skip to main content
new

Africa Next Voices Transcription Training workshop for Dholuo and Kalenjin Communities held at Kisumu Hotel

Image
new
Participants at the Common Voice Transcription Training Workshop at Kisumu Hotel.

The Bill and Melinda Gates Foundation-funded African Next Voices Project: Pilot Data Collection in Kenya is making significant strides in collecting high-quality linguistic datasets in five African languages—Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. This initiative plays a crucial role in bridging the language gap in AI and speech technology by compiling both scripted and unscripted audio data. Having successfully completed the scripted phase, the project is now conducting the unscripted phase, where participants respond to textual, image, audio, and video prompts, providing natural speech in their native languages. This phase ensures the collection of diverse and representative speech patterns that are essential for developing inclusive voice-enabled technologies.

On March 27, 2024, a transcription workshop was held at Kisumu Hotel to train transcribers on the techniques required to process unscripted audio data with accuracy and consistency. The workshop focused on two languages, Dholuo and Kalenjin, and was guided by linguists and language leads. The transcribers were introduced to the African Next Voices (ANV) transcription methodology, which follows rigorous quality assurance loops to ensure that the final datasets meet high linguistic and technical standards.

 

Image
msu
Throughout the workshop, transcribers were trained on key transcription principles, such as maintaining verbatim accuracy by capturing speech exactly as spoken, including pauses, hesitations, and speech errors.
Image
msu
Dr. Lilian Wanzare the Principal Investigator (PI) of the Project, taking the participants through the training
Image
new

The Kalenjin transcription team consisted of native speakers from Kericho, Nandi, Bomet, Uasin Gishu, Nakuru, and Narok. These transcribers underwent intensive training on language-specific guidelines to ensure consistency in transcribing Kalenjin speech data. During the practical sessions, they engaged in real-time transcription exercises, applying the guidelines under the supervision of experienced language leads. Their participation in the workshop highlights the project’s commitment to leveraging regional linguistic expertise to develop high-quality AI training data.