Events

Upcoming events

Event Information:

  • Thu
    13
    Jun
    2024

    Giuseppe Magistro (UGent) - "Creating a corpus of web-data with Pyrlato. A demonstration"

    2:00 pmLokaal 3.30 - Camelot, Blandijn, Campus Boekentoren

    The use of corpora in acoustic analyses has become a standard practice in phonetic phonological research, offering high ecological validity (see e.g. Beckman, 1997; Warner, 2012; Tucker & Mukai, 2023 for a discussion on validity). However, compiling corpora and looking for specific phenomena can be time and resource-consuming. In response to this challenge, we developed a program named Pyrlato, which we aim to demonstrate. Pyrlato is a novel tool designed for creating corpora of real-world spoken data from the web. The tool extracts audio files from YouTube, cutting and extracting desired segments such as specific phonemes, syllables, or words found in YouTube videos. This enables the creation of corpora with tens of thousands of tokens within a few computational hours. Pyrlato works across Dutch, English, French, German, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Turkish, Ukrainian, and Vietnamese, i.e. those languages for which YouTube provides automatic subtitles. The software searches for the desired string in the subtitles and, upon finding the match, extracts the relevant audio extract containing the string in .mp3 format (other formats are also possible).

    The demonstration will showcase Pyrlato's online version and the application of some case studies.

    • Beckman, M.E. (1997).A typology of spontaneous speech. In Y. Sagisaka, N. Campbell, & N. Higuchi (Eds.), Computing Prosody: Computational Models for Processing Spontaneous Speech (pp. 7–26). Springer. http://dx.doi.org/10.1007/978-1-4612-2258-3_2.
    • Tucker, B.V., & Mukai, Y. (2023). Spontaneous speech. Cambridge University Press. http://doi.org/10.1017/9781108943024.
    • Warner, N. (2012). Methods for studying spontaneous speech. In A. Cohn, C. Fougeron, & M. Huffman (Eds.), The Oxford Handbook of Laboratory Phonology (pp. 621–633). Oxford University Press.

     

    Show content

 

Past events

Event Information:

  • Thu
    27
    Oct
    2016

    Parsed Historical Corpora Fest (2/2)

    11:00 amHiko b. 001

    Lecturer: Dr. Joel Wallenberg (Newcastle University)

    In the past, the fields of historical linguistics and synchronic syntax have both largely relied on qualitative data, e.g. the analysis of isolated examples, qualitative judgment data, etc. In the last few years, however, successes in variationist sociolinguistics, quantitative biology, and computer science have begun a revolution in the way both syntax and language change are studied: both fields have begun to use more quantitative data, especially in finding theoretically important statistical patterns in naturalistic production data. These fields have also combined with each other and with quantitative methods to give rise to a new field of quantitative diachronic comparative syntax. However, studying syntactic change in this mathematical way, particularly in a cross-linguistic, comparative approach, presents a number of interesting technical challenges. It requires measuring the frequencies of very abstract objects over very large periods of time, and in order to do this, we need a research infrastructure of diachronic parsed corpora (i.e. treebanks) drawn from a number of language histories. Building and analyzing these treebanks requires considerable technical skill, and a fair amount of collaboration between linguists with various computational, theoretical, and philological skills. Our workshops this week will help students with some background in syntax begin to search parsed corpora of this kind, interpret the results, and if they'd like, help them to contribute to the process of building more diachronic corpora of more languages.

    Dinsdag, 25 oktober 2016, 14.00u - 16.30u, PC-lokaal D (PlaRoz): over werken met geparsede corpora, bv. PPCHE en IcePaHC

    Donderdag, 27 oktober 2016, 11.00u - 13.00u, Hiko b. 001: over het bouwen van een eigen corpus, het parsen van je eigen gegevens

    Deze workshops zijn een initiatief van Prof. Dr. Miriam Bouzouita en Prof. Dr. Anne Breitbarth.

    Show content