Podcasts are a rapidly growing audio-only medium that involve new patterns of usage and new communicative conventions and motivate research in many new directions. To facilitate such research, we present the Spotify Podcast Dataset, with over 200 000 podcast episodes, more than 100 000 hours of speech and more than 1 billion transcribed words in English and Portuguese.
The English-language dataset of about 100 000 episodes was created in 2020 for use in the the TREC Podcasts Track shared tasks. Participants were asked to work on two tasks focusing on understanding podcast content, and enhancing the search functionality for podcast data. In 2021 we released this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. In 2022, we added a Portuguese section of approximately equal size.
The episodes span a variety of lengths, topics, styles, and qualities. Episodes were sampled from both professional and amateur podcasts ranging from material produced in a studio with dedicated equipment by trained professionals to material self-published from a phone app - these vary in quality depending on professionalism and equipment of the creator. Audio quality, topical content, and conversational format all vary over a wide range. The episodes include scripted and unscripted monologues, interviews, conversations, debate, and included clips of other non-speech audio material, some familiar from other published material, some novel with new emerging conventions.
We are happy to share the dataset for non-commercial use only to any research group with an academic affiliation and academic address, or with a track record of relevant research publications. Student class projects are also welcome if the results are reported publicly.
Use this form to request the dataset.
After approval, we will send you a data sharing agreement to sign; after signature we will send you an access link for downloading the data. Approval of requests is a manual process so please allow at least two weeks to hear back from us.
The English-language dataset consists of 105,360 episodes from different podcast shows published between January 1, 2019 and March 1, 2020 on the Spotify platform which works out to about 50,000 hours of recorded speech, and over 600 million transcribed words. In 2022 we added a Portuguese section of 123,054 episodes published between September 9, 2019 and March 31, 2022, of more than 76,000 hours of recorded speech. The data set now consists of over 200,000 episodes. The episodes were randomly selected from our catalog and were constrained to be mostly speech (as opposed to music or ambient sound).
Each of the episodes in the dataset includes an audio file, a text transcript, and some associated metadata. Note that the data does not include listening data, streaming statistics, or other user or usage-related data.
English | Portuguese | |
---|---|---|
Metadata | 150MB | 26MB |
Audio | 2TB | 2TB |
Transcripts | 13GB | 75GB |
Audio features | – | |
- OpenSmile | 75GB | – |
- Yamnet vectors | 400GB | – |
- Yamnet events | 60GB | – |
Pyserini index | 4 GB | – |
The data set is in English and in Portuguese, with some items that have leaked into the data set in other languages. We hope to release successive multilingual versions of the dataset in the future.
No, the data set contains no information on searching, listening, recommendation or other data based on audience behaviour.
We defined two tasks for participants in the 2020 and 2021 TREC Podcasts Track.
Given an arbitrary keyword query, retrieve the jump-in point for relevant segments of podcast episodes. The best result would be a segment with very relevant content, which is also a good jump-in point for the user to start listening. We added some non-topical ranking criteria for the 2021 edition.
Given a podcast episode with its audio and transcription, return a short snippet capturing the most important information in the content.
More information about the TREC Podcasts Track
All RSS headers and audio are supplied by creators, and Spotify does not claim responsibility for the correctness of those data fields. All transcripts are generated using state of the art commercial automatic speech recognition and represent the audio content with a somewhat predictable level of errors. The language identification is based on creator-supplied data as well as automatic identification of language in the episode description and in spite of this we have found a number of non-English and non-Portuguese podcasts in the dataset. This level of inaccuracy in the data reflects the linguistic richness of the data and the multilingual reality of podcasts in general and are to some extent a feature, not a bug, of the dataset.
Speech, NLP and Information Retrieval researchers who want to develop novel models on previously inaccessible streams of data. Also, any researchers interested in podcasts!
The TREC track was a collaboration between Spotify, NIST (the National Institute of Standards and Technology), CLARIN, Dublin City University, and TREC (the Text Retrieval Conference). All organising parties contributed to the task definition, the annotation standards, and the evaluation metrics. Spotify supplied the data. TREC supplied the infrastructure for participants to join the competition, submit their entries, and publish their system descriptions, and organizes a conference in November where participants share their results. NIST supplied the expert human annotators who will judge the participants’ entries according to Spotify’s annotation guidelines and metrics.
The previous Spoken Document Retrieval task at TREC
Contact the organizers: podcasts-challenge-organizers@spotify.com
When referring to the data, please cite the following papers:
Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones. 2020. “100,000 Podcasts: A Spoken English Document Corpus”. In Proceedings of the 28th International Conference on Computational Linguistics (COLING)
@inproceedings{clifton2020100,
title={100,000 podcasts: A spoken {E}nglish document corpus},
author={Clifton, Ann and Reddy, Sravana and Yu, Yongze and Pappu, Aasish and Rezapour, Rezvaneh and Bonab, Hamed and Eskevich, Maria and Jones, Gareth and Karlgren, Jussi and Carterette, Ben and Jones, Rosie},
booktitle={Proceedings of the 28th International Conference on Computational Linguistics (COLING)},
year={2020}
}
Edgar Tanaka, Ann Clifton, Joana Correia, Sharmista Jat, Rosie Jones, Jussi Karlgren, Winstead Zhu. 2022. “Cem Mil Podcasts: A Spoken Portuguese Document Corpus”. arXiv preprint 2209.11871. arXiv
@article{tanaka2022cemmil,
title={Cem {M}il {P}odcasts: {A} spoken {P}ortuguese document corpus},
author={Edgar Tanaka and Ann Clifton and Joana Correia and Sharmista Jat and Rosie Jones and Jussi Karlgren and Winstead Zhu},
journal={arXiv preprint 2209.11871},
year={2022}
}
If you have published material or analyses on the the Podcast Dataset, get in touch to have it included in this bibliography!
TREC 2020: Rosie Jones, Ben Carterette, Ann Clifton, Maria Eskevich, Gareth JF Jones, Jussi Karlgren, Aasish Pappu, Sravana Reddy, and Yongze Yu. “TREC 2020 Podcasts Track Overview.” In the Twenty-Ninth Text REtrieval Conference Proceedings (TREC 2020). NIST Special Publication 1266. Ellen M. Voorhees and Angela Ellis (editors). 2021. arXiv
TREC 2021: Jussi Karlgren, Rosie Jones, Ben Carterette, Ann Clifton, Edgar Tanaka, Maria Eskevich, Gareth J. F. Jones, and Sravana Reddy. “TREC 2021 Podcasts Track Overview” In the Thirtieth Text REtrieval Conference Proceedings (TREC 2021). NIST Special Publication 500-335. Ian Soboroff and Angela Ellis (editors). 2022. NIST
Workshop Report on Use Cases for Human Interaction with Audio Collections: Gareth J. F. Jones, Maria Eskevich, Ben Carterette, Joana Correia, Rosie Jones, Jussi Karlgren, Ian Soboroff. “Report on the 1st Workshop on Audio Collection Human Interaction (AudioCHI 2022) at CHIIR 2022”. SIGIR Forum 56:1. 2022. SIGIR Forum
Information Retrieval Challenges for Podcasts: Rosie Jones, Hamed Zamani, Markus Schedl, Ching-Wei Chen, Sravana Reddy, Ann Clifton, Jussi Karlgren, Helia Hashemi, Aasish Pappu, Zahra Nazari, Longqi Yang, Oguz Semerci, Hugues Bouchard, and Ben Carterette. “Current challenges and future directions in podcast information access.” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. ACM DL
Podcast episode relevance in a search setting: Ben Carterette, Rosie Jones, Gareth F. Jones, Maria Eskevich, Sravana Reddy, Ann Clifton, Yongze Yu, Jussi Karlgren, and Ian Soboroff. “Podcast metadata and content: Episode relevance and attractiveness in ad hoc search.” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. ACM DL
Listener engagement: Sravana Reddy, Mariya Lazarova, Yongze Yu, and Rosie Jones. “Modeling Language Usage and Listener Engagement in Podcasts”. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL & IJCNLP). 2021. ACL Anthology
Quality of summaries: Rezvaneh Rezapour, Sravana Reddy, Rosie Jones, Ian Soboroff. “What Makes a Good Podcast Summary?” In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022. ACM DL
Identifying structure in podcast episodes: Sravana Reddy, Yongze Yu, Aasish Pappu, Aswin Sivaraman, Rezvaneh Rezapour, and Rosie Jones. “Detecting Extraneous Content in Podcasts.” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 2021. ACL Anthology
How podcast language is different from that in other collections: Karlgren, Jussi. “Lexical variation in English language podcasts, editorial media, and social media.” Northern European Journal of Language Technology 8, no. 1 (2022). NEJLT @ LiU