Spotify Podcasts Dataset

Podcasts are a rapidly growing audio-only medium that involve new patterns of usage and new communicative conventions and motivate research in many new directions. To facilitate such research, we present the Spotify Podcast Dataset, with over 200 000 podcast episodes, more than 100 000 hours of speech and more than 1 billion transcribed words in English and Portuguese.

The English-language dataset of about 100 000 episodes was created in 2020 for use in the the TREC Podcasts Track shared tasks. Participants were asked to work on two tasks focusing on understanding podcast content, and enhancing the search functionality for podcast data. In 2021 we released this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. In 2022, we added a Portuguese section of approximately equal size.

The episodes span a variety of lengths, topics, styles, and qualities. Episodes were sampled from both professional and amateur podcasts ranging from material produced in a studio with dedicated equipment by trained professionals to material self-published from a phone app - these vary in quality depending on professionalism and equipment of the creator. Audio quality, topical content, and conversational format all vary over a wide range. The episodes include scripted and unscripted monologues, interviews, conversations, debate, and included clips of other non-speech audio material, some familiar from other published material, some novel with new emerging conventions.

GETTING ACCESS TO THE DATASET

We are happy to share the dataset for non-commercial use only to any research group with an academic affiliation and academic address, or with a track record of relevant research publications. Student class projects are also welcome if the results are reported publicly.

Use this form to request the dataset.

After approval, we will send you a data sharing agreement to sign; after signature we will send you an access link for downloading the data. Approval of requests is a manual process so please allow at least two weeks to hear back from us.

CONTENTS OF THE DATASET

The English-language dataset consists of 105,360 episodes from different podcast shows published between January 1, 2019 and March 1, 2020 on the Spotify platform which works out to about 50,000 hours of recorded speech, and over 600 million transcribed words. In 2022 we added a Portuguese section of 123,054 episodes published between September 9, 2019 and March 31, 2022, of more than 76,000 hours of recorded speech. The data set now consists of over 200,000 episodes. The episodes were randomly selected from our catalog and were constrained to be mostly speech (as opposed to music or ambient sound).

Each of the episodes in the dataset includes an audio file, a text transcript, and some associated metadata. Note that the data does not include listening data, streaming statistics, or other user or usage-related data.

Audio data

Episode RSS header files

Transcripts

Audio features

Estimated size of downloads

  English Portuguese
Metadata 150MB 26MB
Audio 2TB 2TB
Transcripts 13GB 75GB
Audio features  
- OpenSmile 75GB
- Yamnet vectors 400GB
- Yamnet events 60GB
Pyserini index 4 GB

FAQ

Is the dataset multilingual?

The data set is in English and in Portuguese, with some items that have leaked into the data set in other languages. We hope to release successive multilingual versions of the dataset in the future.

Does the dataset contain user data?

No, the data set contains no information on searching, listening, recommendation or other data based on audience behaviour.

What were the TREC Podcasts Track Tasks?

We defined two tasks for participants in the 2020 and 2021 TREC Podcasts Track.

Task 1: Ad-hoc Segment Retrieval

Given an arbitrary keyword query, retrieve the jump-in point for relevant segments of podcast episodes. The best result would be a segment with very relevant content, which is also a good jump-in point for the user to start listening. We added some non-topical ranking criteria for the 2021 edition.

Task 2: Summarization

Given a podcast episode with its audio and transcription, return a short snippet capturing the most important information in the content.

More information about the TREC Podcasts Track

What if there are inaccuracies in the data?

All RSS headers and audio are supplied by creators, and Spotify does not claim responsibility for the correctness of those data fields. All transcripts are generated using state of the art commercial automatic speech recognition and represent the audio content with a somewhat predictable level of errors. The language identification is based on creator-supplied data as well as automatic identification of language in the episode description and in spite of this we have found a number of non-English and non-Portuguese podcasts in the dataset. This level of inaccuracy in the data reflects the linguistic richness of the data and the multilingual reality of podcasts in general and are to some extent a feature, not a bug, of the dataset.

Who should be excited by this dataset?

Speech, NLP and Information Retrieval researchers who want to develop novel models on previously inaccessible streams of data. Also, any researchers interested in podcasts!

Who organised the TREC shared task?

The TREC track was a collaboration between Spotify, NIST (the National Institute of Standards and Technology), CLARIN, Dublin City University, and TREC (the Text Retrieval Conference). All organising parties contributed to the task definition, the annotation standards, and the evaluation metrics. Spotify supplied the data. TREC supplied the infrastructure for participants to join the competition, submit their entries, and publish their system descriptions, and organizes a conference in November where participants share their results. NIST supplied the expert human annotators who will judge the participants’ entries according to Spotify’s annotation guidelines and metrics.

What are some helpful resources we can look at if we want to learn more?

The previous Spoken Document Retrieval task at TREC

Who can I reach out to if I have a question?

Contact the organizers: podcasts-challenge-organizers@spotify.com

Citing the dataset

When referring to the data, please cite the following papers:

English Language Dataset

Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones. 2020. “100,000 Podcasts: A Spoken English Document Corpus”. In Proceedings of the 28th International Conference on Computational Linguistics (COLING)

ACL Anthology

@inproceedings{clifton2020100, title={100,000 podcasts: A spoken {E}nglish document corpus}, author={Clifton, Ann and Reddy, Sravana and Yu, Yongze and Pappu, Aasish and Rezapour, Rezvaneh and Bonab, Hamed and Eskevich, Maria and Jones, Gareth and Karlgren, Jussi and Carterette, Ben and Jones, Rosie}, booktitle={Proceedings of the 28th International Conference on Computational Linguistics (COLING)}, year={2020} }

Portuguese Language Dataset

Edgar Tanaka, Ann Clifton, Joana Correia, Sharmista Jat, Rosie Jones, Jussi Karlgren, Winstead Zhu. 2022. “Cem Mil Podcasts: A Spoken Portuguese Document Corpus”. arXiv preprint 2209.11871. arXiv

@article{tanaka2022cemmil, title={Cem {M}il {P}odcasts: {A} spoken {P}ortuguese document corpus}, author={Edgar Tanaka and Ann Clifton and Joana Correia and Sharmista Jat and Rosie Jones and Jussi Karlgren and Winstead Zhu}, journal={arXiv preprint 2209.11871}, year={2022} }

Published Research on the Spotify Podcast Dataset

If you have published material or analyses on the the Podcast Dataset, get in touch to have it included in this bibliography!

TREC Shared Tasks on Segment Retrieval and Summarisation

Data set enrichments with additional features

Challenges specific to podcasts

Characteristics of podcasts