Skip to content

Create dataset spotify_podcast_dataset #102

@albertvillanova

Description

@albertvillanova
  • uid: spotify_podcast_dataset
  • type: processed
  • description:
    • name: Spotify Podcast Dataset

    • description: Podcasts are a rapidly growing audio-only medium that involve new patterns of usage and new communicative conventions and motivate research in many new directions.To facilitate such research, we present the Spotify English-Language Podcast Dataset.

      This dataset consists of 100,000 episodes from different podcast shows on Spotify. The dataset is available for research purposes.

      The dataset was initially created for use in the the TREC Podcasts Track shared tasks. Participants were asked to work on two tasks focusing on understanding podcast content, and enhancing the search functionality within podcasts.

      We are releasing this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. The dataset contains about 50,000 hours of audio, and over 600 million transcribed words. The episodes span a variety of lengths, topics, styles, and qualities.

    • homepage: https://podcastsdataset.byspotify.com/

    • validated: True

  • languages:
    • language_names:
      • English
    • language_comments:
    • language_locations:
      • Northern America
      • Europe
    • validated: False
  • custodian:
  • availability:
    • procurement:
    • licensing:
      • has_licenses: Yes
      • license_text:
      • license_properties:
        • research use
        • do not distribute
      • license_list:
    • pii:
      • has_pii: Yes
      • generic_pii_likely: very likely
      • generic_pii_list:
        • names
        • URLs
      • numeric_pii_likely: unlikely
      • numeric_pii_list:
        • telephone numbers
      • sensitive_pii_likely: very likely
      • sensitive_pii_list:
        • racial or ethnic origin
        • political opinions
        • religious or philosophical beliefs
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • processed_from_primary:
    • from_primary: Taken from primary source
    • primary_availability: Yes - their documentation/homepage/description is available
    • primary_license: Yes - the dataset curators have obtained consent from the source material owners
    • primary_types:
      • podcasts
    • validated: False
    • from_primary_entries:
  • media:
    • category:
      • text
      • audiovisual
    • text_format:
      • .TXT
    • audiovisual_format:
      • .OGG
    • image_format:
    • database_format:
    • text_is_transcribed: Yes - audiovisual
    • instance_type: episode
    • instance_count: 10K<n<100K
    • instance_size: 100<n<10,000
    • validated: False
  • fname: spotify_podcast_dataset.json

Metadata

Metadata

Assignees

Labels

data catalogGathering data from data sources

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions