Open Source Audio DataSets

Code Wrestling
8 min readNov 19, 2021

Acted Emotional Speech Dynamic Database

Acted Emotional Speech Dynamic Database
Acted Emotional Speech Dynamic Database

Contains utterances of acted emotional speech in the Greek language.

It is divided into two main categories, one containing utterances of acted emotional speech and the other controlling spontaneous emotional speech.

Contributed by: Abid Ali Awan
Original dataset

Arabic Speech Corpus

Arabic Speech Corpus

The corpus was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high-quality, natural voice.

Contributed by: Mert Bozkır
Original dataset

Att-hack: French Expressive Speech

This data is acted expressive speech in French, 100 phrases with multiple versions/repetitions (3 to 5) in four social attitudes: friendly, distant, dominant, and seductive.

Contributed by: Filipp Levikov
Original dataset

Audio MNIST

1. This repository contains code and data used in Interpreting and Explaining Deep Neural Networks for Classifying Audio Signals.
2. The dataset consists of 30,000 audio samples of spoken digits (0–9) from 60 different speakers.
3. Additionally, it holds the audioMNIST_meta.txt, which provides meta information such as the gender or age of each speaker.

Contributed by: Mert Bozkır
Original dataset

BAVED: Basic Arabic Vocal Emotions

The Basic Arabic Vocal Emotions Dataset (BAVED) contains 7 Arabic words spelled in different levels of emotions recorded in an audio/ wav format.

Each word is recorded in three levels of emotions, as follows:

Level 0 — The speaker is expressing a low level of emotion. This is similar to feeling tired or down.
Level 1— The “standard” level where the speaker expresses neutral emotions.
Level 2 — The speaker is expressing a high level of positive or negative emotions.

Contributed by: Kinkusuma
Original dataset

Bird Audio Detection

1. It contains datasets collected in real live bio-acoustics monitoring projects and an objective, standardized evaluation framework.
2. Collection of over 7,000 excerpts from field recordings worldwide, gathered by the FreeSound project and then standardized for research.
3. This collection is very diverse in location and environment.

Contributed by: Abid Ali Awan
Original dataset

CHiME-Home

The CHiME-Home dataset is a collection of annotated domestic environment audio recordings. In the CHiME-Home dataset, 4-second audio chunks are each associated with multiple labels, based on a set of 7 labels associated with sound sources in the acoustic environment.

Contributed by: Abid Ali Awan
Original dataset

CMU-Multimodal SDK

1. CMU-MOSI is a standard benchmark for multimodal sentiment analysis. It is especially suited to train and test multimodal models since most of the newest works in multimodal temporal data use this dataset in their papers.
2. It holds 65 hours of annotated video from more than 1000 speakers, 250 topics, and 6 Emotions (happiness, sadness, anger, fear, disgust, surprise).

Contributed by: Michael Zhou
Original dataset

CREMA-D: Crowd-sourced Emotional Multimodal Actors

1. CREMA-D is a dataset of 7,442 original clips from 91 actors.
2. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from various races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).
3. Actors spoke from a selection of 12 sentences. The sentences were presented using six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four different emotion levels (Low, Medium, High and Unspecified).

Contributed by: Mert Bozkır
Original dataset

Children’s Song

1. Open-source dataset for singing voice research.
2. This dataset contains 50 Korean and 50 English songs sung by one Korean female professional pop singer. Each song is recorded in two separate keys resulting in a total of 200 audio recordings.

Contributed by: Kinkusuma
Original dataset

Device and Produced Speech

1. Collection of aligned versions of professionally produced studio speech recordings and recordings of the same speech on common consumer devices (tablet and smartphone) in real-world environments.
2. It has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations).
3. Each version consists of about 4 1/2 hours of data (about 14 minutes from each of 20 speakers).

Contributed by: Kinkusuma
Original dataset

Deeply Vocal Characterizer

The The Vocal Characterizer Dataset is a human nonverbal vocal sound dataset consisting of 56.7 hours of short clips from 1419 speakers.
Also, the dataset includes metadata such as age, sex, noise level, and quality of utterance.

Contributed by: Filipp Levikov
Original dataset

EMODB

1. The EMODB database is the German emotional database.
2. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances.
3. The EMODB database comprises seven emotions: anger, boredom, anxiety, happiness, sadness, disgust, and neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.

Contributed by: Kinkusuma
Original dataset

EMOVO Corpus

EMOVO Corpus database built from the voices of 6 actors who played 14 sentences simulating six emotional states (disgust, fear, anger, joy, surprise, sadness) plus the neutral state.
These emotions are well-known found in most of the literature related to emotional speech.

Contributed by: Abid Ali Awan
Original dataset

ESC-50: Environmental Sound Classification

1. Labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.
2. The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:

Animals.
Natural soundscapes & water sounds.
Human, non-speech sounds.
Interior/domestic sounds.
Exterior/urban noises.

Contributed by: Kinkusuma
Original dataset

EmoSynth: Emotional Synthetic Audio

1. EmoSynth is a dataset of 144 audio files, approximately 5 seconds long and 430 KB in size, which 40 listeners have labeled for their perceived emotion regarding the dimensions of Valence and Arousal.
2. It has metadata about the classification of the audio based on the dimensions of Valence and Arousal.

Contributed by: Abid Ali Awan
Original dataset

Estonian Emotional Speech Corpus

The corpus contains 1,234 Estonian sentences that express anger, joy, and sadness or are neutral.

Contributed by: Abid Ali Awan
Original dataset

Flickr 8k Audio Caption Corpus

1. The Flickr 8k Audio Caption Corpus contains 40,000 spoken audio captions in .wav audio format, one for each caption included in the train, dev, and test splits in the original corpus.
2. The audio is sampled at 16000 Hz with 16-bit depth and stored in Microsoft WAVE audio format.

Contributed by: Michael Zhou
Original dataset

Golos: Russian ASR

1. Golos is a Russian corpus suitable for speech research.
2. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform.
3. The total duration of the audio is about 1240 hours.

Contributed by: Filipp Levikov
Original dataset

JL Corpus

1. Emotional speech in New Zealand English.
2. This corpus was constructed by maintaining an equal distribution of 4 long vowels.
3. The corpus has five secondary emotions along with five primary emotions.
4. Secondary emotions are important in Human-Robot Interaction (HRI), where the aim is to model natural conversations among humans and robots.

Contributed by: Hazalkl
Original dataset

LJ Speech

1. Consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.
2. A transcription is provided for each clip.
3. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Contributed by: Kinkusuma
Original dataset

MS SNSD

1. Large collection of clean speech files and various environmental noise files in .wav format sampled at 16 kHz.
2. It provides the recipe to mix clean speech and noise at various signal-to-noise ratio (SNR) conditions to generate a large, noisy speech dataset.
3. The SNR conditions and the number of hours of data required can be configured depending on the application requirements.

Contributed by: Hazalkl
Original dataset

Public Domain Sounds

A wide array of sounds can be used for object detection research. The dataset is small (543MB) and divided into subdirectories by its format. The audio files vary from 5 seconds to 5 minutes.

Contributed by: Abid Ali Awan
Original dataset

RSC: sounds from RuneScape Classic

1. Extract RuneScape classic sounds from cache to wav (and vice versa).
2. Jagex used Sun's original .au sound format, which is headerless, 8-bit, u-law encoded, 8000 Hz pcm samples.
3. This module can decompress original sounds from sound archives as headered WAVs, and recompress (+ resample) new WAVs into archives.

Contributed by: Hazalkl
Original dataset

Speech Accent Archive

This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English.

Contributed by: Kinkusuma
Original dataset

Speech Commands Dataset

The dataset (1.4 GB) has 65,000 one-second long utterances of 30 short words by thousands of different people, contributed by public members through the AIY website. This is a set of one-second .wav audio files, each containing a single spoken English word.

Contributed by: Abid Ali Awan
Original dataset

TESS: Toronto Emotional Speech Set

Two actresses (aged 26 and 64 years) recited a set of 200 target words in the carrier phrase “Say the word _____,” and recordings were produced of the set depicting each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). There are a total of 2800 stimuli.

Contributed by: Hazalkl
Original dataset

URDU

1. The URDU dataset contains emotional utterances of Urdu speech gathered from Urdu talk shows.
2. There are 400 utterances of four basic emotions in the book: Angry, Happy, Neutral, and Emotion.
3. There are 38 speakers (27 male and 11 female). This data is created from YouTube.

Contributed by: Abid Ali Awan
Original dataset

VIVAE: Variably Intense Vocalizations of Affect and Emotion

1. VIVAE consists of a set of human non-speech emotion vocalizations.
2. The full set, comprising 1085 audio files, features eleven speakers expressing three positive (achievement/ triumph, sexual pleasure, and surprise) and three negatives (anger, fear, physical pain) affective states.
3. Each parametrically varied from low to peak emotion intensity.

Contributed by: Mert Bozkır
Original dataset

FSDD: Free Spoken Digit Dataset

A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginning and ends.

Contributed by: Kinkusuma
Original dataset

LEGOv2 Corpus

1. This spoken dialogue corpus contains interactions captured from the CMU Let’s Go (LG) System.
2. It is based on raw log files from the LG system.
3. 347 dialogs with 9,083 system-user exchanges; emotions classified as garbage, non-angry, slightly angry, and very angry.

Contributed by: Kinkusuma
Original dataset

MUSDB18

Multi-track music dataset for music source separation. There are two versions of MUSDB18, the compressed and the uncompressed(HQ).

1. MUSDB18 — consists of a total of 150 full-track songs of different styles and includes both the stereo mixtures and the original sources, divided between a training subset and a test subset.

2. MUSDB18-HQ — the uncompressed version of the MUSDB18 dataset. It consists of a total of 150 full-track songs of different styles and includes both the stereo mixtures and the original sources, divided between a training subset and a test subset.

Contributed by: Kinkusuma
Original dataset

Voice Gender

1. 7000+ unique speakers and utterances, 3683 males / 2312 females.
2. Consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
3. Contains speech from speakers spanning a wide range of different ethnicities, accents, professions, and ages.

Contributed by: Abid Ali Awan
Original dataset

--

--