Datasets¶

Training Data¶

NatureLM-audio-training is a large and diverse audio-language dataset designed for training bioacoustic models that can generate a natural language answer to a natural language query on a reference bioacoustic audio recording. For example, for an in-the-wild audio recording of a bird species, a relevant query might be “What is the common name for the focal species in the audio?” to which an audio-language model trained on this dataset may respond with “Common yellowthroat”.

It consists of over 26 million audio-text pairs derived from diverse sources including animal vocalizations, insects, human speech, music, and environmental sounds.

The training dataset is publicly available on Hugging Face:

NatureLM-audio Training Dataset

Data Sources¶

Task	Dataset	Hours	Samples
CAP	WavCaps (Mei et al., 2023)	7,568	402k
CAP	AudioCaps (Kim et al., 2019)	145	52k
CLS	NSynth (Engel et al., 2017)	442	300k
CLS	LibriSpeechTTS (Zen et al., 2019), VCTK (Yamagishi et al., 2019)	689	337k
CAP	Clotho (Drossos et al., 2020)	25	4k
CLS, DET, CAP	Xeno-canto (Vellinga & Planque, 2015)	10,416	607k
CLS, DET, CAP	iNaturalist	1,539	320k
CLS, DET, CAP	Watkins (Sayigh et al., 2016)	27	15k
CLS, DET	Animal Sound Archive (Museum für Naturkunde Berlin)	78	16k
DET	Sapsucker Woods (Kahl et al., 2022)	285	342k
CLS, DET	Barkley Canyon (Kanes, 2021)	876	309k
CLS	UrbanSound (Salamon & Jacoby, 2014)	10	2k

CLS = classification · DET = detection · CAP = captioning

Task Categories¶

The dataset covers 44 task types, grouped into the following categories:

Species identification — common name, scientific name, and taxonomic classification at species, genus, family, and order levels, both open-ended and from a candidate list.

Detection — presence/absence detection with easy, hard, and random negative sampling strategies.

Call type & life stage — classifying vocalization type (song, call, alarm call) and the life stage of the vocalizing individual.

Audio captioning — generating natural language descriptions using common or scientific names, in simple and structured formats.

Question answering — general audio QA (via WavCaps), animal-specific instruction following, and speaker counting for speech samples.

Dataset Fields¶

Each example includes:

Field	Description
`audio`	Raw audio (FLAC/WAV, resampled to 16 kHz)
`instruction`	Natural language task prompt
`output`	Expected model response
`task`	Task category (one of 44 types)
`source_dataset`	Origin dataset
`metadata`	JSON with taxonomic info (class, order, family, genus, species), recordist, and source URL
`license`	Recording license (`CC BY-NC`, `free for personal/academic uses`, or `unknown`)

Dataset Composition¶

The dataset contains a total of 26,440,512 samples organized in shards of 2,500 examples each. An annotations.jsonl file is provided alongside the dataset containing taxonomic information for each sample — including family, genus, species, common name, and other metadata. This file can be used to query and filter the dataset to create custom data mixes. The id field in the dataset matches the id column in annotations.jsonl, and the shard_id column indicates which shard the sample belongs to.

Although diverse, the dataset leans toward bird species in its taxonomic coverage.

Usage¶

from datasets import load_dataset

dataset = load_dataset("EarthSpeciesProject/NatureLM-audio-training", split="train")
print(dataset)

Example Data¶

import numpy as np

# Inspect the first example in the dataset
x = dataset[0]
audio = x["audio"]["array"]
print(audio.shape)
# (503808,)

print(x["instruction"])
# '<Audio><AudioHere></Audio> What is the taxonomic name of the focal species in the audio?'

print(x["output"])
# 'Chordata Aves Passeriformes Passerellidae Atlapetes fuscoolivaceus'

print(x["task"])
# 'taxonomic-classification'

import json
metadata = json.loads(x["metadata"])
print(metadata)
# {'recordist': 'Peter Boesman',
#  'url': 'https://xeno-canto.org/...',
#  'source': 'Xeno-canto',
#  'duration': 31.488,
#  'class': 'Aves',
#  'family': 'Passerellidae',
#  'genus': 'Atlapetes',
#  'species': 'Atlapetes fuscoolivaceus',
#  'phylum': 'Chordata',
#  'order': 'Passeriformes',
#  'subspecies': '',
#  'data_category': 'animal',
#  'text': None,
#  'sample_rate': 16000}

Evaluation Data¶

NatureLM-audio is evaluated on BEANS-Zero, a novel benchmark designed to assess zero-shot generalization across bioacoustics tasks. It covers species classification, detection, life stage classification, call type classification, captioning, and counting — across a diverse set of taxa including birds, marine mammals, insects, and amphibians.

NatureLM-audio sets a new state of the art on several BEANS-Zero tasks, including zero-shot classification of unseen species.

Full benchmark data and evaluation code are available on Hugging Face:

BEANS-Zero Dataset Evaluation Code