NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin

Earth Species Project

Read the Paper

Abstract

Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior—tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.

Overview

Overview

Demo Video

Examples

Input Audio & Prompt System Prediction Gold Label
Dataset: esc50

Prompt: The objective is to classify the sound into one of the following categories: dog, rooster, pig, cow, frog, cat, hen, insects, sheep, crow, rain, sea_waves, crackling_fire, crickets, chirping_birds, water_drops, wind, pouring_water, toilet_flush, thunderstorm, crying_baby, sneezing, clapping, breathing, coughing, footsteps, laughing, brushing_teeth, snoring, drinking_sipping, door_wood_knock, mouse_click, keyboard_typing, door_wood_creaks, can_opening, washing_machine, vacuum_cleaner, clock_alarm, clock_tick, glass_breaking, helicopter, chainsaw, siren, car_horn, engine, train, church_bells, airplane, fireworks, hand_saw

dog dog

Prompt: The objective is to classify the sound into one of the following categories: dog, rooster, pig, cow, frog, cat, hen, insects, sheep, crow, rain, sea_waves, crackling_fire, crickets, chirping_birds, water_drops, wind, pouring_water, toilet_flush, thunderstorm, crying_baby, sneezing, clapping, breathing, coughing, footsteps, laughing, brushing_teeth, snoring, drinking_sipping, door_wood_knock, mouse_click, keyboard_typing, door_wood_creaks, can_opening, washing_machine, vacuum_cleaner, clock_alarm, clock_tick, glass_breaking, helicopter, chainsaw, siren, car_horn, engine, train, church_bells, airplane, fireworks, hand_saw

crying_baby cat
Dataset: watkins

Prompt: What is the common name for the focal species in the audio?

Humpback Whale Humpback Whale

Prompt: What is the common name for the focal species in the audio?

Walrus Walrus

Prompt: What is the common name for the focal species in the audio?

Spinner Dolphin Pantropical Spotted Dolphin
Dataset: cbi

Prompt: What is the common name for the focal species in the audio?

Greater Yellowlegs Greater Yellowlegs

Prompt: What is the common name for the focal species in the audio?

Wood Duck Blue-winged Teal
Dataset: humbugdb

Prompt: What is the common name for the focal species in the audio?

culex pipiens complex culex pipiens complex

Prompt: What is the common name for the focal species in the audio?

others non-mosquito

Prompt: What is the common name for the focal species in the audio?

non-mosquito an dirus
Dataset: dcase

Prompt: What are the common names for the species in the audio, if any?

Gray-cheeked Thrush Gray-cheeked Thrush

Prompt: What are the common names for the species in the audio, if any?

None None

Prompt: What are the common names for the species in the audio, if any?

None Meerkat close call
Dataset: enabirds

Prompt: What are the common names for the species in the audio, if any?

Black-throated Green Warbler Black-throated Green Warbler, Eastern Towhee

Prompt: What are the common names for the species in the audio, if any?

None Kirtland's Warbler, American Crow

Prompt: What are the common names for the species in the audio, if any?

Black-and-white Warbler None
Dataset: hiceas

Prompt: Which of these, if any, are present in the audio recording? Minke whale, None.

Minke whale Minke whale

Prompt: Which of these, if any, are present in the audio recording? Minke whale, None.

None None

Prompt: Which of these, if any, are present in the audio recording? Minke whale, None.

Minke whale None
Dataset: rfcx

Prompt: What are the common names for the species in the audio, if any?

Red-legged thrush Red-legged thrush

Prompt: What are the common names for the species in the audio, if any?

Puerto Rican bullfinch Puerto Rican bullfinch

Prompt: What are the common names for the species in the audio, if any?

Common coqui None
Dataset: hainan-gibbons

Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None.

Multiple pulse gibbon call Multiple pulse gibbon call

Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None.

Single pulse gibbon call Multiple pulse gibbon call

Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None.

Gibbon duet None
Dataset: unseen-cmn

Prompt: What is the common name for the focal species in the audio?

Spectacled Tetraka Spectacled Tetraka

Prompt: What is the common name for the focal species in the audio?

Dusky White-eye Dusky White-eye

Prompt: What is the common name for the focal species in the audio?

Pacific Robin Fire-tailed Sunbird
Dataset: unseen-sci

Prompt: What is the scientific name for the focal species in the audio?

tauraco fischeri tauraco fischeri

Prompt: What is the scientific name for the focal species in the audio?

larvivora cyane larvivora cyane

Prompt: What is the scientific name for the focal species in the audio?

Nisaetus kelaarti Nisaetus philippensis
Dataset: lifestage

Prompt: What is the life stage of the focal species in the audio?

juvenile juvenile

Prompt: What is the life stage of the focal species in the audio?

adult adult

Prompt: What is the life stage of the focal species in the audio?

juvenile nestling
Dataset: call-type

Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'.

call call

Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'.

song song

Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'.

call song
Dataset: captioning

Prompt: Caption the audio, using the common name for any animal species.

Call of a new zealand bellbird with background sounds from new zealand falcon. The common evening song of a Mainland New Zealand Bellbird.

Prompt: Caption the audio, using the common name for any animal species.

Cajun Chorus Frog The sound of Squirrel Treefrog after a rain.
Dataset: zf-nbirds

Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4.

1 1

Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4.

4 4

Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4.

2 3
BibTeX Citation
@misc{robinson2024naturelm-audio,
    title={NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics},
    author={David Robinson and Marius Miron and Masato Hagiwara and Olivier Pietquin},
    year={2024},
    eprint={2411.07186},
    archivePrefix={arXiv},
    primaryClass={cs.SD},
    url={https://arxiv.org/abs/2411.07186}
}