David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin
Large language models (LLMs) prompted with text and audio have achieved state-of-the-art performance across various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, their potential has yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior—tasks that are crucial for conservation, biodiversity monitoring, and animal behavior studies. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our training dataset consists of carefully curated text-audio pairs spanning bioacoustics, speech, and music, designed to address the field's limited availability of annotated data. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. We evaluate NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets a new state of the art on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we release our model weights, benchmark data, and open-source the code for training and benchmark data generation and model training.
Input Audio & Prompt | System Prediction | Gold Label |
---|---|---|
Dataset: esc50 | ||
Prompt: The objective is to classify the sound into one of the following categories: dog, rooster, pig, cow, frog, cat, hen, insects, sheep, crow, rain, sea_waves, crackling_fire, crickets, chirping_birds, water_drops, wind, pouring_water, toilet_flush, thunderstorm, crying_baby, sneezing, clapping, breathing, coughing, footsteps, laughing, brushing_teeth, snoring, drinking_sipping, door_wood_knock, mouse_click, keyboard_typing, door_wood_creaks, can_opening, washing_machine, vacuum_cleaner, clock_alarm, clock_tick, glass_breaking, helicopter, chainsaw, siren, car_horn, engine, train, church_bells, airplane, fireworks, hand_saw |
dog | dog |
Prompt: The objective is to classify the sound into one of the following categories: dog, rooster, pig, cow, frog, cat, hen, insects, sheep, crow, rain, sea_waves, crackling_fire, crickets, chirping_birds, water_drops, wind, pouring_water, toilet_flush, thunderstorm, crying_baby, sneezing, clapping, breathing, coughing, footsteps, laughing, brushing_teeth, snoring, drinking_sipping, door_wood_knock, mouse_click, keyboard_typing, door_wood_creaks, can_opening, washing_machine, vacuum_cleaner, clock_alarm, clock_tick, glass_breaking, helicopter, chainsaw, siren, car_horn, engine, train, church_bells, airplane, fireworks, hand_saw |
crying_baby | cat |
Dataset: watkins | ||
Prompt: What is the common name for the focal species in the audio? |
Humpback Whale | Humpback Whale |
Prompt: What is the common name for the focal species in the audio? |
Walrus | Walrus |
Prompt: What is the common name for the focal species in the audio? |
Spinner Dolphin | Pantropical Spotted Dolphin |
Dataset: cbi | ||
Prompt: What is the common name for the focal species in the audio? |
Greater Yellowlegs | Greater Yellowlegs |
Prompt: What is the common name for the focal species in the audio? |
Wood Duck | Blue-winged Teal |
Dataset: humbugdb | ||
Prompt: What is the common name for the focal species in the audio? |
culex pipiens complex | culex pipiens complex |
Prompt: What is the common name for the focal species in the audio? |
others | non-mosquito |
Prompt: What is the common name for the focal species in the audio? |
non-mosquito | an dirus |
Dataset: dcase | ||
Prompt: What are the common names for the species in the audio, if any? |
Gray-cheeked Thrush | Gray-cheeked Thrush |
Prompt: What are the common names for the species in the audio, if any? |
None | None |
Prompt: What are the common names for the species in the audio, if any? |
None | Meerkat close call |
Dataset: enabirds | ||
Prompt: What are the common names for the species in the audio, if any? |
Black-throated Green Warbler | Black-throated Green Warbler, Eastern Towhee |
Prompt: What are the common names for the species in the audio, if any? |
None | Kirtland's Warbler, American Crow |
Prompt: What are the common names for the species in the audio, if any? |
Black-and-white Warbler | None |
Dataset: hiceas | ||
Prompt: Which of these, if any, are present in the audio recording? Minke whale, None. |
Minke whale | Minke whale |
Prompt: Which of these, if any, are present in the audio recording? Minke whale, None. |
None | None |
Prompt: Which of these, if any, are present in the audio recording? Minke whale, None. |
Minke whale | None |
Dataset: rfcx | ||
Prompt: What are the common names for the species in the audio, if any? |
Red-legged thrush | Red-legged thrush |
Prompt: What are the common names for the species in the audio, if any? |
Puerto Rican bullfinch | Puerto Rican bullfinch |
Prompt: What are the common names for the species in the audio, if any? |
Common coqui | None |
Dataset: hainan-gibbons | ||
Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None. |
Multiple pulse gibbon call | Multiple pulse gibbon call |
Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None. |
Single pulse gibbon call | Multiple pulse gibbon call |
Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None. |
Gibbon duet | None |
Dataset: unseen-cmn | ||
Prompt: What is the common name for the focal species in the audio? |
Spectacled Tetraka | Spectacled Tetraka |
Prompt: What is the common name for the focal species in the audio? |
Dusky White-eye | Dusky White-eye |
Prompt: What is the common name for the focal species in the audio? |
Pacific Robin | Fire-tailed Sunbird |
Dataset: unseen-sci | ||
Prompt: What is the scientific name for the focal species in the audio? |
tauraco fischeri | tauraco fischeri |
Prompt: What is the scientific name for the focal species in the audio? |
larvivora cyane | larvivora cyane |
Prompt: What is the scientific name for the focal species in the audio? |
Nisaetus kelaarti | Nisaetus philippensis |
Dataset: lifestage | ||
Prompt: What is the life stage of the focal species in the audio? |
juvenile | juvenile |
Prompt: What is the life stage of the focal species in the audio? |
adult | adult |
Prompt: What is the life stage of the focal species in the audio? |
juvenile | nestling |
Dataset: call-type | ||
Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'. |
call | call |
Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'. |
song | song |
Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'. |
call | song |
Dataset: captioning | ||
Prompt: Caption the audio, using the common name for any animal species. |
Call of a new zealand bellbird with background sounds from new zealand falcon. | The common evening song of a Mainland New Zealand Bellbird. |
Prompt: Caption the audio, using the common name for any animal species. |
Cajun Chorus Frog | The sound of Squirrel Treefrog after a rain. |
Dataset: zf-nbirds | ||
Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4. |
1 | 1 |
Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4. |
4 | 4 |
Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4. |
2 | 3 |
@inproceedings{robinson2025naturelm, title = {NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics}, author = {David Robinson and Marius Miron and Masato Hagiwara and Olivier Pietquin}, booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)}, year = {2025}, url = {https://openreview.net/forum?id=hJVdwBpWjt} }