David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin
Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior—tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.
Input Audio & Prompt | System Prediction | Gold Label |
---|---|---|
Dataset: esc50 | ||
Prompt: The objective is to classify the sound into one of the following categories: dog, rooster, pig, cow, frog, cat, hen, insects, sheep, crow, rain, sea_waves, crackling_fire, crickets, chirping_birds, water_drops, wind, pouring_water, toilet_flush, thunderstorm, crying_baby, sneezing, clapping, breathing, coughing, footsteps, laughing, brushing_teeth, snoring, drinking_sipping, door_wood_knock, mouse_click, keyboard_typing, door_wood_creaks, can_opening, washing_machine, vacuum_cleaner, clock_alarm, clock_tick, glass_breaking, helicopter, chainsaw, siren, car_horn, engine, train, church_bells, airplane, fireworks, hand_saw |
dog | dog |
Prompt: The objective is to classify the sound into one of the following categories: dog, rooster, pig, cow, frog, cat, hen, insects, sheep, crow, rain, sea_waves, crackling_fire, crickets, chirping_birds, water_drops, wind, pouring_water, toilet_flush, thunderstorm, crying_baby, sneezing, clapping, breathing, coughing, footsteps, laughing, brushing_teeth, snoring, drinking_sipping, door_wood_knock, mouse_click, keyboard_typing, door_wood_creaks, can_opening, washing_machine, vacuum_cleaner, clock_alarm, clock_tick, glass_breaking, helicopter, chainsaw, siren, car_horn, engine, train, church_bells, airplane, fireworks, hand_saw |
crying_baby | cat |
Dataset: watkins | ||
Prompt: What is the common name for the focal species in the audio? |
Humpback Whale | Humpback Whale |
Prompt: What is the common name for the focal species in the audio? |
Walrus | Walrus |
Prompt: What is the common name for the focal species in the audio? |
Spinner Dolphin | Pantropical Spotted Dolphin |
Dataset: cbi | ||
Prompt: What is the common name for the focal species in the audio? |
Greater Yellowlegs | Greater Yellowlegs |
Prompt: What is the common name for the focal species in the audio? |
Wood Duck | Blue-winged Teal |
Dataset: humbugdb | ||
Prompt: What is the common name for the focal species in the audio? |
culex pipiens complex | culex pipiens complex |
Prompt: What is the common name for the focal species in the audio? |
others | non-mosquito |
Prompt: What is the common name for the focal species in the audio? |
non-mosquito | an dirus |
Dataset: dcase | ||
Prompt: What are the common names for the species in the audio, if any? |
Gray-cheeked Thrush | Gray-cheeked Thrush |
Prompt: What are the common names for the species in the audio, if any? |
None | None |
Prompt: What are the common names for the species in the audio, if any? |
None | Meerkat close call |
Dataset: enabirds | ||
Prompt: What are the common names for the species in the audio, if any? |
Black-throated Green Warbler | Black-throated Green Warbler, Eastern Towhee |
Prompt: What are the common names for the species in the audio, if any? |
None | Kirtland's Warbler, American Crow |
Prompt: What are the common names for the species in the audio, if any? |
Black-and-white Warbler | None |
Dataset: hiceas | ||
Prompt: Which of these, if any, are present in the audio recording? Minke whale, None. |
Minke whale | Minke whale |
Prompt: Which of these, if any, are present in the audio recording? Minke whale, None. |
None | None |
Prompt: Which of these, if any, are present in the audio recording? Minke whale, None. |
Minke whale | None |
Dataset: rfcx | ||
Prompt: What are the common names for the species in the audio, if any? |
Red-legged thrush | Red-legged thrush |
Prompt: What are the common names for the species in the audio, if any? |
Puerto Rican bullfinch | Puerto Rican bullfinch |
Prompt: What are the common names for the species in the audio, if any? |
Common coqui | None |
Dataset: hainan-gibbons | ||
Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None. |
Multiple pulse gibbon call | Multiple pulse gibbon call |
Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None. |
Single pulse gibbon call | Multiple pulse gibbon call |
Prompt: Which of these, if any, are present in the audio recording? Single pulse gibbon call, Multiple pulse gibbon call, Gibbon duet, None. |
Gibbon duet | None |
Dataset: unseen-cmn | ||
Prompt: What is the common name for the focal species in the audio? |
Spectacled Tetraka | Spectacled Tetraka |
Prompt: What is the common name for the focal species in the audio? |
Dusky White-eye | Dusky White-eye |
Prompt: What is the common name for the focal species in the audio? |
Pacific Robin | Fire-tailed Sunbird |
Dataset: unseen-sci | ||
Prompt: What is the scientific name for the focal species in the audio? |
tauraco fischeri | tauraco fischeri |
Prompt: What is the scientific name for the focal species in the audio? |
larvivora cyane | larvivora cyane |
Prompt: What is the scientific name for the focal species in the audio? |
Nisaetus kelaarti | Nisaetus philippensis |
Dataset: lifestage | ||
Prompt: What is the life stage of the focal species in the audio? |
juvenile | juvenile |
Prompt: What is the life stage of the focal species in the audio? |
adult | adult |
Prompt: What is the life stage of the focal species in the audio? |
juvenile | nestling |
Dataset: call-type | ||
Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'. |
call | call |
Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'. |
song | song |
Prompt: What type of vocalization is heard from the focal species in the audio? Answer with either 'call' or 'song'. |
call | song |
Dataset: captioning | ||
Prompt: Caption the audio, using the common name for any animal species. |
Call of a new zealand bellbird with background sounds from new zealand falcon. | The common evening song of a Mainland New Zealand Bellbird. |
Prompt: Caption the audio, using the common name for any animal species. |
Cajun Chorus Frog | The sound of Squirrel Treefrog after a rain. |
Dataset: zf-nbirds | ||
Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4. |
1 | 1 |
Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4. |
4 | 4 |
Prompt: How many birds are in the audio? Choose between 1, 2, 3 or 4. |
2 | 3 |
@misc{robinson2024naturelm-audio, title={NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics}, author={David Robinson and Marius Miron and Masato Hagiwara and Olivier Pietquin}, year={2024}, eprint={2411.07186}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2411.07186} }