Prompting Guide¶

This guide applies to NatureLM-audio v1.1, available through the Interactive Demo on Hugging Face Spaces.

This guide covers usage of NatureLM-audio for bioacoustic tasks, with a focus on how to prompt the model to receive best results.

Audio Format¶

The model operates on 16 kHz mono audio, resampled automatically internally. The model is trained to handle clips of up to 10 seconds in length. If you want the model to focus on only a certain section of audio, make sure to clip to that section in advance.

Task Overview¶

Tasks are labeled as either “Core”, producing consistent results in our evaluations, or “Experimental” which show promise but merit further evaluation. NatureLM-audio can be tried on tasks beyond what’s covered in this guide, and on taxa beyond the training data — however these should all be considered “Experimental” and not assumed to work out-of-the-box.

Task	Reliability	Sample Prompt
Species Detection	Core	What are the common names for the species in the audio, if any?
Species Identification	Core	What species is vocalizing in this audio recording? Common name?
Species Identification with Context	Core	Given the context: '...', what is the common name for the focal species in the audio?
Species Identification from Option List	Core	Which of these is the focal species in the audio? Options: ...
Multiple Species Identification	Core	List the scientific names of all species vocalizing in this audio clip.
Taxonomy	Core	What is the genus of the focal species in the audio?
Call Type / Behavior	Core	What type of vocalization or call is this?
Life Stage	Core	Is the focal species an adult or juvenile?
Captioning	Core	Caption the audio, using common names for any animal species.
Combined Task / Multi-Turn	Core	What type of vocalization is it, and what is the life stage?
Environmental Sound Classification	Core	Which of these non-animal sounds are present in the recording? ... or None.
Taxon Presence	Core	Is there a bird vocalizing in this recording? Answer Yes or No.
Call Type Presence	Core	Is a [call type] present in this recording? Answer Yes or No.
Top-3 Species Identification	Experimental	What is the common name of the species vocalizing in this audio recording? Provide your top 3 predictions.
Habitat Inference	Experimental	Based on the sounds, what habitat or environment do you think this was recorded in?
Geographic Inference	Experimental	Based on the species you hear, what region of the world was this likely recorded in?
Structured JSON Output	Experimental	Identify this recording. Respond in JSON format: {"species": "...", "call_type": "..."}
Describe-Then-Identify	Experimental	First describe what you hear, then identify the species.
Frequency Range	Experimental	What is the overall frequency range of the vocalizations in this audio?
Species Count	Experimental	How many different species are vocalizing, and what are they?
Call Count	Experimental	How many individual vocalizations can you detect in this audio?
Individual Count	Experimental	How many individuals are vocalizing in the audio? Answer "one" or "more than one".
Temporal Order	Experimental	List the species in the order they first vocalize, using scientific names.

Core Tasks¶

These tasks produce consistent results across our evaluations and cover the most common bioacoustic research use cases.

Species Detection¶

Detect which species are vocalizing without providing a list to choose from. The model can also answer "None".

Common name

What are the common names for the species in the audio, if any?

Scientific name

What are the scientific names for the species in the audio, if any?

Species Identification¶

Identify the single focal species in a recording.

Common name (recommended)

What species is vocalizing in this audio recording? Common name?

Scientific name

What is the scientific name of the focal species in the audio?

The model is very prompt-robust for species ID — terse prompts like Species? and verbose prompts all perform within ~1% of each other. Scientific-name prompts are slightly more accurate than common-name prompts because scientific names are more standardized.

Tips

With this prompt, the model will default to return a single species. For soundscapes with multiple species, use the multilabel prompts below.
The model is strongest on Western European and North American birds. Accuracy drops in species-rich tropical regions (Neotropics, Southeast Asia).
Weak taxonomic groups: hummingbirds/swifts, grouse/pheasants, and kingfishers are notably harder.

Species Identification with Context¶

Provide geographic or temporal metadata or free-text observations to help narrow identification.

Common name (recommended)

Given the context: '<context>', what is the common name for the focal species in the audio?

Scientific name

Given the context: '<context>', what is the scientific name for the focal species in the audio?

Replace <context> with whatever metadata you have, e.g. country: BR, coordinates: -23.5, -46.6 or recorded in temperate forest, June.

System prompt

You can also place context in the system prompt:

System You are a bioacoustics expert. Recording context: country: BR, coordinates: -23.5, -46.6

User What is the common name of the species in this recording?

Tips

Context provides a small accuracy boost. The model already performs well without it, but it can be helpful especially for ambiguous or acoustically similar calls.
System-prompt context and in-prompt context perform similarly.

Species Identification from Option List¶

Classify the focal species from a provided option list.

Common name

Which of these is the focal species in the audio? Options: <species_choices>

Scientific name

Which of these species (scientific name) is in the audio? Options: <species_choices>

Replace <species_choices> with a comma-separated list, e.g. Turdus merula, Erithacus rubecula, Fringilla coelebs, Parus major, Phylloscopus collybita.

Tips

Multiple-choice accuracy is substantially higher than open-ended ID (~91% vs ~77%).
Up to 15 options work well; more are possible but need to be validated.
Genus-level and family-level option variants also exist and are even easier.
You can add context: Given the context '<context>', which of these is the focal species? <species_choices>

Multiple Species Identification¶

Classify one or more species in a recording. Unlike single-species classification, these prompts can return multiple species and can answer "None" when the correct answer is not present.

Listing prompt (recommended)

List the scientific names of all species vocalizing in this audio clip.

List the common names of all species vocalizing in this audio clip.

Option-list prompt

Which of these species, if any, are present in the recording? <species_choices>

Which of these species (scientific name), if any, are present? <species_choices>

Replace <species_choices> with a comma-separated list, e.g. Turdus merula, Erithacus rubecula, Fringilla coelebs, Parus major, Phylloscopus collybita.

Tips

Both prompt styles can return multiple species and can answer "None". The listing prompt tends to work better on multi-species soundscapes.
The "if any" phrasing in the option-list variant encourages the model to answer "None" when appropriate.

Taxonomy¶

Classify at coarser taxonomic levels.

Genus

What is the genus of the focal species in the audio?

Family

What is the family of the focal species in the audio?

Order

What is the order of the focal species in the audio?

Full taxonomic name

What is the taxonomic name of the focal species in the audio?

Tips

Coarser levels (order, family) are more accurate than finer ones (genus, species), as expected.
Full taxonomic name returns a semicolon-separated string: Chordata; Aves; Passeriformes; Fringillidae; Fringilla coelebs. This can be an alternative to scientific or common name classification that gives more information, in case the precise species-level classification is incorrect.
All taxonomic prompts are highly prompt-robust.

Call Type / Behavior¶

Classify the vocalization type. For exploratory purposes, use an open-ended prompt. To classify into a specific set of calls for analysis, binary prompts may be most reliable — e.g. call vs. song, alarm call present or not present.

Note that this task currently focuses on call types which are established across species — for instance, “call”, “song”, or “alarm call”, but likely not specifically a named call for a specific species.

Open-ended call type

What type of vocalization or call is this?

Binary call vs. song

Is this a call or a song?

Species-conditioned

What type of call is the <species> making in this recording?

Tips

The binary prompt works well when the true label is strictly "call" or "song". On recordings with labels like "alarm call" or "flight call", the binary prompt is less reliable.
Providing a species hint improves call-type accuracy slightly.
The trained vocabulary includes: call, song, alarm call, flight call, begging call.

Life Stage¶

Determine whether the vocalizing animal is an adult or juvenile.

Open-ended

What life stage is the animal in this recording?

Binary (recommended)

Is the focal species an adult or juvenile?

Tips

The model has a strong adult bias: it identifies adults reliably but has low recall of juvenile vocalizations. The binary prompt is slightly more reliable than the open-ended one. Both have high precision for juveniles.
Use this as a soft filter rather than a definitive classifier for juveniles.

Captioning¶

Generate a natural-language description of the audio.

Bioacoustic caption (recommended)

Caption the audio, using common names for any animal species.

General audio caption

Caption this audio with a rich, detailed description. Avoid specific species names.

Tips

Captions are typically 1–2 sentences at the default merging_alpha. For richer descriptions, lower alpha toward 0.7 (see Model Behavior below).
The general caption deliberately avoids species names — use it when you want habitat/acoustic descriptions without identification.

Combined Task / Multi-Turn¶

Ask the model to identify species first, then follow up with behavior or life-stage questions. The model retains audio context across turns.

Species then behavior

User What species is vocalizing in this recording?

Model [species name]

User What type of vocalization is it producing?

Species then life stage

User What species is vocalizing in this recording?

Model [species name]

User What is the life stage of this individual?

Species then call type and life stage

User What species is vocalizing in this recording?

Model [species name]

User What type of vocalization is it, and what is the life stage?

Tips

Species accuracy in multi-turn is identical to single-turn prompts.
Behavior and life-stage follow-ups carry the same caveats as their standalone counterparts (call/song confusion, adult bias).

Environmental Sound Classification¶

Classify non-animal environmental sounds from a provided option list.

Which of these non-animal sounds are present in the recording? <option_choices>. Answer with a comma-separated list using only the provided options, or None.

Tips

Trained on categories including alarms, engines, weather, household sounds, domestic animals, music, and signals/phones.
The model returns a comma-separated list of matches, or "None" if nothing matches.

Taxon Presence¶

Determine whether a broad taxonomic group is present in the recording.

Is there a bird vocalizing in this recording? Answer Yes or No.

Does this recording contain mammal vocalizations? Answer Yes or No.

Are there whale or dolphin sounds in this recording? Answer Yes or No.

Does this recording contain insect sounds? Answer Yes or No.

Is there a frog or amphibian vocalizing in this recording? Answer Yes or No.

Are there any animal vocalizations in this recording? Answer Yes or No.

Tips

Always include "Answer Yes or No" — without it, the model may respond with species names instead of a yes/no answer.
Reliable for bird and mammal presence. Less training data for insect and amphibian, so expect lower reliability.
The model correctly rejects wrong taxa (e.g. answers "No" to mammal presence on bird-only recordings).

Call Type Presence¶

Determine whether a specific call type is present.

Generic (any call type)

Is a <target_call_type> present in this recording? Answer Yes or No.

Replace <target_call_type> with e.g. alarm call, flight call, begging call.

Alarm call

Is an alarm call present in this recording? Answer Yes or No.

Is the <species> making an alarm call in this recording? Answer Yes or No.

Flight call

Is a flight call present in this recording? Answer Yes or No.

Is the <species> making a flight call in this recording? Answer Yes or No.

Experimental Tasks¶

These tasks show promise but require further evaluation. Some are trained on limited data or specific taxa; others emerge from generalization. Treat results as exploratory.

Top-3 Species Identification¶

What is the common name of the species vocalizing in this audio recording? Provide your top 3 predictions.

The model will return in ranked order the top 3 candidates for the focal species.

Habitat Inference¶

Based on the sounds, what habitat or environment do you think this was recorded in?

Produces plausible biome labels (e.g. “Forest”, “Grassland”). For richer descriptions, lower merging_alpha to ~0.7.

Geographic Inference¶

Based on the species you hear, what region of the world was this likely recorded in?

The model appears to infer geography from species identity, background sounds, or other cues. Lower alpha (0.6–0.7) substantially improves geographic reasoning by leveraging the base LLM’s species-range knowledge.

Structured JSON Output¶

Identify this recording. Respond in JSON format: {"species": "...", "call_type": "..."}

At default alpha (1.0), JSON instructions are ignored and the model outputs a plain label. For valid JSON, you must lower alpha to ~0.7. Including detailed field names in the prompt helps the model honor the structure.

Describe-Then-Identify¶

First describe what you hear, then identify the species.

Produces coherent free-form descriptions. For the best chance that the species name appears in the text, use alpha ~0.9.

Frequency Range¶

What is the overall frequency range of the vocalizations in this audio?

Returns a range like 2000–8000 Hz.

Species Count¶

How many different species are vocalizing, and what are they? Give scientific names.

Expected output format: N: species1, species2. Species counting is harder than listing — the model sometimes outputs names instead of a count.

Call Count¶

Per-species

How many calls from each species can you hear? Give scientific names.

Total

How many individual vocalizations can you detect in this audio?

Counting is approximate. On single-species clips the model often answers “1”. It rarely hallucinates large numbers.

Individual Count¶

How many individuals are vocalizing in the audio? Answer "one" or "more than one".

Temporal Order¶

List the species in the order they first vocalize, using scientific names.

Model Behavior¶

Prompt Robustness¶

The model is highly robust to prompt phrasing for classification tasks. Training prompts, terse prompts, and verbose prompts all perform within ~1–2% of each other. Rephrasing is fine. The only measured vulnerability is mild priming — mentioning a specific species in the prompt slightly biases the model toward that answer.

System Prompts¶

System prompts with expert personas (e.g. “You are a bioacoustics expert”) have no measurable effect on classification accuracy. They are supported but not necessary. The one exception is placing geographic context in the system prompt, which works slightly better than inline context. System prompts can also be used to change style and behavior — for instance to make the model more conversational or change the output format — particularly when combined with merging_alpha < 1.

merging_alpha Parameter¶

This parameter controls the blend between the fine-tuned bioacoustics adapter and the base LLM.

Alpha	Behavior
1.0 (default)	Best for classification and automated pipelines. Terse, label-style output.
0.8	Good for interactive/conversational use. Richer answers with minimal accuracy cost.
0.7	Required for JSON output and structured reports. Good for knowledge-dependent tasks. Good for multi-turn without a large drop in accuracy.
0.6	Most conversational. Best for geographic inference and habitat descriptions. Species accuracy drops meaningfully.
< 0.5	Not recommended — bioacoustic capability degrades sharply below this threshold.

Format Control¶

Simple format prefixes (e.g. Respond in the format ‘species: <name>’) work at alpha=1.0. Complex structured formats (JSON, multi-field reports) require alpha ~0.7.

Model Limitations¶

16 kHz encoder: The model is constrained by an 8 kHz Nyquist frequency. This means it will not give meaningful responses for taxa such as bats when recordings fall primarily above this range.
Bird bias: Accuracy is notably higher for birds than other taxa such as anurans or cetaceans, due to training data. Marine PAM data in particular may be unreliable for multilabel classification without fine-tuning.
Refusing wrong-taxon prompts: If you ask “What whale is this?” on a bird recording, the model will typically identify the bird anyway rather than refuse. If you use the exact classification prompt “What is the common name for the focal species in the audio?” the model will always answer as if a species is present. To allow for an answer of “None”, use a multilabel classification prompt instead.
Emotional valence and translation: The model does not have the ability to tell how an animal is feeling or what it is saying, with the limited exception of call type prediction — this use should currently be considered out of scope.