Prompting Guide¶
This guide applies to NatureLM-audio v1.1, available through the Interactive Demo on Hugging Face Spaces.
This guide covers usage of NatureLM-audio for bioacoustic tasks, with a focus on how to prompt the model to receive best results.
Audio Format¶
The model operates on 16 kHz mono audio, resampled automatically internally. The model is trained to handle clips of up to 10 seconds in length. If you want the model to focus on only a certain section of audio, make sure to clip to that section in advance.
Task Overview¶
Tasks are labeled as either “Core”, producing consistent results in our evaluations, or “Experimental” which show promise but merit further evaluation. NatureLM-audio can be tried on tasks beyond what’s covered in this guide, and on taxa beyond the training data — however these should all be considered “Experimental” and not assumed to work out-of-the-box.
| Task | Reliability | Sample Prompt |
|---|---|---|
| Species Detection | Core | What are the common names for the species in the audio, if any? |
| Species Identification | Core | What species is vocalizing in this audio recording? Common name? |
| Species Identification with Context | Core | Given the context: '...', what is the common name for the focal species in the audio? |
| Species Identification from Option List | Core | Which of these is the focal species in the audio? Options: ... |
| Multiple Species Identification | Core | List the scientific names of all species vocalizing in this audio clip. |
| Taxonomy | Core | What is the genus of the focal species in the audio? |
| Call Type / Behavior | Core | What type of vocalization or call is this? |
| Life Stage | Core | Is the focal species an adult or juvenile? |
| Captioning | Core | Caption the audio, using common names for any animal species. |
| Combined Task / Multi-Turn | Core | What type of vocalization is it, and what is the life stage? |
| Environmental Sound Classification | Core | Which of these non-animal sounds are present in the recording? ... or None. |
| Taxon Presence | Core | Is there a bird vocalizing in this recording? Answer Yes or No. |
| Call Type Presence | Core | Is a [call type] present in this recording? Answer Yes or No. |
| Top-3 Species Identification | Experimental | What is the common name of the species vocalizing in this audio recording? Provide your top 3 predictions. |
| Habitat Inference | Experimental | Based on the sounds, what habitat or environment do you think this was recorded in? |
| Geographic Inference | Experimental | Based on the species you hear, what region of the world was this likely recorded in? |
| Structured JSON Output | Experimental | Identify this recording. Respond in JSON format: {"species": "...", "call_type": "..."} |
| Describe-Then-Identify | Experimental | First describe what you hear, then identify the species. |
| Frequency Range | Experimental | What is the overall frequency range of the vocalizations in this audio? |
| Species Count | Experimental | How many different species are vocalizing, and what are they? |
| Call Count | Experimental | How many individual vocalizations can you detect in this audio? |
| Individual Count | Experimental | How many individuals are vocalizing in the audio? Answer "one" or "more than one". |
| Temporal Order | Experimental | List the species in the order they first vocalize, using scientific names. |
Core Tasks¶
These tasks produce consistent results across our evaluations and cover the most common bioacoustic research use cases.
Species Detection¶
Detect which species are vocalizing without providing a list to choose from. The model can also answer "None".
Common name
Scientific name
Species Identification¶
Identify the single focal species in a recording.
Common name (recommended)
Scientific name
The model is very prompt-robust for species ID — terse prompts like Species? and verbose prompts all perform within ~1% of each other. Scientific-name prompts are slightly more accurate than common-name prompts because scientific names are more standardized.
Tips
- With this prompt, the model will default to return a single species. For soundscapes with multiple species, use the multilabel prompts below.
- The model is strongest on Western European and North American birds. Accuracy drops in species-rich tropical regions (Neotropics, Southeast Asia).
- Weak taxonomic groups: hummingbirds/swifts, grouse/pheasants, and kingfishers are notably harder.
Species Identification with Context¶
Provide geographic or temporal metadata or free-text observations to help narrow identification.
Common name (recommended)
Scientific name
Replace <context> with whatever metadata you have, e.g. country: BR, coordinates: -23.5, -46.6 or recorded in temperate forest, June.
System prompt
You can also place context in the system prompt:
Tips
- Context provides a small accuracy boost. The model already performs well without it, but it can be helpful especially for ambiguous or acoustically similar calls.
- System-prompt context and in-prompt context perform similarly.
Species Identification from Option List¶
Classify the focal species from a provided option list.
Common name
Scientific name
Replace <species_choices> with a comma-separated list, e.g. Turdus merula, Erithacus rubecula, Fringilla coelebs, Parus major, Phylloscopus collybita.
Tips
- Multiple-choice accuracy is substantially higher than open-ended ID (~91% vs ~77%).
- Up to 15 options work well; more are possible but need to be validated.
- Genus-level and family-level option variants also exist and are even easier.
- You can add context: Given the context '<context>', which of these is the focal species? <species_choices>
Multiple Species Identification¶
Classify one or more species in a recording. Unlike single-species classification, these prompts can return multiple species and can answer "None" when the correct answer is not present.
Listing prompt (recommended)
Option-list prompt
Replace <species_choices> with a comma-separated list, e.g. Turdus merula, Erithacus rubecula, Fringilla coelebs, Parus major, Phylloscopus collybita.
Tips
- Both prompt styles can return multiple species and can answer "None". The listing prompt tends to work better on multi-species soundscapes.
- The "if any" phrasing in the option-list variant encourages the model to answer "None" when appropriate.
Taxonomy¶
Classify at coarser taxonomic levels.
Genus
Family
Order
Full taxonomic name
Tips
- Coarser levels (order, family) are more accurate than finer ones (genus, species), as expected.
- Full taxonomic name returns a semicolon-separated string: Chordata; Aves; Passeriformes; Fringillidae; Fringilla coelebs. This can be an alternative to scientific or common name classification that gives more information, in case the precise species-level classification is incorrect.
- All taxonomic prompts are highly prompt-robust.
Call Type / Behavior¶
Classify the vocalization type. For exploratory purposes, use an open-ended prompt. To classify into a specific set of calls for analysis, binary prompts may be most reliable — e.g. call vs. song, alarm call present or not present.
Note that this task currently focuses on call types which are established across species — for instance, “call”, “song”, or “alarm call”, but likely not specifically a named call for a specific species.
Open-ended call type
Binary call vs. song
Species-conditioned
Tips
- The binary prompt works well when the true label is strictly "call" or "song". On recordings with labels like "alarm call" or "flight call", the binary prompt is less reliable.
- Providing a species hint improves call-type accuracy slightly.
- The trained vocabulary includes: call, song, alarm call, flight call, begging call.
Life Stage¶
Determine whether the vocalizing animal is an adult or juvenile.
Open-ended
Binary (recommended)
Tips
- The model has a strong adult bias: it identifies adults reliably but has low recall of juvenile vocalizations. The binary prompt is slightly more reliable than the open-ended one. Both have high precision for juveniles.
- Use this as a soft filter rather than a definitive classifier for juveniles.
Captioning¶
Generate a natural-language description of the audio.
Bioacoustic caption (recommended)
General audio caption
Tips
- Captions are typically 1–2 sentences at the default
merging_alpha. For richer descriptions, lower alpha toward 0.7 (see Model Behavior below). - The general caption deliberately avoids species names — use it when you want habitat/acoustic descriptions without identification.
Combined Task / Multi-Turn¶
Ask the model to identify species first, then follow up with behavior or life-stage questions. The model retains audio context across turns.
Species then behavior
Species then life stage
Species then call type and life stage
Tips
- Species accuracy in multi-turn is identical to single-turn prompts.
- Behavior and life-stage follow-ups carry the same caveats as their standalone counterparts (call/song confusion, adult bias).
Environmental Sound Classification¶
Classify non-animal environmental sounds from a provided option list.
Tips
- Trained on categories including alarms, engines, weather, household sounds, domestic animals, music, and signals/phones.
- The model returns a comma-separated list of matches, or "None" if nothing matches.
Taxon Presence¶
Determine whether a broad taxonomic group is present in the recording.
Tips
- Always include "Answer Yes or No" — without it, the model may respond with species names instead of a yes/no answer.
- Reliable for bird and mammal presence. Less training data for insect and amphibian, so expect lower reliability.
- The model correctly rejects wrong taxa (e.g. answers "No" to mammal presence on bird-only recordings).
Call Type Presence¶
Determine whether a specific call type is present.
Generic (any call type)
Replace <target_call_type> with e.g. alarm call, flight call, begging call.
Alarm call
Flight call
Experimental Tasks¶
These tasks show promise but require further evaluation. Some are trained on limited data or specific taxa; others emerge from generalization. Treat results as exploratory.
Top-3 Species Identification¶
The model will return in ranked order the top 3 candidates for the focal species.
Habitat Inference¶
Produces plausible biome labels (e.g. “Forest”, “Grassland”). For richer descriptions, lower merging_alpha to ~0.7.
Geographic Inference¶
The model appears to infer geography from species identity, background sounds, or other cues. Lower alpha (0.6–0.7) substantially improves geographic reasoning by leveraging the base LLM’s species-range knowledge.
Structured JSON Output¶
At default alpha (1.0), JSON instructions are ignored and the model outputs a plain label. For valid JSON, you must lower alpha to ~0.7. Including detailed field names in the prompt helps the model honor the structure.
Describe-Then-Identify¶
Produces coherent free-form descriptions. For the best chance that the species name appears in the text, use alpha ~0.9.
Frequency Range¶
Returns a range like 2000–8000 Hz.
Species Count¶
Expected output format: N: species1, species2. Species counting is harder than listing — the model sometimes outputs names instead of a count.
Call Count¶
Per-species
Total
Counting is approximate. On single-species clips the model often answers “1”. It rarely hallucinates large numbers.
Individual Count¶
Temporal Order¶
Model Behavior¶
Prompt Robustness¶
The model is highly robust to prompt phrasing for classification tasks. Training prompts, terse prompts, and verbose prompts all perform within ~1–2% of each other. Rephrasing is fine. The only measured vulnerability is mild priming — mentioning a specific species in the prompt slightly biases the model toward that answer.
System Prompts¶
System prompts with expert personas (e.g. “You are a bioacoustics expert”) have no measurable effect on classification accuracy. They are supported but not necessary. The one exception is placing geographic context in the system prompt, which works slightly better than inline context. System prompts can also be used to change style and behavior — for instance to make the model more conversational or change the output format — particularly when combined with merging_alpha < 1.
merging_alpha Parameter¶
This parameter controls the blend between the fine-tuned bioacoustics adapter and the base LLM.
Alpha |
Behavior |
|---|---|
1.0 (default) |
Best for classification and automated pipelines. Terse, label-style output. |
0.8 |
Good for interactive/conversational use. Richer answers with minimal accuracy cost. |
0.7 |
Required for JSON output and structured reports. Good for knowledge-dependent tasks. Good for multi-turn without a large drop in accuracy. |
0.6 |
Most conversational. Best for geographic inference and habitat descriptions. Species accuracy drops meaningfully. |
< 0.5 |
Not recommended — bioacoustic capability degrades sharply below this threshold. |
Format Control¶
Simple format prefixes (e.g. Respond in the format ‘species: <name>’) work at alpha=1.0. Complex structured formats (JSON, multi-field reports) require alpha ~0.7.
Model Limitations¶
16 kHz encoder: The model is constrained by an 8 kHz Nyquist frequency. This means it will not give meaningful responses for taxa such as bats when recordings fall primarily above this range.
Bird bias: Accuracy is notably higher for birds than other taxa such as anurans or cetaceans, due to training data. Marine PAM data in particular may be unreliable for multilabel classification without fine-tuning.
Refusing wrong-taxon prompts: If you ask “What whale is this?” on a bird recording, the model will typically identify the bird anyway rather than refuse. If you use the exact classification prompt “What is the common name for the focal species in the audio?” the model will always answer as if a species is present. To allow for an answer of “None”, use a multilabel classification prompt instead.
Emotional valence and translation: The model does not have the ability to tell how an animal is feeling or what it is saying, with the limited exception of call type prediction — this use should currently be considered out of scope.