Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu*, Jin Xu*, Xiaohuan Zhou*, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou,
Alibaba Group
*Equal Contribution Corresponding Author

Abstract

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover xx tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Framework

Interpolate start reference image.

The overview of Qwen-Audio architecture and multitask-pretraining.

Performance

Interpolate start reference image.

Multilingual ASR and Translation

Team Building Group(5)
recognize the speech, and translate it into English
OK. The sentence “大家好我们来讨论一下公司周末团建的地点吧大家有什么意见呢” translated into English is “Hello, let’s discuss the location for our company’s weekend team building. What are your opinions?”.
recognize the speech, and translate it into English
OK. The sentence “I think we can go to the countryside and find a beautiful spot.” translated into English is “I think we can go to the countryside and find a beautiful spot.”
recognize the speech, and translate it into English
OK. The sentence “わかりました。田舎に行くことに同意します。田舎暮らしを体験してみるのもいいかもしれません。” translated into English is “I understand. I agree to go to the countryside. It’s also good to experience country life.”
recognize the speech, and translate it into English
OK. The sentence “Bueno, también podemos considerar algunas actividades divertidas como los deportes acuáticos.” translated into English is “Well, we can also consider some fun activities like water sports.”
recognize the speech, and translate it into English
OK. The sentence “对了我还想提议我们可以租一些自行车骑行一下既锻炼身体又心情愉悦” translated into English is “By the way, I also suggest that we can rent some bicycles to ride and exercise our bodies and improve our mood”.

Multiple Audio Analysis

Recognize the gender, age, accent, emotion, and speaking content of the person in the audio, and combine these to answer his/her questions while explaining the reasons for these answers.
"The person in the audio is potentially a male, approximately in his thirties, from Sichuan, China. The person's emotion is calm, and they say, "What are you up to? Have you eaten yet?" To keep the conversation engaging without going off-topic, we can respond with, "Oh, I was just having a meal. How about you?" The reasons behind this response are as follows: Firstly, we express our concern for the other person, avoiding any awkward silence. Secondly, we answer their question, indicating that we are currently occupied and not idle. Lastly, we invite the person to join us for a meal, which can further expand the conversation and facilitate communication.
What kind of food might he enjoy if we were to have a meal together?
Based on the gender and age of men, as well as the regional characteristics of Sichuan, he may like spicy foods, such as hot pot, spicy stew, boiled fish in soup, etc.
"What are the emotions in audio 1 and audio 2, respectively?"
"The tone of the woman in audio 1 is cheerful, possibly talking about something pleasant. The tone of the woman in audio 2 is fearful, possibly discussing something terrifying."
How to respond to the questions in audio 1 and audio 2?
In audio 1, when the woman asks if you're okay, you can respond with, "Thank you for your concern, I'm fine now, just felt a bit uncomfortable all of a sudden."
In audio 2, considering the woman's fearful tone and discussing something terrifying, you can respond to her with either, "I'm scared, I don't know what to do" or "I'm feeling really upset, I don't know how to face this situation."
The prompts and answers in this conversion are translated from Chinese Demo.

Sound Understanding and Reasoning

Recognize the sound information and assume that the sound is happening around the user. Please provide the user with some handling suggestions.
The sound is of breaking glass. Here are some suggestions for handling the situation:
1. Turn off the source of the sound: If the sound is coming from a broken glass item, turn off the source of the sound as soon as possible to prevent further damage.
2. Clean up the broken glass: Use gloves and a broom or vacuum cleaner to clean up the broken glass. Be careful not to cut yourself.
3. Assess the damage: If the broken glass is in a location that is not easily accessible or if it poses a safety hazard, contact a professional for assistance.
4. Cover the area: Cover the area where the broken glass is located to prevent injury or further damage.
5. Avoid walking on broken glass: If the broken glass is on a floor, do not walk on it. Instead, use a broom or vacuum cleaner to clean it up or cover it with a piece of cardboard or tape.
6. Seek medical attention: If you or someone else is injured by broken glass, seek medical attention immediately.
Recognize the sound information and assume that the sound is happening around the user. Please provide the user with some handling suggestions.
The sound is of a baby crying. Here are some suggestions for handling the situation:
1. Calmly speak to the baby: Try speaking to the baby in a soothing voice, gently rocking them or singing to them. This can help calm them down and comfort them.
2. Check if the baby needs something: Make sure the baby is hungry, wet, or needs a diaper change. Addressing these needs can help soothe the baby and stop them from crying.
3. Create a calm environment: Make sure the baby’s environment is quiet, dark, and comfortable. White noise or soothing music can also help calm the baby.
4. Take a break: If you’re feeling overwhelmed or stressed, take a break and let someone else take over for a few minutes. This can give you a chance to regroup and come back to the situation with a clearer mind.
5. Seek help if needed: If the baby’s crying persists despite your best efforts, or if you’re feeling unsure about how to handle the situation, don’t hesitate to seek help from a healthcare professional. They can provide additional guidance and support.
Recognize the sound information and assume that the sound is happening around the user. Please provide the user with some handling suggestions.
The sound is of a car horn honking. Here are some suggestions for handling the situation:
1. Wait for the car to pass: If the car horn is honking repeatedly, wait for it to pass so you can continue with your activity.
2. Move to a different location: If the car horn honking is causing you discomfort or distraction, move to a different location to avoid the noise.
3. Contact the police: If the car horn honking is excessive or persistent, and you feel threatened or intimidated, contact the police for assistance.
4. Stay calm: It’s important to stay calm and not respond aggressively to the car horn honking. This can escalate the situation and potentially lead to a dangerous situation.
5. Identify the source of the horn: Try to identify the make and model of the car honking to determine if it’s a legitimate warning or if someone is being reckless.

Music Appreciation

what is it
A solo piano piece.
Describe this music in detail
This is a solo piano piece. It is fast tempo with a playful and humorous mood. The piece is in the key of C major and is in 4/4 time signature. The melody is played on the right hand with the left hand playing chords. The piece is a classic ragtime style.
Give me some recommendations for similar music.
Some similar music to “Solo Piano Ragtime” could be:
1. “The Entertainer” by Scott Joplin
2. “Maple Leaf Rag” by Scott Joplin
3. “Ave Maria” by Franz Schubert
4. “Für Elise” by Ludwig van Beethoven
5. “The Blue Danube” by Johann Strauss II
All of these pieces are solo piano compositions and have a similar ragtime style to “Solo Piano Ragtime”. They are all well-known and beloved pieces that are sure to please anyone who enjoys classic piano music.

Speech Editing with SRWT

Find the word "what", so I can turn it into a meme
The word “what” starts at 1.88 seconds and ends at 2.04 seconds.
change “joyful” to “interesting”
ok, here is the edited audio
change the person name to “John”
ok, here is the edited audio

Audio-Motivated Creative Writing

Write a poem about the above audio
Birds are singing in the trees,
A gentle breeze is blowing free,
The rustling leaves are music,
As nature’s symphony goes on.
The sun is shining bright and warm,
A peaceful atmosphere is sown,
In the midst of nature’s beauty,
We can find serenity and duty.
The birds take flight with ease,
Their chirping voices fill the trees,
A symphony of nature’s grace,
A moment to pause and embrace.
The wind whispers through the branches,
As the leaves dance in the branches,
A peaceful moment in time, A chance to let go and unwind.
So let us take a moment here,
To appreciate the beauty so dear,
In the midst of nature’s wonder,
We can find a sense of wonder.