Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

Zheng Fang Wuhan UniversityWuhanChina [email protected] , Tao Wang Wuhan UniversityWuhanChina [email protected] , Lingchen Zhao Wuhan UniversityWuhanChina [email protected] , Shenyi Zhang Wuhan UniversityWuhanChina [email protected] , Bowen Li Wuhan UniversityWuhanChina [email protected] , Yunjie Ge Wuhan UniversityWuhanChina [email protected] , Qi Li Tsinghua UniversityBei**gChina [email protected] , Chao Shen Xi’an Jiaotong UniversityXi’anChina [email protected] and Qian Wang Wuhan UniversityWuhanChina [email protected]

Abstract.

In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting.

Speech recognition; adversarial attacks; zero-query; transferability

To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

1. Introduction

Table 1. Summary of existing audio adversarial attacks.

Method	Setting	Knowledge	Target ASR ¹¹footnotemark: 1	Over-the-line ²²footnotemark: 2	Over-the-air	Queries ³³footnotemark: 3
Carlini et al. (Carlini and Wagner, 2018)	White-box	Gradient	$\square$	100%	-	$\sim$ 1000
Commandersong (Yuan et al., 2018)	White-box	Gradient	$\square$	100%	$\sim$ 80%	$\sim$ 1000
Taori et al. (Taori et al., 2019)	Black-box	Prediction score	$\square$	$\sim$ 40%	-	$\sim$ 300,000
SGEA (Wang et al., 2020)	Black-box	Prediction score	$\square$	100%	-	$\sim$ 100,000
Devil’s Whisper (Chen et al., 2020)	Black-box	Confidence score	$\triangle\star$	$\sim$ 60%	$\sim$ 50%	$\sim$ 1500
Occam (Zheng et al., 2021)	Black-box	Final decision	$\square\triangle$	100%	-	$\sim$ 10,000
KENKU (Wu et al., 2023)	Black-box	None	$\triangle\star$	$\sim$ 70%	$\sim$ 80%	$>$ 0
NI-Occam (Zheng et al., 2021)	Zero-query black-box ⁴⁴footnotemark: 4	None	$\star$	-	$\sim$ 50%	0
TransAudio (Qi et al., 2023)	Zero-query black-box	None	$\square\triangle$	$\sim$ 30%	-	0
ZQ-Attack	Zero-query black-box	None	$\square\triangle\star$	100%	100%	0

Note that, (i) ¹¹footnotemark: 1: We use $\square$ , $\triangle$ , $\star$ to represent open-source ASRs, online speech recognition services, and commercial IVC devices, respectively. (ii) ²²footnotemark: 2: In the over-the-line and over-the-air settings, we employ ‘-’ to represent unsuccessful attacks. In cases of successful attacks, the success rate of attack is presented. (iii) ³³footnotemark: 3: “Queries” refer to the number of queries made to the target ASR system during the generation process. (iv) ⁴⁴footnotemark: 4: In contrast to the black-box setting, the zero-query black-box setting prohibits any queries to the target ASR system during the generation process.

Automatic speech recognition (ASR) techniques, which convert spoken language into text, play a crucial role in modern human-computer interactions. These techniques are now widely employed in online speech recognition services (Microsoft, 2023; OpenAI, 2023a) provided by major companies, including Microsoft, OpenAI, etc. Online repositories also offer a plethora of open-source ASRs for public utilization. Furthermore, ASRs have also been integrated into commercial intelligent voice control (IVC) devices, such as Apple Siri (Apple, 2023) and Amazon Alexa (Amazon, 2023), enabling users to perform various tasks via voice commands. Unfortunately, similar to other deep neural networks (DNNs) based systems, modern ASRs are vulnerable to adversarial attacks (Szegedy et al., 2014; Carlini and Wagner, 2018; Gao et al., 2018). Attackers can introduce small perturbations into audio samples, causing the target ASR system to produce incorrect results.

Audio Adversarial Attacks. Recently, numerous studies have investigated the practicality and effectiveness of audio adversarial attacks on ASR systems, as summarized in Table 1. Depending on the accessibility of the attacker to the target ASR systems, audio adversarial attacks can be categorized into white-box and black-box attacks. The initial works (Carlini and Wagner, 2018; Yuan et al., 2018) employ gradient descent algorithms to generate adversarial audios in the white-box setting, where attackers have full access to the internal information of the target ASR systems. However, in real-world scenarios, attackers typically lack access to the internal information of the target system, making these white-box attacks often impractical. To achieve attacks in black-box settings, several methods have been proposed to generate adversarial examples based on the limited information acquired through queries to the target ASR system (Taori et al., 2019; Chen et al., 2020; Zheng et al., 2021; Wu et al., 2023). However, these methods require huge financial costs and time investments for the queries to generate a single adversarial example, and the large number of highly similar queries makes these attacks easily detectable by the target ASR system, rendering them impractical.

Therefore, to further enhance the practicality of adversarial attacks, researchers have increasingly turned their attention to transfer-based attacks, which can generate adversarial examples effectively across different target systems, thereby eliminating the need for queries. However, the performance of existing transfer-based audio adversarial example attacks also exhibits significant limitations. For instance, NI-Occam (Zheng et al., 2021) generates audio adversarial examples on fine-tuned Kaldi models to attack IVC devices, but its attack success rate is quite limited. TransAudio (Qi et al., 2023) optimizes adversarial examples on a surrogate ASR model and can successfully attack black-box ASR systems with similar architectures to the surrogate model. However, it only achieves word-level modifications to the original transcription and has a low success rate on online speech recognition services. These results indicate that current transfer-based attacks still possess limited transferability.

Consequently, we propose the following question: How to generate audio adversarial examples with high transferability in the challenging zero-query black-box setting?

ZQ-Attack. Our answer to this question is ZQ-Attack, a transfer-based adversarial attack on black-box ASR systems without the need for queries. Inspired by the ensemble method (Opitz and Maclin, 1999; Liu et al., 2017; Brown et al., 2017), our core idea is to optimize the adversarial perturbation on diverse surrogate ASRs. Ideally, adversarial perturbations optimized concurrently on multiple different surrogate models should contain features that can be captured by these models, making them effective against various ASR systems.

Specifically, ZQ-Attack consists of three stages: surrogate ASRs selection, perturbation initialization, and sequential ensemble optimization. For the surrogate ASRs selection stage, we conduct an extensive survey of modern ASR systems and observe that different types of ASR systems utilize distinct acoustic models. Therefore, we first need to select multiple different types of surrogate ASRs. Then, to ensure that the generated adversarial perturbations are effective across these surrogate ASRs while maintaining high stealthiness, we propose an adaptive search algorithm that uses scaled target commands to initialize the adversarial perturbation, instead of using zeros or Gaussian noise as in existing methods. Following the initialization, ZQ-Attack employs a sequential ensemble optimization algorithm to optimize the adversarial perturbations on the sequence of diverse surrogate ASRs collaboratively. This sequential ensemble algorithm allows for the optimization of adversarial perturbations on each surrogate ASR while concurrently leveraging information from prior surrogate ASRs. Consequently, the generated adversarial perturbations are effective not only for the current surrogate ASR but also for the previous ASRs in the sequence.

We conduct extensive experiments in both over-the-line and over-the-air settings to validate the effectiveness and imperceptibility of our ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves an average success rate of attack (SRoA) of 100% and signal-to-noise ratio (SNR) of 21.91dB on four online speech recognition services. Additionally, ZQ-Attack attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. In the over-the-air setting, ZQ-Attack achieves an average SRoA of 100% and an SNR of 15.77dB on two commercial IVC devices. These results demonstrate that ZQ-Attack can successfully generate audio adversarial examples with high transferability, effectively targeting various ASR systems without requiring any queries.

Contributions. Our contributions are summarized as follows:

•

To the best of our knowledge, we are the first to generate audio adversarial examples with high transferability on ASR systems in the most challenging zero-query black-box setting. Our method is effective in both over-the-line and over-the-air settings, with the target ASR systems encompassing online speech recognition services, open-source ASRs, and commercial IVC devices, showcasing remarkable practicality.
•

We introduce ZQ-Attack, a zero-query, transfer-based adversarial attack on black-box ASR systems. This approach optimizes adversarial perturbations using a set of diverse surrogate ASRs simultaneously, thereby enhancing their transferability. Furthermore, we develop an adaptive adversarial perturbation initialization method based on the target command audio to improve the imperceptibility.
•

We conduct comprehensive experiments to evaluate the performance of ZQ-Attack on online speech recognition services, open-source ASRs, and commercial IVC devices. The experimental results demonstrate the superior performance of our method. ZQ-Attack can successfully attack all the target ASR systems without queries while achieving an average SRoA of 100%.

2. Background

2.1. Automatic Speech Recognition

ASR systems can automatically transcribe input audio into the corresponding transcriptions. As shown in Figure 1, a typical ASR system comprises three components: pre-processing, acoustic model, and decoder.

•

Pre-processing. Given an input audio with $n$ sample points, denoted as $x\in\mathbb{Z}^{n}$ , an ASR system first normalizes it to the range of ${[-1,1]}^{n}$ and applies low/high-pass filters to remove frequencies and segments beyond the range of human hearing. Then, the ASR system employs time-frequency transformation techniques, such as the short-time Fourier transform (STFT), to convert the time-domain signal $x$ into the frequency domain spectrogram $S\in\mathbb{R}^{T\times F}$ , where $T$ and $F$ denote the number of frames and frequency bins, respectively. In the subsequent steps, traditional ASR systems utilize acoustic feature extraction algorithms, such as mel-frequency cepstral coefficients (MFCC) (Sigurdsson et al., 2006), linear predictive coefficient (LPC) (Itakura, 1975), or perceptual linear predictive (PLP) (Hermansky, 1990), to further transform $S$ into meticulously designed acoustic features. In contrast, modern DNN-based ASR systems directly utilize $S$ .
•

Acoustic model. The acoustic model is typically a machine learning model that maps the spectrogram or extracted acoustic features to an intermediate representation. Traditional ASR systems use hidden Markov models (HMMs) and Gaussian mixture models (GMMs) (Gales et al., 2008; Rabiner, 1989), while modern ASR systems employ DNNs, such as convolutional neural networks (CNNs) and Transformers (Gulati et al., 2020; Li et al., 2019).
•

Decoder. The decoder uses the intermediate representation to predict tokens and generate corresponding transcriptions. A token is the smallest unit of the transcription, and the set of all possible tokens constitutes a vocabulary $V$ . The vocabulary $V$ varies across different ASR systems. For example, $V=\{a,b,\cdots,z,space\}$ is a simple vocabulary for an ASR system recognizing English.

Refer to caption — Figure 1. The architecture of a typical ASR system.

2.2. Audio Adversarial Attacks

Audio adversarial attacks aim to manipulate the output of ASR systems using audio adversarial examples constructed by adding imperceptible perturbations to benign carrier audios (Carlini and Wagner, 2018; Yuan et al., 2018; Zheng et al., 2021; Chen et al., 2020; Wu et al., 2023). Depending on the objective of the attack, audio adversarial attacks can be categorized into targeted and untargeted attacks.

Formally, let $f(x):x\to y$ denote an ASR system that transcribes an audio $x$ into the transcription $y=f(x)$ , and let $x^{\prime}=x+\delta$ denote the adversarial example constructed by adding the adversarial perturbation $\delta$ to the carrier audio $x$ . An untargeted adversarial attack aims to mislead the target ASR system, causing it to produce any result other than the ground truth, represented as $f(x^{\prime})\neq y$ . In contrast, a targeted adversarial attack aims to induce the output of the target ASR system into a specific target transcription $t\neq f(x)$ , which is formulated as:

(1)

f(x^{\prime})=t,\quad s.t.~{}Dis(x,x^{\prime})<\epsilon,

where $Dis(x,x^{\prime})$ represents the distance between $x$ and $x^{\prime}$ , commonly calculated using the $L_{p}$ norm, with $p$ usually being 0, 2, or $\infty$ . $\epsilon$ is a hyper-parameter that constrains this distance. As targeted adversarial attacks can naturally extend to untargeted attacks, the adversarial attacks discussed in the rest of this paper will specifically refer to the targeted ones unless explicitly specified otherwise.

Typically, the generation process of adversarial examples can be formulated as an optimization problem:

(2)

\min\limits_{\delta}{\mathcal{L}(x,\delta,t,f)=\mathcal{L}_{a}(x,\delta,t,f)+c% \cdot\mathcal{L}_{p}(\delta)},

where the adversarial loss $\mathcal{L}_{a}$ measures the effectiveness of $\delta$ on the target ASR system $f$ , and the imperceptibility loss $\mathcal{L}_{p}$ quantifies the imperceptibility of $\delta$ . The parameter $c$ acts as a weighting factor, balancing the effectiveness and imperceptibility of the attack. Gradient descent is a common method for solving this optimization problem. It can be formulated as:

(3)

\delta\leftarrow clip_{\epsilon}\left(\delta-\alpha\cdot\nabla_{\delta}% \mathcal{L}(x,\delta,t,f)\right),

where $\alpha$ represents the learning rate and $\nabla_{\delta}\mathcal{L}(x,\delta,t,f)$ is the gradient of $\mathcal{L}(x,\delta,t,f)$ with respect to $\delta$ . The function $clip_{\epsilon}$ limits $Dis(x,x^{\prime})$ to a relatively small range controlled by $\epsilon$ .

3. Threat Model & Challenges

3.1. Threat Model

Goals. The attacker aims to generate audio adversarial examples that can be recognized as the target transcription by the target ASR systems without any queries. In the over-the-line setting (i.e., digital attacks), the target systems are online speech recognition services or open-source ASRs. The waveform files of the audio adversarial examples are directly used as inputs, inducing the target ASR systems to produce the specified target transcriptions. In the over-the-air setting (i.e., physical attacks), the target ASR systems are commercial IVC devices. The adversarial examples should be misrecognized as the target transcription by these devices after being transmitted through the air. Additionally, the audio adversarial examples should be imperceptible, making them difficult for human ears to detect.

Knowledge & Capabilities. Prior works on black-box adversarial attacks on ASR systems do not require the attacker to know internal information about the target ASR system, but they still assume that the attacker can interact with the target system. In this paper, we consider a more realistic and challenging scenario where the attacker also cannot query the target ASR system during the generation of adversarial examples. After generating the adversarial examples, the attacker can execute the attacks by uploading audio files to target ASR systems in the over-the-line setting and positioning a speaker near the target commercial IVC devices in the over-the-air setting. Additionally, we assume the attacker has the capability to train surrogate ASRs or obtain pre-trained surrogate ASRs from open-source repositories.

3.2. Challenges

To launch audio adversarial attacks in the challenging zero-query black-box setting, an alternative approach is the transfer-based attack. The basic idea of transfer-based attacks is to use a local surrogate model to generate adversarial examples that are then used to attack the target black-box model. Despite the demonstrated effectiveness of transfer-based attacks in the image domain (Papernot et al., 2016; Demontis et al., 2019; Wu et al., 2018; Liu et al., 2017), achieving similar success in the audio domain remains an unresolved issue. Abdullah et al. (Abdullah et al., 2021) reveal that the transferability of audio adversarial examples among different ASR systems is exceedingly limited, even when these ASRs share the same architecture. Existing attempts to achieve this goal have only demonstrated limited transferability (Zheng et al., 2021; Qi et al., 2023). As demonstrated in prior work (Huang et al., 2019; Wu et al., 2018; Ravikumar et al., 2022), the core reason for this limitation might be that adversarial examples tend to overfit the architecture and feature representations of the specific surrogate model, resulting in limited transferability to the target models with different architecture. Unlike image recognition models, ASR systems exhibit increased complexity and greater architectural diversity, leading to greater differences between surrogate ASRs and target ASRs. Hence, generating highly transferable adversarial examples in the audio domain is more challenging.

4. ZQ-Attack

4.1. Problem Formulation

ZQ-Attack aims to generate transferable audio adversarial examples in the zero-query black-box scenario. Formally, given a carrier audio $x$ and a target command $t$ , ZQ-Attack optimizes the adversarial perturbation $\delta$ to enhance its effectiveness on various black-box target ASR systems. This optimization problem can be formulated as follows:

(4)

\max_{\delta}\mathop{\mathbb{P}}_{f\in\mathcal{F}}~{}\left(f(x+\delta)=t\right),

where $\mathbb{P}$ represents the probability, and $\mathcal{F}$ denotes the set of all black-box target ASR systems. However, since the attacker lacks internal information about the target ASR system and cannot query it, directly solving this optimization problem is challenging.

An intuitive way to solve this problem is leveraging surrogate ASRs. However, optimizing the adversarial perturbation on a single surrogate ASR may result in overfitting to that specific model. Therefore, ZQ-Attack optimizes adversarial perturbations on multiple surrogate ASRs. We denote the set of surrogate ASRs as $\mathbb{F}$ . Then, the optimization problem becomes as follows:

(5)

\min_{\delta}\mathcal{L}_{all}(x,\delta,t,\mathbb{F})\quad\text{s.t.}~{}Dis(x,% x^{\prime})<\epsilon,

where $\mathcal{L}_{all}$ denotes the loss on all surrogate ASRs, and the imposed constraint ensures that the optimized adversarial perturbation attains a specified level of imperceptibility.

4.2. Attack Overview

The core idea of ZQ-Attack is to collaboratively optimize the adversarial perturbation on diverse types of surrogate ASRs. Specifically, ZQ-Attack consists of three stages: surrogate ASRs selection, perturbation initialization, and sequential ensemble optimization. The workflow is illustrated in Figure 2.

Surrogate ASRs Selection. We investigate modern ASR systems and categorize them into two main types: CNN-based and Transformer-based. While CNNs are more adept at capturing local features, Transformers excel at capturing global contexts. Therefore, an intuitive approach is to select surrogate ASRs that include both CNN-based and Transformer-based architectures, ensuring that the adversarial perturbations optimized on these surrogate ASRs concurrently possess both local and global features of the target command.

Perturbation Initialization. Given a target command $t$ , we employ Text-to-Speech (TTS) techniques to generate a corresponding target command audio $x_{t}$ . Subsequently, we initialize the adversarial perturbation with a scaled $x_{t}$ and superimpose it onto the carrier audio, ensuring that the constructed adversarial example is effective on all surrogate ASRs. To further enhance the imperceptibility of the initial adversarial perturbation, we employ an adaptive search algorithm to minimize the scaling factor.

Sequential Ensemble Optimization. Following the perturbation initialization stage, ZQ-Attack employs a sequential ensemble optimization algorithm to collaboratively optimize the adversarial perturbation on the ordered set of surrogate ASRs. This algorithm consists of an inner loop and an outer loop. In each iteration, the ordered set of surrogate ASRs is randomly shuffled in the outer loop. Then, the sequential ensemble optimization takes place within the inner loop. For each surrogate ASR, this algorithm integrates collaborative information from all preceding surrogate ASRs in the ordered set, facilitating collaborative optimization. Additionally, we design a novel loss function for the optimization process to enhance the transferability and imperceptibility of the perturbation. Following the inner loop, the algorithm updates the perturbation and validates its effectiveness on all surrogate ASRs.

4.3. Surrogate ASRs Selection

The selection of surrogate ASRs is the foundation that ensures the effectiveness of ZQ-Attack. As described in Section 3.2, the architecture of ASR systems exhibits considerable diversity. Therefore, randomly selecting surrogate ASRs may fail to cover the mainstream modern ASR systems. Hence, in this subsection, we initially provide a summary and categorization of modern ASR systems, followed by the details of surrogate ASRs selection. The illustration of this stage is shown in Figure 3.

Summary and Categorization of Modern ASR Systems. Modern ASR systems typically convert audio into the spectrogram without employing additional acoustic feature extraction algorithms (Kriman et al., 2020; Li et al., 2019; Han et al., 2020; Gulati et al., 2020; OpenAI, 2023a; Majumdar et al., 2021; Amodei et al., 2016; Hannun et al., 2014; Zhang et al., 2020). Hence, the primary differences among these ASR systems reside in the internal acoustic model. We summarize and categorize modern ASR systems into the following two categories from the perspective of the acoustic model:

•

CNN-based. CNNs have found widespread application in the field of computer vision and have recently exhibited notable progress in ASR as well (Li et al., 2019; Kriman et al., 2020; Han et al., 2020; Majumdar et al., 2021; Abdel-Hamid et al., 2014; Abdeljaber et al., 2017; Schneider et al., 2019). The key advantages of CNNs include their model’s low complexity and high computational efficiency. Additionally, CNNS are adept at extracting local features within the spectrogram.
•

Transformer-based. These ASR systems employ Transformers as their acoustic models. Given that audio is temporal data, using recurrent neural networks (RNNs) as acoustic models is an intuitive choice (Amodei et al., 2016; Hannun et al., 2014). While RNNs can capture short-term dependencies, attention mechanisms (Vaswani et al., 2017) enable non-contiguous frames to attend each other, allowing Transformers to capture long-term dependencies. This results in better speech recognition performances than RNNs (Zeyer et al., 2019; Karita et al., 2019a; Li et al., 2020). Consequently, Transformers are progressively supplanting RNNs in modern ASRs (Dong et al., 2018b; Zhang et al., 2020; Karita et al., 2019b; Higuchi et al., 2020; Gulati et al., 2020; Baevski et al., 2020; Hsu et al., 2021; Radford et al., 2023).

Selection of Surrogate ASRs. The local ASRs for selection can be obtained from online sources or trained by the attackers themselves. While utilizing a single surrogate ASR can lead to the overfitting of $\delta$ to that surrogate ASR, rendering $\delta$ ineffective on target ASR systems, using too many surrogate ASRs can also lead to high computation costs. The diversity in architectures of ASRs may result in significant disparities between the surrogate ASRs and the target ASR systems, making it challenging to generate transferable adversarial examples. Therefore, we need to select surrogate ASRs encompassing modern ASR systems of different types.

According to our categorization, modern ASR systems can be mainly categorized into CNN-based and Transformer-based ASR systems. The CNNs excel in extracting local features but exhibit a comparatively weaker capability in capturing dynamic global contexts. Conversely, Transformers are proficient in effectively capturing global information but demonstrate a diminished ability to extract local features. To integrate the advantages of both CNN-based and Transformer-based ASR systems, the selected surrogate ASRs should encompass representatives from both categories. This ensures that the optimized adversarial perturbations can possess both locally and globally salient features, thereby enhancing their transferability to the target ASR systems. Furthermore, CNNs and Transformers represent prevalent architectures of acoustic models adopted in modern ASR systems. Therefore, we incorporate both CNN-based and Transformer-based ASRs into the surrogate ASRs to effectively cover a broad range of real-world ASR systems. Upon selecting $K$ surrogate ASRs from the local ASRs, we construct the set of these surrogate ASRs, denoted as $\mathbb{F}=[f_{j}]_{j=1}^{K}$ , where $f_{j}$ represents the $j$ -th surrogate ASR in $\mathbb{F}$ . Since the subsequent sequential ensemble optimization algorithm optimizes the adversarial perturbation on these surrogate ASRs in a sequential manner, $\mathbb{F}$ is an ordered set. It is worth noting that the surrogate ASRs are scalable. The quantity of surrogate ASRs can be adjusted flexibly depending on the computational resources of the attackers.

Algorithm 1 Adaptive Search (AdaSearch)

0: Carrier audio

x

, Target command

t

, Target command audio

x_{t}

, Ordered set of

K

surrogate ASRs

\mathbb{F}

, Search stride

s

0: Initial adversarial perturbation

\delta

l_{x}\leftarrow len(x),~{}l_{t}\leftarrow len(x_{t})

\mu=+\infty,~{}\delta\leftarrow 0

3: for

i\leftarrow 0

l_{x}-l_{t}

\mu_{i}\leftarrow 0

\delta_{i}\leftarrow[\underbrace{0,\cdots,0}_{i},~{}x_{t}~{},\underbrace{0,% \cdots,0}_{l_{x}-l_{t}-i}]

6: while

\mu_{i}\leq\mu

7: if

\forall f\in\mathbb{F},~{}f(x+\mu_{i}\cdot\delta_{i})=t

then

\mu=\mu_{i},~{}\delta=\mu\cdot\delta_{i}

9: break

10: end if

11:

\mu_{i}\leftarrow\mu_{i}+s

12: end while

13: end for

14: return

\delta

4.4. Perturbation Initialization

The initialization of the adversarial perturbation $\delta$ may significantly impact the performance of the attack. Initializing $\delta$ from a point far from the region of the target command in the feature space can lead to a time-consuming and uncertain optimization process, and the high dimensionality of audio data further complicates the optimization. In contrast, using the target command audio directly as the initial $\delta$ may result in poor imperceptibility. To obtain an effective and relatively imperceptible initialized adversarial perturbation, we propose an adaptive search algorithm to initialize $\delta$ with a scaled target command audio, as presented in Alg. 1. This algorithm aims to minimize the scaling factor while maintaining the effectiveness of the initialized $\delta$ on all surrogate ASRs.

Specifically, we first choose an audio $x$ as the carrier audio to construct the adversarial example $x^{\prime}=x+\delta$ . Following previous works (Zheng et al., 2021; Yuan et al., 2018), we opt for songs as the carrier audio. For a given target command $t$ , we utilize TTS techniques to generate a corresponding target command audio $x_{t}=\mathcal{T}(t)$ , where $\mathcal{T}(\cdot)$ denotes the TTS process. To alleviate the impact of varying volume levels during the perturbation initialization stage, we normalize the values of sample points in both $x$ and $x_{t}$ to the range of $[-0.5,0.5]$ . Subsequently, the adaptive search algorithm searches for the smallest value of scaling factor $\mu$ , and $\delta$ is initialized using the scaled target command audio $\mu\cdot x_{t}$ , ensuring that the corresponding initial adversarial example is recognized by all surrogate ASRs as the target command.

Since the length of $x_{t}$ is typically shorter than that of $x$ , it is necessary to pad both sides of the scaled $x_{t}$ with zeros to initialize the perturbation. We use $l_{x}$ and $l_{t}$ to denote the length of $x$ and $x_{t}$ , respectively. The lengths of the padding on each side are indeterminate, provided their sum equals $l_{x}-l_{t}$ . The adaptive search algorithm searches for the optimal padding lengths on each side to minimize the scaling factor as much as possible. An example of this initialization method is depicted in Figure 4. It can be seen that the adaptive search algorithm finds the padding lengths and a relatively small scaling factor.

In summary, the adaptive search algorithm initializes $\delta$ by pushing the corresponding adversarial example toward the decision boundary of all surrogate ASRs, thereby circumventing the time-consuming and uncertain initial search process. The initialized adversarial perturbation can be regarded as a coarse-grained optimized perturbation, serving as a basis for subsequent fine-grained optimization in the sequential ensemble optimization stage.

4.5. Sequential Ensemble Optimization

After initializing the adversarial perturbation, ZQ-Attack performs fine-grained optimization of the adversarial perturbation on surrogate ASRs. Unlike the target black-box ASR systems, where the attacker lacks knowledge of their internal architectures and parameters, the white-box surrogate ASRs provide full control to the attacker. Hence, the attacker can optimize $\delta$ using any information acquired through these surrogate ASRs.

A straightforward approach to optimizing $\delta$ on diverse surrogate ASRs is to use the weighted average of the gradients from each one. For the $j$ -th surrogate ASR $f_{j}\in\mathbb{F}$ , the gradient is calculated as $\nabla_{\delta}\mathcal{L}(x,\delta,t,f_{j})$ , where $\mathcal{L}$ represents the loss of $\delta$ on a single surrogate ASR $f_{j}$ , typically similar to Eq. (2). However, this method essentially treats each surrogate ASR independently when optimizing $\delta$ . Each surrogate ASR does not interact with the others and, therefore, cannot leverage the optimization information provided by others. Moreover, this method focuses on optimizing $\delta$ in the direction most effective for a single surrogate ASR, without considering the effectiveness of $\delta$ on other surrogate ASRs.

To facilitate collaboration among these surrogate ASRs, we propose a sequential ensemble optimization algorithm, as presented in Figure 5. This algorithm iteratively optimizes the adversarial perturbation on the ordered set of surrogate ASRs $\mathbb{F}$ . For each surrogate ASR, this algorithm leverages the collaborative information from the preceding surrogate ASRs in the ordered set to optimize $\delta$ . In other words, the optimization process considers not only the efficacy of $\delta$ on the current surrogate ASR but also its efficacy on the preceding surrogate ASRs. Additionally, instead of directly using $\mathcal{L}$ in Eq. (2), we carefully design a novel loss function comprising three loss terms. The detailed design of the loss function is presented in Section 4.6.

Specifically, the sequential ensemble optimization algorithm comprises an outer loop and an inner loop. At each step, the algorithm first randomly shuffles $\mathbb{F}$ in the outer loop. Then, the optimization of $\delta$ on $\mathbb{F}$ takes place in the inner loop. Following the completion of the inner loop, $\delta$ is updated in the outer loop. As ZQ-Attack abstains from interacting with the target ASR systems via queries, it relies on surrogate ASRs to validate the efficacy of the generated adversarial example $x^{\prime}$ . The set of valid adversarial examples, denoted as $X^{\prime}$ , is initialized as $\varnothing$ at the beginning of the algorithm. At the end of each step, $x^{\prime}$ will be added to $X^{\prime}$ if it can successfully attack all surrogate ASRs.

Inner Loop. For clarity, we use $\delta_{j}$ to denote the adversarial perturbation optimized on the previous $j$ surrogate ASR(s) in $\mathbb{F}$ , and $\delta_{0}$ is equivalent to $\delta$ . For the $j$ -th surrogate ASR $f_{j}$ in $\mathbb{F}$ , the input perturbations include $\delta_{0}$ and the optimized adversarial perturbations obtained from the preceding $j-1$ ASR(s). Formally, the input adversarial perturbations for $f_{j}$ , denoted as $\Delta_{j}$ , can be represented as $\Delta_{j}=[\delta_{0},\delta_{1},\cdots,\delta_{j-1}]$ . We first add the adversarial perturbations $\Delta_{j}$ to the carrier audio $x$ to construct the adversarial examples. To ensure that the adversarial examples have been pushed towards the decision boundary, we additionally add randomly sampled Gaussian noise $\sigma$ on each adversarial example. Following the forwarding of these adversarial examples to $f_{j}$ and the subsequent gradient calculation, the perturbation is updated as:

(6)

\delta_{j}=\delta_{0}-\alpha\cdot\frac{1}{j}\sum_{\delta^{\prime}\in\Delta_{j}% }\nabla_{\delta^{\prime}}\mathcal{L}(x,{\delta^{\prime}}+\sigma,t,f_{j}),

where $\alpha$ represents the learning rate. The loss $\mathcal{L}$ is detailed in Section 4.6.

To avoid the updated $\delta_{j}$ being perceptible enough to the human ear, we use a clip** algorithm to restrict the updated $\delta_{j}$ within a limited range. Instead of using the $L_{p}$ norm-based clip** algorithm employed in previous work (Zheng et al., 2021; Chen et al., 2020), we utilize an adaptive clip** algorithm based on psychoacoustics (Gelfand, 2017). To be specific, this algorithm is grounded in temporal masking, a phenomenon where the presence of louder components can influence the perception of quieter components when two sounds with different loudness levels are heard by the human ear. This algorithm constrains $\delta$ within a range proportional to the carrier audio, permitting larger perturbations in louder segments of the carrier while maintaining smaller perturbations in quieter areas, thereby reducing the perceptibility of the perturbation. This clip** algorithm can be represented as follows:

(7)

clip_{\epsilon}(\delta,x)=\text{max}(\text{min}(\delta,\epsilon\cdot|x|),\ -% \epsilon\cdot|x|).

It is noteworthy that the clip** is performed element-wise in the perturbation. After applying the adaptive clip** algorithm, the adversarial perturbation will be bounded as $\delta_{j}=clip_{\epsilon}(\delta_{j},x)$ .

Outer Loop. Each step of the sequential ensemble optimization algorithm begins by randomly shuffling $\mathbb{F}$ . Subsequently, the optimization is performed in the inner loop. Following the inner loop, we obtain $K$ perturbations $[\delta_{1},\delta_{2},\cdots,\delta_{K}]$ optimized on the surrogate ASRs, and $\delta$ is updated as follows:

(8)

\delta=\eta\cdot\frac{1}{K}\sum_{j=1}^{K}\delta_{j}+(1-\eta)\cdot\delta_{0},

where $\eta$ denotes a balancing factor determining the extent of the update. $\eta=0$ implies no update, while $\eta=1$ signifies a complete update to the optimized adversarial perturbation, disregarding the former one. This update method uses the adversarial perturbation $\delta_{0}$ as historical information, while the momentum method (Dong et al., 2018a) uses the gradient as historical information. At the end of each step, $x^{\prime}=x+\delta$ is added to $X^{\prime}$ if all surrogate ASRs transcribe it into the target command.

The comprehensive details of the sequential ensemble optimization algorithm are presented in Alg. 2.

Algorithm 2 ZQ-Attack

0: Carrier audio

x

, Target command

t

, Ordered set of

K

surrogate ASRs

\mathbb{F}

, Max step

N

, Search stride

s

0: The set of valid adversarial examples

X^{\prime}

x_{t}\leftarrow\mathcal{T}(t)

\delta\leftarrow\text{AdaSearch}(x,t,x_{t},\mathbb{F},s)

X^{\prime}\leftarrow\varnothing

4: # Outer Loop

5: for

i\leftarrow 1

N

6: Randomly shuffle

\mathbb{F}

\delta_{0}\leftarrow\delta

8: # Inner Loop

9: for

j\leftarrow 1

K

10:

f_{j}\leftarrow

the

j

-th ASR in

\mathbb{F}

11:

\Delta_{j}\leftarrow[\delta_{0},\delta_{1},\delta_{2},\cdots,\delta_{j-1}]

12: Sample a Gaussian noise

\sigma

13: Compute

\nabla_{\delta^{\prime}}\mathcal{L}(x,{\delta^{\prime}}+\sigma,t,f_{j})

for each

\delta^{\prime}

\Delta_{j}

14: Update

\delta_{j}

using Eq. (6)

15: Clip

\delta_{j}

using Eq. (7)

16: end for

17: Update

\delta

using Eq. (8)

18:

x^{\prime}=x+\delta

19: if

\forall f\in\mathbb{F},~{}f(x^{\prime})=t

then

20:

X^{\prime}\leftarrow X^{\prime}\cup x^{\prime}

21: end if

22: end for

23: return

X^{\prime}

4.6. Loss Design

We design a novel loss $\mathcal{L}$ for optimizing the adversarial perturbations. It comprises three terms: the adversarial loss $\mathcal{L}_{a}$ , the imperceptibility loss $\mathcal{L}_{p}$ , and the acoustic feature loss $\mathcal{L}_{f}$ . The loss $\mathcal{L}$ can be written as follows:

(9)

\mathcal{L}(x,\delta,t,f)=\mathcal{L}_{a}(x,\delta,t,f)+c_{1}\cdot\mathcal{L}_% {p}(x,\delta)+c_{2}\cdot\mathcal{L}_{f}(x,\delta,t),

where $c_{1}$ and $c_{2}$ serve as weighting factors to balance the relative importance of different loss terms, ensuring a trade-off between the effectiveness and imperceptibility of $\delta$ .

Adversarial Loss. The adversarial loss $\mathcal{L}_{a}$ measures the effectiveness of $\delta$ on a surrogate ASR, evaluating how accurately a specific surrogate ASR transcribes the adversarial example to match the target command. The calculation process of $\mathcal{L}_{a}$ begins by inputting the constructed adversarial example into the surrogate ASR to obtain the output probability. Then, $\mathcal{L}_{a}$ is calculated according to the output probability and $t$ . As different ASR systems may utilize different loss functions for training, such as connectionist temporal classification (CTC) (Graves et al., 2006) and Transducer (Graves, 2012; Graves et al., 2013), the calculation method of $\mathcal{L}_{a}$ varies depending on the specific surrogate ASR. For instance, we utilize CTC loss to compute $\mathcal{L}_{a}$ for a surrogate ASR with a CTC architecture (e.g., Citrinet (Majumdar et al., 2021)).

Imperceptibility Loss. The imperceptibility loss $\mathcal{L}_{p}$ aims to minimize the detectability of the adversarial perturbation by human ears. Previous works (Chen et al., 2020; Zheng et al., 2021; Wu et al., 2023) commonly utilize the $L_{p}$ norm of the adversarial perturbation as the imperceptibility loss. However, Duan et al. (Duan et al., 2022) demonstrated that the $L_{p}$ norm shows a limited correlation with human perception, and the $L_{2}$ norm exhibits a relatively high correlation with human perception among the $L_{p}$ norm. Similar to the adaptive clip** algorithm employed when clip** the adversarial perturbation, as shown in Eq. (7), we design a new imperceptibility loss function $\mathcal{L}_{p}$ to calculate the $L_{2}$ norm of the ratio of the adversarial perturbation to the carrier audio. Formally, it can be represented as:

(10)

\mathcal{L}_{p}(x,\delta)={\left\|\frac{\delta}{x}\right\|}_{2}.

Acoustic Feature Loss. As mentioned in Section 2.1, traditional ASR systems employ feature extraction algorithms to obtain the acoustic features of spectrograms as a preprocessing procedure, while modern ASR systems typically utilize the spectrograms directly. Despite the intricacy of acoustic feature extraction algorithms, which require specialized knowledge, the performance of traditional ASR systems demonstrates that these algorithms can extract high-quality features. Prior work has also demonstrated the effectiveness of incorporating acoustic features into the optimization process of audio adversarial examples (Wu et al., 2023). Hence, we utilize the acoustic feature loss based on the widely adopted acoustic feature extraction algorithm, MFCC (Sigurdsson et al., 2006), to further enhance the effectiveness of the adversarial perturbation. Specifically, we first extract the acoustic features of the target command audio and the constructed adversarial example. We denote the acoustic feature of the target command audio and the constructed adversarial example as $M_{t}$ and $M_{x^{\prime}}$ , respectively. Then, the acoustic feature loss $\mathcal{L}_{f}$ can be calculated as:

(11)

\mathcal{L}_{f}(x,\delta,t)={\|M_{x^{\prime}}-M_{t}\|}_{2}.

Table 2. Summary of target ASR systems in the experiments.

ASR	Type	Acoustic Model	Word Error Rate on LibriSpeech test-clean/test-other (%)
Jasper (Li et al., 2019)	Open-source ASR	CNN	3.9/12.0
QuartzNet (Kriman et al., 2020)	Open-source ASR	CNN	3.8/10.4
Citrinet (Majumdar et al., 2021)	Open-source ASR	CNN	4.4/10.7 (S) 3.7/8.9 (M) 3.6/7.9 (L)
ContextNet (Han et al., 2020)	Open-source ASR	CNN	3.3/8.0 (S) 2.2/5.0 (M) 1.9/3.9 (L)
Conformer-CTC (Gulati et al., 2020)	Open-source ASR	Transformer	3.7/8.1 (S) 2.6/5.9 (M) 2.1/4.5 (L) 2.0/3.7 (XL)
Conformer-Transducer (Gulati et al., 2020)	Open-source ASR	Transformer	2.9/6.6 (S) 2.1/4.7 (M) 1.7/3.7 (L) 1.6/3.0 (XL)
Whisper (Radford et al., 2023)	Open-source ASR	Transformer	5.0/12.4 (base) 3.4/7.6 (small) 2.9/5.9 (medium) 2.7/5.6 (large)
Microsoft Azure (Microsoft, 2023)	Online speech recognition service	-⁷⁷footnotemark: 7	-
Tencent Cloud (Tencent, 2023)	Online speech recognition service	-	-
Alibaba Cloud (Alibaba, 2023)	Online speech recognition service	-	-
OpenAI Whisper (OpenAI, 2023a) ⁶⁶footnotemark: 6	Online speech recognition service	Transformer	2.7/5.2
Apple Siri (Apple, 2023)	Commercial IVC device	-	-
Amazon Alexa (Amazon, 2023)	Commercial IVC device	-	-

Note that, (i) ⁷⁷footnotemark: 7: Most online speech recognition services and commercial IVC devices do not reveal their implementation of the underlying ASR systems. Hence, we lack knowledge about their acoustic models and related information. We use “-” to represent unknown information. (ii) ⁶⁶footnotemark: 6: The ASR system employed by the OpenAI API is the open-source Whisper large-v2. Therefore, we have access to information regarding the acoustic model and its recognition performance on LibriSpeech. Despite the open-source nature of Whisper large-v2, we treat it as a black-box ASR system during attacks.

5. Experiments

5.1. Experiment Setup

In this section, we evaluate the performance of ZQ-Attack in both over-the-line and over-the-air settings. In the over-the-line setting, the audio adversarial examples are transmitted directly to the target ASR systems as waveform audio files. In the over-the-air setting, we utilize a speaker positioned near the target devices (e.g., 10 cm) to play the audio adversarial examples in a quiet office environment (e.g., 35dB).

Target ASR Systems. To fully demonstrate the effectiveness of ZQ-Attack, we conduct extensive experiments on various online speech recognition services, commercial IVC devices, and open-source ASRs. The details are presented as follows.

•

Online speech recognition services. In the over-the-line setting, the target online speech recognition services include Microsoft Azure Speech Service (Microsoft, 2023), Tencent Cloud Automatic Speech Recognition (Tencent, 2023), Alibaba Cloud intelligent speech interaction (Alibaba, 2023), and OpenAI Whisper (OpenAI, 2023a).
•

Commercial IVC devices. In the over-the-air setting, we select Apple Siri (Apple, 2023), and Amazon Alexa (Amazon, 2023) as the target IVC devices.
•

Open-source ASRs. Among the numerous open-source ASRs, we select some of the most representative and advanced ASRs as our targets in the over-the-line setting. As prior research indicates that the transferability of audio adversarial examples can be limited even among instances of the same ASR (Abdullah et al., 2021), we select open-source ASRs with the same architectures but varying scales, as well as those with distinct architectures. Specifically, the target ASRs include: Jasper (Li et al., 2019), QuartzNet (Kriman et al., 2020), ContextNet (M/L) (Han et al., 2020), Citrinet (M/L) (Majumdar et al., 2021), Conformer-CTC (M/L/XL) (Gulati et al., 2020), Conformer-Transducer (M/L/XL) (Gulati et al., 2020), and Whisper (base, small, medium, large) (Radford et al., 2023).

We summarize and categorize these target systems in Table 2.

Surrogate ASRs. The surrogate ASRs include Conformer-CTC (S) (Gulati et al., 2020), Conformer-Transducer (S) (Gulati et al., 2020), ContextNet (S) (Han et al., 2020), and Citrinet (S) (Majumdar et al., 2021). ContextNet and Citrinet employ CNNs as the acoustic model, while the acoustic models of Conformer-CTC and Conformer-Transducer are based on Transformers. The checkpoints for these surrogate ASRs are obtained from Nvidia NeMo (Nvidia, 2023).

Target Commands and Carrier Audios. We choose 10 commonly used instructions as the target commands in the experiments: call my wife, make it warmer, navigate to my home, open the door, open the website, play music, send a text, take a picture, turn off the light, and turn on airplane mode. We select five songs used in Commandersong (Yuan et al., 2018) as the carrier audio.

Software and Hardware. We implement ZQ-Attack using the PyTorch framework (Paszke et al., 2019). The experiments are conducted on a server equipped with 8 NVIDIA 3080Ti GPUs, 2 Intel Xeon Gold 5117 CPUs, and 128 GB RAM, running a 64-bit Ubuntu 18.04 operating system. In the over-the-air setting, we use a JBL Clip3 speaker to play the audio adversarial examples. Apple Siri on an iPhone 13 and Amazon Alexa on a second-generation Amazon Echo Dot are used as the target commercial IVC devices.

Baselines. To demonstrate the superior performance of ZQ-Attack, we compare it with several previous works. In the over-the-line setting, we compare ZQ-Attack with Carlini et al. (Carlini and Wagner, 2018), Occam (Zheng et al., 2021) and KENKU (Wu et al., 2023). In the over-the-air setting, we compare ZQ-Attack with NI-Occam (Zheng et al., 2021) and KENKU. For these baselines, we either utilize their open-source code or re-implement them. It is noteworthy that the evaluation results of these methods might be inconsistent with the original paper due to periodic updates by manufacturers to their ASR systems.

Experiment Design. We evaluate ZQ-Attack on online speech recognition services, commercial IVC devices, and open-source ASRs in Section 5.3, Section 5.4, and Section 5.5, respectively. In Section 5.6, we explore the impact of surrogate ASRs on ZQ-Attack. To evaluate the imperceptibility of ZQ-Attack, we conduct a user study in Section 5.7. In Section 5.8 and Section 5.9, we evaluate ZQ-Attack on a large command set and the latest Whisper large-v3 (OpenAI, 2023b), respectively.

Ethical Considerations. We have informed the relevant companies about the potential vulnerability of their ASR systems to our attacks via email.

Table 3. Comparison on online speech recognition services.

Method	SRoA $\uparrow$					SNR (dB) $\uparrow$	Query $\downarrow$
Method	Azure	Tencent	Alibaba	OpenAI	Average	SNR (dB) $\uparrow$	Query $\downarrow$
Carlini et al. (Carlini and Wagner, 2018)	0/10	0/10	0/10	0/10	0/10	/	0
Occam (Zheng et al., 2021)	10/10	10/10	10/10	10/10	10/10	12.54	30000
KENKU (Wu et al., 2023)	10/10	8/10	0/10	9/10	6.75/10	12.72	$>$ 0
ZQ-Attack	10/10	10/10	10/10	10/10	10/10	21.91	0

5.2. Evaluation Metrics

We use the success rate of attack (SRoA) as the metric of attack effectiveness. SRoA is calculated by dividing the number of successfully attacked commands by the total number of commands (i.e., 10). For each target command, if we can generate at least one adversarial example that effectively attacks the target ASR system, we consider the attack on that command as successful. Note that the adversarial example is considered effective only when its transcription matches exactly the target command. Any word errors are regarded as a failure, with case sensitivity being disregarded.

To evaluate the imperceptibility of the adversarial examples, we choose signal-to-noise ratio (SNR) as the metric. SNR is defined as the ratio of the power of a signal (i.e., the carrier audio $x$ ) to the power of a noise (i.e., the adversarial perturbation $\delta$ ) in the logarithm scale, and a higher SNR indicates better imperceptibility. The specific calculation method is shown as follows:

(12)

SNR(\text{dB})=10\cdot\log_{10}\left(\frac{{\|x\|}_{2}^{2}}{{\|\delta\|}_{2}^{% 2}}\right).

5.3. Evaluation on Online Speech Recognition Services

In the over-the-line setting, the results of ZQ-Attack and baseline methods on online speech recognition services are shown in Table 3. ZQ-Attack successfully generates audio adversarial examples for all target commands on four online speech recognition services, achieving an average SRoA of 100% and an average SNR of 21.91dB.

For baseline methods, the adversarial examples generated by Carlini et al. (Carlini and Wagner, 2018) fail to successfully attack any online speech recognition services, as this method is tailored for the white-box setting. Compared to Occam, ZQ-Attack achieves comparable effectiveness and better imperceptibility without any queries. While KENKU still necessitates a small number of queries to search for appropriate hyperparameters tailored to a specific target ASR system, ZQ-Attack attains superior effectiveness and imperceptibility without any queries. Note that KENKU fails to successfully attack Alibaba in our evaluation. We speculate that this could be attributed to updates made by Alibaba to its ASR system.

5.4. Evaluation on Commercial IVC Devices

Table 4. Comparison on commercial IVC devices.

Method	SRoA $\uparrow$			SNR (dB) $\uparrow$
Method	Siri	Alexa	Average	SNR (dB) $\uparrow$
NI-Occam (Zheng et al., 2021)	4/10	5/10	4.5/10	8.38
KENKU (Wu et al., 2023)	7/10	9/10	8/10	12.72
ZQ-Attack	10/10	10/10	10/10	15.77

Table 5. Evaluation on open-source ASRs.

Target ASR	SRoA	SNR (dB)	Target ASR	SRoA	SNR (dB)
Jasper	10/10	13.59	Conformer-CTC (XL)	10/10	23.59
QuartzNet	10/10	12.96	Conformer-Transducer (M)	10/10	25.34
Citrinet (M)	10/10	14.67	Conformer-Transducer (L)	10/10	20.63
Citrinet (L)	10/10	15.89	Conformer-Transducer (XL)	10/10	21.08
ContextNet (M)	10/10	15.73	Whisper (base)	10/10	17.88
ContextNet (L)	10/10	16.87	Whisper (small)	10/10	20.76
Conformer-CTC (M)	10/10	25.13	Whisper (medium)	10/10	23.39
Conformer-CTC (L)	10/10	23.51	Whisper (large)	10/10	23.74

In the over-the-air setting, we generate 10 audio adversarial examples for each target command. Each adversarial example is played up to three times. We consider that the attack on a command is successful if at least one adversarial example is effective.

The results of ZQ-Attack and baseline methods on commercial IVC devices are presented in Table 4. ZQ-Attack successfully generates audio adversarial examples for all target commands on two commercial IVC devices, achieving an SRoA of 100 $\%$ with an average SNR of 15.77dB. Additionally, an average of 6.6 adversarial examples for each command are effective.

In comparison, the SRoA of ZQ-Attack surpasses KENKU and NI-Occam by 20% and 55%. Concurrently, adversarial examples generated by ZQ-Attack exhibit better imperceptibility, with SNR values exceeding those of KENKU and NI-Occam by 3.05dB and 7.39dB, respectively.

5.5. Evaluation on Open-source ASRs

To demonstrate the high transferability of audio adversarial examples generated by ZQ-Attack, we additionally evaluate ZQ-Attack on 16 open-source ASR systems. The evaluation results are shown in Table 5. ZQ-Attack successfully generates audio adversarial examples on all 16 open-source ASRs, achieving an average SRoA of 100% and an average SNR of 19.67dB. These results show that the audio adversarial examples generated by ZQ-Attack exhibit high transferability, successfully attacking the target open-source ASRs.

Additionally, we observe a potential positive correlation between the recognition performance of open-source ASRs and the success rate of transfer attacks. We speculate that this phenomenon arises due to the better feature extraction capabilities of a more powerful ASR system, facilitating the capture of subtle adversarial perturbations. The detailed evaluation results and analysis of the correlation can be found in Appendix C

5.6. Impact of Surrogate ASRs

ZQ-Attack optimizes the adversarial perturbation on an ordered set of $K$ surrogate ASRs. To investigate the impact of surrogate ASRs to the performance of ZQ-Attack, we conduct experiments with various values of $K$ . Specifically, we set $K$ to 1, 2, and 4, respectively. In the cases with only one surrogate ASR, we conduct experiments with four configurations: ContextNet, Citrinet, Conformer-CTC, and Conformer-Transducer (i.e., selecting one from the four available ASRs). In the cases with two surrogate ASRs, six configurations (i.e., combination of two from the four available ASRs) are explored. Subsequently, we compute the average SRoA and SNR for the three groups of experiments. In these experiments, we choose one carrier audio and generate audio adversarial examples for 10 target commands.

The results are presented in Figure 6. It can be observed that with an increase in $K$ , both SRoA and SNR exhibit an increasing trend. For instance, when $K$ is 1, 2, and 4, the average SRoA is 81.25%, 98.75%, and 100%, respectively. The average SNR for the case involving 4 surrogate ASRs surpasses that of the case with 2 surrogate ASRs by 2.98dB and exceeds the SNR of the case with only 1 surrogate ASR by 5.35dB. Additionally, it can also be observed that ZQ-Attack successfully performs attacks for most target commands when using 2 surrogate ASRs. Thus, when computational resources are constrained, and there is a more relaxed requirement for the imperceptibility and effectiveness of the attack, a smaller $K$ can be chosen. Conversely, a larger $K$ can be used. In the cases with four surrogate ASRs, ZQ-Attack can complete a generation process in approximately 10 minutes using only one NVIDIA 3080Ti GPU.

Table 6. User study on human perception.

Audio Device	Method	Normal (%) $\uparrow$	Noise (%)	Talking (%) $\downarrow$	Recognize (%) $\downarrow$		WER (%) $\uparrow$
Audio Device	Method	Normal (%) $\uparrow$	Noise (%)	Talking (%) $\downarrow$	Once	Twice	Once	Twice
Speaker	Songs	92.63	7.37	0.00	0.00	0.00	0.00	0.00
	Occam (Zheng et al., 2021)	5.26	15.79	78.95	29.47	35.79	57.37	51.24
	KENKU (Wu et al., 2023)	0.00	6.32	93.68	56.84	63.68	30.23	25.96
	ZQ-Attack	13.16	71.58	15.26	3.16	3.68	94.95	94.47
Headphone	Songs	92.11	7.89	0.00	0.00	0.00	0.00	0.00
	Occam (Zheng et al., 2021)	2.11	3.16	94.74	49.47	54.74	38.25	31.32
	KENKU (Wu et al., 2023)	0.00	1.58	98.42	67.37	70.00	19.34	17.77
	ZQ-Attack	5.79	47.89	46.32	15.26	22.63	77.98	72.41

5.7. User Study

In this section, we recruit human participants to analyze the imperceptibility of the audio adversarial examples and conduct a comparative analysis between ZQ-Attack and the baseline methods (i.e., Occam, and KENKU). This study is carefully designed to mitigate any conceivable risks (psychological, legal, etc.) to the participants, and it is approved by the institutional review board (IRB). The target commands are common household phrases (e.g., “call my wife”) to minimize discomfort and the audio volume is normalized to maintain it below a safe threshold, preventing any risk of hearing damage. Besides, our study does not collect any private information from the participants, and all data will be deleted upon the completion of the study.

Our test audios comprise both normal audios and audio adversarial examples. The normal audios consist of 10 songs, and the audio adversarial examples include 10 instances from each method. In the user study, each participant listens to all test audios. Following the first listening, participants are provided with three choices: “normal audio”, “noisy audio”, and “audio with background speech”. In cases where participants select the “audio with background speech” option, they are required to provide the content of the perceived speech, followed by a second round of listening and transcription of the same test audio. Specifically, participants first listen to the test audios through speakers (e.g., MacBook Pro Speaker). Then, to eliminate environmental interference, participants listen to the audios again through noise-canceling headphones (e.g., Sony WH-1000XM5).

We gather data from a total of 38 participants, consisting of 20 males and 18 females. Among this group, 10 participants are below the age of 22, 18 participants are aged 22 to 24, and 10 participants are 25 or older. All participants are proficient in both spoken and written English, hold at least a bachelor’s degree, and have normal hearing. The results are presented in Table 6. When using the speaker as the audio device, 13.16% of participants select our adversarial examples as “normal audio”, while 71.58% of participants select them as “noisy audio”. Although 15.26% of participants select our adversarial examples as “audio with background speech,” only 3.68% of commands are recognized even after a second round of listening and transcription. In a more challenging configuration that uses headphones to play the audios, only 22.63% of the commands are recognized even after a second round of listening and transcription. Additionally, for the audios played using speakers and headphones, the average word error rate (WER) between user recognition results and the actual target commands is 94.47% and 72.41%, respectively. Compared with other methods, ZQ-Attack attains superior imperceptibility.

Furthermore, we observed that the imperceptibility of different target commands varies. The command open the door is the most easily perceived, while send a text is the hardest to perceive. This disparity may be attributed to the varying phonemes in different command audios, with vowel phonemes containing more energy and thus being more easily audible (Guo et al., 2022; Yu et al., 2023).

Table 7. Evaluation on a large command set.

Command	Azure		Tencent		Alibaba		OpenAI
Command	Attack	SNR (dB)	Attack	SNR (dB)	Attack	SNR (dB)	Attack	SNR (dB)
ask me a question	✓	23.51	✓	28.66	✓	26.31	✓	28.73
clear notification	✓	26.52	✓	21.28	✓	20.05	✓	26.52
close the shades	✓	26.54	✓	26.24	✓	26.54	✓	26.86
find a hotel	✓	25.78	✓	24.63	✓	20.26	✓	28.21
good morning	✓	19.65	✓	18.96	✓	27.90	✓	25.76
I have a secret to tell you	✓	27.21	✓	27.21	✓	27.21	✓	27.21
I need help	✓	27.23	✓	26.64	✓	20.17	✓	28.54
open the box	✓	18.48	✓	26.34	✓	18.48	✓	27.54
read a book	✓	20.77	✓	25.81	✓	17.27	✓	27.71
record a video	✓	21.38	✓	20.51	✓	21.38	✓	21.38
reset password	✓	22.70	✓	21.69	✓	19.91	✓	22.70
show me my message	✓	27.12	✓	27.54	✓	23.98	✓	27.54
show me the money	✓	27.94	✓	27.96	✓	25.75	✓	27.96
start recording	✓	27.03	✓	20.42	✓	19.87	✓	27.19
tell me a story	✓	27.89	✓	27.89	✓	24.34	✓	27.94
turn off the fan	✓	19.98	✓	15.25	✓	19.98	✓	19.98
turn on the TV	✓	25.55	✓	17.71	✓	17.18	✓	25.55
watch TV	✓	26.32	✓	26.32	✓	19.20	✓	26.32
what time is it	✓	26.22	✓	25.39	✓	25.39	✓	28.20
where is my car	✓	27.24	✓	23.28	✓	25.76	✓	27.24
Average	20/20	24.75	20/20	23.99	20/20	22.35	20/20	26.45

5.8. Evaluation on a Larger Command Set

To thoroughly evaluate the effectiveness of ZQ-Attack, we conduct experiments using a larger command set. This set includes a total of 20 commands: ask me a question, clear notification, close the shades find a hotel, good morning, I have a secret to tell you, I need help, open the box, read a book, record a video, reset password, show me my message, show me the money, start recording, tell me a story, turn off the fan, turn on the TV, watch TV, what time is it where is my car.

We choose a carrier audio and generate audio adversarial examples for each of these target commands. As illustrated in Table 7, ZQ-Attack can generate effective audio adversarial examples for each target command in this set, achieving an average SRoA of 100%. On Azure, Tencent, Alibaba, and OpenAI, ZQ-Attack attains average SNR values of 24.75dB, 23.99dB, 22.35dB, and 26.45dB, respectively.

5.9. Evaluation on Whisper large-v3

Whisper (OpenAI, 2023a) is a popular ASR system trained on a diverse audio dataset comprising 680,000 hours of audio, achieving state-of-the-art multilingual recognition performance. Evaluating ZQ-Attack on Whisper can more comprehensively demonstrate its effectiveness on state-of-the-art ASR systems. Since the models of Whisper come in diverse sizes, we also conduct experiments on these various models. In Section 5.5, we evaluate ZQ-Attack utilizing the base, small, medium, and large models as open-source ASRs. As the OpenAI API employs the large-v2 model, we have evaluated the performance of ZQ-Attack on the large-v2 model in Section 5.3.

Recently, OpenAI introduced the latest large-v3 model (OpenAI, 2023b), which surpasses the performance of the large-v2 model. We have also evaluated ZQ-Attack on the latest model. As indicated in Table 8, ZQ-Attack can successfully generate effective adversarial examples for all 10 target commands, achieving an average SNR of 23.49dB. These results demonstrate the capability of ZQ-Attack to generate effective and imperceptible audio adversarial examples on the most advanced ASR systems.

Table 8. Evaluation on the latest Whisper large-v3.

Command	Whisper large-v3
Command	Attack	SNR (dB)
call my wife	✓	21.28
make it warmer	✓	17.31
navigate to my home	✓	24.35
open the door	✓	24.46
open the website	✓	26.75
play music	✓	24.76
send a text	✓	24.54
take a picture	✓	27.23
turn off the light	✓	26.29
turn on airplane mode	✓	17.95
Average	10/10	23.49

6. Related Work

Table 9. SRoA of ZQ-Attack on online speech recognition services with defenses.

Defenses	Setting	Azure	Tencent	Alibaba	OpenAI	Average
Local Smoothing	$h=1$	10/10	10/10	10/10	10/10	10/10
	$h=2$	10/10	10/10	10/10	10/10	10/10
	$h=3$	10/10	10/10	9/10	10/10	9.75/10
Downsampling	$f_{low}=$ 14kHz	10/10	10/10	10/10	10/10	10/10
	$f_{low}=$ 12kHz	10/10	10/10	9/10	9/10	9.5/10
	$f_{low}=$ 10kHz	10/10	10/10	9/10	9/10	9.5/10
Temporal Dependency	$k=0.2$	10/10	10/10	10/10	10/10	10/10
	$k=0.5$	10/10	10/10	10/10	10/10	10/10
	$k=0.8$	10/10	10/10	10/10	10/10	10/10
MVP-EARS	$m=2$	10/10	10/10	10/10	10/10	10/10
	$m=3$	10/10	10/10	10/10	10/10	10/10
	$m=4$	10/10	10/10	10/10	10/10	10/10

Audio Adversarial Attacks on White-box ASR Systems. In recent years, extensive studies have focused on audio adversarial attacks on ASR systems. Carlini et al. (Carlini and Wagner, 2018) were the first to generate targeted audio adversarial examples on the white-box DeepSpeech model. Concurrently, Commandersong (Yuan et al., 2018) demonstrated successful attacks on the Kaldi model by embedding malicious commands into songs. This approach also enabled physical attacks but had stringent limitations, such as requiring specific speakers and recording devices. Qin et al. (Qin et al., 2019) and Schönherr et al. (Schönherr et al., 2019) contributed to improving the imperceptibility of generated adversarial examples by incorporating the psychoacoustic model into the audio adversarial example generation process.

Audio Adversarial Attacks on Black-box ASR Systems. Despite the success of the aforementioned methods in generating audio adversarial examples on white-box ASR systems, their reliance on the gradient information of the target model limits their applicability for attacking black-box ASR systems. To address this challenge, Taori et al. (Taori et al., 2019) proposed employing gradient estimation and genetic algorithms to achieve black-box attacks, but their method had a relatively low attack success rate. Following this work, SGEA (Wang et al., 2020) improved the attack success rate and reduced the number of queries by employing selective gradient estimation techniques. Nevertheless, generating a single audio adversarial example still required approximately 100,000 queries. Devil’s Whisper (Chen et al., 2020) significantly improved the attack success rate by using a surrogate model. However, it relied on confidence scores returned by the target ASR system, which often be unavailable in real-world scenarios. Occam (Zheng et al., 2021) employed cooperative co-evolution and the CMA evolution strategy (Hansen, 2016), eliminating the need for confidence scores. Recently, KENKU (Wu et al., 2023) optimized the acoustic feature loss based on MFCC and imperceptibility loss simultaneously to generate relatively stealthy audio adversarial examples on black-box ASR systems. Although these methods have achieved improved performance, their dependence on querying the target system continues to limit their practicality.

To generate audio adversarial examples without the need for queries, researchers have proposed transfer-based attacks that can generate adversarial examples capable of attacking different models. NI-Occam (Zheng et al., 2021) utilized fine-tuned Kaldi models to exclusively launch attacks on IVC devices that are more sensitive to commands, but its attack success rate remains relatively low. TransAudio (Qi et al., 2023) presented a two-stage framework and a score-matching-based optimization strategy to achieve word-level adversarial attack, but its target transcription is constrained by the carrier audio. In the field of image adversarial attacks, previous work (Brown et al., 2017; Liu et al., 2017) proposed utilizing the ensemble method to improve the transferability of adversarial examples. Inspired by this, we design a sequential ensemble optimization algorithm to generate adversarial examples using diverse surrogate ASRs.

7. Discussion

7.1. Defenses against ZQ-Attack

We evaluate ZQ-Attack with several state-of-the-art defense methods against audio adversarial attacks, including local smoothing, downsampling, temporal dependency (Yang et al., 2019), and MVP-EARS (Zeng et al., 2019). The results are presented in Table 9.

Local Smoothing. Local smoothing renders audio adversarial attacks ineffective by applying a sliding window with a median filter to the adversarial examples. Given the length of the sliding window, denoted as $h$ , the value of an audio sample point is replaced by the average values of itself and the $h$ sample points before and after it. To evaluate the robustness of ZQ-Attack against local smoothing, we set $h$ to 1, 2, and 3, respectively. The results in Table 9 show that local smoothing has minimal impact on ZQ-Attack. Across different settings, ZQ-Attack consistently achieves an SRoA higher than 97.5%.

Downsampling. This method involves downsampling the original audio to a lower sampling rate $f_{low}$ and subsequently upsampling it to the original sampling rate, which causes a loss of high-frequency information from the original audio. Therefore, if the high-frequency information in the adversarial perturbation is lost, the attack might fail. To assess the robustness of ZQ-Attack against downsampling, we set $f_{low}$ to 14kHz, 12kHz, and 10kHz. The results in Table 9 show that ZQ-Attack has great robustness to downsampling, achieving an SRoA higher than 95% under different settings.

Temporal Dependency. The inherent temporal dependency in audio data can be leveraged to detect audio adversarial examples (Yang et al., 2019). Specifically, audio adversarial examples can be identified by comparing the transcription of the first $k$ part of the audio with the first $k$ part of the transcription of the entire audio, where $k$ is a ratio between 0 and 1. If the consistency between them is low, the audio can be considered an adversarial example; otherwise, it is accepted as normal. To evaluate the robustness of ZQ-Attack against the temporal dependency-based defense, we set $k$ to 0.2, 0.5, and 0.8. The results in Table 9 demonstrate that the audio adversarial examples generated by ZQ-Attack exhibit strong resilience to temporal dependency, as all attacks are successful under different settings.

MVP-EARS (Zeng et al., 2019). MVP-EARS utilizes multiple ASR systems to detect audio adversarial examples. Due to the limited transferability of most prior audio adversarial example generation methods, different ASR systems may produce significantly different transcriptions for the same audio adversarial example. Therefore, multiple ASR systems can be used to transcribe the same audio. If their transcripts differ, the audio can be considered an adversarial example.

To evaluate the robustness of ZQ-Attack against MVP-EARS, we set $m$ to 2, 3, and 4, where $m$ is the number of ASR systems used. The results in Table 9 show that the audio adversarial examples generated by ZQ-Attack exhibit good robustness to MVP-EARS, with all target commands successfully attacked under different settings. This is attributed to the fact that ZQ-Attack does not generate audio adversarial examples customized for a specific ASR but crafts transferable adversarial examples on diverse surrogate ASRs.

7.2. Limitations and Future Work

While ZQ-Attack leverages diverse surrogate ASRs to achieve transferable audio adversarial attacks in the zero-query black-box setting, it incurs a higher computation cost (e.g., GPU memory). However, considering that the audio adversarial examples generated by ZQ-Attack are effective on multiple ASR systems and save the cost of queries, we consider that the additional cost is acceptable. Another limitation is that although the audio adversarial examples generated by ZQ-Attack exhibit better imperceptibility compared to prior work in the over-the-air setting, they may still be detected by humans. We leave enhancing the imperceptibility of audio adversarial examples in the over-the-air setting for future work.

8. Conclusion

We proposed ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box scenario. By summarizing and categorizing the modern ASR systems, we first selected a diverse set of surrogate ASRs for generating adversarial examples. Then, we employed an adaptive search algorithm to initialize the adversarial perturbations with a scaled target command audio, ensuring its effectiveness and imperceptibility. Subsequently, we designed a novel sequential ensemble optimization algorithm to optimize the adversarial perturbations using the selected surrogate ASRs. Our experimental results indicate that ZQ-attack achieves successful attacks on 4 online speech recognition services and 16 open-source ASRs in the over-the-line setting and attacks 2 commercial IVC devices in the over-the-air setting. This demonstrates a significant improvement in the practicality of audio adversarial attacks compared to prior methods.

Acknowledgements.

We thank the anonymous reviewers for their helpful and valuable feedback. This work was partially supported by the National Key R&D Program of China 2023YFE0209800, NSFC under Grants U20B2049, U21B2018, 62302344, 62132011, 62161160337, and Shaanxi Province Key Industry Innovation Program 2021ZDLGY01-02.

References

(1)
Abdel-Hamid et al. (2014) Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 22 (2014), 1533–1545.
Abdeljaber et al. (2017) Osama Abdeljaber, Onur Avci, Serkan Kiranyaz, Moncef Gabbouj, and Daniel J Inman. 2017. Real-time vibration-based structural damage detection using one-dimensional convolutional neural networks. Journal of Sound and Vibration 388 (2017), 154–170.
Abdullah et al. (2021) Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. 2021. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In Proc. of IEEE S&P. 730–747.
Alibaba (2023) Alibaba. 2023. Alibaba Cloud Intelligent Speech Interaction. https://www.alibabacloud.com/product/intelligent-speech-interaction.
Amazon (2023) Amazon. 2023. Alexa. https://www.alexa.com/.
Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, **gliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proc. of ICML. 173–182.
Apple (2023) Apple. 2023. Siri. https://www.apple.com/siri/.
Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
Brown et al. (2017) Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. 2017. Adversarial patch. CoRR abs/1712.09665 (2017).
Carlini and Wagner (2018) Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In Proc. of IEEE SPW. 1–7.
Chen et al. (2020) Yuxuan Chen, Xue**g Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil’s Whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices.. In Proc. of USENIX Security. 2667–2684.
Cohen et al. (2009) Israel Cohen, Yiteng Huang, **gdong Chen, Jacob Benesty, Jacob Benesty, **gdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing (2009), 1–4.
Demontis et al. (2019) Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. 2019. Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. In Proc. of USENIX Security. 321–338.
Dong et al. (2018b) Linhao Dong, Shuang Xu, and Bo Xu. 2018b. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proc. of IEEE ICASSP. 5884–5888.
Dong et al. (2018a) Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018a. Boosting adversarial attacks with momentum. In Proc. of IEEE/CVF CVPR. 9185–9193.
Duan et al. (2022) Rui Duan, Zhe Qu, Shangqing Zhao, Leah Ding, Yao Liu, and Zhuo Lu. 2022. Perception-aware attack: Creating adversarial music via reverse-engineering human perception. In Proc. of ACM CCS. 905–919.
Gales et al. (2008) Mark Gales, Steve Young, et al. 2008. The application of hidden Markov models in speech recognition. Foundations and Trends® in Signal Processing 1 (2008), 195–304.
Gao et al. (2018) Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. In Proc. of IEEE SPW. 50–56.
Gelfand (2017) S.A. Gelfand. 2017. Hearing: An introduction to psychological and physiological acoustics. CRC Press.
Graves (2012) Alex Graves. 2012. Sequence transduction with recurrent neural networks. CoRR abs/1211.3711 (2012).
Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. of ICML. 369–376.
Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proc. of IEEE ICASSP. 6645–6649.
Gulati et al. (2020) Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Proc. of INTERSPEECH. 5036–5040.
Guo et al. (2022) Hanqing Guo, Yuanda Wang, Nikolay Ivanov, Li Xiao, and Qiben Yan. 2022. SPECPATCH: Human-In-The-Loop Adversarial Audio Spectrogram Patch Attack on Speech Recognition. In Proc. of ACM CCS. 1353–1366.
Han et al. (2020) Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu. 2020. ContextNet: Improving convolutional neural networks for automatic speech recognition with global context. In Proc. of INTERSPEECH. 3610–3614.
Hannun et al. (2014) Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. CoRR abs/1412.5567 (2014).
Hansen (2016) Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. CoRR abs/1604.00772 (2016).
Hermansky (1990) Hynek Hermansky. 1990. Perceptual linear predictive (PLP) analysis of speech. the Journal of the Acoustical Society of America 87 (1990), 1738–1752.
Higuchi et al. (2020) Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi. 2020. Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict. In Proc. of INTERSPEECH. 3655–3659.
Hsu et al. (2021) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
Huang et al. (2019) Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim. 2019. Enhancing adversarial example transferability with an intermediate level attack. In Proc. of IEEE/CVF ICCV. 4733–4742.
Itakura (1975) Fumitada Itakura. 1975. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoustical Society of America 57 (1975), S35–S35.
Karita et al. (2019a) Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. 2019a. A comparative study on transformer vs rnn in speech applications. In Proc. of IEEE ASRU. 449–456.
Karita et al. (2019b) Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji Watanabe, Marc Delcroix, Atsunori Ogawa, and Tomohiro Nakatani. 2019b. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proc. of INTERSPEECH. 1408–1412.
Kriman et al. (2020) Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, and Yang Zhang. 2020. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proc. of IEEE ICASSP. 6124–6128.
Li et al. (2019) Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An end-to-end convolutional neural acoustic model. In Proc. of INTERSPEECH. 71–75.
Li et al. (2020) **yu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, and Shujie Liu. 2020. On the comparison of popular end-to-end models for large scale speech recognition. In Proc. of INTERSPEECH. 1–5.
Liu et al. (2017) Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into transferable adversarial examples and black-box attacks. In Proc. of ICLR.
Majumdar et al. (2021) Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, and Boris Ginsburg. 2021. Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition. CoRR abs/2104.01721 (2021).
Microsoft (2023) Microsoft. 2023. Microsoft Azure Speech Service. https://azure.microsoft.com/en-us/products/cognitive-services/speech-services/.
Nvidia (2023) Nvidia. 2023. NeMo. https://developer.nvidia.com/nemo/.
OpenAI (2023a) OpenAI. 2023a. Whisper. https://openai.com/research/whisper.
OpenAI (2023b) OpenAI. 2023b. Whisper large-v3. https://github.com/openai/whisper/.
Opitz and Maclin (1999) David Opitz and Richard Maclin. 1999. Popular ensemble methods: An empirical study. Journal of artificial intelligence research 11 (1999), 169–198.
Papernot et al. (2016) Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. 2016. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR abs/1605.07277 (2016).
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proc. of NeruIPS. 8024–8035.
Qi et al. (2023) Gege Qi, Yuefeng Chen, Yao Zhu, Binyuan Hui, Xiaodan Li, Xiaofeng Mao, Rong Zhang, and Hui Xue. 2023. Transaudio: Towards the transferable adversarial audio attack via learning contextualized perturbations. In Proc. of IEEE ICASSP. 1–5.
Qin et al. (2019) Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Proc. of ICML. 5231–5240.
Rabiner (1989) Lawrence R Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (1989), 257–286.
Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proc. of ICML. 28492–28518.
Ravikumar et al. (2022) Deepak Ravikumar, Sangamesh D Kodge, Isha Garg, and Kaushik Roy. 2022. TREND: Transferability based robust ensemble design. IEEE Transactions on Artificial Intelligence (2022).
Schneider et al. (2019) Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. In Proc. of INTERSPEECH. 3465–3469.
Schönherr et al. (2019) Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. In Proc. of NDSS.
Sigurdsson et al. (2006) Sigurdur Sigurdsson, Kaare Brandt Petersen, and Tue Lehn-Schiøler. 2006. Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music.. In ISMIR. 286–289.
Spearman (1987) Charles Spearman. 1987. The proof and measurement of association between two things. The American journal of psychology 100 (1987), 441–471.
Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In Proc. of ICLR.
Taori et al. (2019) Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2019. Targeted adversarial examples for black box audio systems. In Proc. of IEEE SPW. 15–20.
Tencent (2023) Tencent. 2023. Tencent Cloud Automatic Speech Recognition. https://cloud.tencent.com/document/product/1093/.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proc. of NeruIPS. 5998–6008.
Wang et al. (2020) Qian Wang, Baolin Zheng, Qi Li, Chao Shen, and Zhongjie Ba. 2020. Towards query-efficient adversarial attacks against automatic speech recognition systems. IEEE Transactions on Information Forensics and Security 16 (2020), 896–908.
Wu et al. (2018) Lei Wu, Zhanxing Zhu, Cheng Tai, and Weinan E. 2018. Understanding and enhancing the transferability of adversarial examples. CoRR abs/1802.09707 (2018).
Wu et al. (2023) Xinghui Wu, Shiqing Ma, Chao Shen, Chenhao Lin, Qian Wang, Qi Li, and Yuan Rao. 2023. KENKU: Towards efficient and stealthy black-box adversarial attacks against ASR systems. In Proc. of USENIX Security. 247–264.
Yang et al. (2019) Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2019. Characterizing audio adversarial examples using temporal dependency. In Proc. of ICLR.
Yu et al. (2023) Zhiyuan Yu, Yuanhaur Chang, Ning Zhang, and Chaowei Xiao. 2023. $\{$ SMACK $\}$ : Semantically meaningful adversarial audio attack. In Proc. of USENIX security. 3799–3816.
Yuan et al. (2018) Xue**g Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, XiaoFeng Wang, and Carl A. Gunter. 2018. CommanderSong: A systematic approach for practical adversarial voice recognition. In Proc. of USENIX Security. 49–64.
Zeng et al. (2019) Qiang Zeng, Jianhai Su, Chenglong Fu, Golam Kayas, Lannan Luo, Xiaojiang Du, Chiu C Tan, and Jie Wu. 2019. A multiversion programming inspired approach to detecting audio adversarial examples. In Proc. of IEEE/IFIP DSN. 39–51.
Zeyer et al. (2019) Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter, and Hermann Ney. 2019. A comparison of transformer and lstm encoder decoder models for asr. In Proc. of IEEE ASRU. 8–15.
Zhang et al. (2020) Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. 2020. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In Proc. of IEEE ICASSP. 7829–7833.
Zheng et al. (2021) Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. 2021. Black-box adversarial attacks on commercial speech platforms with minimal information. In Proc. of ACM CCS. 86–107.

Table 10. Detailed results of evaluation on online speech recognition services.

Command	Azure		Tencent		Alibaba		OpenAI
Command	Attack	SNR (dB)	Attack	SNR (dB)	Attack	SNR (dB)	Attack	SNR (dB)
call my wife	✓	18.41	✓	19.88	✓	16.89	✓	17.44
make it warmer	✓	20.35	✓	18.18	✓	16.50	✓	15.41
navigate to my home	✓	20.85	✓	23.42	✓	20.65	✓	20.35
open the door	✓	21.63	✓	23.09	✓	20.73	✓	17.67
open the website	✓	21.86	✓	26.67	✓	23.94	✓	23.55
play music	✓	25.47	✓	26.51	✓	20.26	✓	20.97
send a text	✓	25.60	✓	23.28	✓	25.79	✓	23.18
take a picture	✓	26.85	✓	27.77	✓	25.00	✓	26.48
turn off the light	✓	25.08	✓	27.94	✓	26.16	✓	24.28
turn on airplane mode	✓	17.44	✓	17.79	✓	16.53	✓	16.75
Average	✓	22.35	✓	23.45	✓	21.24	✓	20.61

Appendix A Evaluation on Online Speech Recognition Services

The detailed evaluation results of ZQ-Attack on online speech recognition services are presented in Table 10. For the 10 target commands, ZQ-Attack can generate effective adversarial examples on 4 online speech recognition services. The results demonstrate that ZQ-Attack can successfully generate effective and imperceptible audio adversarial examples on these online speech recognition services.

Table 11. Detailed results of evaluation on commercial IVC devices.

Command	NI-Occam		KENKU		ZQ-Attack
Command	Siri	Alexa	Siri	Alexa	Siri	Alexa
call my wife	✓	✓	✓	✓	✓	✓
make it warmer	✗	✗	✗	✓	✓	✓
navigate to my home	✓	✓	✓	✓	✓	✓
open the door	✗	✗	✓	✓	✓	✓
open the website	✗	✓	✓	✓	✓	✓
play music	✓	✓	✓	✓	✓	✓
send a text	✗	✗	✗	✗	✓	✓
take a picture	✗	✗	✗	✓	✓	✓
turn off the light	✓	✓	✓	✓	✓	✓
turn on airplane mode	✗	✗	✓	✓	✓	✓
Average	4/10	5/10	7/10	9/10	10/10	10/10

Appendix B Evaluation on Commercial IVC Devices

The detailed evaluation results on commercial IVC devices are presented in Table 11. The results demonstrate that ZQ-Attack successfully attacks all 10 target commands, outperforming both KENKU and NI-Occam. Moreover, we observe that the effectiveness of these methods can vary significantly for different commands. For example, for the command play music, all methods successfully execute the attack. However, for the command make it warmer, both KENKU and NI-Occam fail to achieve success.

Appendix C Evaluation on Open-source ASRs

We introduce the transfer rate (TR) to evaluate the transferability of audio adversarial examples generated by ZQ-Attack to open-source ASR systems. For a specified target command and target open-source ASR system, the TR represents the success rate of the transfer attack, i.e., the ratio of adversarial examples within $X^{\prime}$ that successfully attack the target open-source ASR system. A higher TR indicates better transferability of the audio adversarial examples generated by ZQ-Attack to the target open-source ASR system. The results are presented in Table 12. ZQ-Attacks obtains an average TR of 65.19% and an SNR of 19.67dB on 16 open-source ASRs. These results indicate that ZQ-Attack can generate audio adversarial examples that effectively attack a diverse range of open-source ASRs.

Table 12. Evaluation on open-source ASRs.

Command	Jasper			QuartzNet			Citrinet (M)			Citrinet (L)
Command	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)
call my wife	✓	31.73	9.97	✓	33.35	9.90	✓	24.08	11.60	✓	50.28	13.46
make it warmer	✓	18.21	9.43	✓	27.84	9.94	✓	12.50	9.64	✓	38.20	13.00
navigate to my home	✓	27.35	12.28	✓	15.35	11.52	✓	20.73	13.88	✓	38.42	14.71
open the door	✓	28.60	13.19	✓	23.30	12.96	✓	11.15	13.33	✓	27.19	14.54
open the website	✓	19.06	15.13	✓	12.63	14.72	✓	13.08	17.98	✓	23.32	15.59
play music	✓	38.72	14.04	✓	32.41	13.41	✓	29.91	16.20	✓	55.16	17.54
send a text	✓	44.28	18.50	✓	22.50	15.94	✓	2.58	15.69	✓	61.94	19.70
take a picture	✓	20.42	19.98	✓	16.25	17.99	✓	21.92	22.79	✓	36.07	22.07
turn off the light	✓	14.78	17.55	✓	11.85	16.36	✓	13.03	18.73	✓	26.31	20.66
turn on airplane mode	✓	7.27	5.82	✓	24.33	6.86	✓	10.59	6.87	✓	21.75	7.59
Average	10/10	25.04	13.59	10/10	21.98	12.96	10/10	15.96	14.67	10/10	37.86	15.89
Command	ContextNet (M)			ContextNet (L)			Conformer-CTC (M)			Conformer-CTC (L)
Command	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)
call my wife	✓	75.56	15.37	✓	69.31	14.18	✓	100.00	23.06	✓	93.02	20.66
make it warmer	✓	62.58	13.03	✓	65.07	13.01	✓	91.76	23.17	✓	85.89	17.71
navigate to my home	✓	53.39	14.95	✓	56.53	16.71	✓	96.57	25.52	✓	80.47	19.88
open the door	✓	39.96	14.02	✓	50.64	15.40	✓	99.71	26.84	✓	94.56	25.33
open the website	✓	31.78	15.56	✓	49.42	17.73	✓	99.34	26.75	✓	87.89	25.05
play music	✓	46.42	14.78	✓	61.80	17.81	✓	99.05	26.84	✓	99.72	26.85
send a text	✓	49.05	19.44	✓	56.27	18.73	✓	93.02	25.37	✓	98.58	25.91
take a picture	✓	27.93	19.76	✓	37.87	21.81	✓	99.91	27.78	✓	97.82	27.78
turn off the light	✓	42.07	18.46	✓	55.41	20.58	✓	98.01	27.89	✓	96.97	27.84
turn on airplane mode	✓	67.30	11.96	✓	52.77	12.73	✓	98.66	18.06	✓	83.09	18.06
Average	10/10	49.60	15.73	10/10	55.51	16.87	10/10	97.60	25.13	10/10	91.80	23.51
Command	Conformer-CTC (XL)			Conformer-Transducer (M)			Conformer-Transducer (L)			Conformer-Transducer (XL)
Command	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)
call my wife	✓	96.61	22.02	✓	97.99	22.99	✓	96.19	20.58	✓	83.21	20.10
make it warmer	✓	84.56	18.87	✓	97.82	24.77	✓	85.14	18.37	✓	43.34	18.48
navigate to my home	✓	97.54	25.47	✓	99.76	25.52	✓	84.51	22.05	✓	93.03	23.30
open the door	✓	91.13	25.90	✓	100.00	26.84	✓	54.29	16.05	✓	60.62	18.73
open the website	✓	87.18	24.17	✓	100.00	26.76	✓	41.65	23.85	✓	36.62	21.09
play music	✓	97.26	26.85	✓	98.77	26.85	✓	90.61	26.30	✓	59.51	22.27
send a text	✓	92.70	24.39	✓	98.94	25.87	✓	46.71	17.69	✓	92.18	24.76
take a picture	✓	99.76	27.78	✓	99.84	27.78	✓	69.12	23.11	✓	60.07	23.96
turn off the light	✓	76.09	24.81	✓	99.72	27.95	✓	96.35	27.84	✓	68.76	24.02
turn on airplane mode	✓	75.66	15.61	✓	100.00	18.06	✓	20.77	10.48	✓	34.06	14.08
Average	10/10	89.85	23.59	10/10	99.28	25.34	10/10	68.53	20.63	10/10	63.14	21.08
Command	Whisper (base)			Whisper (small)			Whisper (medium)			Whisper (large)
Command	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)	Attack	TR (%)	SNR (dB)
call my wife	✓	58.90	13.55	✓	87.73	17.94	✓	91.34	19.80	✓	89.68	18.73
make it warmer	✓	69.28	13.57	✓	78.24	15.00	✓	80.67	18.76	✓	81.56	17.89
navigate to my home	✓	53.57	15.14	✓	77.49	19.11	✓	85.16	22.64	✓	91.71	25.44
open the door	✓	72.91	18.62	✓	75.41	19.59	✓	88.02	24.02	✓	81.69	22.44
open the website	✓	63.85	20.54	✓	78.41	22.53	✓	89.61	25.15	✓	94.93	26.76
play music	✓	65.24	16.91	✓	81.73	23.04	✓	88.06	26.81	✓	91.75	26.85
send a text	✓	55.08	19.94	✓	66.03	19.53	✓	77.04	25.59	✓	89.01	25.83
take a picture	✓	84.77	26.71	✓	84.91	27.54	✓	97.66	27.78	✓	99.15	27.78
turn off the light	✓	64.04	22.09	✓	86.75	25.97	✓	95.12	26.45	✓	97.20	27.61
turn on airplane mode	✓	66.78	11.70	✓	92.72	17.32	✓	96.96	16.92	✓	99.18	18.06
Average	10/10	65.44	17.88	10/10	80.94	20.76	10/10	88.97	23.39	10/10	91.58	23.74

In the evaluation, we observe a potential correlation between the performance of the target ASR system (e.g., WER on LibriSpeech test-other) and the TR. To further analyze this correlation, we extend the set of 16 open-source ASRs to include Conformer-Transducer (XXL), Whisper (tiny), Whisper large-v2, and Whisper large-v3. Additionally, we categorize them into two groups: Whisper and Others. This categorization stems from the fact that Whisper is designed for multilingual recognition and has not undergone fine-tuning on the LibriSpeech dataset, unlike other open-source ASRs, which are tailored for English recognition tasks and include the LibriSpeech dataset as part of their training set.

The potential correlation is depicted in Figure 7, with Whisper and other open-source ASRs represented by different colors in the scatter plot. It is discernible that there exists a potential negative correlation between the performance of the target ASR system and the TR. For quantitative analysis, we employ the Pearson correlation coefficient (Cohen et al., 2009) and Spearman’s rank correlation coefficient (Spearman, 1987) to characterize this correlation. In statistics, the Pearson correlation coefficient measures the linear correlation between two sets of data, while Spearman’s rank correlation coefficient assesses the correlation of monotonic relationships. Their values range from -1 to 1, with closer proximity to 1 indicating a stronger positive correlation, closer to -1 indicating a stronger negative correlation, and closer to 0 suggesting a weaker correlation. The results are presented in Table 13. They reveal a significant negative correlation between the performance and the TR for both categories of open-source ASRs.

We also conduct an ablation study to evaluate the impact of acoustic feature loss. Evaluations are carried out on 10 target commands, and 16 open-source ASR systems. There are 4 instances of attack failure (4/160) when the acoustic feature loss is omitted from the loss function.

Table 13. Correlation coefficient between ASR performance and TR.

Coefficient	Whisper	Others
Pearson Correlation	-0.933	-0.758
Spearman’s Rank Correlation	-0.607	-0.580

Appendix D Adaptive search algorithm

For a specific target command audio and carrier audio, we illustrate the values of scaling factor $\mu$ corresponding to different padding lengths in Figure 8. The starting position represents the length of the padding on the left side. As the adaptive search algorithm is related to the carrier audio, we present the waveform of the initialized adversarial examples for the same target command audio under different carrier audios in Figure 9. We can see that, for different carrier audios, the adaptive initialization algorithm finds distinct padding lengths on each side and scaling factors to initialize the adversarial perturbation.

Additionally, in our adaptive search algorithm, the obtained $\mu$ is a scalar. We also investigate treating $\mu$ as an optimizable variable. Adversarial examples initialized using these two search methods and optimized through the sequential ensemble optimization algorithm exhibit comparable stealthiness and effectiveness. However, treating $\mu$ as an optimizable variable results in greater time overhead. Therefore, we choose to search for $\mu$ as a scalar.