Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

Zheng Fang Wuhan UniversityWuhanChina [email protected] Tao Wang Wuhan UniversityWuhanChina [email protected] Lingchen Zhao Wuhan UniversityWuhanChina [email protected] Shenyi Zhang Wuhan UniversityWuhanChina [email protected] Bowen Li Wuhan UniversityWuhanChina [email protected] Yunjie Ge Wuhan UniversityWuhanChina [email protected] Qi Li Tsinghua UniversityBei**gChina [email protected] Chao Shen Xi’an Jiaotong UniversityXi’anChina [email protected]  and  Qian Wang Wuhan UniversityWuhanChina [email protected]
Abstract.

In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting.

Speech recognition; adversarial attacks; zero-query; transferability
To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

1. Introduction

Table 1. Summary of existing audio adversarial attacks.
  Method Setting Knowledge Target ASR 11footnotemark: 1 Over-the-line 22footnotemark: 2 Over-the-air Queries 33footnotemark: 3
  Carlini et al. (Carlini and Wagner, 2018) White-box Gradient \square 100% - similar-to\sim1000
Commandersong (Yuan et al., 2018) White-box Gradient \square 100% similar-to\sim80% similar-to\sim1000
Taori et al. (Taori et al., 2019) Black-box Prediction score \square similar-to\sim40% - similar-to\sim300,000
SGEA (Wang et al., 2020) Black-box Prediction score \square 100% - similar-to\sim100,000
Devil’s Whisper (Chen et al., 2020) Black-box Confidence score \triangle\star△ ⋆ similar-to\sim60% similar-to\sim50% similar-to\sim1500
Occam (Zheng et al., 2021) Black-box Final decision \square\triangle□ △ 100% - similar-to\sim10,000
KENKU (Wu et al., 2023) Black-box None \triangle\star△ ⋆ similar-to\sim70% similar-to\sim80% >>>0
NI-Occam (Zheng et al., 2021) Zero-query black-box 44footnotemark: 4 None \star - similar-to\sim50% 0
TransAudio (Qi et al., 2023) Zero-query black-box None \square\triangle□ △ similar-to\sim30% - 0
ZQ-Attack Zero-query black-box None \square\triangle\star□ △ ⋆ 100% 100% 0
 

Note that, (i) 11footnotemark: 1: We use \square, \triangle, \star to represent open-source ASRs, online speech recognition services, and commercial IVC devices, respectively. (ii) 22footnotemark: 2: In the over-the-line and over-the-air settings, we employ ‘-’ to represent unsuccessful attacks. In cases of successful attacks, the success rate of attack is presented. (iii) 33footnotemark: 3: “Queries” refer to the number of queries made to the target ASR system during the generation process. (iv) 44footnotemark: 4: In contrast to the black-box setting, the zero-query black-box setting prohibits any queries to the target ASR system during the generation process.

Automatic speech recognition (ASR) techniques, which convert spoken language into text, play a crucial role in modern human-computer interactions. These techniques are now widely employed in online speech recognition services (Microsoft, 2023; OpenAI, 2023a) provided by major companies, including Microsoft, OpenAI, etc. Online repositories also offer a plethora of open-source ASRs for public utilization. Furthermore, ASRs have also been integrated into commercial intelligent voice control (IVC) devices, such as Apple Siri (Apple, 2023) and Amazon Alexa (Amazon, 2023), enabling users to perform various tasks via voice commands. Unfortunately, similar to other deep neural networks (DNNs) based systems, modern ASRs are vulnerable to adversarial attacks (Szegedy et al., 2014; Carlini and Wagner, 2018; Gao et al., 2018). Attackers can introduce small perturbations into audio samples, causing the target ASR system to produce incorrect results.

Audio Adversarial Attacks. Recently, numerous studies have investigated the practicality and effectiveness of audio adversarial attacks on ASR systems, as summarized in Table 1. Depending on the accessibility of the attacker to the target ASR systems, audio adversarial attacks can be categorized into white-box and black-box attacks. The initial works (Carlini and Wagner, 2018; Yuan et al., 2018) employ gradient descent algorithms to generate adversarial audios in the white-box setting, where attackers have full access to the internal information of the target ASR systems. However, in real-world scenarios, attackers typically lack access to the internal information of the target system, making these white-box attacks often impractical. To achieve attacks in black-box settings, several methods have been proposed to generate adversarial examples based on the limited information acquired through queries to the target ASR system (Taori et al., 2019; Chen et al., 2020; Zheng et al., 2021; Wu et al., 2023). However, these methods require huge financial costs and time investments for the queries to generate a single adversarial example, and the large number of highly similar queries makes these attacks easily detectable by the target ASR system, rendering them impractical.

Therefore, to further enhance the practicality of adversarial attacks, researchers have increasingly turned their attention to transfer-based attacks, which can generate adversarial examples effectively across different target systems, thereby eliminating the need for queries. However, the performance of existing transfer-based audio adversarial example attacks also exhibits significant limitations. For instance, NI-Occam (Zheng et al., 2021) generates audio adversarial examples on fine-tuned Kaldi models to attack IVC devices, but its attack success rate is quite limited. TransAudio (Qi et al., 2023) optimizes adversarial examples on a surrogate ASR model and can successfully attack black-box ASR systems with similar architectures to the surrogate model. However, it only achieves word-level modifications to the original transcription and has a low success rate on online speech recognition services. These results indicate that current transfer-based attacks still possess limited transferability.

Consequently, we propose the following question: How to generate audio adversarial examples with high transferability in the challenging zero-query black-box setting?

ZQ-Attack. Our answer to this question is ZQ-Attack, a transfer-based adversarial attack on black-box ASR systems without the need for queries. Inspired by the ensemble method (Opitz and Maclin, 1999; Liu et al., 2017; Brown et al., 2017), our core idea is to optimize the adversarial perturbation on diverse surrogate ASRs. Ideally, adversarial perturbations optimized concurrently on multiple different surrogate models should contain features that can be captured by these models, making them effective against various ASR systems.

Specifically, ZQ-Attack consists of three stages: surrogate ASRs selection, perturbation initialization, and sequential ensemble optimization. For the surrogate ASRs selection stage, we conduct an extensive survey of modern ASR systems and observe that different types of ASR systems utilize distinct acoustic models. Therefore, we first need to select multiple different types of surrogate ASRs. Then, to ensure that the generated adversarial perturbations are effective across these surrogate ASRs while maintaining high stealthiness, we propose an adaptive search algorithm that uses scaled target commands to initialize the adversarial perturbation, instead of using zeros or Gaussian noise as in existing methods. Following the initialization, ZQ-Attack employs a sequential ensemble optimization algorithm to optimize the adversarial perturbations on the sequence of diverse surrogate ASRs collaboratively. This sequential ensemble algorithm allows for the optimization of adversarial perturbations on each surrogate ASR while concurrently leveraging information from prior surrogate ASRs. Consequently, the generated adversarial perturbations are effective not only for the current surrogate ASR but also for the previous ASRs in the sequence.

We conduct extensive experiments in both over-the-line and over-the-air settings to validate the effectiveness and imperceptibility of our ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves an average success rate of attack (SRoA) of 100% and signal-to-noise ratio (SNR) of 21.91dB on four online speech recognition services. Additionally, ZQ-Attack attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. In the over-the-air setting, ZQ-Attack achieves an average SRoA of 100% and an SNR of 15.77dB on two commercial IVC devices. These results demonstrate that ZQ-Attack can successfully generate audio adversarial examples with high transferability, effectively targeting various ASR systems without requiring any queries.

Contributions. Our contributions are summarized as follows:

  • To the best of our knowledge, we are the first to generate audio adversarial examples with high transferability on ASR systems in the most challenging zero-query black-box setting. Our method is effective in both over-the-line and over-the-air settings, with the target ASR systems encompassing online speech recognition services, open-source ASRs, and commercial IVC devices, showcasing remarkable practicality.

  • We introduce ZQ-Attack, a zero-query, transfer-based adversarial attack on black-box ASR systems. This approach optimizes adversarial perturbations using a set of diverse surrogate ASRs simultaneously, thereby enhancing their transferability. Furthermore, we develop an adaptive adversarial perturbation initialization method based on the target command audio to improve the imperceptibility.

  • We conduct comprehensive experiments to evaluate the performance of ZQ-Attack on online speech recognition services, open-source ASRs, and commercial IVC devices. The experimental results demonstrate the superior performance of our method. ZQ-Attack can successfully attack all the target ASR systems without queries while achieving an average SRoA of 100%.

2. Background

2.1. Automatic Speech Recognition

ASR systems can automatically transcribe input audio into the corresponding transcriptions. As shown in Figure 1, a typical ASR system comprises three components: pre-processing, acoustic model, and decoder.

  • Pre-processing.  Given an input audio with n𝑛nitalic_n sample points, denoted as xn𝑥superscript𝑛x\in\mathbb{Z}^{n}italic_x ∈ blackboard_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, an ASR system first normalizes it to the range of [1,1]nsuperscript11𝑛{[-1,1]}^{n}[ - 1 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and applies low/high-pass filters to remove frequencies and segments beyond the range of human hearing. Then, the ASR system employs time-frequency transformation techniques, such as the short-time Fourier transform (STFT), to convert the time-domain signal x𝑥xitalic_x into the frequency domain spectrogram ST×F𝑆superscript𝑇𝐹S\in\mathbb{R}^{T\times F}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_F end_POSTSUPERSCRIPT, where T𝑇Titalic_T and F𝐹Fitalic_F denote the number of frames and frequency bins, respectively. In the subsequent steps, traditional ASR systems utilize acoustic feature extraction algorithms, such as mel-frequency cepstral coefficients (MFCC) (Sigurdsson et al., 2006), linear predictive coefficient (LPC) (Itakura, 1975), or perceptual linear predictive (PLP) (Hermansky, 1990), to further transform S𝑆Sitalic_S into meticulously designed acoustic features. In contrast, modern DNN-based ASR systems directly utilize S𝑆Sitalic_S.

  • Acoustic model.  The acoustic model is typically a machine learning model that maps the spectrogram or extracted acoustic features to an intermediate representation. Traditional ASR systems use hidden Markov models (HMMs) and Gaussian mixture models (GMMs) (Gales et al., 2008; Rabiner, 1989), while modern ASR systems employ DNNs, such as convolutional neural networks (CNNs) and Transformers (Gulati et al., 2020; Li et al., 2019).

  • Decoder.  The decoder uses the intermediate representation to predict tokens and generate corresponding transcriptions. A token is the smallest unit of the transcription, and the set of all possible tokens constitutes a vocabulary V𝑉Vitalic_V. The vocabulary V𝑉Vitalic_V varies across different ASR systems. For example, V={a,b,,z,space}𝑉𝑎𝑏𝑧𝑠𝑝𝑎𝑐𝑒V=\{a,b,\cdots,z,space\}italic_V = { italic_a , italic_b , ⋯ , italic_z , italic_s italic_p italic_a italic_c italic_e } is a simple vocabulary for an ASR system recognizing English.

Refer to caption
Figure 1. The architecture of a typical ASR system.

2.2. Audio Adversarial Attacks

Audio adversarial attacks aim to manipulate the output of ASR systems using audio adversarial examples constructed by adding imperceptible perturbations to benign carrier audios (Carlini and Wagner, 2018; Yuan et al., 2018; Zheng et al., 2021; Chen et al., 2020; Wu et al., 2023). Depending on the objective of the attack, audio adversarial attacks can be categorized into targeted and untargeted attacks.

Formally, let f(x):xy:𝑓𝑥𝑥𝑦f(x):x\to yitalic_f ( italic_x ) : italic_x → italic_y denote an ASR system that transcribes an audio x𝑥xitalic_x into the transcription y=f(x)𝑦𝑓𝑥y=f(x)italic_y = italic_f ( italic_x ), and let x=x+δsuperscript𝑥𝑥𝛿x^{\prime}=x+\deltaitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x + italic_δ denote the adversarial example constructed by adding the adversarial perturbation δ𝛿\deltaitalic_δ to the carrier audio x𝑥xitalic_x. An untargeted adversarial attack aims to mislead the target ASR system, causing it to produce any result other than the ground truth, represented as f(x)y𝑓superscript𝑥𝑦f(x^{\prime})\neq yitalic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≠ italic_y. In contrast, a targeted adversarial attack aims to induce the output of the target ASR system into a specific target transcription tf(x)𝑡𝑓𝑥t\neq f(x)italic_t ≠ italic_f ( italic_x ), which is formulated as:

(1) f(x)=t,s.t.Dis(x,x)<ϵ,formulae-sequence𝑓superscript𝑥𝑡𝑠𝑡𝐷𝑖𝑠𝑥superscript𝑥italic-ϵf(x^{\prime})=t,\quad s.t.~{}Dis(x,x^{\prime})<\epsilon,italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_t , italic_s . italic_t . italic_D italic_i italic_s ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_ϵ ,

where Dis(x,x)𝐷𝑖𝑠𝑥superscript𝑥Dis(x,x^{\prime})italic_D italic_i italic_s ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) represents the distance between x𝑥xitalic_x and xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, commonly calculated using the Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm, with p𝑝pitalic_p usually being 0, 2, or \infty. ϵitalic-ϵ\epsilonitalic_ϵ is a hyper-parameter that constrains this distance. As targeted adversarial attacks can naturally extend to untargeted attacks, the adversarial attacks discussed in the rest of this paper will specifically refer to the targeted ones unless explicitly specified otherwise.

Typically, the generation process of adversarial examples can be formulated as an optimization problem:

(2) minδ(x,δ,t,f)=a(x,δ,t,f)+cp(δ),subscript𝛿𝑥𝛿𝑡𝑓subscript𝑎𝑥𝛿𝑡𝑓𝑐subscript𝑝𝛿\min\limits_{\delta}{\mathcal{L}(x,\delta,t,f)=\mathcal{L}_{a}(x,\delta,t,f)+c% \cdot\mathcal{L}_{p}(\delta)},roman_min start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT caligraphic_L ( italic_x , italic_δ , italic_t , italic_f ) = caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_x , italic_δ , italic_t , italic_f ) + italic_c ⋅ caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_δ ) ,

where the adversarial loss asubscript𝑎\mathcal{L}_{a}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT measures the effectiveness of δ𝛿\deltaitalic_δ on the target ASR system f𝑓fitalic_f, and the imperceptibility loss psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT quantifies the imperceptibility of δ𝛿\deltaitalic_δ. The parameter c𝑐citalic_c acts as a weighting factor, balancing the effectiveness and imperceptibility of the attack. Gradient descent is a common method for solving this optimization problem. It can be formulated as:

(3) δclipϵ(δαδ(x,δ,t,f)),𝛿𝑐𝑙𝑖subscript𝑝italic-ϵ𝛿𝛼subscript𝛿𝑥𝛿𝑡𝑓\delta\leftarrow clip_{\epsilon}\left(\delta-\alpha\cdot\nabla_{\delta}% \mathcal{L}(x,\delta,t,f)\right),italic_δ ← italic_c italic_l italic_i italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_δ - italic_α ⋅ ∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT caligraphic_L ( italic_x , italic_δ , italic_t , italic_f ) ) ,

where α𝛼\alphaitalic_α represents the learning rate and δ(x,δ,t,f)subscript𝛿𝑥𝛿𝑡𝑓\nabla_{\delta}\mathcal{L}(x,\delta,t,f)∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT caligraphic_L ( italic_x , italic_δ , italic_t , italic_f ) is the gradient of (x,δ,t,f)𝑥𝛿𝑡𝑓\mathcal{L}(x,\delta,t,f)caligraphic_L ( italic_x , italic_δ , italic_t , italic_f ) with respect to δ𝛿\deltaitalic_δ. The function clipϵ𝑐𝑙𝑖subscript𝑝italic-ϵclip_{\epsilon}italic_c italic_l italic_i italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT limits Dis(x,x)𝐷𝑖𝑠𝑥superscript𝑥Dis(x,x^{\prime})italic_D italic_i italic_s ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to a relatively small range controlled by ϵitalic-ϵ\epsilonitalic_ϵ.

3. Threat Model & Challenges

3.1. Threat Model

Refer to caption
Figure 2. Workflow of ZQ-Attack. ZQ-Attack is mainly divided into three stages: surrogate ASRs selection, perturbation initialization, and sequential ensemble optimization.

Goals.  The attacker aims to generate audio adversarial examples that can be recognized as the target transcription by the target ASR systems without any queries. In the over-the-line setting (i.e., digital attacks), the target systems are online speech recognition services or open-source ASRs. The waveform files of the audio adversarial examples are directly used as inputs, inducing the target ASR systems to produce the specified target transcriptions. In the over-the-air setting (i.e., physical attacks), the target ASR systems are commercial IVC devices. The adversarial examples should be misrecognized as the target transcription by these devices after being transmitted through the air. Additionally, the audio adversarial examples should be imperceptible, making them difficult for human ears to detect.

Knowledge & Capabilities.  Prior works on black-box adversarial attacks on ASR systems do not require the attacker to know internal information about the target ASR system, but they still assume that the attacker can interact with the target system. In this paper, we consider a more realistic and challenging scenario where the attacker also cannot query the target ASR system during the generation of adversarial examples. After generating the adversarial examples, the attacker can execute the attacks by uploading audio files to target ASR systems in the over-the-line setting and positioning a speaker near the target commercial IVC devices in the over-the-air setting. Additionally, we assume the attacker has the capability to train surrogate ASRs or obtain pre-trained surrogate ASRs from open-source repositories.

3.2. Challenges

To launch audio adversarial attacks in the challenging zero-query black-box setting, an alternative approach is the transfer-based attack. The basic idea of transfer-based attacks is to use a local surrogate model to generate adversarial examples that are then used to attack the target black-box model. Despite the demonstrated effectiveness of transfer-based attacks in the image domain (Papernot et al., 2016; Demontis et al., 2019; Wu et al., 2018; Liu et al., 2017), achieving similar success in the audio domain remains an unresolved issue. Abdullah et al. (Abdullah et al., 2021) reveal that the transferability of audio adversarial examples among different ASR systems is exceedingly limited, even when these ASRs share the same architecture. Existing attempts to achieve this goal have only demonstrated limited transferability (Zheng et al., 2021; Qi et al., 2023). As demonstrated in prior work (Huang et al., 2019; Wu et al., 2018; Ravikumar et al., 2022), the core reason for this limitation might be that adversarial examples tend to overfit the architecture and feature representations of the specific surrogate model, resulting in limited transferability to the target models with different architecture. Unlike image recognition models, ASR systems exhibit increased complexity and greater architectural diversity, leading to greater differences between surrogate ASRs and target ASRs. Hence, generating highly transferable adversarial examples in the audio domain is more challenging.

4. ZQ-Attack

4.1. Problem Formulation

ZQ-Attack aims to generate transferable audio adversarial examples in the zero-query black-box scenario. Formally, given a carrier audio x𝑥xitalic_x and a target command t𝑡titalic_t, ZQ-Attack optimizes the adversarial perturbation δ𝛿\deltaitalic_δ to enhance its effectiveness on various black-box target ASR systems. This optimization problem can be formulated as follows:

(4) maxδf(f(x+δ)=t),subscript𝛿subscript𝑓𝑓𝑥𝛿𝑡\max_{\delta}\mathop{\mathbb{P}}_{f\in\mathcal{F}}~{}\left(f(x+\delta)=t\right),roman_max start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ( italic_f ( italic_x + italic_δ ) = italic_t ) ,

where \mathbb{P}blackboard_P represents the probability, and \mathcal{F}caligraphic_F denotes the set of all black-box target ASR systems. However, since the attacker lacks internal information about the target ASR system and cannot query it, directly solving this optimization problem is challenging.

An intuitive way to solve this problem is leveraging surrogate ASRs. However, optimizing the adversarial perturbation on a single surrogate ASR may result in overfitting to that specific model. Therefore, ZQ-Attack optimizes adversarial perturbations on multiple surrogate ASRs. We denote the set of surrogate ASRs as 𝔽𝔽\mathbb{F}blackboard_F. Then, the optimization problem becomes as follows:

(5) minδall(x,δ,t,𝔽)s.t.Dis(x,x)<ϵ,subscript𝛿subscript𝑎𝑙𝑙𝑥𝛿𝑡𝔽s.t.𝐷𝑖𝑠𝑥superscript𝑥italic-ϵ\min_{\delta}\mathcal{L}_{all}(x,\delta,t,\mathbb{F})\quad\text{s.t.}~{}Dis(x,% x^{\prime})<\epsilon,roman_min start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT ( italic_x , italic_δ , italic_t , blackboard_F ) s.t. italic_D italic_i italic_s ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_ϵ ,

where allsubscript𝑎𝑙𝑙\mathcal{L}_{all}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT denotes the loss on all surrogate ASRs, and the imposed constraint ensures that the optimized adversarial perturbation attains a specified level of imperceptibility.

4.2. Attack Overview

Refer to caption
Figure 3. Illustration of surrogate ASRs selection.

The core idea of ZQ-Attack is to collaboratively optimize the adversarial perturbation on diverse types of surrogate ASRs. Specifically, ZQ-Attack consists of three stages: surrogate ASRs selection, perturbation initialization, and sequential ensemble optimization. The workflow is illustrated in Figure 2.

Surrogate ASRs Selection.  We investigate modern ASR systems and categorize them into two main types: CNN-based and Transformer-based. While CNNs are more adept at capturing local features, Transformers excel at capturing global contexts. Therefore, an intuitive approach is to select surrogate ASRs that include both CNN-based and Transformer-based architectures, ensuring that the adversarial perturbations optimized on these surrogate ASRs concurrently possess both local and global features of the target command.

Perturbation Initialization.  Given a target command t𝑡titalic_t, we employ Text-to-Speech (TTS) techniques to generate a corresponding target command audio xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Subsequently, we initialize the adversarial perturbation with a scaled xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and superimpose it onto the carrier audio, ensuring that the constructed adversarial example is effective on all surrogate ASRs. To further enhance the imperceptibility of the initial adversarial perturbation, we employ an adaptive search algorithm to minimize the scaling factor.

Sequential Ensemble Optimization.  Following the perturbation initialization stage, ZQ-Attack employs a sequential ensemble optimization algorithm to collaboratively optimize the adversarial perturbation on the ordered set of surrogate ASRs. This algorithm consists of an inner loop and an outer loop. In each iteration, the ordered set of surrogate ASRs is randomly shuffled in the outer loop. Then, the sequential ensemble optimization takes place within the inner loop. For each surrogate ASR, this algorithm integrates collaborative information from all preceding surrogate ASRs in the ordered set, facilitating collaborative optimization. Additionally, we design a novel loss function for the optimization process to enhance the transferability and imperceptibility of the perturbation. Following the inner loop, the algorithm updates the perturbation and validates its effectiveness on all surrogate ASRs.

4.3. Surrogate ASRs Selection

The selection of surrogate ASRs is the foundation that ensures the effectiveness of ZQ-Attack. As described in Section 3.2, the architecture of ASR systems exhibits considerable diversity. Therefore, randomly selecting surrogate ASRs may fail to cover the mainstream modern ASR systems. Hence, in this subsection, we initially provide a summary and categorization of modern ASR systems, followed by the details of surrogate ASRs selection. The illustration of this stage is shown in Figure 3.

Summary and Categorization of Modern ASR Systems.  Modern ASR systems typically convert audio into the spectrogram without employing additional acoustic feature extraction algorithms (Kriman et al., 2020; Li et al., 2019; Han et al., 2020; Gulati et al., 2020; OpenAI, 2023a; Majumdar et al., 2021; Amodei et al., 2016; Hannun et al., 2014; Zhang et al., 2020). Hence, the primary differences among these ASR systems reside in the internal acoustic model. We summarize and categorize modern ASR systems into the following two categories from the perspective of the acoustic model:

  • CNN-based.  CNNs have found widespread application in the field of computer vision and have recently exhibited notable progress in ASR as well (Li et al., 2019; Kriman et al., 2020; Han et al., 2020; Majumdar et al., 2021; Abdel-Hamid et al., 2014; Abdeljaber et al., 2017; Schneider et al., 2019). The key advantages of CNNs include their model’s low complexity and high computational efficiency. Additionally, CNNS are adept at extracting local features within the spectrogram.

  • Transformer-based.  These ASR systems employ Transformers as their acoustic models. Given that audio is temporal data, using recurrent neural networks (RNNs) as acoustic models is an intuitive choice (Amodei et al., 2016; Hannun et al., 2014). While RNNs can capture short-term dependencies, attention mechanisms (Vaswani et al., 2017) enable non-contiguous frames to attend each other, allowing Transformers to capture long-term dependencies. This results in better speech recognition performances than RNNs (Zeyer et al., 2019; Karita et al., 2019a; Li et al., 2020). Consequently, Transformers are progressively supplanting RNNs in modern ASRs (Dong et al., 2018b; Zhang et al., 2020; Karita et al., 2019b; Higuchi et al., 2020; Gulati et al., 2020; Baevski et al., 2020; Hsu et al., 2021; Radford et al., 2023).

Selection of Surrogate ASRs.  The local ASRs for selection can be obtained from online sources or trained by the attackers themselves. While utilizing a single surrogate ASR can lead to the overfitting of δ𝛿\deltaitalic_δ to that surrogate ASR, rendering δ𝛿\deltaitalic_δ ineffective on target ASR systems, using too many surrogate ASRs can also lead to high computation costs. The diversity in architectures of ASRs may result in significant disparities between the surrogate ASRs and the target ASR systems, making it challenging to generate transferable adversarial examples. Therefore, we need to select surrogate ASRs encompassing modern ASR systems of different types.

According to our categorization, modern ASR systems can be mainly categorized into CNN-based and Transformer-based ASR systems. The CNNs excel in extracting local features but exhibit a comparatively weaker capability in capturing dynamic global contexts. Conversely, Transformers are proficient in effectively capturing global information but demonstrate a diminished ability to extract local features. To integrate the advantages of both CNN-based and Transformer-based ASR systems, the selected surrogate ASRs should encompass representatives from both categories. This ensures that the optimized adversarial perturbations can possess both locally and globally salient features, thereby enhancing their transferability to the target ASR systems. Furthermore, CNNs and Transformers represent prevalent architectures of acoustic models adopted in modern ASR systems. Therefore, we incorporate both CNN-based and Transformer-based ASRs into the surrogate ASRs to effectively cover a broad range of real-world ASR systems. Upon selecting K𝐾Kitalic_K surrogate ASRs from the local ASRs, we construct the set of these surrogate ASRs, denoted as 𝔽=[fj]j=1K𝔽superscriptsubscriptdelimited-[]subscript𝑓𝑗𝑗1𝐾\mathbb{F}=[f_{j}]_{j=1}^{K}blackboard_F = [ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, where fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the j𝑗jitalic_j-th surrogate ASR in 𝔽𝔽\mathbb{F}blackboard_F. Since the subsequent sequential ensemble optimization algorithm optimizes the adversarial perturbation on these surrogate ASRs in a sequential manner, 𝔽𝔽\mathbb{F}blackboard_F is an ordered set. It is worth noting that the surrogate ASRs are scalable. The quantity of surrogate ASRs can be adjusted flexibly depending on the computational resources of the attackers.

Algorithm 1 Adaptive Search (AdaSearch)
0:  Carrier audio x𝑥xitalic_x, Target command t𝑡titalic_t, Target command audio xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Ordered set of K𝐾Kitalic_K surrogate ASRs 𝔽𝔽\mathbb{F}blackboard_F, Search stride s𝑠sitalic_s
0:  Initial adversarial perturbation δ𝛿\deltaitalic_δ
1:  lxlen(x),ltlen(xt)formulae-sequencesubscript𝑙𝑥𝑙𝑒𝑛𝑥subscript𝑙𝑡𝑙𝑒𝑛subscript𝑥𝑡l_{x}\leftarrow len(x),~{}l_{t}\leftarrow len(x_{t})italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← italic_l italic_e italic_n ( italic_x ) , italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_l italic_e italic_n ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
2:  μ=+,δ0formulae-sequence𝜇𝛿0\mu=+\infty,~{}\delta\leftarrow 0italic_μ = + ∞ , italic_δ ← 0
3:  for i0𝑖0i\leftarrow 0italic_i ← 0 to  lxltsubscript𝑙𝑥subscript𝑙𝑡l_{x}-l_{t}italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do
4:     μi0subscript𝜇𝑖0\mu_{i}\leftarrow 0italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← 0
5:     δi[0,,0i,xt,0,,0lxlti]subscript𝛿𝑖subscript00𝑖subscript𝑥𝑡subscript00subscript𝑙𝑥subscript𝑙𝑡𝑖\delta_{i}\leftarrow[\underbrace{0,\cdots,0}_{i},~{}x_{t}~{},\underbrace{0,% \cdots,0}_{l_{x}-l_{t}-i}]italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← [ under⏟ start_ARG 0 , ⋯ , 0 end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , under⏟ start_ARG 0 , ⋯ , 0 end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ]
6:     while μiμsubscript𝜇𝑖𝜇\mu_{i}\leq\muitalic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_μ do
7:        if f𝔽,f(x+μiδi)=tformulae-sequencefor-all𝑓𝔽𝑓𝑥subscript𝜇𝑖subscript𝛿𝑖𝑡\forall f\in\mathbb{F},~{}f(x+\mu_{i}\cdot\delta_{i})=t∀ italic_f ∈ blackboard_F , italic_f ( italic_x + italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_t then
8:           μ=μi,δ=μδiformulae-sequence𝜇subscript𝜇𝑖𝛿𝜇subscript𝛿𝑖\mu=\mu_{i},~{}\delta=\mu\cdot\delta_{i}italic_μ = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ = italic_μ ⋅ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
9:           break
10:        end if
11:        μiμi+ssubscript𝜇𝑖subscript𝜇𝑖𝑠\mu_{i}\leftarrow\mu_{i}+sitalic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_s
12:     end while
13:  end for
14:  return  δ𝛿\deltaitalic_δ

4.4. Perturbation Initialization

The initialization of the adversarial perturbation δ𝛿\deltaitalic_δ may significantly impact the performance of the attack. Initializing δ𝛿\deltaitalic_δ from a point far from the region of the target command in the feature space can lead to a time-consuming and uncertain optimization process, and the high dimensionality of audio data further complicates the optimization. In contrast, using the target command audio directly as the initial δ𝛿\deltaitalic_δ may result in poor imperceptibility. To obtain an effective and relatively imperceptible initialized adversarial perturbation, we propose an adaptive search algorithm to initialize δ𝛿\deltaitalic_δ with a scaled target command audio, as presented in Alg. 1. This algorithm aims to minimize the scaling factor while maintaining the effectiveness of the initialized δ𝛿\deltaitalic_δ on all surrogate ASRs.

Specifically, we first choose an audio x𝑥xitalic_x as the carrier audio to construct the adversarial example x=x+δsuperscript𝑥𝑥𝛿x^{\prime}=x+\deltaitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x + italic_δ. Following previous works (Zheng et al., 2021; Yuan et al., 2018), we opt for songs as the carrier audio. For a given target command t𝑡titalic_t, we utilize TTS techniques to generate a corresponding target command audio xt=𝒯(t)subscript𝑥𝑡𝒯𝑡x_{t}=\mathcal{T}(t)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_T ( italic_t ), where 𝒯()𝒯\mathcal{T}(\cdot)caligraphic_T ( ⋅ ) denotes the TTS process. To alleviate the impact of varying volume levels during the perturbation initialization stage, we normalize the values of sample points in both x𝑥xitalic_x and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the range of [0.5,0.5]0.50.5[-0.5,0.5][ - 0.5 , 0.5 ]. Subsequently, the adaptive search algorithm searches for the smallest value of scaling factor μ𝜇\muitalic_μ, and δ𝛿\deltaitalic_δ is initialized using the scaled target command audio μxt𝜇subscript𝑥𝑡\mu\cdot x_{t}italic_μ ⋅ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ensuring that the corresponding initial adversarial example is recognized by all surrogate ASRs as the target command.

Since the length of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is typically shorter than that of x𝑥xitalic_x, it is necessary to pad both sides of the scaled xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with zeros to initialize the perturbation. We use lxsubscript𝑙𝑥l_{x}italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and ltsubscript𝑙𝑡l_{t}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to denote the length of x𝑥xitalic_x and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively. The lengths of the padding on each side are indeterminate, provided their sum equals lxltsubscript𝑙𝑥subscript𝑙𝑡l_{x}-l_{t}italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The adaptive search algorithm searches for the optimal padding lengths on each side to minimize the scaling factor as much as possible. An example of this initialization method is depicted in Figure 4. It can be seen that the adaptive search algorithm finds the padding lengths and a relatively small scaling factor.

In summary, the adaptive search algorithm initializes δ𝛿\deltaitalic_δ by pushing the corresponding adversarial example toward the decision boundary of all surrogate ASRs, thereby circumventing the time-consuming and uncertain initial search process. The initialized adversarial perturbation can be regarded as a coarse-grained optimized perturbation, serving as a basis for subsequent fine-grained optimization in the sequential ensemble optimization stage.

Refer to caption
Figure 4. An example of the perturbation initialization. The adversarial perturbation is initialized using a scaled target command audio. The region of the added initialized adversarial perturbation is highlighted in red.
Refer to caption
Figure 5. Illustration of sequential ensemble optimization.

4.5. Sequential Ensemble Optimization

After initializing the adversarial perturbation, ZQ-Attack performs fine-grained optimization of the adversarial perturbation on surrogate ASRs. Unlike the target black-box ASR systems, where the attacker lacks knowledge of their internal architectures and parameters, the white-box surrogate ASRs provide full control to the attacker. Hence, the attacker can optimize δ𝛿\deltaitalic_δ using any information acquired through these surrogate ASRs.

A straightforward approach to optimizing δ𝛿\deltaitalic_δ on diverse surrogate ASRs is to use the weighted average of the gradients from each one. For the j𝑗jitalic_j-th surrogate ASR fj𝔽subscript𝑓𝑗𝔽f_{j}\in\mathbb{F}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_F, the gradient is calculated as δ(x,δ,t,fj)subscript𝛿𝑥𝛿𝑡subscript𝑓𝑗\nabla_{\delta}\mathcal{L}(x,\delta,t,f_{j})∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT caligraphic_L ( italic_x , italic_δ , italic_t , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where \mathcal{L}caligraphic_L represents the loss of δ𝛿\deltaitalic_δ on a single surrogate ASR fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, typically similar to Eq. (2). However, this method essentially treats each surrogate ASR independently when optimizing δ𝛿\deltaitalic_δ. Each surrogate ASR does not interact with the others and, therefore, cannot leverage the optimization information provided by others. Moreover, this method focuses on optimizing δ𝛿\deltaitalic_δ in the direction most effective for a single surrogate ASR, without considering the effectiveness of δ𝛿\deltaitalic_δ on other surrogate ASRs.

To facilitate collaboration among these surrogate ASRs, we propose a sequential ensemble optimization algorithm, as presented in Figure 5. This algorithm iteratively optimizes the adversarial perturbation on the ordered set of surrogate ASRs 𝔽𝔽\mathbb{F}blackboard_F. For each surrogate ASR, this algorithm leverages the collaborative information from the preceding surrogate ASRs in the ordered set to optimize δ𝛿\deltaitalic_δ. In other words, the optimization process considers not only the efficacy of δ𝛿\deltaitalic_δ on the current surrogate ASR but also its efficacy on the preceding surrogate ASRs. Additionally, instead of directly using \mathcal{L}caligraphic_L in Eq. (2), we carefully design a novel loss function comprising three loss terms. The detailed design of the loss function is presented in Section 4.6.

Specifically, the sequential ensemble optimization algorithm comprises an outer loop and an inner loop. At each step, the algorithm first randomly shuffles 𝔽𝔽\mathbb{F}blackboard_F in the outer loop. Then, the optimization of δ𝛿\deltaitalic_δ on 𝔽𝔽\mathbb{F}blackboard_F takes place in the inner loop. Following the completion of the inner loop, δ𝛿\deltaitalic_δ is updated in the outer loop. As ZQ-Attack abstains from interacting with the target ASR systems via queries, it relies on surrogate ASRs to validate the efficacy of the generated adversarial example xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The set of valid adversarial examples, denoted as Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, is initialized as \varnothing at the beginning of the algorithm. At the end of each step, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will be added to Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if it can successfully attack all surrogate ASRs.

Inner Loop. For clarity, we use δjsubscript𝛿𝑗\delta_{j}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to denote the adversarial perturbation optimized on the previous j𝑗jitalic_j surrogate ASR(s) in 𝔽𝔽\mathbb{F}blackboard_F, and δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is equivalent to δ𝛿\deltaitalic_δ. For the j𝑗jitalic_j-th surrogate ASR fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in 𝔽𝔽\mathbb{F}blackboard_F, the input perturbations include δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the optimized adversarial perturbations obtained from the preceding j1𝑗1j-1italic_j - 1 ASR(s). Formally, the input adversarial perturbations for fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, denoted as ΔjsubscriptΔ𝑗\Delta_{j}roman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, can be represented as Δj=[δ0,δ1,,δj1]subscriptΔ𝑗subscript𝛿0subscript𝛿1subscript𝛿𝑗1\Delta_{j}=[\delta_{0},\delta_{1},\cdots,\delta_{j-1}]roman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_δ start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ]. We first add the adversarial perturbations ΔjsubscriptΔ𝑗\Delta_{j}roman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to the carrier audio x𝑥xitalic_x to construct the adversarial examples. To ensure that the adversarial examples have been pushed towards the decision boundary, we additionally add randomly sampled Gaussian noise σ𝜎\sigmaitalic_σ on each adversarial example. Following the forwarding of these adversarial examples to fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the subsequent gradient calculation, the perturbation is updated as:

(6) δj=δ0α1jδΔjδ(x,δ+σ,t,fj),subscript𝛿𝑗subscript𝛿0𝛼1𝑗subscriptsuperscript𝛿subscriptΔ𝑗subscriptsuperscript𝛿𝑥superscript𝛿𝜎𝑡subscript𝑓𝑗\delta_{j}=\delta_{0}-\alpha\cdot\frac{1}{j}\sum_{\delta^{\prime}\in\Delta_{j}% }\nabla_{\delta^{\prime}}\mathcal{L}(x,{\delta^{\prime}}+\sigma,t,f_{j}),italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_α ⋅ divide start_ARG 1 end_ARG start_ARG italic_j end_ARG ∑ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_x , italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_σ , italic_t , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

where α𝛼\alphaitalic_α represents the learning rate. The loss \mathcal{L}caligraphic_L is detailed in Section 4.6.

To avoid the updated δjsubscript𝛿𝑗\delta_{j}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT being perceptible enough to the human ear, we use a clip** algorithm to restrict the updated δjsubscript𝛿𝑗\delta_{j}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT within a limited range. Instead of using the Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm-based clip** algorithm employed in previous work (Zheng et al., 2021; Chen et al., 2020), we utilize an adaptive clip** algorithm based on psychoacoustics (Gelfand, 2017). To be specific, this algorithm is grounded in temporal masking, a phenomenon where the presence of louder components can influence the perception of quieter components when two sounds with different loudness levels are heard by the human ear. This algorithm constrains δ𝛿\deltaitalic_δ within a range proportional to the carrier audio, permitting larger perturbations in louder segments of the carrier while maintaining smaller perturbations in quieter areas, thereby reducing the perceptibility of the perturbation. This clip** algorithm can be represented as follows:

(7) clipϵ(δ,x)=max(min(δ,ϵ|x|),ϵ|x|).𝑐𝑙𝑖subscript𝑝italic-ϵ𝛿𝑥maxmin𝛿italic-ϵ𝑥italic-ϵ𝑥clip_{\epsilon}(\delta,x)=\text{max}(\text{min}(\delta,\epsilon\cdot|x|),\ -% \epsilon\cdot|x|).italic_c italic_l italic_i italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_δ , italic_x ) = max ( min ( italic_δ , italic_ϵ ⋅ | italic_x | ) , - italic_ϵ ⋅ | italic_x | ) .

It is noteworthy that the clip** is performed element-wise in the perturbation. After applying the adaptive clip** algorithm, the adversarial perturbation will be bounded as δj=clipϵ(δj,x)subscript𝛿𝑗𝑐𝑙𝑖subscript𝑝italic-ϵsubscript𝛿𝑗𝑥\delta_{j}=clip_{\epsilon}(\delta_{j},x)italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_c italic_l italic_i italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x ).

Outer Loop. Each step of the sequential ensemble optimization algorithm begins by randomly shuffling 𝔽𝔽\mathbb{F}blackboard_F. Subsequently, the optimization is performed in the inner loop. Following the inner loop, we obtain K𝐾Kitalic_K perturbations [δ1,δ2,,δK]subscript𝛿1subscript𝛿2subscript𝛿𝐾[\delta_{1},\delta_{2},\cdots,\delta_{K}][ italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] optimized on the surrogate ASRs, and δ𝛿\deltaitalic_δ is updated as follows:

(8) δ=η1Kj=1Kδj+(1η)δ0,𝛿𝜂1𝐾superscriptsubscript𝑗1𝐾subscript𝛿𝑗1𝜂subscript𝛿0\delta=\eta\cdot\frac{1}{K}\sum_{j=1}^{K}\delta_{j}+(1-\eta)\cdot\delta_{0},italic_δ = italic_η ⋅ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ( 1 - italic_η ) ⋅ italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

where η𝜂\etaitalic_η denotes a balancing factor determining the extent of the update. η=0𝜂0\eta=0italic_η = 0 implies no update, while η=1𝜂1\eta=1italic_η = 1 signifies a complete update to the optimized adversarial perturbation, disregarding the former one. This update method uses the adversarial perturbation δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as historical information, while the momentum method (Dong et al., 2018a) uses the gradient as historical information. At the end of each step, x=x+δsuperscript𝑥𝑥𝛿x^{\prime}=x+\deltaitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x + italic_δ is added to Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if all surrogate ASRs transcribe it into the target command.

The comprehensive details of the sequential ensemble optimization algorithm are presented in Alg. 2.

Algorithm 2 ZQ-Attack
0:  Carrier audio x𝑥xitalic_x, Target command t𝑡titalic_t, Ordered set of K𝐾Kitalic_K surrogate ASRs 𝔽𝔽\mathbb{F}blackboard_F, Max step N𝑁Nitalic_N, Search stride s𝑠sitalic_s
0:  The set of valid adversarial examples Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
1:  xt𝒯(t)subscript𝑥𝑡𝒯𝑡x_{t}\leftarrow\mathcal{T}(t)italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← caligraphic_T ( italic_t )
2:  δAdaSearch(x,t,xt,𝔽,s)𝛿AdaSearch𝑥𝑡subscript𝑥𝑡𝔽𝑠\delta\leftarrow\text{AdaSearch}(x,t,x_{t},\mathbb{F},s)italic_δ ← AdaSearch ( italic_x , italic_t , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_F , italic_s )
3:  Xsuperscript𝑋X^{\prime}\leftarrow\varnothingitalic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← ∅
4:  # Outer Loop
5:  for i1𝑖1i\leftarrow 1italic_i ← 1 to  N𝑁Nitalic_N do
6:     Randomly shuffle 𝔽𝔽\mathbb{F}blackboard_F
7:     δ0δsubscript𝛿0𝛿\delta_{0}\leftarrow\deltaitalic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_δ
8:     # Inner Loop
9:     for j1𝑗1j\leftarrow 1italic_j ← 1 to  K𝐾Kitalic_K do
10:        fjsubscript𝑓𝑗absentf_{j}\leftarrowitalic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← the j𝑗jitalic_j-th ASR in 𝔽𝔽\mathbb{F}blackboard_F
11:        Δj[δ0,δ1,δ2,,δj1]subscriptΔ𝑗subscript𝛿0subscript𝛿1subscript𝛿2subscript𝛿𝑗1\Delta_{j}\leftarrow[\delta_{0},\delta_{1},\delta_{2},\cdots,\delta_{j-1}]roman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← [ italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_δ start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ]
12:        Sample a Gaussian noise σ𝜎\sigmaitalic_σ
13:        Compute δ(x,δ+σ,t,fj)subscriptsuperscript𝛿𝑥superscript𝛿𝜎𝑡subscript𝑓𝑗\nabla_{\delta^{\prime}}\mathcal{L}(x,{\delta^{\prime}}+\sigma,t,f_{j})∇ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_x , italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_σ , italic_t , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for each δsuperscript𝛿\delta^{\prime}italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in ΔjsubscriptΔ𝑗\Delta_{j}roman_Δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
14:        Update δjsubscript𝛿𝑗\delta_{j}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using Eq. (6)
15:        Clip δjsubscript𝛿𝑗\delta_{j}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using Eq. (7)
16:     end for
17:     Update δ𝛿\deltaitalic_δ using Eq. (8)
18:     x=x+δsuperscript𝑥𝑥𝛿x^{\prime}=x+\deltaitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x + italic_δ
19:     if f𝔽,f(x)=tformulae-sequencefor-all𝑓𝔽𝑓superscript𝑥𝑡\forall f\in\mathbb{F},~{}f(x^{\prime})=t∀ italic_f ∈ blackboard_F , italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_t then
20:        XXxsuperscript𝑋superscript𝑋superscript𝑥X^{\prime}\leftarrow X^{\prime}\cup x^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
21:     end if
22:  end for
23:  return  Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

4.6. Loss Design

We design a novel loss \mathcal{L}caligraphic_L for optimizing the adversarial perturbations. It comprises three terms: the adversarial loss asubscript𝑎\mathcal{L}_{a}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, the imperceptibility loss psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and the acoustic feature loss fsubscript𝑓\mathcal{L}_{f}caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. The loss \mathcal{L}caligraphic_L can be written as follows:

(9) (x,δ,t,f)=a(x,δ,t,f)+c1p(x,δ)+c2f(x,δ,t),𝑥𝛿𝑡𝑓subscript𝑎𝑥𝛿𝑡𝑓subscript𝑐1subscript𝑝𝑥𝛿subscript𝑐2subscript𝑓𝑥𝛿𝑡\mathcal{L}(x,\delta,t,f)=\mathcal{L}_{a}(x,\delta,t,f)+c_{1}\cdot\mathcal{L}_% {p}(x,\delta)+c_{2}\cdot\mathcal{L}_{f}(x,\delta,t),caligraphic_L ( italic_x , italic_δ , italic_t , italic_f ) = caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_x , italic_δ , italic_t , italic_f ) + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x , italic_δ ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x , italic_δ , italic_t ) ,

where c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT serve as weighting factors to balance the relative importance of different loss terms, ensuring a trade-off between the effectiveness and imperceptibility of δ𝛿\deltaitalic_δ.

Adversarial Loss.  The adversarial loss asubscript𝑎\mathcal{L}_{a}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT measures the effectiveness of δ𝛿\deltaitalic_δ on a surrogate ASR, evaluating how accurately a specific surrogate ASR transcribes the adversarial example to match the target command. The calculation process of asubscript𝑎\mathcal{L}_{a}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT begins by inputting the constructed adversarial example into the surrogate ASR to obtain the output probability. Then, asubscript𝑎\mathcal{L}_{a}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is calculated according to the output probability and t𝑡titalic_t. As different ASR systems may utilize different loss functions for training, such as connectionist temporal classification (CTC) (Graves et al., 2006) and Transducer (Graves, 2012; Graves et al., 2013), the calculation method of asubscript𝑎\mathcal{L}_{a}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT varies depending on the specific surrogate ASR. For instance, we utilize CTC loss to compute asubscript𝑎\mathcal{L}_{a}caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT for a surrogate ASR with a CTC architecture (e.g., Citrinet (Majumdar et al., 2021)).

Imperceptibility Loss.  The imperceptibility loss psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT aims to minimize the detectability of the adversarial perturbation by human ears. Previous works (Chen et al., 2020; Zheng et al., 2021; Wu et al., 2023) commonly utilize the Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm of the adversarial perturbation as the imperceptibility loss. However, Duan et al. (Duan et al., 2022) demonstrated that the Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm shows a limited correlation with human perception, and the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm exhibits a relatively high correlation with human perception among the Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm. Similar to the adaptive clip** algorithm employed when clip** the adversarial perturbation, as shown in Eq. (7), we design a new imperceptibility loss function psubscript𝑝\mathcal{L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to calculate the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the ratio of the adversarial perturbation to the carrier audio. Formally, it can be represented as:

(10) p(x,δ)=δx2.subscript𝑝𝑥𝛿subscriptnorm𝛿𝑥2\mathcal{L}_{p}(x,\delta)={\left\|\frac{\delta}{x}\right\|}_{2}.caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x , italic_δ ) = ∥ divide start_ARG italic_δ end_ARG start_ARG italic_x end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Acoustic Feature Loss.  As mentioned in Section 2.1, traditional ASR systems employ feature extraction algorithms to obtain the acoustic features of spectrograms as a preprocessing procedure, while modern ASR systems typically utilize the spectrograms directly. Despite the intricacy of acoustic feature extraction algorithms, which require specialized knowledge, the performance of traditional ASR systems demonstrates that these algorithms can extract high-quality features. Prior work has also demonstrated the effectiveness of incorporating acoustic features into the optimization process of audio adversarial examples (Wu et al., 2023). Hence, we utilize the acoustic feature loss based on the widely adopted acoustic feature extraction algorithm, MFCC (Sigurdsson et al., 2006), to further enhance the effectiveness of the adversarial perturbation. Specifically, we first extract the acoustic features of the target command audio and the constructed adversarial example. We denote the acoustic feature of the target command audio and the constructed adversarial example as Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Mxsubscript𝑀superscript𝑥M_{x^{\prime}}italic_M start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, respectively. Then, the acoustic feature loss fsubscript𝑓\mathcal{L}_{f}caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT can be calculated as:

(11) f(x,δ,t)=MxMt2.subscript𝑓𝑥𝛿𝑡subscriptnormsubscript𝑀superscript𝑥subscript𝑀𝑡2\mathcal{L}_{f}(x,\delta,t)={\|M_{x^{\prime}}-M_{t}\|}_{2}.caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x , italic_δ , italic_t ) = ∥ italic_M start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Table 2. Summary of target ASR systems in the experiments.
  ASR Type Acoustic Model Word Error Rate on LibriSpeech test-clean/test-other (%)
  Jasper (Li et al., 2019) Open-source ASR CNN 3.9/12.0
QuartzNet (Kriman et al., 2020) Open-source ASR CNN 3.8/10.4
Citrinet (Majumdar et al., 2021) Open-source ASR CNN 4.4/10.7 (S) 3.7/8.9 (M) 3.6/7.9 (L)
ContextNet (Han et al., 2020) Open-source ASR CNN 3.3/8.0 (S) 2.2/5.0 (M) 1.9/3.9 (L)
Conformer-CTC (Gulati et al., 2020) Open-source ASR Transformer 3.7/8.1 (S) 2.6/5.9 (M) 2.1/4.5 (L) 2.0/3.7 (XL)
Conformer-Transducer (Gulati et al., 2020) Open-source ASR Transformer 2.9/6.6 (S) 2.1/4.7 (M) 1.7/3.7 (L) 1.6/3.0 (XL)
Whisper (Radford et al., 2023) Open-source ASR Transformer 5.0/12.4 (base) 3.4/7.6 (small) 2.9/5.9 (medium) 2.7/5.6 (large)
Microsoft Azure (Microsoft, 2023) Online speech recognition service -77footnotemark: 7 -
Tencent Cloud (Tencent, 2023) Online speech recognition service - -
Alibaba Cloud (Alibaba, 2023) Online speech recognition service - -
OpenAI Whisper (OpenAI, 2023a) 66footnotemark: 6 Online speech recognition service Transformer 2.7/5.2
Apple Siri (Apple, 2023) Commercial IVC device - -
Amazon Alexa (Amazon, 2023) Commercial IVC device - -
 

Note that, (i) 77footnotemark: 7: Most online speech recognition services and commercial IVC devices do not reveal their implementation of the underlying ASR systems. Hence, we lack knowledge about their acoustic models and related information. We use “-” to represent unknown information. (ii) 66footnotemark: 6: The ASR system employed by the OpenAI API is the open-source Whisper large-v2. Therefore, we have access to information regarding the acoustic model and its recognition performance on LibriSpeech. Despite the open-source nature of Whisper large-v2, we treat it as a black-box ASR system during attacks.

5. Experiments

5.1. Experiment Setup

In this section, we evaluate the performance of ZQ-Attack in both over-the-line and over-the-air settings. In the over-the-line setting, the audio adversarial examples are transmitted directly to the target ASR systems as waveform audio files. In the over-the-air setting, we utilize a speaker positioned near the target devices (e.g., 10 cm) to play the audio adversarial examples in a quiet office environment (e.g., 35dB).

Target ASR Systems. To fully demonstrate the effectiveness of ZQ-Attack, we conduct extensive experiments on various online speech recognition services, commercial IVC devices, and open-source ASRs. The details are presented as follows.

  • Online speech recognition services. In the over-the-line setting, the target online speech recognition services include Microsoft Azure Speech Service (Microsoft, 2023), Tencent Cloud Automatic Speech Recognition (Tencent, 2023), Alibaba Cloud intelligent speech interaction (Alibaba, 2023), and OpenAI Whisper (OpenAI, 2023a).

  • Commercial IVC devices. In the over-the-air setting, we select Apple Siri (Apple, 2023), and Amazon Alexa (Amazon, 2023) as the target IVC devices.

  • Open-source ASRs. Among the numerous open-source ASRs, we select some of the most representative and advanced ASRs as our targets in the over-the-line setting. As prior research indicates that the transferability of audio adversarial examples can be limited even among instances of the same ASR (Abdullah et al., 2021), we select open-source ASRs with the same architectures but varying scales, as well as those with distinct architectures. Specifically, the target ASRs include: Jasper (Li et al., 2019), QuartzNet (Kriman et al., 2020), ContextNet (M/L) (Han et al., 2020), Citrinet (M/L) (Majumdar et al., 2021), Conformer-CTC (M/L/XL) (Gulati et al., 2020), Conformer-Transducer (M/L/XL) (Gulati et al., 2020), and Whisper (base, small, medium, large) (Radford et al., 2023).

We summarize and categorize these target systems in Table 2.

Surrogate ASRs. The surrogate ASRs include Conformer-CTC (S) (Gulati et al., 2020), Conformer-Transducer (S) (Gulati et al., 2020), ContextNet (S) (Han et al., 2020), and Citrinet (S) (Majumdar et al., 2021). ContextNet and Citrinet employ CNNs as the acoustic model, while the acoustic models of Conformer-CTC and Conformer-Transducer are based on Transformers. The checkpoints for these surrogate ASRs are obtained from Nvidia NeMo (Nvidia, 2023).

Target Commands and Carrier Audios. We choose 10 commonly used instructions as the target commands in the experiments: call my wife, make it warmer, navigate to my home, open the door, open the website, play music, send a text, take a picture, turn off the light, and turn on airplane mode. We select five songs used in Commandersong (Yuan et al., 2018) as the carrier audio.

Software and Hardware. We implement ZQ-Attack using the PyTorch framework (Paszke et al., 2019). The experiments are conducted on a server equipped with 8 NVIDIA 3080Ti GPUs, 2 Intel Xeon Gold 5117 CPUs, and 128 GB RAM, running a 64-bit Ubuntu 18.04 operating system. In the over-the-air setting, we use a JBL Clip3 speaker to play the audio adversarial examples. Apple Siri on an iPhone 13 and Amazon Alexa on a second-generation Amazon Echo Dot are used as the target commercial IVC devices.

Baselines. To demonstrate the superior performance of ZQ-Attack, we compare it with several previous works. In the over-the-line setting, we compare ZQ-Attack with Carlini et al. (Carlini and Wagner, 2018), Occam (Zheng et al., 2021) and KENKU (Wu et al., 2023). In the over-the-air setting, we compare ZQ-Attack with NI-Occam (Zheng et al., 2021) and KENKU. For these baselines, we either utilize their open-source code or re-implement them. It is noteworthy that the evaluation results of these methods might be inconsistent with the original paper due to periodic updates by manufacturers to their ASR systems.

Experiment Design. We evaluate ZQ-Attack on online speech recognition services, commercial IVC devices, and open-source ASRs in Section 5.3, Section 5.4, and Section 5.5, respectively. In Section 5.6, we explore the impact of surrogate ASRs on ZQ-Attack. To evaluate the imperceptibility of ZQ-Attack, we conduct a user study in Section 5.7. In Section  5.8 and Section  5.9, we evaluate ZQ-Attack on a large command set and the latest Whisper large-v3 (OpenAI, 2023b), respectively.

Ethical Considerations. We have informed the relevant companies about the potential vulnerability of their ASR systems to our attacks via email.

Table 3. Comparison on online speech recognition services.
  Method SRoA \uparrow SNR (dB) \uparrow Query \downarrow
Azure Tencent Alibaba OpenAI Average
  Carlini et al. (Carlini and Wagner, 2018) 0/10 0/10 0/10 0/10 0/10 / 0
Occam (Zheng et al., 2021) 10/10 10/10 10/10 10/10 10/10 12.54 30000
KENKU (Wu et al., 2023) 10/10 8/10 0/10 9/10 6.75/10 12.72 >>>0
ZQ-Attack 10/10 10/10 10/10 10/10 10/10 21.91 0
 

5.2. Evaluation Metrics

We use the success rate of attack (SRoA) as the metric of attack effectiveness. SRoA is calculated by dividing the number of successfully attacked commands by the total number of commands (i.e., 10). For each target command, if we can generate at least one adversarial example that effectively attacks the target ASR system, we consider the attack on that command as successful. Note that the adversarial example is considered effective only when its transcription matches exactly the target command. Any word errors are regarded as a failure, with case sensitivity being disregarded.

To evaluate the imperceptibility of the adversarial examples, we choose signal-to-noise ratio (SNR) as the metric. SNR is defined as the ratio of the power of a signal (i.e., the carrier audio x𝑥xitalic_x) to the power of a noise (i.e., the adversarial perturbation δ𝛿\deltaitalic_δ) in the logarithm scale, and a higher SNR indicates better imperceptibility. The specific calculation method is shown as follows:

(12) SNR(dB)=10log10(x22δ22).𝑆𝑁𝑅dB10subscript10superscriptsubscriptnorm𝑥22superscriptsubscriptnorm𝛿22SNR(\text{dB})=10\cdot\log_{10}\left(\frac{{\|x\|}_{2}^{2}}{{\|\delta\|}_{2}^{% 2}}\right).italic_S italic_N italic_R ( dB ) = 10 ⋅ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

5.3. Evaluation on Online Speech Recognition Services

In the over-the-line setting, the results of ZQ-Attack and baseline methods on online speech recognition services are shown in Table 3. ZQ-Attack successfully generates audio adversarial examples for all target commands on four online speech recognition services, achieving an average SRoA of 100% and an average SNR of 21.91dB.

For baseline methods, the adversarial examples generated by Carlini et al. (Carlini and Wagner, 2018) fail to successfully attack any online speech recognition services, as this method is tailored for the white-box setting. Compared to Occam, ZQ-Attack achieves comparable effectiveness and better imperceptibility without any queries. While KENKU still necessitates a small number of queries to search for appropriate hyperparameters tailored to a specific target ASR system, ZQ-Attack attains superior effectiveness and imperceptibility without any queries. Note that KENKU fails to successfully attack Alibaba in our evaluation. We speculate that this could be attributed to updates made by Alibaba to its ASR system.

5.4. Evaluation on Commercial IVC Devices

Table 4. Comparison on commercial IVC devices.
  Method SRoA \uparrow SNR (dB) \uparrow
Siri Alexa Average
  NI-Occam (Zheng et al., 2021) 4/10 5/10 4.5/10 8.38
KENKU (Wu et al., 2023) 7/10 9/10 8/10 12.72
ZQ-Attack 10/10 10/10 10/10 15.77
 
Table 5. Evaluation on open-source ASRs.
  Target ASR SRoA SNR (dB) Target ASR SRoA SNR (dB)
  Jasper 10/10 13.59 Conformer-CTC (XL) 10/10 23.59
QuartzNet 10/10 12.96 Conformer-Transducer (M) 10/10 25.34
Citrinet (M) 10/10 14.67 Conformer-Transducer (L) 10/10 20.63
Citrinet (L) 10/10 15.89 Conformer-Transducer (XL) 10/10 21.08
ContextNet (M) 10/10 15.73 Whisper (base) 10/10 17.88
ContextNet (L) 10/10 16.87 Whisper (small) 10/10 20.76
Conformer-CTC (M) 10/10 25.13 Whisper (medium) 10/10 23.39
Conformer-CTC (L) 10/10 23.51 Whisper (large) 10/10 23.74
 

In the over-the-air setting, we generate 10 audio adversarial examples for each target command. Each adversarial example is played up to three times. We consider that the attack on a command is successful if at least one adversarial example is effective.

The results of ZQ-Attack and baseline methods on commercial IVC devices are presented in Table 4. ZQ-Attack successfully generates audio adversarial examples for all target commands on two commercial IVC devices, achieving an SRoA of 100%percent\%% with an average SNR of 15.77dB. Additionally, an average of 6.6 adversarial examples for each command are effective.

In comparison, the SRoA of ZQ-Attack surpasses KENKU and NI-Occam by 20% and 55%. Concurrently, adversarial examples generated by ZQ-Attack exhibit better imperceptibility, with SNR values exceeding those of KENKU and NI-Occam by 3.05dB and 7.39dB, respectively.

5.5. Evaluation on Open-source ASRs

To demonstrate the high transferability of audio adversarial examples generated by ZQ-Attack, we additionally evaluate ZQ-Attack on 16 open-source ASR systems. The evaluation results are shown in Table 5. ZQ-Attack successfully generates audio adversarial examples on all 16 open-source ASRs, achieving an average SRoA of 100% and an average SNR of 19.67dB. These results show that the audio adversarial examples generated by ZQ-Attack exhibit high transferability, successfully attacking the target open-source ASRs.

Additionally, we observe a potential positive correlation between the recognition performance of open-source ASRs and the success rate of transfer attacks. We speculate that this phenomenon arises due to the better feature extraction capabilities of a more powerful ASR system, facilitating the capture of subtle adversarial perturbations. The detailed evaluation results and analysis of the correlation can be found in Appendix C

5.6. Impact of Surrogate ASRs

ZQ-Attack optimizes the adversarial perturbation on an ordered set of K𝐾Kitalic_K surrogate ASRs. To investigate the impact of surrogate ASRs to the performance of ZQ-Attack, we conduct experiments with various values of K𝐾Kitalic_K. Specifically, we set K𝐾Kitalic_K to 1, 2, and 4, respectively. In the cases with only one surrogate ASR, we conduct experiments with four configurations: ContextNet, Citrinet, Conformer-CTC, and Conformer-Transducer (i.e., selecting one from the four available ASRs). In the cases with two surrogate ASRs, six configurations (i.e., combination of two from the four available ASRs) are explored. Subsequently, we compute the average SRoA and SNR for the three groups of experiments. In these experiments, we choose one carrier audio and generate audio adversarial examples for 10 target commands.

The results are presented in Figure 6. It can be observed that with an increase in K𝐾Kitalic_K, both SRoA and SNR exhibit an increasing trend. For instance, when K𝐾Kitalic_K is 1, 2, and 4, the average SRoA is 81.25%, 98.75%, and 100%, respectively. The average SNR for the case involving 4 surrogate ASRs surpasses that of the case with 2 surrogate ASRs by 2.98dB and exceeds the SNR of the case with only 1 surrogate ASR by 5.35dB. Additionally, it can also be observed that ZQ-Attack successfully performs attacks for most target commands when using 2 surrogate ASRs. Thus, when computational resources are constrained, and there is a more relaxed requirement for the imperceptibility and effectiveness of the attack, a smaller K𝐾Kitalic_K can be chosen. Conversely, a larger K𝐾Kitalic_K can be used. In the cases with four surrogate ASRs, ZQ-Attack can complete a generation process in approximately 10 minutes using only one NVIDIA 3080Ti GPU.

Refer to caption
(a) SRoA
Refer to caption
(b) SNR
Figure 6. Impact of Surrogate ASRs.
Table 6. User study on human perception.
  Audio Device Method Normal (%) \uparrow Noise (%) Talking (%) \downarrow Recognize (%) \downarrow WER (%) \uparrow
Once Twice Once Twice
  Speaker Songs 92.63 7.37 0.00 0.00 0.00 0.00 0.00
Occam (Zheng et al., 2021) 5.26 15.79 78.95 29.47 35.79 57.37 51.24
KENKU (Wu et al., 2023) 0.00 6.32 93.68 56.84 63.68 30.23 25.96
ZQ-Attack 13.16 71.58 15.26 3.16 3.68 94.95 94.47
  Headphone Songs 92.11 7.89 0.00 0.00 0.00 0.00 0.00
Occam (Zheng et al., 2021) 2.11 3.16 94.74 49.47 54.74 38.25 31.32
KENKU (Wu et al., 2023) 0.00 1.58 98.42 67.37 70.00 19.34 17.77
ZQ-Attack 5.79 47.89 46.32 15.26 22.63 77.98 72.41
 

5.7. User Study

In this section, we recruit human participants to analyze the imperceptibility of the audio adversarial examples and conduct a comparative analysis between ZQ-Attack and the baseline methods (i.e., Occam, and KENKU). This study is carefully designed to mitigate any conceivable risks (psychological, legal, etc.) to the participants, and it is approved by the institutional review board (IRB). The target commands are common household phrases (e.g., “call my wife”) to minimize discomfort and the audio volume is normalized to maintain it below a safe threshold, preventing any risk of hearing damage. Besides, our study does not collect any private information from the participants, and all data will be deleted upon the completion of the study.

Our test audios comprise both normal audios and audio adversarial examples. The normal audios consist of 10 songs, and the audio adversarial examples include 10 instances from each method. In the user study, each participant listens to all test audios. Following the first listening, participants are provided with three choices: “normal audio”, “noisy audio”, and “audio with background speech”. In cases where participants select the “audio with background speech” option, they are required to provide the content of the perceived speech, followed by a second round of listening and transcription of the same test audio. Specifically, participants first listen to the test audios through speakers (e.g., MacBook Pro Speaker). Then, to eliminate environmental interference, participants listen to the audios again through noise-canceling headphones (e.g., Sony WH-1000XM5).

We gather data from a total of 38 participants, consisting of 20 males and 18 females. Among this group, 10 participants are below the age of 22, 18 participants are aged 22 to 24, and 10 participants are 25 or older. All participants are proficient in both spoken and written English, hold at least a bachelor’s degree, and have normal hearing. The results are presented in Table 6. When using the speaker as the audio device, 13.16% of participants select our adversarial examples as “normal audio”, while 71.58% of participants select them as “noisy audio”. Although 15.26% of participants select our adversarial examples as “audio with background speech,” only 3.68% of commands are recognized even after a second round of listening and transcription. In a more challenging configuration that uses headphones to play the audios, only 22.63% of the commands are recognized even after a second round of listening and transcription. Additionally, for the audios played using speakers and headphones, the average word error rate (WER) between user recognition results and the actual target commands is 94.47% and 72.41%, respectively. Compared with other methods, ZQ-Attack attains superior imperceptibility.

Furthermore, we observed that the imperceptibility of different target commands varies. The command open the door is the most easily perceived, while send a text is the hardest to perceive. This disparity may be attributed to the varying phonemes in different command audios, with vowel phonemes containing more energy and thus being more easily audible (Guo et al., 2022; Yu et al., 2023).

Table 7. Evaluation on a large command set.
  Command Azure Tencent Alibaba OpenAI
Attack SNR (dB) Attack SNR (dB) Attack SNR (dB) Attack SNR (dB)
  ask me a question 23.51 28.66 26.31 28.73
clear notification 26.52 21.28 20.05 26.52
close the shades 26.54 26.24 26.54 26.86
find a hotel 25.78 24.63 20.26 28.21
good morning 19.65 18.96 27.90 25.76
I have a secret to tell you 27.21 27.21 27.21 27.21
I need help 27.23 26.64 20.17 28.54
open the box 18.48 26.34 18.48 27.54
read a book 20.77 25.81 17.27 27.71
record a video 21.38 20.51 21.38 21.38
reset password 22.70 21.69 19.91 22.70
show me my message 27.12 27.54 23.98 27.54
show me the money 27.94 27.96 25.75 27.96
start recording 27.03 20.42 19.87 27.19
tell me a story 27.89 27.89 24.34 27.94
turn off the fan 19.98 15.25 19.98 19.98
turn on the TV 25.55 17.71 17.18 25.55
watch TV 26.32 26.32 19.20 26.32
what time is it 26.22 25.39 25.39 28.20
where is my car 27.24 23.28 25.76 27.24
  Average 20/20 24.75 20/20 23.99 20/20 22.35 20/20 26.45
 

5.8. Evaluation on a Larger Command Set

To thoroughly evaluate the effectiveness of ZQ-Attack, we conduct experiments using a larger command set. This set includes a total of 20 commands: ask me a question, clear notification, close the shades find a hotel, good morning, I have a secret to tell you, I need help, open the box, read a book, record a video, reset password, show me my message, show me the money, start recording, tell me a story, turn off the fan, turn on the TV, watch TV, what time is it where is my car.

We choose a carrier audio and generate audio adversarial examples for each of these target commands. As illustrated in Table 7, ZQ-Attack can generate effective audio adversarial examples for each target command in this set, achieving an average SRoA of 100%. On Azure, Tencent, Alibaba, and OpenAI, ZQ-Attack attains average SNR values of 24.75dB, 23.99dB, 22.35dB, and 26.45dB, respectively.

5.9. Evaluation on Whisper large-v3

Whisper (OpenAI, 2023a) is a popular ASR system trained on a diverse audio dataset comprising 680,000 hours of audio, achieving state-of-the-art multilingual recognition performance. Evaluating ZQ-Attack on Whisper can more comprehensively demonstrate its effectiveness on state-of-the-art ASR systems. Since the models of Whisper come in diverse sizes, we also conduct experiments on these various models. In Section 5.5, we evaluate ZQ-Attack utilizing the base, small, medium, and large models as open-source ASRs. As the OpenAI API employs the large-v2 model, we have evaluated the performance of ZQ-Attack on the large-v2 model in Section 5.3.

Recently, OpenAI introduced the latest large-v3 model (OpenAI, 2023b), which surpasses the performance of the large-v2 model. We have also evaluated ZQ-Attack on the latest model. As indicated in Table 8, ZQ-Attack can successfully generate effective adversarial examples for all 10 target commands, achieving an average SNR of 23.49dB. These results demonstrate the capability of ZQ-Attack to generate effective and imperceptible audio adversarial examples on the most advanced ASR systems.

Table 8. Evaluation on the latest Whisper large-v3.
  Command Whisper large-v3
Attack SNR (dB)
  call my wife 21.28
make it warmer 17.31
navigate to my home 24.35
open the door 24.46
open the website 26.75
play music 24.76
send a text 24.54
take a picture 27.23
turn off the light 26.29
turn on airplane mode 17.95
  Average 10/10 23.49
 

6. Related Work

Table 9. SRoA of ZQ-Attack on online speech recognition services with defenses.
  Defenses Setting Azure Tencent Alibaba OpenAI Average
  Local Smoothing h=11h=1italic_h = 1 10/10 10/10 10/10 10/10 10/10
h=22h=2italic_h = 2 10/10 10/10 10/10 10/10 10/10
h=33h=3italic_h = 3 10/10 10/10 9/10 10/10 9.75/10
  Downsampling flow=subscript𝑓𝑙𝑜𝑤absentf_{low}=italic_f start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT = 14kHz 10/10 10/10 10/10 10/10 10/10
flow=subscript𝑓𝑙𝑜𝑤absentf_{low}=italic_f start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT = 12kHz 10/10 10/10 9/10 9/10 9.5/10
flow=subscript𝑓𝑙𝑜𝑤absentf_{low}=italic_f start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT = 10kHz 10/10 10/10 9/10 9/10 9.5/10
  Temporal Dependency k=0.2𝑘0.2k=0.2italic_k = 0.2 10/10 10/10 10/10 10/10 10/10
k=0.5𝑘0.5k=0.5italic_k = 0.5 10/10 10/10 10/10 10/10 10/10
k=0.8𝑘0.8k=0.8italic_k = 0.8 10/10 10/10 10/10 10/10 10/10
  MVP-EARS m=2𝑚2m=2italic_m = 2 10/10 10/10 10/10 10/10 10/10
m=3𝑚3m=3italic_m = 3 10/10 10/10 10/10 10/10 10/10
m=4𝑚4m=4italic_m = 4 10/10 10/10 10/10 10/10 10/10
 

Audio Adversarial Attacks on White-box ASR Systems. In recent years, extensive studies have focused on audio adversarial attacks on ASR systems. Carlini et al. (Carlini and Wagner, 2018) were the first to generate targeted audio adversarial examples on the white-box DeepSpeech model. Concurrently, Commandersong (Yuan et al., 2018) demonstrated successful attacks on the Kaldi model by embedding malicious commands into songs. This approach also enabled physical attacks but had stringent limitations, such as requiring specific speakers and recording devices. Qin et al. (Qin et al., 2019) and Schönherr et al. (Schönherr et al., 2019) contributed to improving the imperceptibility of generated adversarial examples by incorporating the psychoacoustic model into the audio adversarial example generation process.

Audio Adversarial Attacks on Black-box ASR Systems. Despite the success of the aforementioned methods in generating audio adversarial examples on white-box ASR systems, their reliance on the gradient information of the target model limits their applicability for attacking black-box ASR systems. To address this challenge, Taori et al. (Taori et al., 2019) proposed employing gradient estimation and genetic algorithms to achieve black-box attacks, but their method had a relatively low attack success rate. Following this work, SGEA (Wang et al., 2020) improved the attack success rate and reduced the number of queries by employing selective gradient estimation techniques. Nevertheless, generating a single audio adversarial example still required approximately 100,000 queries. Devil’s Whisper (Chen et al., 2020) significantly improved the attack success rate by using a surrogate model. However, it relied on confidence scores returned by the target ASR system, which often be unavailable in real-world scenarios. Occam (Zheng et al., 2021) employed cooperative co-evolution and the CMA evolution strategy (Hansen, 2016), eliminating the need for confidence scores. Recently, KENKU (Wu et al., 2023) optimized the acoustic feature loss based on MFCC and imperceptibility loss simultaneously to generate relatively stealthy audio adversarial examples on black-box ASR systems. Although these methods have achieved improved performance, their dependence on querying the target system continues to limit their practicality.

To generate audio adversarial examples without the need for queries, researchers have proposed transfer-based attacks that can generate adversarial examples capable of attacking different models. NI-Occam (Zheng et al., 2021) utilized fine-tuned Kaldi models to exclusively launch attacks on IVC devices that are more sensitive to commands, but its attack success rate remains relatively low. TransAudio (Qi et al., 2023) presented a two-stage framework and a score-matching-based optimization strategy to achieve word-level adversarial attack, but its target transcription is constrained by the carrier audio. In the field of image adversarial attacks, previous work (Brown et al., 2017; Liu et al., 2017) proposed utilizing the ensemble method to improve the transferability of adversarial examples. Inspired by this, we design a sequential ensemble optimization algorithm to generate adversarial examples using diverse surrogate ASRs.

7. Discussion

7.1. Defenses against ZQ-Attack

We evaluate ZQ-Attack with several state-of-the-art defense methods against audio adversarial attacks, including local smoothing, downsampling, temporal dependency (Yang et al., 2019), and MVP-EARS (Zeng et al., 2019). The results are presented in Table 9.

Local Smoothing. Local smoothing renders audio adversarial attacks ineffective by applying a sliding window with a median filter to the adversarial examples. Given the length of the sliding window, denoted as hhitalic_h, the value of an audio sample point is replaced by the average values of itself and the hhitalic_h sample points before and after it. To evaluate the robustness of ZQ-Attack against local smoothing, we set hhitalic_h to 1, 2, and 3, respectively. The results in Table 9 show that local smoothing has minimal impact on ZQ-Attack. Across different settings, ZQ-Attack consistently achieves an SRoA higher than 97.5%.

Downsampling. This method involves downsampling the original audio to a lower sampling rate flowsubscript𝑓𝑙𝑜𝑤f_{low}italic_f start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT and subsequently upsampling it to the original sampling rate, which causes a loss of high-frequency information from the original audio. Therefore, if the high-frequency information in the adversarial perturbation is lost, the attack might fail. To assess the robustness of ZQ-Attack against downsampling, we set flowsubscript𝑓𝑙𝑜𝑤f_{low}italic_f start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT to 14kHz, 12kHz, and 10kHz. The results in Table 9 show that ZQ-Attack has great robustness to downsampling, achieving an SRoA higher than 95% under different settings.

Temporal Dependency. The inherent temporal dependency in audio data can be leveraged to detect audio adversarial examples (Yang et al., 2019). Specifically, audio adversarial examples can be identified by comparing the transcription of the first k𝑘kitalic_k part of the audio with the first k𝑘kitalic_k part of the transcription of the entire audio, where k𝑘kitalic_k is a ratio between 0 and 1. If the consistency between them is low, the audio can be considered an adversarial example; otherwise, it is accepted as normal. To evaluate the robustness of ZQ-Attack against the temporal dependency-based defense, we set k𝑘kitalic_k to 0.2, 0.5, and 0.8. The results in Table 9 demonstrate that the audio adversarial examples generated by ZQ-Attack exhibit strong resilience to temporal dependency, as all attacks are successful under different settings.

MVP-EARS (Zeng et al., 2019). MVP-EARS utilizes multiple ASR systems to detect audio adversarial examples. Due to the limited transferability of most prior audio adversarial example generation methods, different ASR systems may produce significantly different transcriptions for the same audio adversarial example. Therefore, multiple ASR systems can be used to transcribe the same audio. If their transcripts differ, the audio can be considered an adversarial example.

To evaluate the robustness of ZQ-Attack against MVP-EARS, we set m𝑚mitalic_m to 2, 3, and 4, where m𝑚mitalic_m is the number of ASR systems used. The results in Table 9 show that the audio adversarial examples generated by ZQ-Attack exhibit good robustness to MVP-EARS, with all target commands successfully attacked under different settings. This is attributed to the fact that ZQ-Attack does not generate audio adversarial examples customized for a specific ASR but crafts transferable adversarial examples on diverse surrogate ASRs.

7.2. Limitations and Future Work

While ZQ-Attack leverages diverse surrogate ASRs to achieve transferable audio adversarial attacks in the zero-query black-box setting, it incurs a higher computation cost (e.g., GPU memory). However, considering that the audio adversarial examples generated by ZQ-Attack are effective on multiple ASR systems and save the cost of queries, we consider that the additional cost is acceptable. Another limitation is that although the audio adversarial examples generated by ZQ-Attack exhibit better imperceptibility compared to prior work in the over-the-air setting, they may still be detected by humans. We leave enhancing the imperceptibility of audio adversarial examples in the over-the-air setting for future work.

8. Conclusion

We proposed ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box scenario. By summarizing and categorizing the modern ASR systems, we first selected a diverse set of surrogate ASRs for generating adversarial examples. Then, we employed an adaptive search algorithm to initialize the adversarial perturbations with a scaled target command audio, ensuring its effectiveness and imperceptibility. Subsequently, we designed a novel sequential ensemble optimization algorithm to optimize the adversarial perturbations using the selected surrogate ASRs. Our experimental results indicate that ZQ-attack achieves successful attacks on 4 online speech recognition services and 16 open-source ASRs in the over-the-line setting and attacks 2 commercial IVC devices in the over-the-air setting. This demonstrates a significant improvement in the practicality of audio adversarial attacks compared to prior methods.

Acknowledgements.
We thank the anonymous reviewers for their helpful and valuable feedback. This work was partially supported by the National Key R&D Program of China 2023YFE0209800, NSFC under Grants U20B2049, U21B2018, 62302344, 62132011, 62161160337, and Shaanxi Province Key Industry Innovation Program 2021ZDLGY01-02.

References

  • (1)
  • Abdel-Hamid et al. (2014) Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 22 (2014), 1533–1545.
  • Abdeljaber et al. (2017) Osama Abdeljaber, Onur Avci, Serkan Kiranyaz, Moncef Gabbouj, and Daniel J Inman. 2017. Real-time vibration-based structural damage detection using one-dimensional convolutional neural networks. Journal of Sound and Vibration 388 (2017), 154–170.
  • Abdullah et al. (2021) Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. 2021. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In Proc. of IEEE S&P. 730–747.
  • Alibaba (2023) Alibaba. 2023. Alibaba Cloud Intelligent Speech Interaction. https://www.alibabacloud.com/product/intelligent-speech-interaction.
  • Amazon (2023) Amazon. 2023. Alexa. https://www.alexa.com/.
  • Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, **gliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proc. of ICML. 173–182.
  • Apple (2023) Apple. 2023. Siri. https://www.apple.com/siri/.
  • Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
  • Brown et al. (2017) Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. 2017. Adversarial patch. CoRR abs/1712.09665 (2017).
  • Carlini and Wagner (2018) Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In Proc. of IEEE SPW. 1–7.
  • Chen et al. (2020) Yuxuan Chen, Xue**g Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil’s Whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices.. In Proc. of USENIX Security. 2667–2684.
  • Cohen et al. (2009) Israel Cohen, Yiteng Huang, **gdong Chen, Jacob Benesty, Jacob Benesty, **gdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. Noise reduction in speech processing (2009), 1–4.
  • Demontis et al. (2019) Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. 2019. Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. In Proc. of USENIX Security. 321–338.
  • Dong et al. (2018b) Linhao Dong, Shuang Xu, and Bo Xu. 2018b. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proc. of IEEE ICASSP. 5884–5888.
  • Dong et al. (2018a) Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018a. Boosting adversarial attacks with momentum. In Proc. of IEEE/CVF CVPR. 9185–9193.
  • Duan et al. (2022) Rui Duan, Zhe Qu, Shangqing Zhao, Leah Ding, Yao Liu, and Zhuo Lu. 2022. Perception-aware attack: Creating adversarial music via reverse-engineering human perception. In Proc. of ACM CCS. 905–919.
  • Gales et al. (2008) Mark Gales, Steve Young, et al. 2008. The application of hidden Markov models in speech recognition. Foundations and Trends® in Signal Processing 1 (2008), 195–304.
  • Gao et al. (2018) Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. In Proc. of IEEE SPW. 50–56.
  • Gelfand (2017) S.A. Gelfand. 2017. Hearing: An introduction to psychological and physiological acoustics. CRC Press.
  • Graves (2012) Alex Graves. 2012. Sequence transduction with recurrent neural networks. CoRR abs/1211.3711 (2012).
  • Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. of ICML. 369–376.
  • Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proc. of IEEE ICASSP. 6645–6649.
  • Gulati et al. (2020) Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Proc. of INTERSPEECH. 5036–5040.
  • Guo et al. (2022) Hanqing Guo, Yuanda Wang, Nikolay Ivanov, Li Xiao, and Qiben Yan. 2022. SPECPATCH: Human-In-The-Loop Adversarial Audio Spectrogram Patch Attack on Speech Recognition. In Proc. of ACM CCS. 1353–1366.
  • Han et al. (2020) Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu. 2020. ContextNet: Improving convolutional neural networks for automatic speech recognition with global context. In Proc. of INTERSPEECH. 3610–3614.
  • Hannun et al. (2014) Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. CoRR abs/1412.5567 (2014).
  • Hansen (2016) Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. CoRR abs/1604.00772 (2016).
  • Hermansky (1990) Hynek Hermansky. 1990. Perceptual linear predictive (PLP) analysis of speech. the Journal of the Acoustical Society of America 87 (1990), 1738–1752.
  • Higuchi et al. (2020) Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi. 2020. Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict. In Proc. of INTERSPEECH. 3655–3659.
  • Hsu et al. (2021) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
  • Huang et al. (2019) Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim. 2019. Enhancing adversarial example transferability with an intermediate level attack. In Proc. of IEEE/CVF ICCV. 4733–4742.
  • Itakura (1975) Fumitada Itakura. 1975. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoustical Society of America 57 (1975), S35–S35.
  • Karita et al. (2019a) Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. 2019a. A comparative study on transformer vs rnn in speech applications. In Proc. of IEEE ASRU. 449–456.
  • Karita et al. (2019b) Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji Watanabe, Marc Delcroix, Atsunori Ogawa, and Tomohiro Nakatani. 2019b. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proc. of INTERSPEECH. 1408–1412.
  • Kriman et al. (2020) Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, and Yang Zhang. 2020. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proc. of IEEE ICASSP. 6124–6128.
  • Li et al. (2019) Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An end-to-end convolutional neural acoustic model. In Proc. of INTERSPEECH. 71–75.
  • Li et al. (2020) **yu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, and Shujie Liu. 2020. On the comparison of popular end-to-end models for large scale speech recognition. In Proc. of INTERSPEECH. 1–5.
  • Liu et al. (2017) Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into transferable adversarial examples and black-box attacks. In Proc. of ICLR.
  • Majumdar et al. (2021) Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, and Boris Ginsburg. 2021. Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition. CoRR abs/2104.01721 (2021).
  • Microsoft (2023) Microsoft. 2023. Microsoft Azure Speech Service. https://azure.microsoft.com/en-us/products/cognitive-services/speech-services/.
  • Nvidia (2023) Nvidia. 2023. NeMo. https://developer.nvidia.com/nemo/.
  • OpenAI (2023a) OpenAI. 2023a. Whisper. https://openai.com/research/whisper.
  • OpenAI (2023b) OpenAI. 2023b. Whisper large-v3. https://github.com/openai/whisper/.
  • Opitz and Maclin (1999) David Opitz and Richard Maclin. 1999. Popular ensemble methods: An empirical study. Journal of artificial intelligence research 11 (1999), 169–198.
  • Papernot et al. (2016) Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. 2016. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR abs/1605.07277 (2016).
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proc. of NeruIPS. 8024–8035.
  • Qi et al. (2023) Gege Qi, Yuefeng Chen, Yao Zhu, Binyuan Hui, Xiaodan Li, Xiaofeng Mao, Rong Zhang, and Hui Xue. 2023. Transaudio: Towards the transferable adversarial audio attack via learning contextualized perturbations. In Proc. of IEEE ICASSP. 1–5.
  • Qin et al. (2019) Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Proc. of ICML. 5231–5240.
  • Rabiner (1989) Lawrence R Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (1989), 257–286.
  • Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proc. of ICML. 28492–28518.
  • Ravikumar et al. (2022) Deepak Ravikumar, Sangamesh D Kodge, Isha Garg, and Kaushik Roy. 2022. TREND: Transferability based robust ensemble design. IEEE Transactions on Artificial Intelligence (2022).
  • Schneider et al. (2019) Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. In Proc. of INTERSPEECH. 3465–3469.
  • Schönherr et al. (2019) Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. In Proc. of NDSS.
  • Sigurdsson et al. (2006) Sigurdur Sigurdsson, Kaare Brandt Petersen, and Tue Lehn-Schiøler. 2006. Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music.. In ISMIR. 286–289.
  • Spearman (1987) Charles Spearman. 1987. The proof and measurement of association between two things. The American journal of psychology 100 (1987), 441–471.
  • Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In Proc. of ICLR.
  • Taori et al. (2019) Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2019. Targeted adversarial examples for black box audio systems. In Proc. of IEEE SPW. 15–20.
  • Tencent (2023) Tencent. 2023. Tencent Cloud Automatic Speech Recognition. https://cloud.tencent.com/document/product/1093/.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proc. of NeruIPS. 5998–6008.
  • Wang et al. (2020) Qian Wang, Baolin Zheng, Qi Li, Chao Shen, and Zhongjie Ba. 2020. Towards query-efficient adversarial attacks against automatic speech recognition systems. IEEE Transactions on Information Forensics and Security 16 (2020), 896–908.
  • Wu et al. (2018) Lei Wu, Zhanxing Zhu, Cheng Tai, and Weinan E. 2018. Understanding and enhancing the transferability of adversarial examples. CoRR abs/1802.09707 (2018).
  • Wu et al. (2023) Xinghui Wu, Shiqing Ma, Chao Shen, Chenhao Lin, Qian Wang, Qi Li, and Yuan Rao. 2023. KENKU: Towards efficient and stealthy black-box adversarial attacks against ASR systems. In Proc. of USENIX Security. 247–264.
  • Yang et al. (2019) Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2019. Characterizing audio adversarial examples using temporal dependency. In Proc. of ICLR.
  • Yu et al. (2023) Zhiyuan Yu, Yuanhaur Chang, Ning Zhang, and Chaowei Xiao. 2023. {{\{{SMACK}}\}}: Semantically meaningful adversarial audio attack. In Proc. of USENIX security. 3799–3816.
  • Yuan et al. (2018) Xue**g Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, XiaoFeng Wang, and Carl A. Gunter. 2018. CommanderSong: A systematic approach for practical adversarial voice recognition. In Proc. of USENIX Security. 49–64.
  • Zeng et al. (2019) Qiang Zeng, Jianhai Su, Chenglong Fu, Golam Kayas, Lannan Luo, Xiaojiang Du, Chiu C Tan, and Jie Wu. 2019. A multiversion programming inspired approach to detecting audio adversarial examples. In Proc. of IEEE/IFIP DSN. 39–51.
  • Zeyer et al. (2019) Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter, and Hermann Ney. 2019. A comparison of transformer and lstm encoder decoder models for asr. In Proc. of IEEE ASRU. 8–15.
  • Zhang et al. (2020) Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. 2020. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In Proc. of IEEE ICASSP. 7829–7833.
  • Zheng et al. (2021) Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. 2021. Black-box adversarial attacks on commercial speech platforms with minimal information. In Proc. of ACM CCS. 86–107.
Table 10. Detailed results of evaluation on online speech recognition services.
  Command Azure Tencent Alibaba OpenAI
Attack SNR (dB) Attack SNR (dB) Attack SNR (dB) Attack SNR (dB)
  call my wife 18.41 19.88 16.89 17.44
make it warmer 20.35 18.18 16.50 15.41
navigate to my home 20.85 23.42 20.65 20.35
open the door 21.63 23.09 20.73 17.67
open the website 21.86 26.67 23.94 23.55
play music 25.47 26.51 20.26 20.97
send a text 25.60 23.28 25.79 23.18
take a picture 26.85 27.77 25.00 26.48
turn off the light 25.08 27.94 26.16 24.28
turn on airplane mode 17.44 17.79 16.53 16.75
  Average 22.35 23.45 21.24 20.61
 

Appendix A Evaluation on Online Speech Recognition Services

The detailed evaluation results of ZQ-Attack on online speech recognition services are presented in Table 10. For the 10 target commands, ZQ-Attack can generate effective adversarial examples on 4 online speech recognition services. The results demonstrate that ZQ-Attack can successfully generate effective and imperceptible audio adversarial examples on these online speech recognition services.

Table 11. Detailed results of evaluation on commercial IVC devices.
  Command NI-Occam KENKU ZQ-Attack
Siri Alexa Siri Alexa Siri Alexa
  call my wife
make it warmer
navigate to my home
open the door
open the website
play music
send a text
take a picture
turn off the light
turn on airplane mode
Average 4/10 5/10 7/10 9/10 10/10 10/10
 

Appendix B Evaluation on Commercial IVC Devices

The detailed evaluation results on commercial IVC devices are presented in Table 11. The results demonstrate that ZQ-Attack successfully attacks all 10 target commands, outperforming both KENKU and NI-Occam. Moreover, we observe that the effectiveness of these methods can vary significantly for different commands. For example, for the command play music, all methods successfully execute the attack. However, for the command make it warmer, both KENKU and NI-Occam fail to achieve success.

Appendix C Evaluation on Open-source ASRs

We introduce the transfer rate (TR) to evaluate the transferability of audio adversarial examples generated by ZQ-Attack to open-source ASR systems. For a specified target command and target open-source ASR system, the TR represents the success rate of the transfer attack, i.e., the ratio of adversarial examples within Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that successfully attack the target open-source ASR system. A higher TR indicates better transferability of the audio adversarial examples generated by ZQ-Attack to the target open-source ASR system. The results are presented in Table 12. ZQ-Attacks obtains an average TR of 65.19% and an SNR of 19.67dB on 16 open-source ASRs. These results indicate that ZQ-Attack can generate audio adversarial examples that effectively attack a diverse range of open-source ASRs.

Table 12. Evaluation on open-source ASRs.
  Command Jasper QuartzNet Citrinet (M) Citrinet (L)
Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB)
  call my wife 31.73 9.97 33.35 9.90 24.08 11.60 50.28 13.46
make it warmer 18.21 9.43 27.84 9.94 12.50 9.64 38.20 13.00
navigate to my home 27.35 12.28 15.35 11.52 20.73 13.88 38.42 14.71
open the door 28.60 13.19 23.30 12.96 11.15 13.33 27.19 14.54
open the website 19.06 15.13 12.63 14.72 13.08 17.98 23.32 15.59
play music 38.72 14.04 32.41 13.41 29.91 16.20 55.16 17.54
send a text 44.28 18.50 22.50 15.94 2.58 15.69 61.94 19.70
take a picture 20.42 19.98 16.25 17.99 21.92 22.79 36.07 22.07
turn off the light 14.78 17.55 11.85 16.36 13.03 18.73 26.31 20.66
turn on airplane mode 7.27 5.82 24.33 6.86 10.59 6.87 21.75 7.59
  Average 10/10 25.04 13.59 10/10 21.98 12.96 10/10 15.96 14.67 10/10 37.86 15.89
  Command ContextNet (M) ContextNet (L) Conformer-CTC (M) Conformer-CTC (L)
Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB)
  call my wife 75.56 15.37 69.31 14.18 100.00 23.06 93.02 20.66
make it warmer 62.58 13.03 65.07 13.01 91.76 23.17 85.89 17.71
navigate to my home 53.39 14.95 56.53 16.71 96.57 25.52 80.47 19.88
open the door 39.96 14.02 50.64 15.40 99.71 26.84 94.56 25.33
open the website 31.78 15.56 49.42 17.73 99.34 26.75 87.89 25.05
play music 46.42 14.78 61.80 17.81 99.05 26.84 99.72 26.85
send a text 49.05 19.44 56.27 18.73 93.02 25.37 98.58 25.91
take a picture 27.93 19.76 37.87 21.81 99.91 27.78 97.82 27.78
turn off the light 42.07 18.46 55.41 20.58 98.01 27.89 96.97 27.84
turn on airplane mode 67.30 11.96 52.77 12.73 98.66 18.06 83.09 18.06
  Average 10/10 49.60 15.73 10/10 55.51 16.87 10/10 97.60 25.13 10/10 91.80 23.51
  Command Conformer-CTC (XL) Conformer-Transducer (M) Conformer-Transducer (L) Conformer-Transducer (XL)
Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB)
  call my wife 96.61 22.02 97.99 22.99 96.19 20.58 83.21 20.10
make it warmer 84.56 18.87 97.82 24.77 85.14 18.37 43.34 18.48
navigate to my home 97.54 25.47 99.76 25.52 84.51 22.05 93.03 23.30
open the door 91.13 25.90 100.00 26.84 54.29 16.05 60.62 18.73
open the website 87.18 24.17 100.00 26.76 41.65 23.85 36.62 21.09
play music 97.26 26.85 98.77 26.85 90.61 26.30 59.51 22.27
send a text 92.70 24.39 98.94 25.87 46.71 17.69 92.18 24.76
take a picture 99.76 27.78 99.84 27.78 69.12 23.11 60.07 23.96
turn off the light 76.09 24.81 99.72 27.95 96.35 27.84 68.76 24.02
turn on airplane mode 75.66 15.61 100.00 18.06 20.77 10.48 34.06 14.08
  Average 10/10 89.85 23.59 10/10 99.28 25.34 10/10 68.53 20.63 10/10 63.14 21.08
  Command Whisper (base) Whisper (small) Whisper (medium) Whisper (large)
Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB) Attack TR (%) SNR (dB)
  call my wife 58.90 13.55 87.73 17.94 91.34 19.80 89.68 18.73
make it warmer 69.28 13.57 78.24 15.00 80.67 18.76 81.56 17.89
navigate to my home 53.57 15.14 77.49 19.11 85.16 22.64 91.71 25.44
open the door 72.91 18.62 75.41 19.59 88.02 24.02 81.69 22.44
open the website 63.85 20.54 78.41 22.53 89.61 25.15 94.93 26.76
play music 65.24 16.91 81.73 23.04 88.06 26.81 91.75 26.85
send a text 55.08 19.94 66.03 19.53 77.04 25.59 89.01 25.83
take a picture 84.77 26.71 84.91 27.54 97.66 27.78 99.15 27.78
turn off the light 64.04 22.09 86.75 25.97 95.12 26.45 97.20 27.61
turn on airplane mode 66.78 11.70 92.72 17.32 96.96 16.92 99.18 18.06
  Average 10/10 65.44 17.88 10/10 80.94 20.76 10/10 88.97 23.39 10/10 91.58 23.74
 

In the evaluation, we observe a potential correlation between the performance of the target ASR system (e.g., WER on LibriSpeech test-other) and the TR. To further analyze this correlation, we extend the set of 16 open-source ASRs to include Conformer-Transducer (XXL), Whisper (tiny), Whisper large-v2, and Whisper large-v3. Additionally, we categorize them into two groups: Whisper and Others. This categorization stems from the fact that Whisper is designed for multilingual recognition and has not undergone fine-tuning on the LibriSpeech dataset, unlike other open-source ASRs, which are tailored for English recognition tasks and include the LibriSpeech dataset as part of their training set.

The potential correlation is depicted in Figure 7, with Whisper and other open-source ASRs represented by different colors in the scatter plot. It is discernible that there exists a potential negative correlation between the performance of the target ASR system and the TR. For quantitative analysis, we employ the Pearson correlation coefficient (Cohen et al., 2009) and Spearman’s rank correlation coefficient (Spearman, 1987) to characterize this correlation. In statistics, the Pearson correlation coefficient measures the linear correlation between two sets of data, while Spearman’s rank correlation coefficient assesses the correlation of monotonic relationships. Their values range from -1 to 1, with closer proximity to 1 indicating a stronger positive correlation, closer to -1 indicating a stronger negative correlation, and closer to 0 suggesting a weaker correlation. The results are presented in Table 13. They reveal a significant negative correlation between the performance and the TR for both categories of open-source ASRs.

We also conduct an ablation study to evaluate the impact of acoustic feature loss. Evaluations are carried out on 10 target commands, and 16 open-source ASR systems. There are 4 instances of attack failure (4/160) when the acoustic feature loss is omitted from the loss function.

Refer to caption
Figure 7. Correlation between ASR performance and TR.
Table 13. Correlation coefficient between ASR performance and TR.
  Coefficient Whisper Others
  Pearson Correlation -0.933 -0.758
Spearman’s Rank Correlation -0.607 -0.580
 

Appendix D Adaptive search algorithm

For a specific target command audio and carrier audio, we illustrate the values of scaling factor μ𝜇\muitalic_μ corresponding to different padding lengths in Figure 8. The starting position represents the length of the padding on the left side. As the adaptive search algorithm is related to the carrier audio, we present the waveform of the initialized adversarial examples for the same target command audio under different carrier audios in Figure 9. We can see that, for different carrier audios, the adaptive initialization algorithm finds distinct padding lengths on each side and scaling factors to initialize the adversarial perturbation.

Additionally, in our adaptive search algorithm, the obtained μ𝜇\muitalic_μ is a scalar. We also investigate treating μ𝜇\muitalic_μ as an optimizable variable. Adversarial examples initialized using these two search methods and optimized through the sequential ensemble optimization algorithm exhibit comparable stealthiness and effectiveness. However, treating μ𝜇\muitalic_μ as an optimizable variable results in greater time overhead. Therefore, we choose to search for μ𝜇\muitalic_μ as a scalar.

Refer to caption
Figure 8. Scaling factor at different starting positions.
Refer to caption
Figure 9. Waveforms of the initialized adversarial examples for different carrier audios.