Entanglement engineering of optomechanical systems by reinforcement learning

Li-Li Ye School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA    Christian Arenz School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA    Joseph M. Lukens Research Technology Office and Quantum Collaborative, Arizona State University, Tempe, Arizona 85287, USA Quantum Information Science Section, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA    Ying-Cheng Lai [email protected] School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA Department of Physics, Arizona State University, Tempe, Arizona 85287, USA
(July 2, 2024)
Abstract

Entanglement is fundamental to quantum information science and technology, yet controlling and manipulating entanglement — so-called entanglement engineering — for arbitrary quantum systems remains a formidable challenge. There are two difficulties: the fragility of quantum entanglement and its experimental characterization. We develop a model-free deep reinforcement-learning (RL) approach to entanglement engineering, in which feedback control together with weak continuous measurement and partial state observation is exploited to generate and maintain desired entanglement. We employ quantum optomechanical systems with linear or nonlinear photon-phonon interactions to demonstrate the workings of our machine-learning-based entanglement engineering protocol. In particular, the RL agent sequentially interacts with one or multiple parallel quantum optomechanical environments, collects trajectories, and updates the policy to maximize the accumulated reward to create and stabilize quantum entanglement over an arbitrary amount of time. The machine-learning-based model-free control principle is applicable to the entanglement engineering of experimental quantum systems in general.

Introduction

Entanglement [1, 2, 3, 4, 5] is fundamental to all fields in quantum information science such as quantum sensing [6], quantum computation [7], and quantum networks [8, 9, 10, 11, 12, 13]. However, the inherent fragility of quantum entanglement and coherence [14] poses significant challenges for experimental applications. For example, in quantum computing, the application of quantum gates to quantum states needs to last for a finite amount of time [15, 16, 17, 18, 19, 20], making it critical to maintain the entanglement after its creation. Moreover, the transition from noisy intermediate-scale systems [21] to large-scale, fault-tolerant systems [16] requires sophisticated entanglement engineering strategies to establish and maintain entanglement through optimal control protocols in the presence of noise and decoherence.

At the present, a major limitation/challenge in entanglement engineering is the experimental observation design. Existing machine-learning based works use the full fidelity, i.e., the overlap between the current and target quantum states, as the observation metric. Applications range from the generation of two [22] and multi-qubit entangled states [23, 24] to specific many-body states [25, 26, 27] and single-particle quantum state engineering via deep reinforcement learning (RL) [28, 29]. However, full fidelity observation is not universally applicable in experiments. Moreover, obtaining the relationship between the entanglement and experimental observables is difficult. So far there have been no systematical methods to extract quantitative entanglement from experimental observation for arbitrary quantum systems [30, 31, 32], in spite of some initial exploration for specific systems. For example, an entanglement criterion for non-Gaussian states in coupled harmonic oscillators was developed [30]. Under the strong laser approximation, a Bell inequality was tested with photon counting [31], and stationary entanglement for Gaussian states was inferred from the continuous measurement of light only [32].

In this paper, using quantum optomechanical systems with linear or nonlinear photon-phonon interactions as a paradigm, we develop a deep RL approach to entanglement engineering. For quantum control of optomechanical systems, most existing theoretical studies focused on Gaussian states or the linear interaction regime [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 38, 44, 45], with the primary goal of generating entanglement as quickly as possible (entanglement enhancement) [43, 38, 44, 45]. Previous control methods are mostly model-based: prior information about the system model is needed, such as the pulse method [33, 34, 35, 36, 37], time-continuous laser-driven approaches [38, 39], periodic modulations [40, 41, 42], optimal pulse protocols [43], linear quadratic-Gaussian (LQG) methods [38], and coherent feedback methods using auxiliary optical components [44, 45]. We note that there were two previous works [46, 47] on model-free RL for controlling and stabilizing a quantum system with an inverted harmonic potential and a double-well nonlinear potential, respectively, to a target state using weak-current measurements (WCMs) and partial state observation. However, these two works did not address entanglement control, while our work is develo** a model-free deep-RL method to realize non-Gaussian entanglement engineering using only photon number counting from WCMs. To our knowledge, prior to our work, model-free deep RL feedback control to create and stabilize the entanglement with WCM observations had not been available.

The particular aspects of our work that go beyond the existing works are briefly described, as follows. In our work, in the linear (nonlinear) interaction regime, the observation is the WCM photocurrent (the expectation value of the photon number). We note a previous work [29] that employed a proximal policy optimization (PPO) [48] RL agent, to generate different Fock states and the superposition of a single cavity mode based on observing the density matrix and a fidelity-based reward function. In contrast, the observable in our work is the photocurrent that is more experimentally accessible [49]. For quantum measurement, we use WCM in real-time feedback control, taking into consideration the resulting quantum stochastic process [50, 51, 47], and identify a numerical relationship between the entanglement and photocurrent. In both the linear and nonlinear regimes, we focus on non-Gaussian state control because, according to the nonlinear quantum master equation resulting from WCM, the time evolving quantum states are intrinsically non-Gaussian. Our deep RL control scheme is model-free [52], where policies or value functions are directly learned from the interactions with the quantum environment without any explicit model of this environment. This should be contrasted to the model-based deep RL methods [53], where a pre-built model of the environment for policy decision-making is needed. We demonstrate that, under the actions of the well-trained PPO or recurrent PPO RL agent, entanglement between the quantum optical and mechanical modes can be created and maintained about the target entanglement.

Our main results are as follows. First, under the strong laser approximation, the interaction resulting from the radiation pressure between the cavity and the mechanical oscillator modes can be linearized and described by the beam-splitter Hamiltonian. During the training phase, the PPO agent interacts with parallel quantum environments and collects the subsequent data by episodic learning, with the observation being the WCM photocurrent. The deep-RL method can extract useful information from the measurement photocurrent, which is encoded in the Wiener process, and achieve the target entanglement engineering in a model-free manner for the quantum system that is dissipative due to coupling to the vacuum bath and is driven by a laser. In the testing phase, with the agent interacting and observing a single quantum environment, we demonstrate that the entanglement-engineering performance of our deep-RL method with WCM observation greatly exceeds that of both state-based Bayesian methods [54, 47] and random control. Second, when the driving laser field is not strong, the quantum optomechanical interaction is nonlinear [55, 56]. In this case, we articulate two training phases for nonlinear entanglement engineering. The first phase is utilized to infer the entanglement by the model-free deep RL, dubbed as the target-generating phase, where the observation of the PPO agent [with multilayer perceptions (MLPs)] is the logarithmic negativity and the reward function is constructed to limit the high-level excitation and facilitate entanglement learning. (Direct experimental measurement of the logarithmic negativity is currently not available.) The time series of the expected photon number in the regime from converged training episodes is selected as the target for the next phase. The second phase is then the target-utilization phase, where the recurrent PPO (with long short-term memory (LSTM) [57] added after MLPs) observes the expected photon number and obtains the reward only based on the target expected photon number obtained from the last phase. In this framework, the recurrent PPO controls the quantum state in the low-energy regime with the desired entanglement created and stabilized.

Refer to caption
Figure 1: Experimental proposal of measurement-based feedback control of deep RL to create and stabilize entanglement in an open quantum optomechanical system dissipatively coupled to the vacuum bath. Quantum optomechanics was experimentally realized in a microwave electromechanical system [58, 59, 60], where the multiplexing qubit was used to weakly couple to the microwave resonator for extracting the photon number statistics through weak measurements [49]. The RL agent acts in one or multiple parallel quantum optomechanical environments according to the parameterized policy and collects data in one episode consisting of T𝑇Titalic_T time steps: observations Otsubscript𝑂𝑡O_{t}italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, reward Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and actions, after which the quantum optomechanical environment is reset. After one or several episodes, the policy is updated using minibatch data to maximize the accumulated reward. The aim is to achieve the desired entanglement ENlog20.7similar-tosubscript𝐸𝑁2similar-to0.7E_{N}\sim\log 2\sim 0.7italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ roman_log 2 ∼ 0.7 (in the natural logarithmic base) between the cavity-optical and mechanical modes. Entanglement engineering of this type can be achieved in both the linear and nonlinear interaction regimes. In the linear case, the task is similar to that of achieving an entangled Bell state of the beam-splitter Hamiltonian or “swap” Hamiltonian. In the nonlinear regime, the entangled states from entanglement engineering can be complicated. Illustrated are the resulting photon and phonon number distributions of the entangled states.

Results

Experimental proposal for entanglement engineering

Our goal is achieving entanglement engineering between the optical cavity and mechanical oscillator modes using deep RL. Based on the current experimental progress, we articulate an experimental proposal to achieve this goal, as shown in Fig. 1. Consider a Fabry-Perot cavity that consists of a single-mode cavity and a movable end mirror. The optical cavity has the frequency ωcsubscript𝜔𝑐\omega_{c}italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the optical field exerts a radiation pressure on the mirror. The cavity mode is driven externally by a coherent laser field with frequency ωLsubscript𝜔𝐿\omega_{L}italic_ω start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. The mirror’s quantized center-of-mass motion is described by a harmonic oscillator of frequency ωmsubscript𝜔𝑚\omega_{m}italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In the rotating frame of the laser, the Hamiltonian describing the coupling between the optical cavity and mechanical oscillator modes is given by [55, 56]

H~nl=Δa^a^+ωmb^b^+g0(b^+b^)a^a^+αL(a^+a^),subscript~𝐻𝑛𝑙Planck-constant-over-2-piΔsuperscript^𝑎^𝑎Planck-constant-over-2-pisubscript𝜔𝑚superscript^𝑏^𝑏Planck-constant-over-2-pisubscript𝑔0superscript^𝑏^𝑏superscript^𝑎^𝑎Planck-constant-over-2-pisubscript𝛼𝐿superscript^𝑎^𝑎\displaystyle\tilde{H}_{nl}=-\hbar\Delta\hat{a}^{\dagger}\hat{a}+\hbar\omega_{% m}\hat{b}^{\dagger}\hat{b}+\hbar g_{0}(\hat{b}^{\dagger}+\hat{b})\hat{a}^{% \dagger}\hat{a}+\hbar\alpha_{L}(\hat{a}^{\dagger}+\hat{a}),over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT = - roman_ℏ roman_Δ over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG + roman_ℏ italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over^ start_ARG italic_b end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_b end_ARG + roman_ℏ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over^ start_ARG italic_b end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + over^ start_ARG italic_b end_ARG ) over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG + roman_ℏ italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + over^ start_ARG italic_a end_ARG ) ,

where a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG and b^^𝑏\hat{b}over^ start_ARG italic_b end_ARG are the annihilation operators of the cavity and mechanical mode, respectively, a^superscript^𝑎\hat{a}^{\dagger}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT and b^superscript^𝑏\hat{b}^{\dagger}over^ start_ARG italic_b end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT are the corresponding creation operators. The frequency detuning of the cavity is ΔωLωcΔsubscript𝜔𝐿subscript𝜔𝑐\Delta\equiv\omega_{L}-\omega_{c}roman_Δ ≡ italic_ω start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT - italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The nonlinear coupling g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT arises from the radiation pressure force between the light and the movable mirror (details given in Supplementary Note 2), and αLsubscript𝛼𝐿\alpha_{L}italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is the real amplitude of the driven electromagnetic field. We set g0>κsubscript𝑔0𝜅g_{0}>\kappaitalic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > italic_κ so that the single-photon optomechanical coupling rate g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT exceeds the coupling strength κ𝜅\kappaitalic_κ between the cavity and the vacuum bath. This condition guarantees observable nonlinear quantum effects [61]. Under the strong laser approximation: |α¯c|1much-greater-thansubscript¯𝛼𝑐1|\bar{\alpha}_{c}|\gg 1| over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ≫ 1, where |α¯c|subscript¯𝛼𝑐|\bar{\alpha}_{c}|| over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | is the amplitude of the light field inside the cavity induced by the strong laser, we have a^α¯c+δa^^𝑎subscript¯𝛼𝑐𝛿^𝑎\hat{a}\approx\bar{\alpha}_{c}+\delta\hat{a}over^ start_ARG italic_a end_ARG ≈ over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_δ over^ start_ARG italic_a end_ARG with δa^𝛿^𝑎\delta\hat{a}italic_δ over^ start_ARG italic_a end_ARG denoting the excitation or the shifted oscillator on top of the large coherent state with the amplitude α¯csubscript¯𝛼𝑐\bar{\alpha}_{c}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The resulting linearized beam-splitter or “swap” Hamiltonian  [62, 55] is

H~bsωmδa^δa^+ωmb^b^+G(δa^b^+b^δa^),subscript~𝐻𝑏𝑠Planck-constant-over-2-pisubscript𝜔𝑚𝛿superscript^𝑎𝛿^𝑎Planck-constant-over-2-pisubscript𝜔𝑚superscript^𝑏^𝑏Planck-constant-over-2-pi𝐺𝛿superscript^𝑎^𝑏superscript^𝑏𝛿^𝑎\displaystyle\tilde{H}_{bs}\approx\hbar\omega_{m}\delta\hat{a}^{\dagger}\delta% \hat{a}+\hbar\omega_{m}\hat{b}^{\dagger}\hat{b}+\hbar G(\delta\hat{a}^{\dagger% }\hat{b}+\hat{b}^{\dagger}\delta\hat{a}),over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_b italic_s end_POSTSUBSCRIPT ≈ roman_ℏ italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_δ over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_δ over^ start_ARG italic_a end_ARG + roman_ℏ italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over^ start_ARG italic_b end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_b end_ARG + roman_ℏ italic_G ( italic_δ over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_b end_ARG + over^ start_ARG italic_b end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_δ over^ start_ARG italic_a end_ARG ) ,

which is obtained in the red-detuned regime Δ=ωmΔsubscript𝜔𝑚\Delta=-\omega_{m}roman_Δ = - italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where the coefficient Gg0α¯c𝐺subscript𝑔0subscript¯𝛼𝑐G\equiv g_{0}\bar{\alpha}_{c}italic_G ≡ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be tuned by the amplitude of the incoming laser (a time-dependent modulation) [63]. The interaction term describes the state transfer between photons and phonons in the strong coupling regime for G>κ𝐺𝜅G>\kappaitalic_G > italic_κ, with κ𝜅\kappaitalic_κ (γ𝛾\gammaitalic_γ) being the decay rate of the cavity (mechanical) mode to the vacuum bath at zero temperature.

Our control strategy was developed based on considering the current experimental capability. Previous works on the microwave regime of the optomechanical systems [58, 59, 60] suggested the feasibility of the experimental implementation of our RL control scheme. In particular, a one-to-one correspondence between the Fabry-Perot cavity and the microwave electromechanical system was demonstrated [64, 59, 58]. As shown in Fig. 1, the microwave resonator of an LC circuit is equivalent to the Fabry-Perot optical cavity mode with the movable capacity [64] Cm(x)subscript𝐶𝑚𝑥C_{m}(x)italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x ) corresponding to the flexible mirror in the optical cavity. The resistors Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Rmsubscript𝑅𝑚R_{m}italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be related to the decay rate κ,γ𝜅𝛾\kappa,\gammaitalic_κ , italic_γ to the vacuum bath [64]. Based on the experimental results, we can compare the typical parameter configurations between the optomechanical and electromechanical systems. The decay rate of the optical cavity mode is κ=0.01ωm𝜅0.01subscript𝜔𝑚\kappa=0.01\,\omega_{m}italic_κ = 0.01 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in the linear regime and κ=0.1ωm𝜅0.1subscript𝜔𝑚\kappa=0.1\,\omega_{m}italic_κ = 0.1 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in the nonlinear regime, with the better quality of the mechanical oscillator mode γ=0.01κ𝛾0.01𝜅\gamma=0.01\,\kappaitalic_γ = 0.01 italic_κ. Consequently, we have γ103ωm104ωm𝛾superscript103subscript𝜔𝑚similar-tosuperscript104subscript𝜔𝑚\gamma\approx 10^{-3}\,\omega_{m}\sim 10^{-4}\,\omega_{m}italic_γ ≈ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The typical experimental decay rate of the microwave resonator is [59, 58, 65, 66, 67, 61] κ0.01ωm0.1ωm𝜅0.01subscript𝜔𝑚similar-to0.1subscript𝜔𝑚\kappa\approx 0.01\,\omega_{m}\sim 0.1\,\omega_{m}italic_κ ≈ 0.01 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ 0.1 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with γ103ωm109ωm𝛾superscript103subscript𝜔𝑚similar-tosuperscript109subscript𝜔𝑚\gamma\approx 10^{-3}\,\omega_{m}\sim 10^{-9}\,\omega_{m}italic_γ ≈ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In our work, the nonlinear coupling is set to be g0=0.2ωmsubscript𝑔00.2subscript𝜔𝑚g_{0}=0.2\,\omega_{m}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.2 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, whereas the typical coupling in the strong coupling regime in a previous work [58] was about g0=0.1ωmsubscript𝑔00.1subscript𝜔𝑚g_{0}=0.1\,\omega_{m}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.1 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The strength of the laser in our work is G[5,5]ωm𝐺55subscript𝜔𝑚G\in[-5,5]\,\omega_{m}italic_G ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for the linear system in the red-detuned regime Δ=ωmΔsubscript𝜔𝑚\Delta=-\omega_{m}roman_Δ = - italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and Δ,αL[5,5]ωmΔsubscript𝛼𝐿55subscript𝜔𝑚\Delta,\alpha_{L}\in[-5,5]\,\omega_{m}roman_Δ , italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for the nonlinear system. In the microwave version, this range can be adjusted by the pump’s strength [59, 58, 65, 66, 67, 61].

In the microwave regime, it was demonstrated that the photon-number statistics of a microwave cavity mode can be detected using multiplexed photon number measurements [49, 68, 69]. By this method, the multiplexing qubit encodes multiple bits about the photon number distribution of a microwave resonator through dispersive interaction. A frequency comb drive, distributed at fMPkχsubscript𝑓MP𝑘𝜒f_{\textnormal{MP}}-k\chiitalic_f start_POSTSUBSCRIPT MP end_POSTSUBSCRIPT - italic_k italic_χ, reads out all the information about the photon number distribution at once [49], where k𝑘kitalic_k denotes the number of photons and χ𝜒\chiitalic_χ represents the dispersive qubit-resonator coupling, as shown in Fig. 1. The reduction in the reflection amplitude, 1rk1subscript𝑟𝑘1-r_{k}1 - italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with k=0,1,𝑘01k=0,1,...italic_k = 0 , 1 , …, of the frequency comb, is proportional to the photon-number distribution of the microwave cavity mode over the Fock bases, as detected by the weak measurement [29, 49]. In our circuit design of experimental proposal, we add a capacitor Cqsubscript𝐶𝑞C_{q}italic_C start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT to realize the weak coupling to the original electromechanical system. The coupling capacitance is small enough to be neglected in the total Hamiltonian, but it still allows the multiplexing qubit, denoted by the green cross in Fig. 1, to encode the photon number distribution of the microwave resonator through dispersive interaction.

Under weak measurement [29, 49], the sequence of the reduced reflection amplitude 1rk1subscript𝑟𝑘1-r_{k}1 - italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is collected by the PPO agent, which is proportional to the occupied photon number probability. Consequently, the expected photon number is calculated as

n^p=nnηP^n/η=nnP^ndelimited-⟨⟩subscript^𝑛𝑝subscript𝑛𝑛delimited-⟨⟩𝜂subscript^𝑃𝑛𝜂subscript𝑛𝑛delimited-⟨⟩subscript^𝑃𝑛\displaystyle\langle\hat{n}_{p}\rangle=\sum_{n}n\langle\sqrt{\eta}\hat{P}_{n}% \rangle/\sqrt{\eta}=\sum_{n}n\langle\hat{P}_{n}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_n ⟨ square-root start_ARG italic_η end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ / square-root start_ARG italic_η end_ARG = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_n ⟨ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩

and the WCM photocurrent is

η(t)=nn[ηP^n+dWn(t)4ηdt]𝜂𝑡subscript𝑛𝑛delimited-[]delimited-⟨⟩𝜂subscript^𝑃𝑛𝑑subscript𝑊𝑛𝑡4𝜂𝑑𝑡\displaystyle\sqrt{\eta}\,\mathcal{I}(t)=\sum_{n}n\left[\langle\sqrt{\eta}\hat% {P}_{n}\rangle+\frac{dW_{n}(t)}{\sqrt{4\eta}dt}\right]square-root start_ARG italic_η end_ARG caligraphic_I ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_n [ ⟨ square-root start_ARG italic_η end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ + divide start_ARG italic_d italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG square-root start_ARG 4 italic_η end_ARG italic_d italic_t end_ARG ] (1)

with the measurement rate η𝜂\etaitalic_η, where P^n=|nn|subscript^𝑃𝑛ket𝑛bra𝑛\hat{P}_{n}=|n\rangle\langle n|over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = | italic_n ⟩ ⟨ italic_n | is the measurement projector on the Fock state |nket𝑛|n\rangle| italic_n ⟩, and dW(t)𝑑𝑊𝑡dW(t)italic_d italic_W ( italic_t ) is the Wiener increment with zero mean and variance dt=0.01ωm1𝑑𝑡0.01superscriptsubscript𝜔𝑚1dt=0.01\,\omega_{m}^{-1}italic_d italic_t = 0.01 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (the time step size in our calculations). In the linear quantum optomechanical regime, the Fock space for each mode is limited to n=0,1𝑛01n=0,1italic_n = 0 , 1. The action is the amplitude modulation of the laser, which is in the range G[5,5]ωm𝐺55subscript𝜔𝑚G\in[-5,5]\,\omega_{m}italic_G ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In the nonlinear regime, the Fock dimension is n=0,1,,9𝑛019n=0,1,\ldots,9italic_n = 0 , 1 , … , 9. The time-dependent control signal consists of the detuning ΔΔ\Deltaroman_Δ and the amplitude αLsubscript𝛼𝐿\alpha_{L}italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT of the driven laser within the fixed range Δ,αL[5,5]ωmΔsubscript𝛼𝐿55subscript𝜔𝑚\Delta,\alpha_{L}\in[-5,5]\,\omega_{m}roman_Δ , italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

The open dissipative quantum optomechanics under the WCM obey the stochastic master equation (SME) (see Methods). The number ntrajsubscript𝑛trajn_{\textnormal{traj}}italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT of trajectories simulated from SME can be selected according to the following considerations. If the observable is some expected physical quantity, using one trajectory is sufficient to extract the information about the quantum state: ntraj=1subscript𝑛traj1n_{\textnormal{traj}}=1italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT = 1. Experimentally, WCMs are performed, encoding the Wiener process in the observation and resulting in a large variance from the expectation value. To reduce the variance, more quantum trajectories should be used. To make computations feasible, we use five trajectories: ntraj=5subscript𝑛traj5n_{\textnormal{traj}}=5italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT = 5.

In the online training phase, for each episode with time steps, e.g., T=500𝑇500T=500italic_T = 500, the PPO agent - the combination of the actor and critic network, collects the sequence of the observations O(t)=n^p(t)𝑂𝑡delimited-⟨⟩subscript^𝑛𝑝𝑡O(t)=\langle\hat{n}_{p}\rangle(t)italic_O ( italic_t ) = ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) or (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ), the reward value R(t)=|O(t)n^ptarget(t)|𝑅𝑡𝑂𝑡delimited-⟨⟩subscriptsuperscript^𝑛target𝑝𝑡R(t)=-|O(t)-\langle\hat{n}^{\textnormal{target}}_{p}\rangle(t)|italic_R ( italic_t ) = - | italic_O ( italic_t ) - ⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) |, and the resulting actions generated by its policy. After one or several episodes, the policy of the PPO agent is updated using minibatch data to maximize the accumulated reward. The RL agent is designed to interact with a single or multiple parallel quantum environments to make the time evolving observation O(t)𝑂𝑡O(t)italic_O ( italic_t ) align with the target one n^ptarget(t)delimited-⟨⟩subscriptsuperscript^𝑛target𝑝𝑡\langle\hat{n}^{\textnormal{target}}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ). In the online testing phase, the policy of the well-trained agent will not update and only interact with a single quantum environment to give the optimal control protocol to the corresponding observation. To realize entanglement engineering, i.e., achieving the desired entanglement between the cavity-optical and mechanical modes, finding the relation between the experimental observables and entanglement quantities is an unavoidable challenge. In our work, the model-free PPO agent finds the numerical relationship between them and realizes the entanglement engineering in both the linear and nonlinear regimes of quantum optomechanics, as shown in Fig. 1.

A general quantity to measure the entanglement between arbitrary quantum bipartite systems for any mixed states is the logarithmic negativity [70, 71, 72], without the influence of the vacuum bath [73]. In contrast, the conventional pure-state entanglement measures, such as the von Neumann and Rényi entropy, capture both quantum and classical correlations. Since the goal of our study is harnessing the entanglement between the cavity and oscillator modes, we focus on the logarithmic negativity: EN(ρ)log2ρTi1subscript𝐸𝑁𝜌subscript2subscriptnormsuperscript𝜌subscript𝑇𝑖1E_{N}(\rho)\equiv\log_{2}||\rho^{T_{i}}||_{1}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_ρ ) ≡ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where X1=TrXXsubscriptnorm𝑋1Trsuperscript𝑋𝑋||X||_{1}=\textnormal{Tr}\sqrt{X^{\dagger}X}| | italic_X | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = Tr square-root start_ARG italic_X start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_X end_ARG is the trace norm of the partial transpose ρTisuperscript𝜌subscript𝑇𝑖\rho^{T_{i}}italic_ρ start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with respect to the two subsystems i=0𝑖0i=0italic_i = 0 (quantum-optical cavity mode) and 1111 (mechanical oscillator mode). The logarithmic negativity measures the degree to which ρTisuperscript𝜌subscript𝑇𝑖\rho^{T_{i}}italic_ρ start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT fails to be positive, i.e., the extent of inseparability or entanglement, and it is the upper bound of the distillable entanglement [70, 71]. The logarithmic negativity is the full entanglement monotone [71], which satisfies the following criteria [74, 72]: (1) ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is a non-negative functional, (2) ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT vanishes if the state ρ𝜌\rhoitalic_ρ is separable, and (3) ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT does not increase on average under Gaussian local operations and classical communication [75, 76] or positive partial transpose preserving operations [77]. Since ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT quantifies the quantum correlation between the bipartite systems in spite of the coupling to the vacuum bath, the value of ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT calculated from the cavity mode is equal to that of the oscillator mode: EN0=EN1=ENsubscriptsuperscript𝐸0𝑁subscriptsuperscript𝐸1𝑁subscript𝐸𝑁E^{0}_{N}=E^{1}_{N}=E_{N}italic_E start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_E start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, which can be verified numerically.

To characterize the quantum-entanglement control performance, we use the following three quantities: ENdelimited-⟨⟩subscript𝐸𝑁\langle E_{N}\rangle⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩, E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG in open quantum optomechanical systems with either linear or nonlinear interaction between the quantum cavity and oscillator modes. In particular, ENdelimited-⟨⟩subscript𝐸𝑁\langle E_{N}\rangle⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ is the logarithmic negativity averaged over ten successive episodes with a single environment, E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the corresponding average over one episode with T𝑇Titalic_T time steps in a single quantum environment, and R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG denotes the ensemble-averaged value of the reward R𝑅Ritalic_R over a small number of multiple parallel quantum environments for each episode. In our computations, all the control actions G𝐺Gitalic_G, αLsubscript𝛼𝐿\alpha_{L}italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, or detuning ΔΔ\Deltaroman_Δ, the nonlinear coupling g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the dissipation coefficients (κ,γ)𝜅𝛾(\kappa,\gamma)( italic_κ , italic_γ ) are in units of ωmsubscript𝜔𝑚\omega_{m}italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The time unit is ωm1superscriptsubscript𝜔𝑚1\omega_{m}^{-1}italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

RL in linear quantum optomechanics

A quantum optomechanical system with linear photon-phonon interactions is governed by the beam-splitter Hamiltonian. In an optical experimental platform, a 50:50 beam splitter with the transformation angle π/4𝜋4\pi/4italic_π / 4 can create an entangled Bell state between the two input optical modes [78, 79, 80]. Similarly, in a quantum optomechanical system, Bell states between photon and phonon modes can be realized by controlling the beam-splitter Hamiltonian. As a result, the maximally attainable value of the logarithmic negativity is ENlog20.7similar-tosubscript𝐸𝑁2similar-to0.7E_{N}\sim\log 2\sim 0.7italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ roman_log 2 ∼ 0.7 (in the natural logarithmic base), corresponding to the maximally entangled Bell state, as shown in Fig. 1. This “best” entangled state can be realized by the model-free PPO agent, regardless of whether the observation is the expectation or WCM photocurrent. To see this, we note that, in the beam-splitter model, the initial quantum state is set as a pure state [81, 82]: |ψ=|10ket𝜓ket10|\psi\rangle=|10\rangle| italic_ψ ⟩ = | 10 ⟩, where the photon is in the first excited mode and the phonon is in the vacuum mode. The partial observable of the quantum state for the PPO agent is set as the expectation of the photon number n^p(t)=P^1(t)delimited-⟨⟩subscript^𝑛𝑝𝑡delimited-⟨⟩subscript^𝑃1𝑡\langle\hat{n}_{p}\rangle(t)=\langle\hat{P}_{1}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) = ⟨ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ( italic_t ) or the WCM photocurrent η(t)=ηP^1(t)+dW(t)4ηdt𝜂𝑡delimited-⟨⟩𝜂subscript^𝑃1𝑡𝑑𝑊𝑡4𝜂𝑑𝑡\sqrt{\eta}\,\mathcal{I}(t)=\langle\sqrt{\eta}\hat{P}_{1}\rangle(t)+\frac{dW(t% )}{\sqrt{4\eta}\;dt}square-root start_ARG italic_η end_ARG caligraphic_I ( italic_t ) = ⟨ square-root start_ARG italic_η end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ( italic_t ) + divide start_ARG italic_d italic_W ( italic_t ) end_ARG start_ARG square-root start_ARG 4 italic_η end_ARG italic_d italic_t end_ARG.

Table 1: Results of entanglement engineering from deep RL-based, Bayesian, and random control. The observations are the expectation of the photon number n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ and the WCM photocurent (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ) at the measurement rate η=1𝜂1\eta=1italic_η = 1. The Bayesian hyperparameter is λopt=10subscript𝜆𝑜𝑝𝑡10\lambda_{opt}=10italic_λ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT = 10 for the n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ task and λopt=2subscript𝜆𝑜𝑝𝑡2\lambda_{opt}=2italic_λ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT = 2 for the (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ) task. Displayed are the results of the average logarithmic negativity EN/log2delimited-⟨⟩subscript𝐸𝑁2\langle E_{N}\rangle/\log{2}⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / roman_log 2 with the standard deviation. For training and testing phases, EN/log2delimited-⟨⟩subscript𝐸𝑁2\langle E_{N}\rangle/\log{2}⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / roman_log 2 is averaged over ten end-training or testing episodes, each having T=500𝑇500T=500italic_T = 500 time steps. Each observation is obtained by averaging over ntraj=1subscript𝑛traj1n_{\textnormal{traj}}=1italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT = 1 for n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ and ntraj=5subscript𝑛traj5n_{\textnormal{traj}}=5italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT = 5 for (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ) through simulating the SME, and ntrajsubscript𝑛trajn_{\textnormal{traj}}italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT denotes the number of independent trajectories from SME simulations.
Controller Condition n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ntraj=1subscript𝑛traj1n_{\textnormal{traj}}=1italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT = 1 (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ) ntraj=5subscript𝑛traj5n_{\textnormal{traj}}=5italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT = 5
Deep RL (%percent\%%) Training 83.81±1.85plus-or-minus83.811.8583.81\pm 1.8583.81 ± 1.85 64.81±1.47plus-or-minus64.811.47\mathbf{64.81\pm 1.47}bold_64.81 ± bold_1.47
Testing 84.95±1.99plus-or-minus84.951.9984.95\pm 1.9984.95 ± 1.99 65.01±1.76plus-or-minus65.011.76\mathbf{65.01\pm 1.76}bold_65.01 ± bold_1.76
Bayesian (%percent\%%) λ=1𝜆1\lambda=1italic_λ = 1 56.89±6.40plus-or-minus56.896.4056.89\pm 6.4056.89 ± 6.40 35.48±5.34plus-or-minus35.485.3435.48\pm 5.3435.48 ± 5.34
λoptsubscript𝜆𝑜𝑝𝑡\lambda_{opt}italic_λ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT 93.21±0.89plus-or-minus93.210.89\mathbf{93.21\pm 0.89}bold_93.21 ± bold_0.89 49.24±0.44plus-or-minus49.240.4449.24\pm 0.4449.24 ± 0.44
Random (%percent\%%) 38.15±9.46plus-or-minus38.159.4638.15\pm 9.4638.15 ± 9.46 33.46±4.27plus-or-minus33.464.2733.46\pm 4.2733.46 ± 4.27

Experimentally, directly measuring the entanglement, e.g., in terms of logarithmic negativity, for arbitrary entangled states is generally not viable. Identifying an experimentally feasible quantity to characterize the entanglement in arbitrary quantum systems remains challenging. We focus on the relationship between logarithmic negativity and the expected photon number, based on recent experiments on multiplexed photon number measurement [49, 83, 29, 68, 69, 84, 85]. To proceed, we note that the beam-splitter Hamiltonian is limited to a four-level basis, due to the following reasons: (1) only one energy level in the cavity mode of the initial state has been excited from the vacuum state, i.e., |ψ=|10ket𝜓ket10|\psi\rangle=|10\rangle| italic_ψ ⟩ = | 10 ⟩, (2) the linear interaction serves only to transfer the quantum states between the cavity and mechanical mode (i.e., no quantum excitation), and (3) the system couples to the vacuum bath only at absolute zero temperature (i.e., without any thermal excitation), thereby blocking any interactions between higher-level quantum states. In this case, the maximum logarithmic negativity ENlog20.7similar-tosubscript𝐸𝑁2similar-to0.7E_{N}\sim\log 2\sim 0.7italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ roman_log 2 ∼ 0.7 implies that the attained quantum state is the following Bell state

|Φφ=12[|10+eiφ|01],ketsuperscriptΦ𝜑12delimited-[]ket10superscript𝑒𝑖𝜑ket01\displaystyle|\Phi^{\varphi}\rangle=\frac{1}{\sqrt{2}}\left[|10\rangle+e^{i% \varphi}|01\rangle\right],| roman_Φ start_POSTSUPERSCRIPT italic_φ end_POSTSUPERSCRIPT ⟩ = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG [ | 10 ⟩ + italic_e start_POSTSUPERSCRIPT italic_i italic_φ end_POSTSUPERSCRIPT | 01 ⟩ ] ,

with the associated expected photon number n^ptarget=P^1=0.5delimited-⟨⟩subscriptsuperscript^𝑛target𝑝delimited-⟨⟩subscript^𝑃10.5\langle\hat{n}^{\textnormal{target}}_{p}\rangle=\langle\hat{P}_{1}\rangle=0.5⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ = ⟨ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ = 0.5. Consequently, the reward function can be set as Rt|Ot0.5|subscript𝑅𝑡subscript𝑂𝑡0.5R_{t}\equiv-|O_{t}-0.5|italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ - | italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 0.5 |, regardless of whether the observation Otsubscript𝑂𝑡O_{t}italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is n^p(t)delimited-⟨⟩subscript^𝑛𝑝𝑡\langle\hat{n}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) or (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ). Because of the relatively small target value of the expected photon number: P^1=0.5delimited-⟨⟩subscript^𝑃10.5\langle\hat{P}_{1}\rangle=0.5⟨ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ = 0.5, the variance (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ) in the WCM photocurrent can be reduced by a Gaussian filter [86] with the weak measurement rate η1𝜂1\eta\leq 1italic_η ≤ 1. The Gaussian kernel parameters of the filter such as the filter interval and the variance can be numerically chosen to reduce the standard deviation of the measurement photocurrent into a certain range, e.g., about ten times larger than the mean value (details in Supplementary Note 6). The PPO agent applies an updated stochastic policy to the quantum optomechanical environment to maximize the accumulated reward, where the action G(t)𝐺𝑡G(t)italic_G ( italic_t ) is proportional to the amplitude of the cavity mode: G(t)=g0α¯c(t)𝐺𝑡subscript𝑔0subscript¯𝛼𝑐𝑡G(t)=g_{0}\bar{\alpha}_{c}(t)italic_G ( italic_t ) = italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_t ). The action can be controlled by an incident laser [63] and is continuous in a certain range, e.g., G[5,5]ωm𝐺55subscript𝜔𝑚G\in[-5,5]\omega_{m}italic_G ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The decay rate of cavity and mechanical modes κ=0.01ωm𝜅0.01subscript𝜔𝑚\kappa=0.01\,\omega_{m}italic_κ = 0.01 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, γ=0.01κ𝛾0.01𝜅\gamma=0.01\,\kappaitalic_γ = 0.01 italic_κ, respectively, because the quality of the mechanical oscillator mode is generally better than that of the optical cavity or microwave resonator mode [61, 59, 58, 65, 66, 67].

Refer to caption
Figure 2: Performance in terms of EN/log2delimited-⟨⟩subscript𝐸𝑁2\langle E_{N}\rangle/\log 2⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / roman_log 2 over a long time interval, compared for deep RL-based, Bayesian, and random control methods with respect to two observable options: the expected value n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ and the WCM photocurrent (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ). The deep RL controller is trained with T=500𝑇500T=500italic_T = 500 time steps. For all three control methods, displayed are results from the testing phase for the following set of time steps: T=[500,1000,1500,2000,2500,3000,3500,4000]𝑇5001000150020002500300035004000T=[500,1000,1500,2000,2500,3000,3500,4000]italic_T = [ 500 , 1000 , 1500 , 2000 , 2500 , 3000 , 3500 , 4000 ] at the measurement rate η=1𝜂1\eta=1italic_η = 1. The conventions, which apply to this and all subsequent figures, are as follows. If the vertical axis is labeled as EN/E0delimited-⟨⟩subscript𝐸𝑁subscript𝐸0\langle E_{N}\rangle/E_{0}⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, it represents the normalized logarithmic negativity, with E0=log20.7subscript𝐸02similar-to0.7E_{0}=\log 2\sim 0.7italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_log 2 ∼ 0.7 (in the natural logarithmic base) as the target entanglement value. Otherwise, when the vertical axis is labeled as ENdelimited-⟨⟩subscript𝐸𝑁\langle E_{N}\rangle⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩, E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, or EN(t)subscript𝐸𝑁𝑡E_{N}(t)italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ), it represents the original value of the logarithmic negativity.
Refer to caption
Figure 3: Performance of deep-RL agent in the online training and testing phase. The characterizing quantities are the logarithmic negativity ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and the reward function R𝑅Ritalic_R with the measurement rate η=1𝜂1\eta=1italic_η = 1. (a,c) Performance measures in the online training phase, where the mean E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is over one episode with T=500𝑇500T=500italic_T = 500 time steps on the fifth quantum environment (only one environment) and the mean reward R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG is obtained from =55\mathbb{N}=5blackboard_N = 5 parallel quantum environments. (b,d) Performance measures during the testing phase, where the logarithmic negativity EN(t)subscript𝐸𝑁𝑡E_{N}(t)italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) and WCM photocurrent (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ) are obtained with T=4000𝑇4000T=4000italic_T = 4000 time steps. The solid traces represent the moving-window average over 100100100100 episodes for (a,c) and 100100100100 time steps for (b,d).

Our deep RL, a model-free learning method, is implemented in the measurement-based feedback control framework for entanglement engineering in open quantum optomechanics. Details about the PPO algorithm applied in the linear quantum optomechanics are presented in Supplementary Note 4. To appreciate its performance, we employ two benchmark methods for comparison: Bayesian [54, 47] and random control. Bayesian control [54, 47] is a state-based feedback control of the stochastic process as governed by the SME. In our case, the control law is given by G(t)=λ|n^p(t)0.5|ωm𝐺𝑡𝜆delimited-⟨⟩subscript^𝑛𝑝𝑡0.5subscript𝜔𝑚G(t)=-\lambda|\langle\hat{n}_{p}\rangle(t)-0.5|\omega_{m}italic_G ( italic_t ) = - italic_λ | ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) - 0.5 | italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with n^p(t)delimited-⟨⟩subscript^𝑛𝑝𝑡\langle\hat{n}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) being the observation, where the hyperparameter λ𝜆\lambdaitalic_λ can be numerically optimized based on the performance. If the observation is (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ), the control flow will be in the form G(t)=λ|(t)0.5|ωm𝐺𝑡𝜆𝑡0.5subscript𝜔𝑚G(t)=-\lambda|\mathcal{I}(t)-0.5|\omega_{m}italic_G ( italic_t ) = - italic_λ | caligraphic_I ( italic_t ) - 0.5 | italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, in which the Wiener process blocks the performance to some degree. In Bayesian control, the smaller the variance in the measured photocurrent, the better the performance. For the random control method, the flow is generated by a uniform distribution in the action range G[5,5]ωm𝐺55subscript𝜔𝑚G\in[-5,5]\,\omega_{m}italic_G ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Note that the actions G𝐺Gitalic_G of random control and deep RL are in the same range while the one of Bayesian control is determined by the hyperparameter λ𝜆\lambdaitalic_λ and the state-based observation value or the WCM photocurrent. To make a fair comparison, λoptsubscript𝜆𝑜𝑝𝑡\lambda_{opt}italic_λ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT is optimized within the action range G[5,5]ωm𝐺55subscript𝜔𝑚G\in[-5,5]\omega_{m}italic_G ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Specifically, the optimized hyperparameter λoptsubscript𝜆𝑜𝑝𝑡\lambda_{opt}italic_λ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT corresponds to the best performance of Bayesian control in the set λ{1,2,,λmax}𝜆12subscript𝜆𝑚𝑎𝑥\lambda\in\{1,2,\ldots,\lambda_{max}\}italic_λ ∈ { 1 , 2 , … , italic_λ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }, where λmaxsubscript𝜆𝑚𝑎𝑥\lambda_{max}italic_λ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the maximum integer of λ𝜆\lambdaitalic_λ to guarantee the action range G[5,5]ωm𝐺55subscript𝜔𝑚G\in[-5,5]\,\omega_{m}italic_G ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Table 1 displays the values of the averaged logarithmic negativity EN/log2delimited-⟨⟩subscript𝐸𝑁2\langle E_{N}\rangle/\log 2⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / roman_log 2 from the deep RL, Bayesian, and random control methods. From the SME simulations, when the observation is the expectation of the photon number, the Bayesian control with the optimized hyperparameter outperforms the deep RL method. However, when the observation is the WCM photocurrent, the deep RL control outperforms the Bayesian method. This is promising as the WCM photocurrent is directly experimentally accessible while the expected photon number is not. Regardless of the observation, random control is generally ineffective. The results by deep RL control from the observation of WCM photocurrent tend to reduce the performance by about 20%percent2020\%20 % compared to that based on the expected photon number. For Bayesian control, the reduction is about 40%percent4040\%40 %. Moreover, Fig. 2 compares the long-time entanglement engineering for three control methods. Especially, for deep RL control, the PPO agent is trained with T=500𝑇500T=500italic_T = 500 time steps but tested with a longer time horizon, e.g., T=4000𝑇4000T=4000italic_T = 4000 steps, including the unexplored regime by the PPO agent. It is worth noting from Fig. 2 that the performance of deep RL with the observation of WCM photocurrent exhibits a more stable and smaller variance compared to the case where the observation is the expected photon number, especially after T=2000𝑇2000T=2000italic_T = 2000. Overall, with the experimentally feasible observation of WCM, the deep RL controller stands out as the choice of entanglement control for quantum optomechanical systems.

Refer to caption
Figure 4: Effects of decay and measurement rates on the control performance. Shown are the values of the average logarithmic negativity for (a) decay rates κ=[0,0.01,0.03,0.05]ωm𝜅00.010.030.05subscript𝜔𝑚\kappa=[0,0.01,0.03,0.05]\,\omega_{m}italic_κ = [ 0 , 0.01 , 0.03 , 0.05 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with η=0.5𝜂0.5\eta=0.5italic_η = 0.5 and γ=0.01κ𝛾0.01𝜅\gamma=0.01\,\kappaitalic_γ = 0.01 italic_κ, and (b) measurement rates η=[0.05,0.1,0.3,0.5,0.7,1]𝜂0.050.10.30.50.71\eta=[0.05,0.1,0.3,0.5,0.7,1]italic_η = [ 0.05 , 0.1 , 0.3 , 0.5 , 0.7 , 1 ] with κ=0.01ωm𝜅0.01subscript𝜔𝑚\kappa=0.01\,\omega_{m}italic_κ = 0.01 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and γ=0.01κ𝛾0.01𝜅\gamma=0.01\,\kappaitalic_γ = 0.01 italic_κ. The error bars represent the standard deviation of the data points. The average operation is over ten end-training or testing episodes. The training and testing time steps are the same: T=500𝑇500T=500italic_T = 500.

We characterize the performance of our deep-RL-based control method in terms of the dissipation rate, measurement rate, and the randomness effect for the initial state. For the measurement rate η=1𝜂1\eta=1italic_η = 1, the PPO agent is sequence-wise trained with the WCM photocurrent. Figures 3(a) and 3(c) show the average logarithmic negativity E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and the mean reward R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG, respectively, versus the episode during the training phase, in which E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG are averaged over one and five parallel quantum environments, respectively. Both quantities ultimately converge due to the properly designed reward function R(t)=|(t)0.5|𝑅𝑡𝑡0.5R(t)=-|\mathcal{I}(t)-0.5|italic_R ( italic_t ) = - | caligraphic_I ( italic_t ) - 0.5 |. Note that the variance of E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is suppressed with the episodes, implying the mixture-robust nature of entanglement in the quantum optomechanical system. The testing phase is longer (T=4000𝑇4000T=4000italic_T = 4000 time steps) than the training phase (T=500𝑇500T=500italic_T = 500 time steps) and the corresponding performance measures are shown in Figs. 3(b) and 3(d). In addition to the variance in the learning of the deep RL agent with the stochastic policy, the Gaussian Wiener process in the WCM photocurrent and the stochastic collapse process stipulated by the SME also contribute to the variances of the performance measures. However, the deep RL control still manages to maintain the solid traces of the testing (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ) around the target value n^ptarget=0.5delimited-⟨⟩subscriptsuperscript^𝑛target𝑝0.5\langle\hat{n}^{\textnormal{target}}_{p}\rangle=0.5⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ = 0.5 in Fig. 3(d) and the resulting entanglement quantity EN(t)subscript𝐸𝑁𝑡E_{N}(t)italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) is displayed in Fig. 3(b).

Since the quantum optomechanical system is coupled to the vacuum bath, the coupling strength or disturbance between the classical and quantum environments will affect the control performance, as exemplified in Fig. 4(a). Previous experiments [61, 64, 58, 59] demonstrated that the quality of the mechanical oscillator is generally better than that of the optical cavity or microwave resonator, i.e., γ<κ𝛾𝜅\gamma<\kappaitalic_γ < italic_κ, so we set the decay rate of the oscillator at two orders of magnitude smaller than that of the cavity [56]: γ=0.01κ𝛾0.01𝜅\gamma=0.01\,\kappaitalic_γ = 0.01 italic_κ. Figure 4(a) shows, for both the expectation and the measurement flow observations, the performances of the training and testing processes, which are consistent with each other in the sense that their mean values decrease and the variances increase with the decay rate. The origin of the performance fluctuations is the classical dissipation to the vacuum bath, rendering the system less controllable by laser.

The uncertainty in the classical information extracted from the quantum system depends on the discrete-time step size dt𝑑𝑡dtitalic_d italic_t and the measurement rate η𝜂\etaitalic_η, which directly determines the degree of the quantum-state stochastic collapse and quantum decoherence from the WCM term in the SME. If the expectation of the photon number is the observation, the stronger the measurement rate (proportional to the measurement strength), the poorer the performance of deep-RL control as characterized by a decrease in the mean values and an increase in the uncertainties of ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, as shown in Fig. 4(b), which originate from the intrinsic random process in the SME induced by the measurement process. However, if the observation is the WCM photocurrent, the weaker measurement rate will introduce larger variances in the observation signal and reduce the stochasticity of the process due to the incomplete/partial extracted information as described by the SME. In our case, the target mean value, n^ptarget=0.5delimited-⟨⟩subscriptsuperscript^𝑛target𝑝0.5\langle\hat{n}^{\textnormal{target}}_{p}\rangle=0.5⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ = 0.5, is on the order of 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, rendering necessary introducing a Gaussian filter to reduce the uncertainty. The resulting performance of deep-RL control is approximately the same for η[0.05,1]𝜂0.051\eta\in[0.05,1]italic_η ∈ [ 0.05 , 1 ], as shown in Fig. 4(b).

Refer to caption
Figure 5: Robustness of deep-RL method trained with pure or mixed states. (a) In the training and testing phase, performance of EN/E0delimited-⟨⟩subscript𝐸𝑁subscript𝐸0\langle E_{N}\rangle/E_{0}⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for different initial mixed states (solid traces): ρ=(1p)|1010|+p|0101|𝜌1𝑝ket10quantum-operator-product10𝑝01bra01\rho=(1-p)|10\rangle\langle 10|+p|01\rangle\langle 01|italic_ρ = ( 1 - italic_p ) | 10 ⟩ ⟨ 10 | + italic_p | 01 ⟩ ⟨ 01 | with p=[0,0.1,0.2,0.3,0.4,0.5]𝑝00.10.20.30.40.5p=[0,0.1,0.2,0.3,0.4,0.5]italic_p = [ 0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 ]. The dashed traces indicate the performances trained with random initial mixed states with the random variable p[0,0.5]𝑝00.5p\in[0,0.5]italic_p ∈ [ 0 , 0.5 ]. (b) Testing performance of two kinds of trained agents with p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ]: one trained with the pure initial state |ψ=|10ket𝜓ket10|\psi\rangle=|10\rangle| italic_ψ ⟩ = | 10 ⟩ and another with random initial mixed states, which are distinguished by the color depth of the curve and the error bars. The blue and red curves denote the performances with the observation n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ and (t)𝑡\mathcal{I}(t)caligraphic_I ( italic_t ), respectively, with error bars. The measurement rate is η=0.5𝜂0.5\eta=0.5italic_η = 0.5, and the training and testing time steps are T=500𝑇500T=500italic_T = 500.
Refer to caption
Figure 6: Generating target for deep-RL based creation and stabilization of entanglement in a nonlinear open quantum optomechanical system. (a,b) Trained quantities R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG and E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT converge to a certain value as the episode number increases, as illustrated by the light-color curves, where the dark blue and orange traces represent the data averaged over 100 previously consecutive episodes. (c,d) Time-dependent series of EN(t)subscript𝐸𝑁𝑡E_{N}(t)italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) and the driven laser signals Δ,αLΔsubscript𝛼𝐿\Delta,\alpha_{L}roman_Δ , italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT at a certain episode selected from the training converged regime in (a,b). (e,f) The corresponding photon and phonon statistics on the Fock basis at the end of the time point of the selected training episode in (c,d). (g) The time evolution of the corresponding expected quantities, including the expected numbers n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ and n^mdelimited-⟨⟩subscript^𝑛𝑚\langle\hat{n}_{m}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ in the Fock basis, where the time series of n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩(t) serves as the target to construct the reward function in the next phase, i.e., the experimental version shown in Fig. 7.

Experimentally, mixed quantum states are more realizable than pure states due to the quantum decoherence with the classical environment, e.g., the vacuum bath. To address this issue, and referring to the previous work [87], we assume that the initial state is a mixed state in the form of ρ=(1p)|1010|+p|0101|𝜌1𝑝ket10quantum-operator-product10𝑝01bra01\rho=(1-p)|10\rangle\langle 10|+p|01\rangle\langle 01|italic_ρ = ( 1 - italic_p ) | 10 ⟩ ⟨ 10 | + italic_p | 01 ⟩ ⟨ 01 |, where the parameter p𝑝pitalic_p is fixed or a random variable p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ] because of the coupling to the classical environment. The beam-splitter Hamiltonian stipulates that the photon and phonon modes are symmetric to each other, allowing p𝑝pitalic_p to be rescaled to the interval p[0,0.5]𝑝00.5p\in[0,0.5]italic_p ∈ [ 0 , 0.5 ]. Figure 5(a) shows the performance with respect to the initial mixed quantum state with the same parameter p𝑝pitalic_p for each training and testing episode (solid traces), where the complete mixed case with p=0.5𝑝0.5p=0.5italic_p = 0.5 leads to the worst performance but still possesses entanglement to a significant extent. The reason lies in the inherent property of the beam-splitter Hamiltonian, which can create the maximum entangled states: [|10+eiφ|01]/2delimited-[]ket10superscript𝑒𝑖𝜑ket012[|10\rangle+e^{i\varphi}|01\rangle]/\sqrt{2}[ | 10 ⟩ + italic_e start_POSTSUPERSCRIPT italic_i italic_φ end_POSTSUPERSCRIPT | 01 ⟩ ] / square-root start_ARG 2 end_ARG, with respect to the part of the initial quantum state, such as |10ket10|10\rangle| 10 ⟩ or |01ket01|01\rangle| 01 ⟩ through the linear interactions, regardless of whether it acts on a pure or a mixed state. In Fig. 5 (a), the dashed traces display the performance during the training phase with a random initial mixed quantum state, which is generated by the random variable p𝑝pitalic_p with the uniform distribution in the range of p[0,0.5]𝑝00.5p\in[0,0.5]italic_p ∈ [ 0 , 0.5 ]. The error bar characterizes the uncertainty over ten end-training episodes.

Figure 5(b) shows the testing performance of two kinds of trained models, one trained by the initial state |ψ=|10ket𝜓ket10|\psi\rangle=|10\rangle| italic_ψ ⟩ = | 10 ⟩ and another by the random initial mixed-state ρ=(1p)|1010|+p|0101|𝜌1𝑝ket10quantum-operator-product10𝑝01bra01\rho=(1-p)|10\rangle\langle 10|+p|01\rangle\langle 01|italic_ρ = ( 1 - italic_p ) | 10 ⟩ ⟨ 10 | + italic_p | 01 ⟩ ⟨ 01 | (distinguished by dark and light colors, respectively). Note that the beam-splitter Hamiltonian transforms the initial state |10ket10|10\rangle| 10 ⟩ or |01ket01|01\rangle| 01 ⟩ to a Bell state with the corresponding expected photon number: n^ptarget=0.5delimited-⟨⟩subscriptsuperscript^𝑛target𝑝0.5\langle\hat{n}^{\textnormal{target}}_{p}\rangle=0.5⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ = 0.5, where the dissipative degree to the vacuum bath is much weaker than the beam-splitter interaction. However, if the initial state is the mixed state, the |1010|ket10bra10|10\rangle\langle 10|| 10 ⟩ ⟨ 10 | and |0101|ket01bra01|01\rangle\langle 01|| 01 ⟩ ⟨ 01 | components will become independently entangled, resulting in the total quantum state being a mixture of two entangled Bell states. As a result, a nontrivial entanglement value is expected for the initial mixed state governed by the beam-splitter Hamiltonian. With the mixed probability p=0.5𝑝0.5p=0.5italic_p = 0.5, it results in an equal mixture of the Bell states, as shown in Fig. 5. In the testing phase, the two trained models use the same initial state for a fixed value of p𝑝pitalic_p. The two models have a comparable performance, suggesting that the deep RL method is robust to the initial randomness in a mixed state. More specifically, during the testing phase, the observation is the expected photon number or WCM photocurrent. The worse performance occurs for p=0.5𝑝0.5p=0.5italic_p = 0.5 and for other values of p𝑝pitalic_p, the performance is symmetric about p=0.5𝑝0.5p=0.5italic_p = 0.5 due to symmetric role of the photon and photon modes in the beam splitter Hamiltonian. Note that the model trained with the observation being the measurement photocurrent displays a small difference in the performance measure [EN/E0delimited-⟨⟩subscript𝐸𝑁subscript𝐸0\langle E_{N}\rangle/E_{0}⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over the whole probability interval p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ]] between the best and worst cases, with less uncertainties than the case where the observation is the expected photon number. Taken together, our deep-RL model trained by the weak measurement photocurrent holds a lower mean performance but possesses robustness against mixed quantum states compared with the scenario based on observing the expected value of the photon number, due to the strong capability of RL in learning randomness and executing accurate high-dimensional data-fitting.

RL in nonlinear quantum optomechanics

Refer to caption
Figure 7: Entanglement engineering by the recurrent PPO agent. The target generated as described in Fig. 6 is exploited to create entanglement by ENlog2similar-tosubscript𝐸𝑁2E_{N}\sim\log 2italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ roman_log 2 from the only partial observation of the expected photon number n^p(t)delimited-⟨⟩subscript^𝑛𝑝𝑡\langle\hat{n}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ). The reward function is R(t)=|n^p(t)n^ptarget(t)|𝑅𝑡delimited-⟨⟩subscript^𝑛𝑝𝑡delimited-⟨⟩superscriptsubscript^𝑛𝑝target𝑡R(t)=-|\langle\hat{n}_{p}\rangle(t)-\langle\hat{n}_{p}^{\textnormal{target}}% \rangle(t)|italic_R ( italic_t ) = - | ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) - ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT ⟩ ( italic_t ) |, where the target time series n^ptarget(t)delimited-⟨⟩superscriptsubscript^𝑛𝑝target𝑡\langle\hat{n}_{p}^{\textnormal{target}}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT ⟩ ( italic_t ) is from the target-generating process in Fig. 6(g). In this training configuration, while only partial information is extracted from the system, the performance measures in (a-g) display a similar behavior compared with those in Fig. 6. Other aspects of the setting and parameters are the same as in Fig. 6.

In an open quantum optomechanical system under the strong laser-driven approximation, the radiation pressure on the movable mechanical mirror generates a linear interaction between the optical and mechanical modes. When this approximation does not hold, the interaction between the two modes becomes nonlinear. Entanglement can still be created despite the nonlinear interaction, but control becomes more challenging. In particular, in the standard quantum optomechanical system, the nonlinear coupling term g0a^a^(b^+b^)Planck-constant-over-2-pisubscript𝑔0superscript^𝑎^𝑎superscript^𝑏^𝑏\hbar g_{0}\hat{a}^{\dagger}\hat{a}(\hat{b}^{\dagger}+\hat{b})roman_ℏ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG ( over^ start_ARG italic_b end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + over^ start_ARG italic_b end_ARG ) can be used to create entanglement, but high-level quantum states can also be excited during the process, making it difficult to stabilize the entanglement within a finite Fock basis. Realistically, the quantum dynamics are governed by the SME due to the WCM, which induces the nonlinear stochastic evolution. The problem then becomes that of creating and stabilizing the entanglement of non-Gaussian states decaying to the vacuum bath. Despite the difficulties, model-free deep-RL can still provide a general approach through some optimal combination of the neural network structure, observable, reward function, and action.

We consider the nonlinear optomechanical system and exploit deep RL to set the control goal of achieving the entanglement near ENlog2similar-tosubscript𝐸𝑁2E_{N}\sim\log 2italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ roman_log 2. This nonlinear entangled state shares a similar entanglement value with the maximum entangled Bell state in the corresponding linear system. For entanglement engineering of a nonlinear optomechanical system, a key issue is selecting an effective and experimentally feasible observation quantity. Utilizing a general actor and critic neural network, the deep RL agent can learn the relationship between entanglement and the experimental observables of the optomechanical system in a model-free manner. To achieve control, we articulate a training process consisting of two phases: the target-generating phase and the target-utilization phase, facilitated by deep RL.

The first training step is the target-generating phase, in which numerical SME simulations are used to generate the observation and reward data and the PPO agent interacts with the quantum environment, observes the logarithmic negativity EN(t)subscript𝐸𝑁𝑡E_{N}(t)italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) and constructs the reward function combining the expectation number of the photons and phonons: R(t)=|EN(t)log2||n^p(t)+n^m(t)a|/b𝑅𝑡subscript𝐸𝑁𝑡2delimited-⟨⟩subscript^𝑛𝑝𝑡delimited-⟨⟩subscript^𝑛𝑚𝑡𝑎𝑏R(t)=-|E_{N}(t)-\log 2|-|\langle\hat{n}_{p}\rangle(t)+\langle\hat{n}_{m}% \rangle(t)-a|/bitalic_R ( italic_t ) = - | italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_t ) - roman_log 2 | - | ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) + ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ ( italic_t ) - italic_a | / italic_b with numerically optimized hyperparameters a=1𝑎1a=1italic_a = 1 and b=30𝑏30b=30italic_b = 30. (Note that direct experimental measurement of the logarithmic negativity is currently not available.) Figure 6 shows the control results, where the excitation of quantum states is limited by the total number n^p+n^mdelimited-⟨⟩subscript^𝑛𝑝delimited-⟨⟩subscript^𝑛𝑚\langle\hat{n}_{p}\rangle+\langle\hat{n}_{m}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ + ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩. The target time series of the expected photon number is obtained as n^ptarget(t)delimited-⟨⟩superscriptsubscript^𝑛𝑝target𝑡\langle\hat{n}_{p}^{\textnormal{target}}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT ⟩ ( italic_t ). The second step is the target-utilization phase, during which the reward function is R(t)=|n^p(t)n^ptarget(t)|𝑅𝑡delimited-⟨⟩subscript^𝑛𝑝𝑡delimited-⟨⟩superscriptsubscript^𝑛𝑝target𝑡R(t)=-|\langle\hat{n}_{p}\rangle(t)-\langle\hat{n}_{p}^{\textnormal{target}}% \rangle(t)|italic_R ( italic_t ) = - | ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) - ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT ⟩ ( italic_t ) |.

Since it is time-dependent, the recurrent neural network added after the MLPs in the PPO agent displays a strong and stable learning ability, which outperforms the case with only MLPs. The expected photon number n^p(t)delimited-⟨⟩subscript^𝑛𝑝𝑡\langle\hat{n}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) is observed by the recurrent PPO agent as n^p=nnP^ndelimited-⟨⟩subscript^𝑛𝑝subscript𝑛𝑛delimited-⟨⟩subscript^𝑃𝑛\langle\hat{n}_{p}\rangle=\sum_{n}n\langle\hat{P}_{n}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_n ⟨ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩, which is experimentally more feasible than the quantity ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. While the recurrent neural network has some considerable advantages, such as long-term momery [57], it still encounters the challenge of engineering optimization [88] in order to achieve a correct and efficient implementation. In our case, the main challenge is the time cost to optimize the parameters to search for a global minimum or maximum due to the ten stochastic collapse operators, P^n=|nn|subscript^𝑃𝑛ket𝑛bra𝑛\hat{P}_{n}=|n\rangle\langle n|over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = | italic_n ⟩ ⟨ italic_n | with the respective Fock numbers n=0,1,,9𝑛019n=0,1,\ldots,9italic_n = 0 , 1 , … , 9, in the SME with the measurement rate η=0.1𝜂0.1\eta=0.1italic_η = 0.1, requiring a long simulation time. Our solution is to consider only the =11\mathbb{N}=1blackboard_N = 1 quantum optomechanical environment, in which the agent collects data and updates the policy every =1515\mathbb{Z}=15blackboard_Z = 15 and =55\mathbb{Z}=5blackboard_Z = 5 episodes in two phases (target-generating and target-utilization), respectively, with the time horizon T=500𝑇500T=500italic_T = 500. Note that, using ten stochastic projectors P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can result in a large variance in the WCM photocurrent:

η(t)=nn[ηP^n+dWn(t)4ηdt],𝜂𝑡subscript𝑛𝑛delimited-[]delimited-⟨⟩𝜂subscript^𝑃𝑛𝑑subscript𝑊𝑛𝑡4𝜂𝑑𝑡\displaystyle\sqrt{\eta}\,\mathcal{I}(t)=\sum_{n}n\left[\langle\sqrt{\eta}\hat% {P}_{n}\rangle+\frac{dW_{n}(t)}{\sqrt{4\eta}dt}\right],square-root start_ARG italic_η end_ARG caligraphic_I ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_n [ ⟨ square-root start_ARG italic_η end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ + divide start_ARG italic_d italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG square-root start_ARG 4 italic_η end_ARG italic_d italic_t end_ARG ] ,

where ten independent Wiener processes dWn(t)𝑑subscript𝑊𝑛𝑡dW_{n}(t)italic_d italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) are used. In this case, observation of the measured random photocurrent is infeasible. Even if the deep RL agent is trained in two phases with the expected photon number, it can fail during the training process due to the numerical cutoff in the Hilbert space dimension and the strong randomness introduced by the SME. In the nonlinear quantum optomechanical system, the interaction strength is g0=0.2ωmsubscript𝑔00.2subscript𝜔𝑚g_{0}=0.2\,\omega_{m}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.2 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The PPO agent creates entanglement characterized by ENlog2similar-tosubscript𝐸𝑁2E_{N}\sim\log 2italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ roman_log 2 versus time, calculated through the SME with dissipation to the vacuum bath for κ=0.1ωm𝜅0.1subscript𝜔𝑚\kappa=0.1\,\omega_{m}italic_κ = 0.1 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and γ=0.01κ𝛾0.01𝜅\gamma=0.01\,\kappaitalic_γ = 0.01 italic_κ. The system is initialized in the vacuum state |ψ=|00ket𝜓ket00|\psi\rangle=|00\rangle| italic_ψ ⟩ = | 00 ⟩, i.e., the pure state, with 10×10101010\times 1010 × 10 Fock bases. The time-dependent control signal is the detuning ΔΔ\Deltaroman_Δ and the amplitude αLsubscript𝛼𝐿\alpha_{L}italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT of the driven laser within the fixed range Δ,αL[5,5]ωmΔsubscript𝛼𝐿55subscript𝜔𝑚\Delta,\alpha_{L}\in[-5,5]\,\omega_{m}roman_Δ , italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Representative results are as follows. In the target-generating phase, despite the disturbance of the stochastic process from WCM, the training curves for both the reward R~~𝑅\widetilde{R}over~ start_ARG italic_R end_ARG and the logarithmic negativity E~Nsubscript~𝐸𝑁\widetilde{E}_{N}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT converge with the episode number, as shown in Figs. 6(a,b), indicating that entanglement has been created and stabilized by the well-trained PPO agent, as shown by Fig. 6(c) with the laser control signal displayed in Fig. 6(d). At the end of the time period, the photon and phonon statistics with respect to the Fock basis are shown in Figs. 6(e,f), where the reduced photon state exhibits an oscillating tail that resembles the displaced squeezed state and the reduced phonon state displays the thermal-like state. Figure 6(g) shows the corresponding target pattern n^ptarget(t)delimited-⟨⟩subscriptsuperscript^𝑛target𝑝𝑡\langle\hat{n}^{\textnormal{target}}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ). In the target-utilization phase, the recurrent PPO agent is able to steadily learn to create and stabilize the entanglement, as shown in Fig. 7, where only partial information is extracted from the quantum optomechanical environment. Especially, various entangled states have been created, such as a reduced photon state with the head oscillating on the Fock basis in photon statistics entangled with the thermal-like reduced phonon state, as exemplified in Figs. 7(e,f). Due to the nonlinear and stochastic process in the SME, the entangled states created and controlled are not steady states, rendering infeasible Bayesian control. We thus employ random control as a benchmark, where a uniformly random distribution of actions is taken in a certain range Δ,αL[5,5]ωmΔsubscript𝛼𝐿55subscript𝜔𝑚\Delta,\alpha_{L}\in[-5,5]\omega_{m}roman_Δ , italic_α start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the tested values of the measurement rate are η=[0.05,0.1,0.3,0.5,0.7,1]𝜂0.050.10.30.50.71\eta=[0.05,0.1,0.3,0.5,0.7,1]italic_η = [ 0.05 , 0.1 , 0.3 , 0.5 , 0.7 , 1 ]. Figure 8 shows that, as the measurement rate increases, the random control is unable to harness the entanglement while our well-trained recurrent PPO agent can maintain the entanglement percentage at 50%percent5050\%50 % or higher.

Refer to caption
Figure 8: Target-utilization phase of entanglement engineering of a nonlinear optomechanical systems. Shown are the results of online training and testing of the entanglement measure EN/E0delimited-⟨⟩subscript𝐸𝑁subscript𝐸0\langle E_{N}\rangle/E_{0}⟨ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ / italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for measurement rates η=[0.05,0.1,0.3,0.5,0.7,1]𝜂0.050.10.30.50.71\eta=[0.05,0.1,0.3,0.5,0.7,1]italic_η = [ 0.05 , 0.1 , 0.3 , 0.5 , 0.7 , 1 ], in comparison with the benchmark performance of random control. The error bars are the corresponding standard deviation. The results from random control flow are also included for comparison. Other parameters are the same as those in Fig. 6.

Physical understanding of entanglement engineering through model-free deep RL

In an experiment, it is usually difficult to directly obtain information about the entanglement. For entanglement engineering of a quantum optomechanical system, one scenario is that the RL agent observes the photon number to steer the laser to create and stabilize entanglement, as illustrated in Fig. 9. Here we provide a physical interpretation of RL control for entanglement engineering in both the linear and the nonlinear interaction regimes. The key physical relationships involved are that between the entanglement and photon number, and that between the photon number and laser driving. We also describe the capability of the RL agent to train the laser driving to modulate the two-mode interaction to reduce quantum decoherence resulting from WCM and the quantum dissipation to the vacuum bath.

Linear interaction regime

For the linear quantum optomechanical system, the maximum entanglement corresponds to a Bell state, of which the expected photon number is n^p=0.5delimited-⟨⟩subscript^𝑛𝑝0.5\langle\hat{n}_{p}\rangle=0.5⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ = 0.5. Intrinsically, the beam splitter Hamiltonian is capable of generating Bell states [78, 79, 80], a reasonable assumption is that, when the expected photon number reaches the value of 0.5, the maximum entanglement is achieved in a linearly interacting quantum optomechanical system. This assumption provides the base for constructing the reward function R(t)=|n^p(t)0.5|𝑅𝑡delimited-⟨⟩subscript^𝑛𝑝𝑡0.5R(t)=-|\langle\hat{n}_{p}\rangle(t)-0.5|italic_R ( italic_t ) = - | ⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) - 0.5 |, where the deviation in the expected photon number from 0.5 results in a decreasing reward and therefore implies reduced entanglement. As illustrated in Fig. 9, the RL agent is designed to maximize the accumulated reward value, which is equivalent to stabilizing the expected photon number about the value of 0.5 for as long as possible. The testing results shown in Fig. 3 indicate that the maximum entanglement can indeed be created and stabilized by the RL control.

Refer to caption
Figure 9: RL-based entanglement engineering of a quantum optomechanical system.

A central step in RL control is to modulate the laser input based on the measured photon number, which requires the relationship between the laser driving and the photon number. When the frequency of the laser is in the red-detuned regime: Δ=ωLωc=ωmΔsubscript𝜔𝐿subscript𝜔𝑐subscript𝜔𝑚\Delta=\omega_{L}-\omega_{c}=-\omega_{m}roman_Δ = italic_ω start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT - italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = - italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the quantum state switches between the two modes - the cavity optical and the mechanical oscillator modes, leading to a “swap” Hamiltonian. The coefficient G𝐺Gitalic_G is proportional to the amplitude of the cavity parameter α¯csubscript¯𝛼𝑐\bar{\alpha}_{c}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT that is determined by the laser. In the linear interaction regime, RL control is achieved via two adjustments of the laser based on the measured photon number: (1) the laser frequency is changed into the red-detuned regime and (2) the laser amplitude is perturbed to modulate the driving strength G𝐺Gitalic_G to control the two modes of switching, which affects the expected photon number. Note that, during this process, there is no energy gain: there is energy loss due to the dissipation of the cavity and oscillator modes into the vacuum bath with the dissipation rate given by γ=0.01κ𝛾0.01𝜅\gamma=0.01\,\kappaitalic_γ = 0.01 italic_κ. This relation means that the energy loss due to the oscillator mode occurs more slowly than that with the cavity mode. In essence, the working of the laser is to transfer the energy from the oscillator mode to the cavity mode to stabilize the photon number to a desired value. The underlying dissipation process is not beneficial to the entanglement, as it cannot be modulated by the “swap” term in the Hamiltonian, eliminating any possibility of entanglement enhancement in an optomechanical system in the linear interaction regime. It is worth noting that, in the nonlinear interaction regime, entanglement enhancement and dissipation reduction are possible, as will be described below.

Nonlinear interaction regime

When the interactions between the optical and mechanical modes are nonlinear, the relationship between entanglement and photon number can be sophisticated and is currently unknown. However, model-free deep RL can be used to find the relation numerically. To achieve this, we first assume that there is a solution of the one-to-one correspondence between ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ in the time domain. The reward function is constructed according to the target entanglement EN=log2subscript𝐸𝑁2E_{N}=\log 2italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = roman_log 2 to train the RL agent to maximize the accumulated reward. In the testing phase, the time-dependent series of the expected photon number controlled by the well-trained PPO in Fig. 6(g) is regarded as the target time series of the expected photon number for the next target-utilization phase. Note that the “best” photon number is no longer simply 0.5: it is now time-dependent. In the next training phase, the reset RL agent will learn to control the system with the observation n^p(t)delimited-⟨⟩subscript^𝑛𝑝𝑡\langle\hat{n}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ) based on the target’s expected photon number n^ptarget(t)delimited-⟨⟩subscriptsuperscript^𝑛target𝑝𝑡\langle\hat{n}^{\textnormal{target}}_{p}\rangle(t)⟨ over^ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩ ( italic_t ). The performance of the new RL agent in the testing phase, as shown in Fig. 7, validates our initial assumption about the existence of the one-to-one correspondence between ENsubscript𝐸𝑁E_{N}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and n^pdelimited-⟨⟩subscript^𝑛𝑝\langle\hat{n}_{p}\rangle⟨ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⟩, even though it is time-dependent.

Refer to caption
Figure 10: Physical insights in the nonlinear regime of cavity-mechanical interaction under the strong laser limit: |α¯c|1much-greater-thansubscript¯𝛼𝑐1|\bar{\alpha}_{c}|\gg 1| over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ≫ 1. When the strong laser is in the red-detuned regime with Δ=ωmΔsubscript𝜔𝑚\Delta=-\omega_{m}roman_Δ = - italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the laser controls the two-mode transferring process but, in the blue-detuned regime with Δ=+ωmΔsubscript𝜔𝑚\Delta=+\omega_{m}roman_Δ = + italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the laser controls the exponential growth of the two modes in energies and creates the quantum correlation between two modes [61].

In the nonlinear interaction regime, the physical picture of how the laser leverages the radiation-pressure interaction to create and stabilize the photon number and even the entanglement is not straightforward. However, physical insights can be gained by examining the strong laser limit. When the amplitude of the laser is strong: α¯c1much-greater-thansubscript¯𝛼𝑐1\bar{\alpha}_{c}\gg 1over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≫ 1, in the blue-detuned regime with Δ=+ωmΔsubscript𝜔𝑚\Delta=+\omega_{m}roman_Δ = + italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the laser can modulate its frequency to create exponential growth of the energies of both the cavity and oscillator modes, accompanied by the generation of strong quantum correlation between the two modes. In the red-detuned regime with Δ=ωmΔsubscript𝜔𝑚\Delta=-\omega_{m}roman_Δ = - italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, a switching process between the two modes occurs, which is the same as that in the linear interaction regime.

The blue- and red-detuned regimes have a competitive relationship with each other in terms of both the photon number and entanglement. In particular, in the blue-detuned regime, photons are excited and the rate of excitation can be larger than that associated with quantum dissipation to the vacuum bath. Furthermore, quantum entanglement is enhanced, overcoming quantum decoherence from the classical environment and even from the SME. However, in the red-detuned regime, no photons are excited and there is only a two-mode energy-transferring process that does not completely suppress the process of quantum dissipation to the vacuum bath, resulting in photon loss and eventually reducing entanglement. Stabilizing the photon number and entanglement requires a balance between the operations in the blue- and red-detuned regimes. In general, the blue-detuned regime is prone to too high photon levels with strong entanglement, which should be balanced by the red-detuned regime operation to reduce the photon number to realize our target entanglement engineering, as shown schematically in Fig. 10. Overall, in the nonlinear interaction regime, laser driving of finite amplitude and frequency modulation can control the photon number and entanglement to a certain extent. An example is shown in Fig. 7(d), where the RL agent finds the optimal action flow with a finite laser amplitude. Note that, the detuning ΔΔ\Deltaroman_Δ is modulated mainly in the range Δ[2ωm,2ωm]Δ2subscript𝜔𝑚2subscript𝜔𝑚\Delta\in[-2\omega_{m},2\omega_{m}]roman_Δ ∈ [ - 2 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , 2 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ], signifying a balance between the blue- and red-detuned operations.

Weak continuous measurement

In an open quantum system, under WCM and quantum dissipation into the vacuum bath as well, a Wiener process occurs in the observable. More specifically, the Wiener process arises from the Gaussian-weighted projection over the eigenstates, which weakly extracts the partial information from the quantum system and induces stochastic disturbances in both the dynamical equation and observation. Such disturbances can avoid a complete quantum state collapse and provide the capability to extract the quantum information continuously in the time domain. However, the nonlinear stochastic process occurs in both quantum dynamical trajectories and the measurement photocurrent, making it challenging to control the quantum system through WCM continuously. (Backgrounds about WCM, deep RL, and quantum control are presented in Supplementary Note 1.)

For stochastic noise in the WCM photocurrent, the present cutting-edge technology enables the RL agent to extract quantum information through a process resembling noise filtering. Specifically, the observation in the reward function is the WCM photocurrent. We can employ ntrajsubscript𝑛trajn_{\textnormal{traj}}italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT quantum ensembles to reduce the variance and use a Gaussian filter for data pre-processing. The RL agent is trained to maximize the accumulated reward, which serves to average the stochastic term in the measurement photocurrent over time. These noise-filtering processes help extract information about the expected photon number and thus the target quantum entanglement. For the nonlinear quantum stochastic process with quantum dissipation, the RL agent successfully trains the laser to leverage interactions between the optical and mechanical modes, linear or nonlinear, to mitigate quantum decoherence and dissipation to some extent, as exemplified in Figs. 3 and 7.

Discussion

Exploiting machine learning for controlling quantum information systems is becoming a promising research realm and is attracting increasing attention. We have developed a model-free deep-RL method for entanglement engineering. We demonstrated its superiority over benchmark quantum control methods in quantum optomechanical systems under WCM. The model-free deep-RL agent sequentially interacts with one or multiple parallel quantum optomechanical environments, collects trajectories, and updates its policy to maximize the accumulated reward to create and stabilize the entanglement. Both linear and nonlinear interacting regimes between the photons in the optical cavity and the phonons associated with the mechanical oscillator in the cavity have been studied. In particular, for linear interactions, the PPO agent directly observes the WCM photocurrent and delivers better performance compared with the benchmark Bayesian and random control methods in the framework of measurement-based feedback control. The performance of deep RL control is tolerant to randomness when initially the system is in some mixed state. For nonlinear interactions, both the model-free PPO and recurrent PPO agents have been tested, where the first was utilized to generate the time series of the target of the expected photon number, and the second one was employed to control entanglement according to an objective. Because of the high degree of randomness in the SME originating from ten stochastic collapse operators, only the observation of the expected photon number is feasible in the nonlinear interaction regime.

More specifically, linear interactions can naturally limit the excitation in the energy levels, providing a mechanism to directly create the entangled Bell states under the premise of strong laser approximation in the red-detuned regime. A disadvantage is that its performance is sensitive to the coupling of the vacuum or thermal bath, even when the decay rate is small (e.g., κ=0.01ωm𝜅0.01subscript𝜔𝑚\kappa=0.01\,\omega_{m}italic_κ = 0.01 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with G[5,5]ωm𝐺55subscript𝜔𝑚G\in[-5,5]\,\omega_{m}italic_G ∈ [ - 5 , 5 ] italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT). This phenomenon is in fact quite common in quantum systems. For instance, in systems with magnon-photon coupling [87], steady Bell states can ideally be generated in the PT-broken phase without dissipation while the entanglement is reduced when the decay rate is not negligible. Another issue with linear interactions is that the time scale associated with generating entangled Bell states [89] tends to be much shorter than the inverse of the coupling strength about the higher-order exceptional points in a system of coupled non-Hermitian qubits with energy loss while the maximum entanglement can only last for a short instant.

In contrast, nonlinear interactions can create and stabilize entanglement and are more robust to the disturbance from the vacuum bath even with a relatively large decay rate, e.g., κ=0.1ωm𝜅0.1subscript𝜔𝑚\kappa=0.1\,\omega_{m}italic_κ = 0.1 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with the strong coupling g0=0.2ωmsubscript𝑔00.2subscript𝜔𝑚g_{0}=0.2\,\omega_{m}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.2 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, so g0/κ=2>1subscript𝑔0𝜅21g_{0}/\kappa=2>1italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_κ = 2 > 1 to stipulate the nonlinear effect [61]. Potentially, systems with nonlinear coupling thus can outperform those with linear interactions. A caveat is that, in nonlinear optomechanical systems, there is limited experimentally accessible observation. In fact, the relationship between experimental observables and entanglement in nonlinear quantum optomechanical systems has not been well understood, rendering challenging to choose a feasible observable to control entanglement. We have partially relied on the numerical method to create and stabilize entanglement, based on the numerical relation between entanglement and the expected photon number discovered by the deep RL. Another difficulty is that the nonlinear interaction can readily excite the system to high quantum states, which we have overcome by designing a proper reward function.

A previous work [90] studied the acceleration of entanglement generation through feedback weak measurement for two qubits in a four-dimensional Hilbert space, where coupling to a vacuum or thermal bath was not taken into account, nor the interactions between the two qubits, and the control protocol required prior knowledge about the system such as the decoherence-free subspace. In addition, complete observation was needed to design the local Hamiltonian feedback to speed up entanglement. This is in fact a model-based approach. In another study [91], steady-state entanglement between two qubits was achieved using a continuous feedback control method, where the feedback protocol design was informed by a detailed model of the system’s dynamics. In contrast, our work creates and stabilizes a two-mode entangled state about a predetermined level of entanglement for both linear and nonlinear interactions via model-free reinforcement learning, with the respective dimensions of the Hilbert space being four and one hundred.

Our work suggests the possibility of exploiting multi-agent RL through parallel computation to stabilize entanglement. The agents leverage the decentralized structure of the task and share information via communication. Saliently, if several agents fail in a multiagent system, the remaining agents can take over some of their tasks. In principle, our control framework can be extended to multi-agent RL for multi-mode entanglement engineering of a quantum black box.

Methods

Stochastic Master Equation

An experimental optomechanical system is effectively an open quantum system interacting with the vacuum bath under WCM with the operators [49, 29] C^nηP^nsubscript^𝐶𝑛𝜂subscript^𝑃𝑛\hat{C}_{n}\equiv\sqrt{\eta}\hat{P}_{n}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≡ square-root start_ARG italic_η end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where P^n=|nn|subscript^𝑃𝑛ket𝑛bra𝑛\hat{P}_{n}=|n\rangle\langle n|over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = | italic_n ⟩ ⟨ italic_n | with n=0,1𝑛01n=0,1italic_n = 0 , 1 (linear) or n=0,1,,9𝑛019n=0,1,\ldots,9italic_n = 0 , 1 , … , 9 (nonlinear) is the measurement operator on the Fock state and η𝜂\etaitalic_η denotes the measurement rate. The quantum dynamics of this system are described by the stochastic master equation (SME) [92, 29, 93, 94, 51]:

dρ𝑑𝜌\displaystyle d\rhoitalic_d italic_ρ =1i[H~,ρ]dt+envρdtabsent1𝑖Planck-constant-over-2-pi~𝐻𝜌𝑑𝑡subscript𝑒𝑛𝑣𝜌𝑑𝑡\displaystyle=\frac{1}{i\hbar}[\tilde{H},\rho]dt+\mathcal{L}_{env}\;\rho dt= divide start_ARG 1 end_ARG start_ARG italic_i roman_ℏ end_ARG [ over~ start_ARG italic_H end_ARG , italic_ρ ] italic_d italic_t + caligraphic_L start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT italic_ρ italic_d italic_t
+n𝒟(C^n)ρdt+n(C^n)ρdWn,subscript𝑛𝒟subscript^𝐶𝑛𝜌𝑑𝑡subscript𝑛subscript^𝐶𝑛𝜌𝑑subscript𝑊𝑛\displaystyle+\sum_{n}\mathcal{D}(\hat{C}_{n})\rho dt+\sum_{n}\mathcal{H}(\hat% {C}_{n})\rho dW_{n},+ ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_D ( over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ρ italic_d italic_t + ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_H ( over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ρ italic_d italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (2)

where the Hamiltonian is H~=H~bs~𝐻subscript~𝐻𝑏𝑠\tilde{H}=\tilde{H}_{bs}over~ start_ARG italic_H end_ARG = over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_b italic_s end_POSTSUBSCRIPT or H~nlsubscript~𝐻𝑛𝑙\tilde{H}_{nl}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT and ρ𝜌\rhoitalic_ρ is a density operator in the Hilbert space. Under the Born-Markov approximation [95, 96], which requires the system-bath coupling to be weak and the correlation time of the bath to be much shorter than a characteristic timescale of system-bath interactions, the Markovian master equation, i.e., the first two terms in the right-hand side of Eq. (Stochastic Master Equation), has the Lindblad form [95]. At absolute zero temperature, the following environmental operator envρsubscript𝑒𝑛𝑣𝜌\mathcal{L}_{env}\;\rhocaligraphic_L start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT italic_ρ can be introduced to describe the coupling between the system and vacuum bath: envρ=κ𝒟(a^)ρ+γ𝒟(b^)ρsubscript𝑒𝑛𝑣𝜌𝜅𝒟^𝑎𝜌𝛾𝒟^𝑏𝜌\mathcal{L}_{env}\;\rho=\kappa\mathcal{D}(\hat{a})\rho+\gamma\mathcal{D}(\hat{% b})\rhocaligraphic_L start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT italic_ρ = italic_κ caligraphic_D ( over^ start_ARG italic_a end_ARG ) italic_ρ + italic_γ caligraphic_D ( over^ start_ARG italic_b end_ARG ) italic_ρ, where the cavity and oscillator modes are coupled to the vacuum bath with the strength κ𝜅\kappaitalic_κ and γ𝛾\gammaitalic_γ, respectively [56]. The deep RL results in the Lindblad master equation with the nonlinear interaction are presented in Supplementary Note 5.

The WCM process described by the last two terms in the right-hand side of Eq. (Stochastic Master Equation) is nonlinear and Markovian in the unconditional master equation [51] in ρ𝜌\rhoitalic_ρ. Under WCM, a Wiener process dW𝑑𝑊dWitalic_d italic_W with a Gaussian distribution [51] arises from the Gaussian-weighted projection over the eigenstates that allows the quantum information to be extracted continuously in the time domain, subject to stochastic disturbances in the last term of Eq. (Stochastic Master Equation) and quantum decoherence in the penultimate term of Eq. (Stochastic Master Equation). (Supplementary Note 3 in SI provides a detailed derivation of the SME.) The Lindblad operator 𝒟𝒟\mathcal{D}caligraphic_D and the measurement superoperator \mathcal{H}caligraphic_H in Eq. (Stochastic Master Equation) are given by

𝒟(A^)ρ𝒟^𝐴𝜌\displaystyle\mathcal{D}(\hat{A})\rhocaligraphic_D ( over^ start_ARG italic_A end_ARG ) italic_ρ A^ρA^12(A^A^ρ+ρA^A^),absent^𝐴𝜌superscript^𝐴12superscript^𝐴^𝐴𝜌𝜌superscript^𝐴^𝐴\displaystyle\equiv\hat{A}\rho\hat{A}^{\dagger}-\frac{1}{2}(\hat{A}^{\dagger}% \hat{A}\rho+\rho\hat{A}^{\dagger}\hat{A}),≡ over^ start_ARG italic_A end_ARG italic_ρ over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG italic_ρ + italic_ρ over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG ) ,
(A^)ρ^𝐴𝜌\displaystyle\mathcal{H}(\hat{A})\rhocaligraphic_H ( over^ start_ARG italic_A end_ARG ) italic_ρ A^ρ+ρA^A^+A^ρ,absent^𝐴𝜌𝜌superscript^𝐴delimited-⟨⟩^𝐴superscript^𝐴𝜌\displaystyle\equiv\hat{A}\rho+\rho\hat{A}^{\dagger}-\langle\hat{A}+\hat{A}^{% \dagger}\rangle\rho,≡ over^ start_ARG italic_A end_ARG italic_ρ + italic_ρ over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT - ⟨ over^ start_ARG italic_A end_ARG + over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ⟩ italic_ρ ,

with A^Tr[A^ρ]delimited-⟨⟩^𝐴Trdelimited-[]^𝐴𝜌\langle\hat{A}\rangle\equiv\textnormal{Tr}[\hat{A}\rho]⟨ over^ start_ARG italic_A end_ARG ⟩ ≡ Tr [ over^ start_ARG italic_A end_ARG italic_ρ ]. The two operators serve to weakly drive the quantum state into the corresponding eigenstates to some degree.

Implementation details of deep RL

For simulating the linear or nonlinear quantum optomechanical system described by Eq. (Stochastic Master Equation), we use the “taylor1.5” solver from the SME solver in the QuTip’s package [97] with the tolerance tol=106tolsuperscript106\textnormal{tol}=10^{-6}tol = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and time step size dt=0.01ωm1𝑑𝑡0.01superscriptsubscript𝜔𝑚1dt=0.01\,\omega_{m}^{-1}italic_d italic_t = 0.01 italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. The measurement current is simulated with the “homodyne” method, and the custom environment is constructed by the open-source platform OpenAI-Gym [98]. For RL simulations, we construct the PPO agent [48] by “stable-baselines3” [99] in the A2C [100] settings, where stochastic policy (actor) and the value function (critic) are modeled by two independent neural network function approximators, i.e., a set of fully connected feed-forward networks of dimensions 256×128×6425612864256\times 128\times 64256 × 128 × 64 and the hyperbolic tangent nonlinear activation function for each hidden layer. For the nonlinear quantum optomechanical configuration, in the target-utilization phase, the recurrent PPO agent outperforms the PPO agent, where both independent critic and actor networks are MLPs followed by one independent layer of LSTM with 256×128×6425612864256\times 128\times 64256 × 128 × 64 fully connected networks and 256256256256 hidden states. More details are described in Supplementary Note 6.

Data availability

The data generated in this study about training results can be found in this Zenodo repository: https://doi.org/10.5281/zenodo.12584159 [101].

Code availability

The codes used in this paper can be found in the repository:
https://github.com/liliyequantum/Entanglement-engineering-by-RL [102].

Supplementary information

Supplementary information for this study is available on GitHub:
https://github.com/liliyequantum/Entanglement-engineering-by-RL [102].

References

  • [1] Bennett, C. H. & DiVincenzo, D. P. Quantum information and computation. nature 404, 247–255 (2000).
  • [2] Braunstein, S. L. & van Loock, P. Quantum information with continuous variables. Rev. Mod. Phys. 77, 513–577 (2005). URL https://link.aps.org/doi/10.1103/RevModPhys.77.513.
  • [3] Huang, H.-Y., Kueng, R. & Preskill, J. Information-theoretic bounds on quantum advantage in machine learning. Phys. Rev. Lett. 126, 190505 (2021). URL https://link.aps.org/doi/10.1103/PhysRevLett.126.190505.
  • [4] Huang, H.-Y. et al. Quantum advantage in learning from experiments. Science 376, 1182–1186 (2022).
  • [5] Lee, S. et al. Evaluating the evidence for exponential quantum advantage in ground-state quantum chemistry. Nat. Commun. 14, 1952 (2023).
  • [6] Degen, C. L., Reinhard, F. & Cappellaro, P. Quantum sensing. Rev. Mod. Phys. 89, 035002 (2017).
  • [7] Horodecki, R., Horodecki, P., Horodecki, M. & Horodecki, K. Quantum entanglement. Rev. Mod. Phys. 81, 865 (2009).
  • [8] Kimble, H. J. The quantum internet. Nature 453, 1023–1030 (2008).
  • [9] Wehner, S., Elkouss, D. & Hanson, R. Quantum internet: A vision for the road ahead. Science 362, eaam9288 (2018).
  • [10] Bouwmeester, D. et al. Experimental quantum teleportation. Nature 390, 575–579 (1997).
  • [11] Ren, J.-G. et al. Ground-to-satellite quantum teleportation. Nature 549, 70–73 (2017).
  • [12] Chen, Y.-A. et al. An integrated space-to-ground quantum communication network over 4,600 kilometres. Nature 589, 214–219 (2021).
  • [13] Pirandola, S., Laurenza, R., Ottaviani, C. & Banchi, L. Fundamental limits of repeaterless quantum communications. Nat. Commun. 8, 15043 (2017).
  • [14] De Leon, N. P. et al. Materials challenges and opportunities for quantum computing hardware. Science 372, eabb2823 (2021).
  • [15] Noiri, A. et al. Fast universal quantum gate above the fault-tolerance threshold in silicon. Nature 601, 338–342 (2022).
  • [16] Magann, A. B. et al. From pulses to circuits and back again: A quantum optimal control perspective on variational quantum algorithms. PRX Quantum 2, 010101 (2021).
  • [17] Banchi, L., Pancotti, N. & Bose, S. Quantum gate learning in qubit networks: Toffoli gate without time-dependent control. Npj Quantum Inf. 2, 1–6 (2016).
  • [18] Romero, G., Ballester, D., Wang, Y., Scarani, V. & Solano, E. Ultrafast quantum gates in circuit QED. Phys. Rev. Lett. 108, 120501 (2012).
  • [19] Van der Sar, T. et al. Decoherence-protected quantum gates for a hybrid solid-state spin register. Nature 484, 82–86 (2012).
  • [20] Chow, J. M. et al. Universal quantum gate set approaching fault-tolerant thresholds with superconducting qubits. Phys. Rev. Lett. 109, 060501 (2012).
  • [21] Bharti, K. et al. Noisy intermediate-scale quantum algorithms. Rev. Mod. Phys. 94, 015004 (2022).
  • [22] Mackeprang, J., Dasari, D. B. R. & Wrachtrup, J. A reinforcement learning approach for quantum state engineering. Quantum Mach. Intell. 2, 1–14 (2020).
  • [23] Giordano, S. & Martin-Delgado, M. A. Reinforcement-learning generation of four-qubit entangled states. Phys. Rev. Research 4, 043056 (2022).
  • [24] Zhang, Z.-Y. et al. Entanglement generation of polar molecules via deep reinforcement learning. J. Chem. Theory Comput. 20, 1811–1820 (2024).
  • [25] Metz, F. & Bukov, M. Self-correcting quantum many-body control using reinforcement learning with tensor networks. Nat. Mach. Intell. 5, 780–791 (2023).
  • [26] Bukov, M. et al. Reinforcement learning in different phases of quantum control. Phys. Rev. X 8, 031086 (2018).
  • [27] Guo, S.-F. et al. Faster state preparation across quantum phase transition assisted by reinforcement learning. Phys. Rev. Lett. 126, 060401 (2021).
  • [28] Sivak, V. et al. Model-free quantum control with reinforcement learning. Phys. Rev. X 12, 011059 (2022).
  • [29] Porotti, R., Essig, A., Huard, B. & Marquardt, F. Deep reinforcement learning for quantum state preparation with weak nonlinear measurements. Quantum 6, 747 (2022).
  • [30] Jayachandran, P., Zaw, L. H. & Scarani, V. Dynamics-based entanglement witnesses for non-Gaussian states of harmonic oscillators. Phys. Rev. Lett. 130, 160201 (2023).
  • [31] Ho, M., Oudot, E., Bancal, J.-D. & Sangouard, N. Witnessing optomechanical entanglement with photon counting. Phys. Rev. Lett. 121, 023602 (2018).
  • [32] Gut, C. et al. Stationary optomechanical entanglement between a mechanical oscillator and its measurement apparatus. Phys. Rev. Research 2, 033244 (2020).
  • [33] Hofer, S. G., Wieczorek, W., Aspelmeyer, M. & Hammerer, K. Quantum entanglement and teleportation in pulsed cavity optomechanics. Phys. Rev. A 84, 052327 (2011).
  • [34] Cai, Q. et al. Entangling optical and mechanical cavity modes in an optomechanical crystal nanobeam. Phys. Rev. A 108, 022419 (2023).
  • [35] Clarke, J. et al. Generating mechanical and optomechanical entanglement via pulsed interaction and measurement. New J. Phys. 22, 063001 (2020).
  • [36] Kiesewetter, S., He, Q., Drummond, P. & Reid, M. Scalable quantum simulation of pulsed entanglement and Einstein-Podolsky-Rosen steering in optomechanics. Phys. Rev. A 90, 043805 (2014).
  • [37] Wang, G., Huang, L., Lai, Y.-C. & Grebogi, C. Nonlinear dynamics and quantum entanglement in optomechanical systems. Phys. Rev. Lett. 112, 110406 (2014).
  • [38] Hofer, S. G. & Hammerer, K. Entanglement-enhanced time-continuous quantum control in optomechanics. Phys. Rev. A 91, 033822 (2015).
  • [39] Lin, Q., He, B., Ghobadi, R. & Simon, C. Fully quantum approach to optomechanical entanglement. Phys. Rev. A 90, 022309 (2014).
  • [40] Farace, A. & Giovannetti, V. Enhancing quantum effects via periodic modulations in optomechanical systems. Phys. Rev. A 86, 013820 (2012).
  • [41] Mari, A. & Eisert, J. Opto-and electro-mechanical entanglement improved by modulation. New J. Phys. 14, 075014 (2012).
  • [42] Mari, A. & Eisert, J. Gently modulating optomechanical systems. Phys. Rev. Lett. 103, 213603 (2009).
  • [43] Stefanatos, D. Maximising optomechanical entanglement with optimal control. Quantum Sci. Technol. 2, 014003 (2017).
  • [44] Guo, J. & Gröblacher, S. Coherent feedback in optomechanical systems in the sideband-unresolved regime. Quantum 6, 848 (2022).
  • [45] Harwood, A., Brunelli, M. & Serafini, A. Cavity optomechanics assisted by optical coherent feedback. Phys. Rev. A 103, 023509 (2021).
  • [46] Wang, Z. T., Ashida, Y. & Ueda, M. Deep reinforcement learning control of quantum cartpoles. Phys. Rev. Lett. 125, 100401 (2020). URL https://link.aps.org/doi/10.1103/PhysRevLett.125.100401.
  • [47] Borah, S., Sarma, B., Kewming, M., Milburn, G. J. & Twamley, J. Measurement-based feedback quantum control with deep reinforcement learning for a double-well nonlinear potential. Phys. Rev. Lett. 127, 190403 (2021). URL https://link.aps.org/doi/10.1103/PhysRevLett.127.190403.
  • [48] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  • [49] Essig, A. et al. Multiplexed photon number measurement. Phys. Rev. X 11, 031045 (2021).
  • [50] Smith, G. A., Chaudhury, S., Silberfarb, A., Deutsch, I. H. & Jessen, P. S. Continuous weak measurement and nonlinear dynamics in a cold spin ensemble. Phys. Rev. Lett. 93, 163602 (2004). URL https://link.aps.org/doi/10.1103/PhysRevLett.93.163602.
  • [51] Jacobs, K. & Steck, D. A. A straightforward introduction to continuous quantum measurement. Contemp. Phys. 47, 279–303 (2006).
  • [52] Ramírez, J., Yu, W. & Perrusquía, A. Model-free reinforcement learning from expert demonstrations: A survey. Artif. Intell. Rev. 55, 3213–3241 (2022).
  • [53] Kaiser, L. et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374 (2019).
  • [54] Stockton, J. K., van Handel, R. & Mabuchi, H. Deterministic Dicke-state preparation with continuous measurement and control. Phys. Rev. A 70, 022106 (2004). URL https://link.aps.org/doi/10.1103/PhysRevA.70.022106.
  • [55] Liu, Y.-C., Xiao, Y.-F., Luan, X. & Wong, C. W. Dynamic dissipative cooling of a mechanical resonator in strong coupling optomechanics. Phys. Rev. Lett. 110, 153606 (2013).
  • [56] Qian, J., Clerk, A., Hammerer, K. & Marquardt, F. Quantum signatures of the optomechanical instability. Phys. Rev. Lett. 109, 253601 (2012).
  • [57] Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
  • [58] Dobrindt, J. M., Wilson-Rae, I. & Kippenberg, T. J. Parametric normal-mode splitting in cavity optomechanics. Phys. Rev. Lett. 101, 263602 (2008).
  • [59] Palomaki, T., Teufel, J., Simmonds, R. & Lehnert, K. W. Entangling mechanical motion with microwave fields. Science 342, 710–713 (2013).
  • [60] Barzanjeh, S. et al. Optomechanics for quantum technologies. Nat. Phys. 18, 15–24 (2022).
  • [61] Aspelmeyer, M., Kippenberg, T. J. & Marquardt, F. Cavity optomechanics. Rev. Mod. Phys. 86, 1391–1452 (2014). URL https://link.aps.org/doi/10.1103/RevModPhys.86.1391.
  • [62] Lemonde, M.-A., Didier, N. & Clerk, A. A. Nonlinear interaction effects in a strongly driven optomechanical cavity. Phys. Rev. Lett. 111, 053602 (2013).
  • [63] Marquardt, F. Quantum optomechanics (2014). https://mpl.mpg.de/fileadmin/user_upload/Marquardt_Division/Teaching/2014_ChapterDraftLesHouches.pdf.
  • [64] Song, X., Oksanen, M., Li, J., Hakonen, P. J. & Sillanpää, M. A. Graphene optomechanics realized at microwave frequencies. Phys. Rev. Lett. 113, 027404 (2014).
  • [65] Shin, J. et al. On-chip microwave frequency combs in a superconducting nanoelectromechanical device. Nano Lett. 22, 5459–5465 (2022).
  • [66] Seis, Y. et al. Ground state cooling of an ultracoherent electromechanical system. Nat. Commun. 13, 1507 (2022).
  • [67] Liu, Y. et al. Long-lived microwave electromechanical systems enabled by cubic silicon-carbide membrane crystals. arXiv preprint arXiv:2401.01020 (2024).
  • [68] Johnson, B. et al. Quantum non-demolition detection of single microwave photons in a circuit. Nat. Phys. 6, 663–667 (2010).
  • [69] Gleyzes, S. et al. Quantum jumps of light recording the birth and death of a photon in a cavity. Nat. 446, 297–300 (2007).
  • [70] Vidal, G. & Werner, R. F. Computable measure of entanglement. Phys. Rev. A 65, 032314 (2002). URL https://link.aps.org/doi/10.1103/PhysRevA.65.032314.
  • [71] Plenio, M. B. Logarithmic negativity: A full entanglement monotone that is not convex. Phys. Rev. Lett. 95, 090503 (2005). URL https://link.aps.org/doi/10.1103/PhysRevLett.95.090503.
  • [72] Kitagawa, A., Takeoka, M., Sasaki, M. & Chefles, A. Entanglement evaluation of non-Gaussian states generated by photon subtraction from squeezed states. Phys. Rev. A 73, 042310 (2006).
  • [73] Shapourian, H., Liu, S., Kudler-Flam, J. & Vishwanath, A. Entanglement negativity spectrum of random mixed states: A diagrammatic approach. PRX Quantum 2, 030347 (2021). URL https://link.aps.org/doi/10.1103/PRXQuantum.2.030347.
  • [74] Vidal, G. Entanglement monotones. J. Mod. Opt. 47, 355–376 (2000).
  • [75] Eisert, J., Scheel, S. & Plenio, M. B. Distilling Gaussian states with Gaussian operations is impossible. Phys. Rev. Lett. 89, 137903 (2002).
  • [76] Fiurášek, J. Gaussian transformations and distillation of entangled Gaussian states. Phys. Rev. Lett. 89, 137904 (2002).
  • [77] Rains, E. M. A semidefinite program for distillable entanglement. IEEE Trans. Inf. Theory 47, 2921–2933 (2001).
  • [78] Pakniat, R., Zandi, M. H. & Tavassoly, M. K. On the entanglement swap** by using the beam splitter. Eur. Phys. J. Plus 132, 1–10 (2017).
  • [79] Bouchard, F. et al. Two-photon interference: the [hong–ou–mandel] effect. Rep. Prog. Phys. 84, 012402 (2020).
  • [80] Kim, Y.-H. & Grice, W. P. Reliability of the beam-splitter–based bell-state measurement. Phys. Rev. A 68, 062305 (2003).
  • [81] Behunin, R. O. & Rakich, P. T. Quantum optomechanics in tripartite systems. arXiv preprint arXiv:2210.14967 (2022).
  • [82] Wang, Y.-D., Chesi, S. & Clerk, A. A. Bipartite and tripartite output entanglement in three-mode optomechanical systems. Phys. Rev. A 91, 013807 (2015).
  • [83] Thekkadath, G. S. et al. Tuning between photon-number and quadrature measurements with weak-field homodyne detection. Phys. Rev. A 101, 031801 (2020). URL https://link.aps.org/doi/10.1103/PhysRevA.101.031801.
  • [84] Puentes, G. et al. Bridging particle and wave sensitivity in a configurable detector of positive operator-valued measures. Phys. Rev. Lett. 102, 080404 (2009).
  • [85] Lvovsky, A. I. & Raymer, M. G. Continuous-variable optical quantum-state tomography. Rev. Mod. Phys. 81, 299 (2009).
  • [86] Nixon, M. & Aguado, A. Feature extraction and image processing for computer vision (Academic press, 2019).
  • [87] Yuan, H. Y. et al. Steady Bell state generation via magnon-photon coupling. Phys. Rev. Lett. 124, 053602 (2020). URL https://link.aps.org/doi/10.1103/PhysRevLett.124.053602.
  • [88] Pleines, M., Pallasch, M., Zimmer, F. & Preuss, M. Generalization, mayhems and limits in recurrent proximal policy optimization. arXiv preprint arXiv:2205.11104 (2022).
  • [89] Li, Z.-Z., Chen, W., Abbasi, M., Murch, K. W. & Whaley, K. B. Speeding up entanglement generation by proximity to higher-order exceptional points. Phys. Rev. Lett. 131, 100202 (2023). URL https://link.aps.org/doi/10.1103/PhysRevLett.131.100202.
  • [90] Hill, C. & Ralph, J. Weak measurement and control of entanglement generation. Phys. Rev. A 77, 014305 (2008).
  • [91] Diotallevi, G. F., Annby-Andersson, B., Samuelsson, P., Tavakoli, A. & Bakhshinezhad, P. Steady-state entanglement production in a quantum thermal machine with continuous feedback control. New J. Phys. 26, 053005 (2024).
  • [92] Blaquiere, A., Diner, S. & Lochak, G. Information complexity and control in quantum physics. In Proceedings of the 4th International Seminar on Mathematical Theory of Dynamical Systems and Microphysics Udine. Vienna: Springer. Springer (Springer, 1987).
  • [93] Bowen, W. P. & Milburn, G. J. Quantum optomechanics (CRC press, 2015).
  • [94] Wiseman, H. M. & Milburn, G. J. Quantum measurement and control (Cambridge university press, 2009).
  • [95] Nathan, F. & Rudner, M. S. Universal Lindblad equation for open quantum systems. Phys. Rev. B 102, 115109 (2020).
  • [96] Bai, S.-Y., Chen, C., Wu, H. & An, J.-H. Quantum control in open and periodically driven systems. Adv. Phys. X 6, 1870559 (2021).
  • [97] Johansson, J. R., Nation, P. D. & Nori, F. QuTiP: An open-source Python framework for the dynamics of open quantum systems. Comput. Phys. Commun. 183, 1760–1772 (2012).
  • [98] Brockman, G. et al. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
  • [99] Raffin, A. et al. Stable-Baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 22, 1–8 (2021). URL http://jmlr.org/papers/v22/20-1364.html.
  • [100] Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In ICML, 1928–1937 (PMLR, 2016).
  • [101] Ye, L.-L. Entanglement engineering of optomechanical systems by reinforcement learning. https://doi.org/10.5281/zenodo.12584159 (2024).
  • [102] Ye, L.-L. Entanglement engineering by RL. https://github.com/liliyequantum/Entanglement-engineering-by-RL (2024).

Acknowledgment

We thank Dr. Kanu Sinha for discussions and comments. This work was supported by AFOSR under Grants Nos. FA9550-21-1-0438 and FA9550-21-1-0186. The Quantum Collaborative, led by Arizona State University, also provided valuable expertise and resources for this work through seed funding. The Quantum Collaborative connects top scientific programs, initiatives, and facilities with prominent industry partners to advance the science and engineering of quantum information science.

Author Contributions

L.-L. Y., C. A., J. L., and Y.-C. L. designed the research project, the models, and methods. L.-L. Y. performed the computations. L.-L. Y., C. A., J. L., and Y.-C. L. analyzed the data. L.-L. Y. and Y.-C. L. wrote and edited the manuscript.

Competing Interests

The authors declare no competing interests.