Acta Optica Sinica, Volume. 43, Issue 21, 2114002(2023)

Reinforcement Learning for Free Electron Laser Online Optimization

Jiacheng Wu1,2, Meng Cai3, Yujie Lu1,3, Nanshun Huang4、*, Chao Feng2,3, and Zhentang Zhao1,2,3
Author Affiliations
  • 1School of Physical Science and Technology, ShanghaiTech University, Shanghai 201210, China
  • 2Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China
  • 3Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai 201800, China
  • 4Zhangjiang Laboratory, Shanghai 201210, China
  • show less

    Objective

    The X-ray free-electron lasers (FELs) have undergone a significant transformation in the fields of biology, chemistry, and material science. The capacity to produce femtosecond and nanoscale pulses with gigawatt peak power and tunable wavelengths down to less than 0.1 nm has stimulated the construction and operation of numerous FEL user facilities worldwide. Shanghai soft X-ray free-electron laser (SXFEL) is the first X-ray FEL user facility in China. Its daily operation requires precise control of the accelerator state to ensure laser quality and stability. This necessitates high-dimensional, high-frequency, and closed-loop control of beam parameters. Furthermore, the intricate demands of scientific experiments on FEL characteristics such as wavelength, bandwidth, and brightness make the control and optimization task of FEL devices even more challenging. This activity is usually carried out by proficient commissioning personnel and requires a significant investment of time. Therefore, the utilization of automated online optimization algorithms is crucial in enhancing the commissioning procedure.

    Methods

    A deep reinforcement learning method combined with a neural network is employed in this study. Reinforcement learning uses positive and negative rewards obtained from the interaction between agents and the environment to update parameters. It does not require input from the inherent nature of the environment and is not dependent on data sets. In theory, this methodology has the potential to be implemented in various scenarios to optimize any given parameter in online devices. We employ SAC, TD3, and DDPG algorithms to adjust multiple correction magnets and optimize the output power of the free electron laser in a simulation environment. To simulate non-ideal orbit conditions, the beam trajectory is deflected by a magnet at the entrance of the first undulator. In the optimization task, we set the current values of seven correction magnets in both horizontal and vertical directions as the agent's action. The position coordinates of the electron beam along the x and y directions of the undulator line after passing through the seven correction magnets are set as the environment's state. The intensity and roundness of the spot are used as evaluation criteria for laser quality. During the simulation, Python is used to modify the input file and magnetic structure file of Genesis 1.3 to execute the action. The status and reward are obtained by reading and analyzing the power output and radiation field of Genesis 1.3. For each step in the optimization process, the agent first performs an action and adjusts 14 magnet parameters to correct the orbit. At this time, the environment changes and returns a reward to the agent according to evaluation criteria for laser quality. The agent optimizes its action to maximize cumulative reward.

    Results and Discussions

    In the FEL simulation environment, we use SAC, TD3, and DDPG algorithms with parameters listed in Table 2 to optimize the beam orbit under different random number seeds. Figure 2 shows the training results of the proposed algorithm. As the learning process of SAC and TD3 algorithms progresses, the reward function converges, and the FEL power eventually reaches saturation. SAC and TD3 algorithms maximize FEL intensity at about 400 steps, with the convergence results of the SAC algorithm being better than those of the TD3 algorithm. This is because the TD3 algorithm, built on the DDPG algorithm, mitigates the impact of overestimation of action value on strategy updating and enhances the stability of the training process. The SAC algorithm maximizes the entropy while maximizing the expected reward, enhances the randomness of the strategy, and prevents the strategy from prematurely converging to the local optimal value. Furthermore, after convergence, the power mean of the SAC algorithm is noticeably more stable compared to that of the TD3 algorithm. Its confidence interval is also smaller, indicating better stability. The gain curve and initial curve of the three algorithms in the tuning task are shown in Fig. 3(a). The SAC algorithm approximately optimizes the output power from 0.08 GW to 0.77 GW, slightly higher than that of TD3 algorithm and significantly higher than that of DDPG algorithm. The optimized orbits and initial orbits of the three algorithms are shown in Fig. 3(b). Due to the deflection magnet applied at the entrance of the system and the drift section set, the beam is deflected and divergent in the first 2.115 m of the undulator structure, with the uncorrected orbits maintaining this state. The SAC, TD3, and DDPG algorithms all make adjustments to the orbits. Figure 3(b) shows that the orbits optimized by the SAC algorithm are closer to the center of the undulator, namely the ideal orbits, in both horizontal and vertical directions, which can also explain that the output power optimized by SAC is higher than that of TD3 and DDPG. To more directly reflect the results of orbit optimization, we compare the initial light spot at the outlet of the undulator with the optimized light spots of three algorithms (Fig. 4). The initial light spot is offset in both x and y directions and has weak intensity. However, the light spot optimized by SAC is completely centered in the undulator with the highest intensity, while it remains offset in the x direction for the other two algorithms.

    Conclusions

    We employ deep reinforcement learning techniques to simultaneously control multiple correction magnets to optimize the beam orbit within the undulator. The deep reinforcement learning approach acquires rules from past experiences, avoiding the need for training with a calibration dataset. In contrast to heuristic algorithms, this approach exhibits superior efficiency and less proneness to local optima. In this study, the SAC and TD3 algorithms have been shown to effectively optimize beam orbit and improve spot quality through the analysis of system state, reward balancing, and action optimization. Results of the simulation indicate that the TD3 algorithm effectively optimizes the laser power to 0.71 GW, thereby resolving the issue of bias that arises from overestimating the action value of DDPG. Furthermore, the SAC algorithm has been utilized to optimize laser power to a value of 0.77 GW, demonstrating a marked improvement in the learning efficiency and performance of DDPG. The SAC optimization is based on the maximum entropy principle and is indicative of improved training effectiveness and stability. Thus, the SAC algorithm exhibits strong robustness and holds the potential to be utilized for the automated light optimization of SXFEL.

    Tools

    Get Citation

    Copy Citation Text

    Jiacheng Wu, Meng Cai, Yujie Lu, Nanshun Huang, Chao Feng, Zhentang Zhao. Reinforcement Learning for Free Electron Laser Online Optimization[J]. Acta Optica Sinica, 2023, 43(21): 2114002

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Lasers and Laser Optics

    Received: Apr. 28, 2023

    Accepted: May. 31, 2023

    Published Online: Nov. 16, 2023

    The Author Email: Huang Nanshun (huangns@zjlab.ac.cn)

    DOI:10.3788/AOS230893

    Topics