From f2f1d279db9654c81131a878cf06381fba9a8d52 Mon Sep 17 00:00:00 2001
From: LegionAtol <alessio.parato@gmail.com>
Date: Thu, 22 Aug 2024 16:34:35 +0200
Subject: [PATCH 1/5] added RL algorithm

---
 doc/guide/guide-control.rst | 39 +++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/doc/guide/guide-control.rst b/doc/guide/guide-control.rst
index b2029cd..6a0f9aa 100644
--- a/doc/guide/guide-control.rst
+++ b/doc/guide/guide-control.rst
@@ -195,6 +195,45 @@ experimental systematic noise, ...) can be done all in one, using this
 algorithm.
 
 
+The RL Algorithm
+================
+Reinforcement Learning (RL) represents a different approach compared to traditional
+quantum control methods, such as GRAPE and CRAB. Instead of relying on gradients or
+prior knowledge of the system, RL uses an agent that autonomously learns to optimize
+control policies by interacting with the quantum environment.
+
+The RL algorithm consists of three main components:
+**Agent**: The RL agent is responsible for making decisions regarding control
+parameters at each time step. The agent observes the current state of the quantum
+system and chooses an action (i.e., a set of control parameters) based on the current policy.
+**Environment**: The environment represents the quantum system that evolves over time.
+The environment is defined by the system's dynamics, which include drift and control Hamiltonians.
+Each action chosen by the agent induces a response in the environment, which manifests as an
+evolution of the system's state. From this, a reward can be derived.
+**Reward**: The reward is a measure of how much the action chosen by the agent brings the
+quantum system closer to the desired objective. In this context, the objective could be the
+preparation of a specific state, state-to-state transfer, or the synthesis of a quantum gate.
+
+Each interaction between the agent and the environment defines a step.
+A sequence of steps forms an episode.The episode ends when certain conditions, such as reaching
+a specific fidelity, are met.
+The reward function is a crucial component of the RL algorithm. It must be designed to
+accurately reflect the objective of the quantum control problem.
+The algorithm will aim to update its policy to maximize the reward obtained during the
+various episodes of training. This highlights the importance of ensuring that the control
+problem's objectives are well encoded in the reward function. For example, in a state-to-state
+transfer problem, the reward might be based on the fidelity between the final state obtained
+and the desired target state. A common choice is:
+.. math:: R(s, a) = 1 - \text{infidelity}(s_{\text{final}}, s_{\text{target}}) - \text{step penalty}
+Here, the step penalty is a small negative value that encourages the agent to reach the objective
+in as few steps as possible.
+
+In QuTiP, the RL environment is modeled as a custom class derived from the gymnasium library.
+This class allows defining the quantum system's dynamics at each step, the actions the agent
+can take, the observation space, and so on. The RL agent can be trained using pre-existing
+policies such as Proximal Policy Optimization (PPO) from the stable_baselines3 library.
+
+
 Optimal Quantum Control in QuTiP
 ================================
 Defining a control problem with QuTiP is very easy.

From 1fd96626deedf74f9cb6b5bd08b86fd715234496 Mon Sep 17 00:00:00 2001
From: LegionAtol <alessio.parato@gmail.com>
Date: Thu, 22 Aug 2024 16:50:45 +0200
Subject: [PATCH 2/5] rendering update

---
 doc/guide/guide-control.rst | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/doc/guide/guide-control.rst b/doc/guide/guide-control.rst
index 6a0f9aa..2976fac 100644
--- a/doc/guide/guide-control.rst
+++ b/doc/guide/guide-control.rst
@@ -203,14 +203,15 @@ prior knowledge of the system, RL uses an agent that autonomously learns to opti
 control policies by interacting with the quantum environment.
 
 The RL algorithm consists of three main components:
-**Agent**: The RL agent is responsible for making decisions regarding control
+
+- **Agent**: The RL agent is responsible for making decisions regarding control
 parameters at each time step. The agent observes the current state of the quantum
 system and chooses an action (i.e., a set of control parameters) based on the current policy.
-**Environment**: The environment represents the quantum system that evolves over time.
+- **Environment**: The environment represents the quantum system that evolves over time.
 The environment is defined by the system's dynamics, which include drift and control Hamiltonians.
 Each action chosen by the agent induces a response in the environment, which manifests as an
 evolution of the system's state. From this, a reward can be derived.
-**Reward**: The reward is a measure of how much the action chosen by the agent brings the
+- **Reward**: The reward is a measure of how much the action chosen by the agent brings the
 quantum system closer to the desired objective. In this context, the objective could be the
 preparation of a specific state, state-to-state transfer, or the synthesis of a quantum gate.
 
@@ -224,7 +225,10 @@ various episodes of training. This highlights the importance of ensuring that th
 problem's objectives are well encoded in the reward function. For example, in a state-to-state
 transfer problem, the reward might be based on the fidelity between the final state obtained
 and the desired target state. A common choice is:
-.. math:: R(s, a) = 1 - \text{infidelity}(s_{\text{final}}, s_{\text{target}}) - \text{step penalty}
+.. math:: 
+  
+  R(s, a) = 1 - \text{infidelity}(s_{\text{final}}, s_{\text{target}}) - \text{step penalty}
+
 Here, the step penalty is a small negative value that encourages the agent to reach the objective
 in as few steps as possible.
 

From 0d2df6f701f9dd05187c9d30147f339bc3f738c7 Mon Sep 17 00:00:00 2001
From: LegionAtol <alessio.parato@gmail.com>
Date: Thu, 22 Aug 2024 17:00:17 +0200
Subject: [PATCH 3/5] rendering update

---
 doc/guide/guide-control.rst | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/doc/guide/guide-control.rst b/doc/guide/guide-control.rst
index 2976fac..4212899 100644
--- a/doc/guide/guide-control.rst
+++ b/doc/guide/guide-control.rst
@@ -204,14 +204,14 @@ control policies by interacting with the quantum environment.
 
 The RL algorithm consists of three main components:
 
-- **Agent**: The RL agent is responsible for making decisions regarding control
+**Agent**: The RL agent is responsible for making decisions regarding control
 parameters at each time step. The agent observes the current state of the quantum
 system and chooses an action (i.e., a set of control parameters) based on the current policy.
-- **Environment**: The environment represents the quantum system that evolves over time.
+**Environment**: The environment represents the quantum system that evolves over time.
 The environment is defined by the system's dynamics, which include drift and control Hamiltonians.
 Each action chosen by the agent induces a response in the environment, which manifests as an
 evolution of the system's state. From this, a reward can be derived.
-- **Reward**: The reward is a measure of how much the action chosen by the agent brings the
+**Reward**: The reward is a measure of how much the action chosen by the agent brings the
 quantum system closer to the desired objective. In this context, the objective could be the
 preparation of a specific state, state-to-state transfer, or the synthesis of a quantum gate.
 
@@ -223,14 +223,9 @@ accurately reflect the objective of the quantum control problem.
 The algorithm will aim to update its policy to maximize the reward obtained during the
 various episodes of training. This highlights the importance of ensuring that the control
 problem's objectives are well encoded in the reward function. For example, in a state-to-state
-transfer problem, the reward might be based on the fidelity between the final state obtained
-and the desired target state. A common choice is:
-.. math:: 
-  
-  R(s, a) = 1 - \text{infidelity}(s_{\text{final}}, s_{\text{target}}) - \text{step penalty}
-
-Here, the step penalty is a small negative value that encourages the agent to reach the objective
-in as few steps as possible.
+transfer problem, the reward could be based on the fidelity between the achieved final state
+and the desired target state and subtract a constant penalty term.
+The step penalty is a small value that encourages the agent to reach the objective in as few steps as possible.
 
 In QuTiP, the RL environment is modeled as a custom class derived from the gymnasium library.
 This class allows defining the quantum system's dynamics at each step, the actions the agent

From 9d6fa29727d0a4f53f4dcb5548c4d6c98ef460c7 Mon Sep 17 00:00:00 2001
From: LegionAtol <alessio.parato@gmail.com>
Date: Sun, 25 Aug 2024 14:06:13 +0200
Subject: [PATCH 4/5] small corrections

---
 doc/guide/guide-control.rst | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/doc/guide/guide-control.rst b/doc/guide/guide-control.rst
index 4212899..a469716 100644
--- a/doc/guide/guide-control.rst
+++ b/doc/guide/guide-control.rst
@@ -216,21 +216,20 @@ quantum system closer to the desired objective. In this context, the objective c
 preparation of a specific state, state-to-state transfer, or the synthesis of a quantum gate.
 
 Each interaction between the agent and the environment defines a step.
-A sequence of steps forms an episode.The episode ends when certain conditions, such as reaching
+A sequence of steps forms an episode. The episode ends when certain conditions, such as reaching
 a specific fidelity, are met.
-The reward function is a crucial component of the RL algorithm. It must be designed to
-accurately reflect the objective of the quantum control problem.
-The algorithm will aim to update its policy to maximize the reward obtained during the
-various episodes of training. This highlights the importance of ensuring that the control
-problem's objectives are well encoded in the reward function. For example, in a state-to-state
-transfer problem, the reward could be based on the fidelity between the achieved final state
-and the desired target state and subtract a constant penalty term.
+The reward function is a crucial component of the RL algorithm, carefully designed to
+reflect the objective of the quantum control problem.
+It guides the algorithm in updating its policy to maximize the reward obtained during the various
+training episodes.
+For example, in a state-to-state transfer problem, the reward could be based on the fidelity
+between the achieved final state and the desired target state and subtract a constant penalty term.
 The step penalty is a small value that encourages the agent to reach the objective in as few steps as possible.
 
 In QuTiP, the RL environment is modeled as a custom class derived from the gymnasium library.
 This class allows defining the quantum system's dynamics at each step, the actions the agent
-can take, the observation space, and so on. The RL agent can be trained using pre-existing
-policies such as Proximal Policy Optimization (PPO) from the stable_baselines3 library.
+can take, the observation space, and so on. The RL agent is trained using the Proximal Policy Optimization
+(PPO) algorithm from the stable baselines3 library.
 
 
 Optimal Quantum Control in QuTiP

From 54038092d56fab144d6282466856858387a6b668 Mon Sep 17 00:00:00 2001
From: LegionAtol <alessio.parato@gmail.com>
Date: Mon, 9 Sep 2024 13:29:42 +0200
Subject: [PATCH 5/5] Minor text fixes

---
 doc/guide/guide-control.rst | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/doc/guide/guide-control.rst b/doc/guide/guide-control.rst
index a469716..915b6b6 100644
--- a/doc/guide/guide-control.rst
+++ b/doc/guide/guide-control.rst
@@ -222,9 +222,10 @@ The reward function is a crucial component of the RL algorithm, carefully design
 reflect the objective of the quantum control problem.
 It guides the algorithm in updating its policy to maximize the reward obtained during the various
 training episodes.
-For example, in a state-to-state transfer problem, the reward could be based on the fidelity
-between the achieved final state and the desired target state and subtract a constant penalty term.
-The step penalty is a small value that encourages the agent to reach the objective in as few steps as possible.
+For example, in a state-to-state transfer problem, the reward is based on the fidelity
+between the achieved final state and the desired target state.
+In addition, a constant penalty term is subtracted in order to encourages the agent to reach the
+objective in as few steps as possible.
 
 In QuTiP, the RL environment is modeled as a custom class derived from the gymnasium library.
 This class allows defining the quantum system's dynamics at each step, the actions the agent