File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Policy optimization for offline reinforcement learning
Title | Policy optimization for offline reinforcement learning |
---|---|
Authors | |
Advisors | Advisor(s):Hofert, JM |
Issue Date | 2023 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Liu, Y. [刘阳]. (2023). Policy optimization for offline reinforcement learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Offline reinforcement learning (RL) aims to learn a policy from previously collected data that can outperform the behavior policy that generates the data. This thesis proposes policy optimization approaches from the model-free and model-based perspectives.
For model-free RL, this dissertation proposes to implicitly or explicitly unify maximizing Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to the discrepancy between the target policy and the offline data. For implicit unification, we propose to unify the action space by generative adversarial networks that try to make the actions of the target policy and behavior policy indistinguishable. For explicit unification, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair.
For model-based RL, this dissertation proposes an approach of policy optimization by looking ahead (POLA). Existing approaches first learn a value function from historical data, then guide the update of policy parameters by maximizing the value function at a single time step, which tries to find the optimal action at each time.
We argue this strategy is greedy and propose to optimize the policy by looking ahead to alleviate the greediness. Concretely, we look $T$ time steps ahead and then optimize the policy on both current and future states where the future states are predicted by a transition model. A trajectory contains numerous actions before the agent reaches the terminal state. Performing the best action at each time step does not necessarily mean we can get an optimal trajectory in the end. Occasionally, we need to allow sub-optimal or negative actions.
Besides, hidden confounding factors may affect the decision making process. The policy should consider the unobserved variables when making decisions. To that end, we incorporate the correlations among dimensions of a state into the policy, consequently providing more information of the environment for the policy. Then the new state with correlation information is fed to the diffusion policy that is good at generating diverse actions. Empirical results on the MuJoCo environments show the effectiveness of the proposed approach.
Extensive experiments have been conducted on the D4RL dataset, the results show that our approaches exhibit superior performance.
Our results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also propose a topK loss for ensemble Q-values to alleviate training instability. We find that the topK loss and MIS can stabilize the reward curve effectively.
For POLA, when the rollout length is set properly, the performance is better than without looking ahead. The optimal rollout lengths can be different for various tasks. The correlation demonstrates its effectiveness in improving the convergence rate of reward curve. |
Degree | Master of Philosophy |
Subject | Reinforcement learning - Mathematical models |
Dept/Program | Statistics and Actuarial Science |
Persistent Identifier | http://hdl.handle.net/10722/335944 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Hofert, JM | - |
dc.contributor.author | Liu, Yang | - |
dc.contributor.author | 刘阳 | - |
dc.date.accessioned | 2023-12-29T04:05:03Z | - |
dc.date.available | 2023-12-29T04:05:03Z | - |
dc.date.issued | 2023 | - |
dc.identifier.citation | Liu, Y. [刘阳]. (2023). Policy optimization for offline reinforcement learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/335944 | - |
dc.description.abstract | Offline reinforcement learning (RL) aims to learn a policy from previously collected data that can outperform the behavior policy that generates the data. This thesis proposes policy optimization approaches from the model-free and model-based perspectives. For model-free RL, this dissertation proposes to implicitly or explicitly unify maximizing Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to the discrepancy between the target policy and the offline data. For implicit unification, we propose to unify the action space by generative adversarial networks that try to make the actions of the target policy and behavior policy indistinguishable. For explicit unification, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. For model-based RL, this dissertation proposes an approach of policy optimization by looking ahead (POLA). Existing approaches first learn a value function from historical data, then guide the update of policy parameters by maximizing the value function at a single time step, which tries to find the optimal action at each time. We argue this strategy is greedy and propose to optimize the policy by looking ahead to alleviate the greediness. Concretely, we look $T$ time steps ahead and then optimize the policy on both current and future states where the future states are predicted by a transition model. A trajectory contains numerous actions before the agent reaches the terminal state. Performing the best action at each time step does not necessarily mean we can get an optimal trajectory in the end. Occasionally, we need to allow sub-optimal or negative actions. Besides, hidden confounding factors may affect the decision making process. The policy should consider the unobserved variables when making decisions. To that end, we incorporate the correlations among dimensions of a state into the policy, consequently providing more information of the environment for the policy. Then the new state with correlation information is fed to the diffusion policy that is good at generating diverse actions. Empirical results on the MuJoCo environments show the effectiveness of the proposed approach. Extensive experiments have been conducted on the D4RL dataset, the results show that our approaches exhibit superior performance. Our results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also propose a topK loss for ensemble Q-values to alleviate training instability. We find that the topK loss and MIS can stabilize the reward curve effectively. For POLA, when the rollout length is set properly, the performance is better than without looking ahead. The optimal rollout lengths can be different for various tasks. The correlation demonstrates its effectiveness in improving the convergence rate of reward curve. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Reinforcement learning - Mathematical models | - |
dc.title | Policy optimization for offline reinforcement learning | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Statistics and Actuarial Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044751041503414 | - |