Policy optimization for offline reinforcement learning

Liu, Yang; 刘阳

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Statistics & Actuarial Science: Theses

postgraduate thesis: Policy optimization for offline reinforcement learning

Title	Policy optimization for offline reinforcement learning
Authors	Liu, Yang 刘阳
Advisors	Advisor(s):Hofert, JM
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Liu, Y. [刘阳]. (2023). Policy optimization for offline reinforcement learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Offline reinforcement learning (RL) aims to learn a policy from previously collected data that can outperform the behavior policy that generates the data. This thesis proposes policy optimization approaches from the model-free and model-based perspectives. For model-free RL, this dissertation proposes to implicitly or explicitly unify maximizing Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to the discrepancy between the target policy and the offline data. For implicit unification, we propose to unify the action space by generative adversarial networks that try to make the actions of the target policy and behavior policy indistinguishable. For explicit unification, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. For model-based RL, this dissertation proposes an approach of policy optimization by looking ahead (POLA). Existing approaches first learn a value function from historical data, then guide the update of policy parameters by maximizing the value function at a single time step, which tries to find the optimal action at each time. We argue this strategy is greedy and propose to optimize the policy by looking ahead to alleviate the greediness. Concretely, we look $T$ time steps ahead and then optimize the policy on both current and future states where the future states are predicted by a transition model. A trajectory contains numerous actions before the agent reaches the terminal state. Performing the best action at each time step does not necessarily mean we can get an optimal trajectory in the end. Occasionally, we need to allow sub-optimal or negative actions. Besides, hidden confounding factors may affect the decision making process. The policy should consider the unobserved variables when making decisions. To that end, we incorporate the correlations among dimensions of a state into the policy, consequently providing more information of the environment for the policy. Then the new state with correlation information is fed to the diffusion policy that is good at generating diverse actions. Empirical results on the MuJoCo environments show the effectiveness of the proposed approach. Extensive experiments have been conducted on the D4RL dataset, the results show that our approaches exhibit superior performance. Our results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also propose a topK loss for ensemble Q-values to alleviate training instability. We find that the topK loss and MIS can stabilize the reward curve effectively. For POLA, when the rollout length is set properly, the performance is better than without looking ahead. The optimal rollout lengths can be different for various tasks. The correlation demonstrates its effectiveness in improving the convergence rate of reward curve.
Degree	Master of Philosophy
Subject	Reinforcement learning - Mathematical models
Dept/Program	Statistics and Actuarial Science
Persistent Identifier	http://hdl.handle.net/10722/335944

DC Field	Value	Language
dc.contributor.advisor	Hofert, JM	-
dc.contributor.author	Liu, Yang	-
dc.contributor.author	刘阳	-
dc.date.accessioned	2023-12-29T04:05:03Z	-
dc.date.available	2023-12-29T04:05:03Z	-
dc.date.issued	2023	-
dc.identifier.citation	Liu, Y. [刘阳]. (2023). Policy optimization for offline reinforcement learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/335944	-
dc.description.abstract	Offline reinforcement learning (RL) aims to learn a policy from previously collected data that can outperform the behavior policy that generates the data. This thesis proposes policy optimization approaches from the model-free and model-based perspectives. For model-free RL, this dissertation proposes to implicitly or explicitly unify maximizing Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to the discrepancy between the target policy and the offline data. For implicit unification, we propose to unify the action space by generative adversarial networks that try to make the actions of the target policy and behavior policy indistinguishable. For explicit unification, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. For model-based RL, this dissertation proposes an approach of policy optimization by looking ahead (POLA). Existing approaches first learn a value function from historical data, then guide the update of policy parameters by maximizing the value function at a single time step, which tries to find the optimal action at each time. We argue this strategy is greedy and propose to optimize the policy by looking ahead to alleviate the greediness. Concretely, we look $T$ time steps ahead and then optimize the policy on both current and future states where the future states are predicted by a transition model. A trajectory contains numerous actions before the agent reaches the terminal state. Performing the best action at each time step does not necessarily mean we can get an optimal trajectory in the end. Occasionally, we need to allow sub-optimal or negative actions. Besides, hidden confounding factors may affect the decision making process. The policy should consider the unobserved variables when making decisions. To that end, we incorporate the correlations among dimensions of a state into the policy, consequently providing more information of the environment for the policy. Then the new state with correlation information is fed to the diffusion policy that is good at generating diverse actions. Empirical results on the MuJoCo environments show the effectiveness of the proposed approach. Extensive experiments have been conducted on the D4RL dataset, the results show that our approaches exhibit superior performance. Our results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also propose a topK loss for ensemble Q-values to alleviate training instability. We find that the topK loss and MIS can stabilize the reward curve effectively. For POLA, when the rollout length is set properly, the performance is better than without looking ahead. The optimal rollout lengths can be different for various tasks. The correlation demonstrates its effectiveness in improving the convergence rate of reward curve.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Reinforcement learning - Mathematical models	-
dc.title	Policy optimization for offline reinforcement learning	-
dc.type	PG_Thesis	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Statistics and Actuarial Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044751041503414	-

File Download

Supplementary

postgraduate thesis: Policy optimization for offline reinforcement learning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats