File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
  • Find via Find It@HKUL
Supplementary

Article: Implicit and explicit policy constraints for offline reinforcement learning

TitleImplicit and explicit policy constraints for offline reinforcement learning
Authors
Issue Date1-Apr-2024
PublisherML Research Press
Citation
Proceedings of Machine Learning Research, 2024, v. 236, p. 499-513 How to Cite?
Abstract

Offline reinforcement learning (RL) aims to improve the target policy over the behavior policy based on historical data. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to out-of-distribution actions. Most existing works focus on either behavioral cloning (BC) or maximizing Q-Learning methods to suppress distribution shift. BC methods try to mitigate the shift by constraining the target policy to be close to the offline data, but it makes the learned policy highly conservative. On the other hand, maximizing Q-Learning methods adopt pessimism mechanism to generate actions by maximizing Q-value and penalizing Q-value according to the uncertainty of actions. However, the generated actions might be arbitrary, resulting in the predicted Q-values highly uncertain, which will in turn misguide the policy to generate the next action. To alleviate the adverse effect of the distribution shift, we propose to constrain the policy implicitly and explicitly by unifying Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. For the implicit constraint approach, we propose to unify the action space by generative adversarial networks that dedicate to make the actions of the target policy and behavior policy indistinguishable. For the explicit constraint approach, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. Extensive experiments on the D4RL dataset indicate that our approaches can achieve superior performance. The results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also found that MIS can stabilize the reward curve effectively


Persistent Identifierhttp://hdl.handle.net/10722/344214
ISSN

 

DC FieldValueLanguage
dc.contributor.authorLiu Yang-
dc.contributor.authorHofert, Jan Marius-
dc.date.accessioned2024-07-16T03:41:42Z-
dc.date.available2024-07-16T03:41:42Z-
dc.date.issued2024-04-01-
dc.identifier.citationProceedings of Machine Learning Research, 2024, v. 236, p. 499-513-
dc.identifier.issn2640-3498-
dc.identifier.urihttp://hdl.handle.net/10722/344214-
dc.description.abstract<p> Offline reinforcement learning (RL) aims to improve the target policy over the behavior policy based on historical data. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to out-of-distribution actions. Most existing works focus on either behavioral cloning (BC) or maximizing Q-Learning methods to suppress distribution shift. BC methods try to mitigate the shift by constraining the target policy to be close to the offline data, but it makes the learned policy highly conservative. On the other hand, maximizing Q-Learning methods adopt pessimism mechanism to generate actions by maximizing Q-value and penalizing Q-value according to the uncertainty of actions. However, the generated actions might be arbitrary, resulting in the predicted Q-values highly uncertain, which will in turn misguide the policy to generate the next action. To alleviate the adverse effect of the distribution shift, we propose to constrain the policy implicitly and explicitly by unifying Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. For the implicit constraint approach, we propose to unify the action space by generative adversarial networks that dedicate to make the actions of the target policy and behavior policy indistinguishable. For the explicit constraint approach, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. Extensive experiments on the D4RL dataset indicate that our approaches can achieve superior performance. The results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also found that MIS can stabilize the reward curve effectively <br></p>-
dc.languageeng-
dc.publisherML Research Press-
dc.relation.ispartofProceedings of Machine Learning Research-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.titleImplicit and explicit policy constraints for offline reinforcement learning-
dc.typeArticle-
dc.identifier.volume236-
dc.identifier.spage499-
dc.identifier.epage513-
dc.identifier.issnl2640-3498-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats