Conversation
Created Hybrid distr, populated HybridDirstribution and forward method (tbc) Co-authored-by: simrey <simonereynoso@gmail.com>
collect_rollouts overrides the one of PPO (in turn inherited by OnPolicyAlgorithm). It requires to implement new environment and rollout buffer, as they need to work with multiple actions (discrete and continuous). Co-authored-by: simrey <simonereynoso@gmail.com>
Calls super-method as done in PPO Co-authored-by: simrey <simonereynoso@gmail.com>
Co-authored-by: simrey <simonereynoso@gmail.com>
- Added evaluate_actions in HybridActorCriticPolicy - Completed log_prob and entropy of HybridDistribution Co-authored-by: simrey <simonereynoso@gmail.com>
|
Related https://github.com/adysonmaia/sb3-plus (I rediscovered it recently) |
RolloutBuffer for hybrid actions. HybridPPO.train() not adapted yet Co-authored-by: simrey <simonereynoso@gmail.com>
Co-authored-by: simrey <simonereynoso@gmail.com>
Co-authored-by: simrey <simonereynoso@gmail.com>
Plus fix some imports Co-authored-by: simrey <simonereynoso@gmail.com>
…te with the library PPO and the base algorithm do some validation on the action space. If we want our HybridPPO to be subclass of PPO, we need this wrapper for the integration with the library. Co-authored-by: simrey <simonereynoso@gmail.com>
Co-authored-by: simrey <simonereynoso@gmail.com>
This reverts commit 392e60f.
… integrate with the library" This reverts commit 365bdb9.
|
Hi @araffin, I have a little problem here: Our class
AlternativesNow, we have a couple of alternatives:
My opinionAlternative 1 is better. It adds flexibility without overhead (the parameter is optional). Regardless of how the code in this PR evolves, this change to PPO's parameters would be barely noticeable to sb3's users. Action itemsIf you want to proceed with alternative 1, I'd be happy to open a PR in sb3 for it (including creating an issue to discuss the change beforehand, if necessary). Let me know what you think so we can be unblocked 😄 |
This is fine, I don't think there should be too many duplicated code because of that, no? |
Ok, done in the latest commit. |
Co-authored-by: simrey <simonereynoso@gmail.com>
Instead of using the one of the superclass. Needed because the features extractor may be shared or not. Same mechanism of PPO, which also re-implemented this method. Co-authored-by: simrey <simonereynoso@gmail.com>
Refactor HybridDistribution log_prob method to accept discrete and continuous actions. Co-authored-by: simrey <simonereynoso@gmail.com>
Plus updated reward in CatchingPoint env (for initial tests) Co-authored-by: simrey <simonereynoso@gmail.com>
Implementation of Hybrid PPO
Description
Closes #202
(Description will follow)
Context
Types of changes
Checklist:
make format(required)make check-codestyleandmake lint(required)make pytestandmake typeboth pass. (required)Note: we are using a maximum length of 127 characters per line