Machine Learning Glossary: what are model training steps ?

There may exist many possible models to solve a given problem at hand. Based on your modeling decision there are usually two different ways to complete the machine learning lifecycle.

1st scenario. Training a single model with a training dataset and final evaluation with the test set.
2nd scenario. Training multiple models with training/validation dataset and final evaluation with the test set.

In case of (1st scenario), you will follow the following approach:

Divide the data into training and test sets. (Usually 70/30 splits)
Select your preferable model.
Train it with a training dataset.
Assess the trained model in the test set. (no need to perform validation in your trained model)

In case of (2nd scenario), you will follow the following approach:

Divide the data into training, validation, and test sets. (Usually 50/25/25 splits)
Select the initial model/architecture.
Train the model with a training dataset.
Evaluate the model using the validation dataset.
Repeat steps (b) through (d) for different models or training parameters.
Select the best model based on evaluation and train the best model with combined (training + validation) datasets.
Assess the trained model in the test set.

Machine Learning Glossary: what is model training in machine learning ?

The Machine Learning model is represented by the model parameters. Those parameters are the learnable parameters. Learning happens when these parameters are updated with suitable values and the model is able to solve the given tasks. Training is the process of feeding a training dataset to your model. The training process uses an objective function (example MSE) to get the feedback in each iteration. Since we are trying to improve the accuracy of the model on a given input, and lower the error between model prediction and actual output, we also called training process as a model optimization process.

Machine Learning Glossary: what is machine learning ?

Understanding and extracting hidden patterns or features from the data is the learning process in machine learning. Instead of using explicit logic supplied by people, machine learning has the capacity to learn from experiences. Conventional systems are created with the use of well defined human-set rules. In order for machine learning algorithms to understand complicated patterns from inputs (x), they use outputs (y) as a feedback signal. Thus, an intelligent program is the ML system's final product.

We often use a logical method to solve any issue. We make an effort to break the task up into several smaller tasks and solve each smaller task using a distinct rationale. When dealing with extremely complicated jobs, like stock price prediction, the patterns are always changing, which has an impact on the results. That implies that, in order to answer this problem logically, we must adjust our handwritten logic for each new change in the outputs. Machine Learning (ML), on the other hand, creates the model using a vast amount of data. The data gives the model all of its historical experience, which helps it better understand the pattern. We just retrain the model with fresh instances whenever the data changes.

Paper Summary : Playing Atari With Deep Reinforcement Learning

Summary of the paper "Playing Atari with Deep Reinforcement Learning"

Motivation¶

Deep Learning (DL) has proven to work well when we have large amount of data. Unline supervised DL algorithm setup, Reinforcement Learning (RL) doesn't have direct access to the targets/labels. RL agent usually get "delayed and sparsed" rewards as a signal to understand about the environment and learn policy for a given environment. Another challenge is about the distribution of the inputs. In supervised learning, each batch in training loop is drawn randomly which make sure each inputs/samples are independent and the parameter updates won't overfit to some specific direction/class in the data. In case of RL, inputs are usually correlated. For example, when you collect image inputs/frames of video of games, their pixel positions won't change much. Therefore, many samples will look alike and this might lead to poor learning and local optimal solution. Another problem is the non-stationarity of the target. The target will be changing throughout the episodes when the agent learns new behaviour from the environment, or adopting well.

Contribution¶

Authors proposed 'Deep Q Network' (DQN) learning algorithm with experience replay. This approach solves both the correlated inputs and non-stationarity problems.

They uses CNN with a variant of Q-learning algorithm, and uses stochastic gradient descent (SGD) for the training. They maintained a buffer named - 'Experience Replay' of the transitions while the agent nagivates through the environment. While SGD training process, samples from this stored buffer is used to create mini-batches and used for the training of the NN. This refer this NN as Q-network with parameter, $ \theta $, which minimizes the sequences of loss functions $ L_i (\theta_i) $ :

$ L_i(\theta_i) $ = $ \mathbb{E_{s,a \sim \rho(.)}} [ (y_i - Q(s, a; \theta_i)^2 ] $

Where $ y_i = \mathbb{E_{s' \sim \varepsilon}} [ r + \gamma \underset{a'} max (s', a', \theta_{i-}) | s,a] $

is the target for iteration i.

They used the previous iteration parameter value ($ \theta_{i-1} $) in order to calculate the target ($y_i$). The parameter ($ \theta_{i-1} $) from previous iteration won't change for some long future iterations, which makes it stationary and training will be smooth. They also feed concatenation of four video frames as an input to the CNN in order to avoid the partial observation contraints in the learning. Using four frames, CNN will be able locate the movement direction, speed of the objects in the frames.

DQN is used to train on Atari 2600 games. The video frames from emulator are the observations based on discrete actions (up, down, left, rigth..) of the agent in the environment. The network consists of two convolutional layers and two fully connected layers. The last layer outputs the distribution over possible actions.

In [ ]:

Paper Summary : Policy Gradient Methods for Reinforcement Learning with Function Approximation

Summary of the paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation"

Motivation¶

Reinforcement Learning (RL) solves the problem of learning through experiments in the (dynamic) environments. The learner objective is to find an optimal policy which can guide the agent for the nagivation. This optimal policy is formulated in terms of maximizing future reward of the agent. Value-function $ V_{\pi} (s) $ and action-value function $ Q_{\pi}(s,a) $ are the measure of potential future rewards.

$ V_{\pi} (s) $ : Goodness measure to be in a state s and then following policy $ \pi $
$ Q_{\pi}(s,a) $ : Goodness measure to be in a state s, perform action a and then follow policy $ \pi $

NOTE

Both $ V_{\pi} (s) $ and $ Q_{\pi}(s,a) $ are related to rewards in terms of expectation of the discounted future reards and their values are maintained on a lookup table.
Goal : We want to find the value of (state) or (state,action) in a given environment, so that the agent can follow an optimal path, collecting maximum rewards.

In a large scale RL problem, maintaining lookup table will lead to the the problem of 'curse of dimensionality'. Currently, this problem is solved using function approximation. The function approximation tries to generalize the estimation of value of state or state-action value based on a set of features in a given state/observations. Most of the existing approaches follow the idea of approximating the value function and then deriving policy out of it. Authors have pointed out two major limitations of this approach:

a. This approach focused towards finding deterministic policy, which might not be the case for complex problems/environments. b. Small variation in the value estimation might cause different action selection; derived policy is sensitive.

Contribution¶

Authors proposed an alternative way to approximate policy directly using parameterized function. So, we won't be storing any Q-values in a table, but, learnt using a function approximator. For an example, the policy can be represented by a Neural Network (NN) where we can feed state as input and get probability distribution for action selection as output. Considering $ \theta $ as parameters of the NN, representing the policy and $ \rho $ as its performance measure (which can be a loss function), then the parameter $ \theta $ will be updated as:

$ \theta_{t+1} \gets \theta_t + \alpha \frac{\partial {\rho}}{ \partial{\theta}} $

Policy Gradient Theorem:¶

For any Markov Decision Process (MDP),

$ \nabla_{\theta} J(\theta) = \frac{\partial {\rho(\pi)}}{ \partial{\theta}} = \underset{s} \sum d^{\pi} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} Q^{\pi}(s,a) $ ----------(a)

Here $ \rho(\pi) $ , the average rewards under current policy ($ \pi $) and $ d^{\pi}(s) $, stationary distribution of states under $ \pi $

The problem with the above formulation is 'how to get Q(s,a) ?' -> Q(s,a) must be estimated.

We can see that the state distribution is independent of policy parameter $ \theta $. Since, gradient is independent of MDP dynamics, it allows model-free learning in RL. If we estimate the policy gradient using Monte-Carlo sampling, it will give REINFORCE algorithm.

In Monte-Carlo sampling, we take N trajectories using current policy $ \pi $ and collect the returns. However, these returns hae high variance and we might need many episodes for the smooth convergence. The variance is introduced due to the fact that we won't be able to collect same trajectories multiple times(.i.e movement of agent is also dynamic) using out stochastic policies in the stochastic environment.

QUESTION : How to estimate Q-value in equation (a) ?

Authors used a function approximation $ f_w (s,a) $ with parameters 'w' to estimate $ Q^{\pi} (s,a) $ as :

$ \nabla_{\theta} J(\theta) = \underset{s} \sum d^{\pi(s)} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} f_w(s,a) $ --------- (b)

Here $ f_w(s,a) $ is learnt by following $ \pi $ and updating 'w' by minimizing mean-square error between Q-values $ [ Q^{\pi}(s,a) - f_w(s,a) ]^2 $. The neural network/policy will predict some Q-value and also when agent take some action in the environment, we predict Q-value for a given state/action. Algorithm will try to make sure difference between these two remains as close as possible.

The resulting formulation (b) gives the idea of actor-critic architecture for RL where

i. $ \pi(s,a) $ is the actor which is learning to approximate the policy by maximixing (b)

ii. The critic $ f_w(s,a) $ learning to estimate the policy by minimizing MSE with estimated and true Q-values.

In [ ]:

Paper Summary : Proximal Policy Optimization Algorithms

Summary of the paper "Proximal Policy Optimization Algorithms"

Motivation¶

Deep Q learning, 'Vanilla' Policy Gradient, REINFORCE are the examples of approaches for function approximation in RL. When it comes to RL, robustness and sample efficiency are the measures that defines effectivenss of the applied algorithm.

In RL formulation, the agent needs to solve a task/problem in an envronment. Agent countinously interacts with the environment, which provides rewards to the agent, and thereafter agent learns a policy to navigate and tackle the problem. In every time step, RL agent has to make a decision by selecting preferrable action. To do so, agent fully relies on the information of current state and accumulated knowledge (history of rewards) up to current time step. Once the action is performed, the next state/observation is defined by some (stochastic) transition probability model. Also, reward will be signaled to the agent based on this new state information and performed action to get there. In overall, the goal of the agent is to maximize expected cumulative rewards.

In terms of RL, this goal can be formulated as finding a policy $ \pi (a_t | s_t) $ such that expected reward $ \mathbb{E_{\pi_{\theta, \tau}}} [G_t] $ is maximized.

In high dimensional/ countinous action space, policy gradient method can be used to solve this problem. In "vanilla" policy gradient method, the policy is parameterized by some parameter $ \theta $ .i.e parametric policy $ \pi_{\theta} (a_t | s_t) $ and we directly optimize the policy by finding the optimal parameter $ \theta $.

Even though 'Vanilla' policy gradient/ REINFORCE are simple/easier to implement, they come with some learning issues:

PROBLEM : Usually give rise to high variance while estimating gradient. This is because, the objective function

$ J(\theta) = \mathbb{E_{\pi_{\theta, \tau}}} (G_t) $

contains expectation; so we can't directly compute the exact gradient. We use stochatic gradient estimates such as REINFORCE based on some batch of trajectories. This sampling approximation adds some variance. That means, we need large number of trajectories to get the best estimation. In addition, we can see that collecting trajectories could be a problem in complex environments-might take long time to run.

Contribution¶

Authors have introduced a family of policy optimization methods which is build up on the basis of work of Trust-Region Policy optimization. Two main ideas:

a. Clipped Surrogate Objective Function : It avoids large deviations of learned policy $ \pi_{\theta} $ from old policy $ \pi_{\theta old} $; which is formulated as:

$ L^{clip}(\theta) = \mathbb{E}_t [ min (r_t (\theta) \hat{A_t}, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A_t}) ] $

Here $ r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta old}(a_t | s_t)} $

$ \epsilon $ is hyperparameter, which restricts the new policy from being too far from old policy. $ \hat{A_t} $ can be discounted return or advantage function.

The clipping ensures the updates take place in "trust-region". Also, introduces less variance than vanilla gradient methods

b. Multiple Epochs for policy update:

PPO allows to run multiple epochs on the same trajectories and optimize the objective $ L^{clip}(\theta) $. This also reduces the sample inefficieny while learning. In order to collect data, PPO runs policy with parallel actors and then samples mini-batches of the data for training k-epochs using the objective function above.

Let's observe the behaviour of the objective function based on changes in advantage function.

CASE I : When $ \hat{A_t} $ is +ve :

The objective function can be written as :

$ L^{clip} (\theta) = min ( r_t(\theta), 1 + \epsilon ) \hat{A_t} $

Since $\hat{A_t} $ is +ve, when the action occurrence likelihood increases (i.e. $ \pi_{\theta}(a_t | s_t) $), the whole objective value will also increase. The min operator limits the increasing objective value. So, when $ \pi_{\theta}(a_t | s_t) ) > (1 + \epsilon) \pi_{\theta old}(a_t | s_t) $, ceiling occurs at $ (1 + \epsilon) \hat{A_t} $

CASE II : When $ \hat{A_t} $ is -ve :

The objective function can be written as :

$ L^{clip} (\theta) = max ( r_t(\theta), 1 - \epsilon ) \hat{A_t} $

Now, the objective function value only increases when the likelihood of the action is less likely (i.e. $ if \pi_{\theta}(a_t | s_t) $ decreases, $ L^{clip} (\theta) $ increases ) In this case, when $ (1 - \epsilon) \pi_{\theta old}(a_t | s_t) > \pi_{\theta}(a_t | s_t) ) $, max operator limits the value at $ (1 - \epsilon) \pi_{\theta old}(a_t | s_t) \hat{A_t} $

In [ ]:

Paper Summary : Learning What Data to Learn

Summary of the paper "Learning What Data to Learn"

Motivation¶

The performance of learning algorithms based on Machine Learning or Deep Learning rely on amount of training data. Having more data points also has benefit of learning more generalized models and avoiding overfitting. However, collecting data is a painstalking work. Instead, we can learn automatic and adaptive data selection in the training process and make learning faster with minimal data points.

Contribution¶

In this paper, authors have introduced Neural Data Filter (NDF) as an adaptive framework which can learn data selection policy using deep reinforcement learning(DRL) algorithm 'Policy Gradient'. Two important aspects of this framework:

a. NDF filter the data instances from randomly fetched mini-batches of data during training process. b. Training loop provides feedback to NDF policy based on reward signal (e.g. calculated in validation set) and NDF policy is trained using DRL.

NDF in detail¶

NDF is designed to filter out some portion of training data based on some quality measure. The filtered high-quality data points speed up the convergence of the model.

In order to formulate Markov Decision Process (MDP) in NDF, authors used 'SGD-MDP' with following tuple: <s, a, P, r, $ \gamma $>

s : representing mini-batch data and current state of training model (weights/biases) as a state
a : binary filtering actions; $ a = {\{a_m\}}_{m=1}^M \in (0, 1)^M $, M-batch size and $ a_m \in \{0,1\} $ indicating whether a particular data instance in minibatch will be selected or not.
P : P(s`| s, a) is a transition probability
r = r(s,a), reward signal based on performance of the current model under consideration (e.g. validation accuracy),
$ \gamma \in [0,1] $, discounting factor

The NDF policy A(s,a, $ \Theta $) can be represented by a binary classification algorithm such as logistic regression or deep NN, where $ \Theta $ is policy parameter and it is updated as:

$ \Theta \gets \Theta + \alpha V_t \sum_m \frac{\partial log P_{\Theta} (a_m|s_m)}{\partial \Theta} $

and, $ V_t $ is the sampled estimation of reward $ R(s_t, a_t) $ from one episode.

In [ ]:

Paper Summary : Curiosity-driven Exploration by Self-supervised Prediction

Summary of the paper "Curiosity-driven Exploration by Self-supervised Prediction"

Motivation¶

The policy learning process in Reinforcement Learning (RL) is usually suffered due to delayed/sparse rewards. Reward is a direct signal for an agent to evaluate 'how good the current action selection is'. Since reward collection takes time, learning optimal policy also takes longer time to derive. Another factor that influence the learning process is human-designed reward function. These reward functions might not represent the optimal guidance for learning of the agent or won't be scalable to real world problems. We need a way to overcome reward sparsity and also improves exploration of the agent to make learning more robust.

Human learning process is not only guided by the final goal or achievement, but also driven by motivation or curiosity of the being. Curiosity adds exploratory behaviour to the agent, allowing it to acquire new skills and gain new knowledge about the environment. It also makes agent robust to perform actions which ultimately reduces uncertaintly on it's behaviours to capture the consequences of it's own action.

Contribution¶

The authors, in this paper, proposed curiosity-driven learning by uing agent-intrinsic reward (.i.e a reward which is learnt by agent itself by understanding the current environment or possible changes in the states while navigation). In order to quantify curiosity, they have introduced "Intrinsic Curiosity Module".

Intrinsic Curiosity Module (ICM)¶

The output of ICM is the state prediction error, which serves as reward for curiosity. This module has two sub-components, each represented by neural networks.

a. Inverse Model :

This model learns feature space using self-supervision. This new feature space is learnt in order to avoid features/information which are irrelevant to the agent while nagivation. Learning feature space is completed within two sub-modules:

i) First module encodes the raw input state ($s_t$) into a feature vector ($ \phi(s_t) $) ii) Second module takes $ \phi(s_t) $ and $ \phi(s_{t+1}) $) as encoded feature inputs and predicts action $ \hat{a_t} $ that agent might take to go to $ s_{t+1} $ from $ s_t $

$ \hat{a_t} = g( \phi(s_t), \phi(s_{t+1}), \theta_i ) $

Here function g represents NN and $ \hat{a_t} $ is estimated action. The learnable parameters $ \theta_i $ are trained with loss function representing difference between predicted action and actual action. i.e $ L_I( \hat{a_t}, a_t) $

b. Forward Model :

This is a NN which predicts the next state ($ s_{t+1} $) with inputs $ \phi(s_t) $ and action executed at $ s_t $.

$ \hat{\phi(s_{t+1})} = f( \phi(s_t), a_t, \theta_F) $

$ \hat{\phi(s_{t+1})}$ is the predicted estimation of $ \phi(s_{t+1})$ and $ \theta_F $ represents trainable parameters, with loss function as:

$ L_F ( \phi(s_{t+1}), \hat{\phi(s_{t+1})}) = \frac{1}{2} || \hat{\phi(s_{t+1})} - \phi(s_{t+1}) ||^2 = \eta L_F $

Both losses can be jointly expressed as :

$ \underset{\theta_i, \theta_F} {max} [ (1-\beta) L_I + \beta L_F ] $

NOTE:

** ICM worked with two connected modules - inverse model (which learnt the feature representation of state and next state) and forward model ( which predicts the feature representation of the next state) ** Curiosity can be calculated by the difference between output of forward model i.e $ \hat{\phi(s_{t+1})} $ and output of the inverse model $ \phi(s_{t+1}) $.

In [ ]:

Paper Summary : Asynchronous Methods for Deep Reinforcement Learning

Summary of the paper "Asynchronous Methods for Deep Reinforcement Learning"

Motivation¶

Deep Neural Network (DNN) is introduced to Reinforcement Learning (RL) framework in order to make function approximation easier/scable for large state-space problems. DNN itself suffers from overfitting because of the correlated data while nagivating through the environments (e.g. when we play a game, each consecutive moves of a player withing smaller timeframes looks similar, which won't contribute much for learning). In order to avoid it, people started using experience replay, where we have to store navigation experience (e.g. screenshots in games) as a buffer and we can use them later while training/updating policy/model parameters.

This works well, but only for off-policy algorithms like Q-learning. How to use on-policy algorithms like SARSA and make it stable learning using DNN ? Also, using experience replay introduces extra memory requirements/ computatonal delay for each update and real interaction with environment.

**NOTE :

On-policy : The training data is generated by the same policy being trained. e.g : Reinforce
Off-Policy : The training data generated from another policy can be used to train the current policy. e.g : Q-learning

Contribution¶

In this paper, authors introduced an asynchronous training process by executing multiple agents in parallel in different instances of the same environment using multiple CPU cores. It uses multithreading to run those agents and update the global model parameters asynchronously in online fashion. It is reported that this approach enables stable learning and faster convergence speed.

They have introduced asynchronous variants of SARSA, 1-step/n-step Q learning and advantage actor-critic algorithm. Let's discuss some details on Asynchronous Q-learning and Async. Advantage Actor critic (A3C) algorithms.

Asynchronous Q-Learning¶

In Deep Q-Learning (DQN), the neural network (NN) approximates the Q-value function Q(s,a, $\theta$) with loss formulated as: $ \begin{equation} L_{i} (\theta_i) = \mathbb{E} [ r + \gamma Q (s^`, a^`, \theta_{i-1}) - Q(s, a, \theta_i) ]^2 ..........(i) \end{equation} $

In Async 1-step Q learning, each thread maintains it's own copy of environment and agent traverse through the environment with the help of $ \epsilon $ - greedy policy. At each step, we compute teh gradient of the loss (i) and collect gradients over multiple timesteps before updating the parameters.

Asynchronous Advantage Actor-Critic (A3C)¶

Actor-critic method combines both value-pased and policy based methods.

It has a policy $ \pi(a_t | s_t; \theta) $ and value function $ V (s_t; \theta_t)$ to be learnt. It uses "forward-view", i.e. selecting actions based on its exploration strategy $ \pi (a_t | s_t; \theta) $ up to some $ t_{max} $ steps in the future, to collect up to $ t_{max} $ rewards since last update.

Now, policy and value functions are updated after every $ t_max $ actions as:

Policy Network : $ \bigtriangledown_{\theta} log \pi(a_t | s_t; \theta) (R_t - V(s_t, \theta_v)) $
Value Network : $ \bigtriangledown_{v} (R - V(s_t; \theta_v))^2 $

In this learning framework, parallel actor-learners updates a shared model and make learning process more robust and stable

In [ ]:

Paper Summary : Visual Reinforcement Learning Imagined Goals

Summary of the paper "Visual Reinforcement Learning with Imagined Goals"

Motivation¶

Humans can easily adjust themselves or adopt to new environments and learn new tasks by setting their own goals. In case of Reinforcement Learning framework, we have to manually design the reward function which gives an orientation towards the goal of a given task. For example, if we have to train a robot to pick a package and deliver to a destination, we have to set reward based on its distance-covered. Along with delivery task, there might be other tasks like adjusting robot-arm to pick the package based on it's shape/size or placing the package at the destination without throwing it on the ground. For each of these tasks, we can design specific-reward functions.But, it won't be practical or scalable for real-world problems where an agent has to solve many tasks synchronously.

Contribution¶

Authors proposed a reinforcement learning framework where an agent can learn general-purpose goal-conditioned polices by setting it's own synthetic goals and learning tasks to achieve those goals, without human intervention.

They referred this framework as "reinforcement with imagined goals" (RIG).

Synthetic Goals¶

Initially, the agent itself generate a set of synthetic goals by exploration through a random policy. Both state observations and goals are the image data (for example in case of robot navigation). By random policy, agent executes some random actions in the environment and the trajectories consisting of state observations are stored for later use.

During policy training phase, agent can randomly fetch those stored observations as a set of initial states or set of goals.

Now, we have all the information to train a goal-conditioned agent. Authors used Q-learning agent - Q(s,a,g), where s - states, a- actions and g-goals to be achieved by executing action 'a'. And, the optimal policy can be derived as : $ \pi (s,g) = \underset{a} max Q(s,a, g) $

In order to train this policy, two main issues need to be addressed:

a. How to design reward function ? Distance between images while nagivation is one possible reward. But, pixel-wise distance won't carry semantic meaning of actual distance between states and this will be also computationally involved. b. How to represent the goal as a distribution so that we sample goals for the training?

Authorse resolved these issues by using Variational Autoencoders (VAE), to learn encoded representation of images. The VAE takes raw images (x) as input and generate low-dimensional latent representation (z). Using these latent representation, we have now latent states (z) and latent goals ($ z_g $).

The working algorithm can be summarized as :

a. Initially, agent explores environment using random policy and the state observations will be stored.

b. VAE will be trained using raw images from (a) to learn latent representation of all state observations.

c. Initial states (z) and goals ($ z_g $) are sampled from (b)

d. Goal-conditioned Q-function Q(z,a, $ z_g $) is trained using data from (c) and policy $ \pi_{\theta} (z, z_g) $ will be learnt in the latent space.

In [ ]:

Menu

CODEBUG (page 4)

Data Exploration...

Machine Learning Glossary: what are model training steps ?

Machine Learning Glossary: what is model training in machine learning ?

Machine Learning Glossary: what is machine learning ?

Paper Summary : Playing Atari With Deep Reinforcement Learning

Motivation¶

Contribution¶

Paper Summary : Policy Gradient Methods for Reinforcement Learning with Function Approximation

Motivation¶

Contribution¶

Policy Gradient Theorem:¶

Paper Summary : Proximal Policy Optimization Algorithms

Motivation¶

Contribution¶

Paper Summary : Learning What Data to Learn

Motivation¶

Contribution¶

NDF in detail¶

Paper Summary : Curiosity-driven Exploration by Self-supervised Prediction

Motivation¶

Contribution¶

Intrinsic Curiosity Module (ICM)¶

Paper Summary : Asynchronous Methods for Deep Reinforcement Learning

Motivation¶

Contribution¶

Asynchronous Q-Learning¶

Asynchronous Advantage Actor-Critic (A3C)¶

Paper Summary : Visual Reinforcement Learning Imagined Goals

Motivation¶

Contribution¶

Synthetic Goals¶