What is underfitting in Machine Learning ?

Underfitting occurs in machine leanrning / data science when a data model fails to accurately capture the relationship between input and output variables, resulting in high error rates on both the training set and unseen data. This also entails that the model has insufficient training duration or the input variables lack significance to establish a meaningful relationship between the input and output variables. As the model learns, its bias diminishes, but its variance may increase, leading to overfitting. The objective in model fitting is to identify the optimal balance between underfitting and overfitting (.i.e., finding the sweet spot), allowing the model to capture the dominant trend in the training data and generalize effectively to new datasets.

Important details:

  1. High biased model (underfitted) is not able to learn the very basic/important patterns in the training data.

  2. Adding more data and making your model simpler won't help to avoid underfitting.

  3. One should try other sophisticated models (e.g Decision tree in comparision to kNN) or add complexity in the current model.

  4. Using complex models (example : polynomial regression rather than linear one) may be useful to capture the relevant patterns in the training data.

  5. Adding more features (or derived features from existing one) will also increase the model capacity and helps to avoid underfitting.

  6. If you see unacceptably high training error and test error, the model is underfitted.

  7. High bias and low variance are the indicators of underfitting models.

  8. Underfitting is easier to track than overfitting since the performance can be measured during training phase.

What is overfitting in Machine Learning ?

Overfitting occurs when the model attempts to match the training set too closely. On fresh data, the overfitted model is unable to produce accurate predictions.

Important details:

  1. The model will attempt to match the data too closely and will pick up on noise in the data when the training data set is limited or the given model is complex.

  2. An overfitted model picks up patterns that are unique to the training set and overlooks the generic patterns.

  3. Regularization can reduce overfitting.

  4. Overfitting can also be decreased by training on a large and diversed training data points.

  5. Overfitting can be detected by high variation .i.e, if the test data has a high error rate while the training data has a low error rate.

  6. A high variance model will overfit the data and is flexible in capturing every detail—relevant or not—and noise in the data.

  7. A high variance model is also indicated as: Training error << Validation error.

  8. More training data will improve the generalization of the given model and avoids overfitting.

What is gradient descent ?

Gradient descent is a widely used optimization approach for training machine learning models and neural networks. Optimization is the process of minimizing or increasing an objective function. Optimization entails calculating the gradient (partial derivatives) of the cost function for each parameter (weights and biases). To do this, the models are given training data iteratively. And, the gradient points are determined. The gradient represents the steepest rise in the function. Gradient descent lowers cost function values by going in the opposite direction of the steepest decrease.

What is Inductive Bias in Machine Learning ?

An explicit or implicit assumption or prior information about the model that permits it to generalize outside of the training set of data is known as inductive bias.

Examples of inductive bias:

  1. When it comes to decision trees, shorter trees work better than longer ones.

  2. The response variable (y) in linear regression is thought to vary linearly in predictors (X).

  3. In general, the belief that the most simplest hypothesis is more accurate than the more complicated one (Occam's razor) .

What are model training steps in machine learning ?

There may exist many possible models to solve a given problem at hand. Based on your modeling decision there are usually two different ways to complete the machine learning lifecycle.

  • 1st scenario. Training a single model with a training dataset and final evaluation with the test set.

  • 2nd scenario. Training multiple models with training/validation dataset and final evaluation with the test set.

In case of (1st scenario), you will follow the following approach:

  • Divide the data into training and test sets. (Usually 70/30 splits)

  • Select your preferable model.

  • Train it with a training dataset.

  • Assess the trained model in the test set. (no need to perform validation in your trained model)

In case of (2nd scenario), you will follow the following approach:

  • Divide the data into training, validation, and test sets. (Usually 50/25/25 splits)

  • Select the initial model/architecture.

  • Train the model with a training dataset.

  • Evaluate the model using the validation dataset.

  • Repeat steps (b) through (d) for different models or training parameters.

  • Select the best model based on evaluation and train the best model with combined (training + validation) datasets.

  • Assess the trained model in the test set.

what is model training in machine learning ?

The Machine Learning model is represented by the model parameters. Those parameters are the learnable parameters. Learning happens when these parameters are updated with suitable values and the model is able to solve the given tasks. Training is the process of feeding a training dataset to your model. The training process uses an objective function (example MSE) to get the feedback in each iteration. Since we are trying to improve the accuracy of the model on a given input, and lower the error between model prediction and actual output, we also called training process as a model optimization process.

What is machine learning ?

Understanding and extracting hidden patterns or features from the data is the learning process in machine learning. Instead of using explicit logic supplied by people, machine learning has the capacity to learn from experiences. Conventional systems are created with the use of well defined human-set rules. In order for machine learning algorithms to understand complicated patterns from inputs (x), they use outputs (y) as a feedback signal. Thus, an intelligent program is the ML system's final product.

We often use a logical method to solve any issue. We make an effort to break the task up into several smaller tasks and solve each smaller task using a distinct rationale. When dealing with extremely complicated jobs, like stock price prediction, the patterns are always changing, which has an impact on the results. That implies that, in order to answer this problem logically, we must adjust our handwritten logic for each new change in the outputs. Machine Learning (ML), on the other hand, creates the model using a vast amount of data. The data gives the model all of its historical experience, which helps it better understand the pattern. We just retrain the model with fresh instances whenever the data changes.

Paper Summary : Playing Atari With Deep Reinforcement Learning

Motivation

Deep Learning (DL) has proven to work well when we have large amount of data. Unline supervised DL algorithm setup, Reinforcement Learning (RL) doesn't have direct access to the targets/labels. RL agent usually get "delayed and sparsed" rewards as a signal to understand about the environment and learn policy for a given environment. Another challenge is about the distribution of the inputs. In supervised learning, each batch in training loop is drawn randomly which make sure each inputs/samples are independent and the parameter updates won't overfit to some specific direction/class in the data. In case of RL, inputs are usually correlated. For example, when you collect image inputs/frames of video of games, their pixel positions won't change much. Therefore, many samples will look alike and this might lead to poor learning and local optimal solution. Another problem is the non-stationarity of the target. The target will be changing throughout the episodes when the agent learns new behaviour from the environment, or adopting well.

Contribution

Authors proposed 'Deep Q Network' (DQN) learning algorithm with experience replay. This approach solves both the correlated inputs and non-stationarity problems.

They uses CNN with a variant of Q-learning algorithm, and uses stochastic gradient descent (SGD) for the training. They maintained a buffer named - 'Experience Replay' of the transitions while the agent nagivates through the environment. While SGD training process, samples from this stored buffer is used to create mini-batches and used for the training of the NN. This refer this NN as Q-network with parameter, $ \theta $, which minimizes the sequences of loss functions $ L_i (\theta_i) $ :

$ L_i(\theta_i) $ = $ \mathbb{E_{s,a \sim \rho(.)}} [ (y_i - Q(s, a; \theta_i)^2 ] $

Where $ y_i = \mathbb{E_{s' \sim \varepsilon}} [ r + \gamma \underset{a'} max (s', a', \theta_{i-}) | s,a] $

is the target for iteration i.

They used the previous iteration parameter value ($ \theta_{i-1} $) in order to calculate the target ($y_i$). The parameter ($ \theta_{i-1} $) from previous iteration won't change for some long future iterations, which makes it stationary and training will be smooth. They also feed concatenation of four video frames as an input to the CNN in order to avoid the partial observation contraints in the learning. Using four frames, CNN will be able locate the movement direction, speed of the objects in the frames.

DQN is used to train on Atari 2600 games. The video frames from emulator are the observations based on discrete actions (up, down, left, rigth..) of the agent in the environment. The network consists of two convolutional layers and two fully connected layers. The last layer outputs the distribution over possible actions.

In [ ]:
 
Sijan Bhandari on

Paper Summary : Policy Gradient Methods for Reinforcement Learning with Function Approximation

Motivation

Reinforcement Learning (RL) solves the problem of learning through experiments in the (dynamic) environments. The learner objective is to find an optimal policy which can guide the agent for the nagivation. This optimal policy is formulated in terms of maximizing future reward of the agent. Value-function $ V_{\pi} (s) $ and action-value function $ Q_{\pi}(s,a) $ are the measure of potential future rewards.

  • $ V_{\pi} (s) $ : Goodness measure to be in a state s and then following policy $ \pi $
  • $ Q_{\pi}(s,a) $ : Goodness measure to be in a state s, perform action a and then follow policy $ \pi $

NOTE

  • Both $ V_{\pi} (s) $ and $ Q_{\pi}(s,a) $ are related to rewards in terms of expectation of the discounted future reards and their values are maintained on a lookup table.
  • Goal : We want to find the value of (state) or (state,action) in a given environment, so that the agent can follow an optimal path, collecting maximum rewards.

In a large scale RL problem, maintaining lookup table will lead to the the problem of 'curse of dimensionality'. Currently, this problem is solved using function approximation. The function approximation tries to generalize the estimation of value of state or state-action value based on a set of features in a given state/observations. Most of the existing approaches follow the idea of approximating the value function and then deriving policy out of it. Authors have pointed out two major limitations of this approach:

a. This approach focused towards finding deterministic policy, which might not be the case for complex problems/environments. b. Small variation in the value estimation might cause different action selection; derived policy is sensitive.

Contribution

Authors proposed an alternative way to approximate policy directly using parameterized function. So, we won't be storing any Q-values in a table, but, learnt using a function approximator. For an example, the policy can be represented by a Neural Network (NN) where we can feed state as input and get probability distribution for action selection as output. Considering $ \theta $ as parameters of the NN, representing the policy and $ \rho $ as its performance measure (which can be a loss function), then the parameter $ \theta $ will be updated as:

$ \theta_{t+1} \gets \theta_t + \alpha \frac{\partial {\rho}}{ \partial{\theta}} $

Policy Gradient Theorem:

For any Markov Decision Process (MDP),

$ \nabla_{\theta} J(\theta) = \frac{\partial {\rho(\pi)}}{ \partial{\theta}} = \underset{s} \sum d^{\pi} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} Q^{\pi}(s,a) $ ----------(a)

Here $ \rho(\pi) $ , the average rewards under current policy ($ \pi $) and $ d^{\pi}(s) $, stationary distribution of states under $ \pi $

  • The problem with the above formulation is 'how to get Q(s,a) ?' -> Q(s,a) must be estimated.

We can see that the state distribution is independent of policy parameter $ \theta $. Since, gradient is independent of MDP dynamics, it allows model-free learning in RL. If we estimate the policy gradient using Monte-Carlo sampling, it will give REINFORCE algorithm.

In Monte-Carlo sampling, we take N trajectories using current policy $ \pi $ and collect the returns. However, these returns hae high variance and we might need many episodes for the smooth convergence. The variance is introduced due to the fact that we won't be able to collect same trajectories multiple times(.i.e movement of agent is also dynamic) using out stochastic policies in the stochastic environment.

  • QUESTION : How to estimate Q-value in equation (a) ?

Authors used a function approximation $ f_w (s,a) $ with parameters 'w' to estimate $ Q^{\pi} (s,a) $ as :

$ \nabla_{\theta} J(\theta) = \underset{s} \sum d^{\pi(s)} (s) \underset{a} \sum \frac{\partial{\pi(s,a)}}{\partial(\theta)} f_w(s,a) $ --------- (b)

Here $ f_w(s,a) $ is learnt by following $ \pi $ and updating 'w' by minimizing mean-square error between Q-values $ [ Q^{\pi}(s,a) - f_w(s,a) ]^2 $. The neural network/policy will predict some Q-value and also when agent take some action in the environment, we predict Q-value for a given state/action. Algorithm will try to make sure difference between these two remains as close as possible.

The resulting formulation (b) gives the idea of actor-critic architecture for RL where

i. $ \pi(s,a) $ is the actor which is learning to approximate the policy by maximixing (b)

ii. The critic $ f_w(s,a) $ learning to estimate the policy by minimizing MSE with estimated and true Q-values.

In [ ]:
 
Sijan Bhandari on

Paper Summary : Proximal Policy Optimization Algorithms

Motivation

Deep Q learning, 'Vanilla' Policy Gradient, REINFORCE are the examples of approaches for function approximation in RL. When it comes to RL, robustness and sample efficiency are the measures that defines effectivenss of the applied algorithm.

In RL formulation, the agent needs to solve a task/problem in an envronment. Agent countinously interacts with the environment, which provides rewards to the agent, and thereafter agent learns a policy to navigate and tackle the problem. In every time step, RL agent has to make a decision by selecting preferrable action. To do so, agent fully relies on the information of current state and accumulated knowledge (history of rewards) up to current time step. Once the action is performed, the next state/observation is defined by some (stochastic) transition probability model. Also, reward will be signaled to the agent based on this new state information and performed action to get there. In overall, the goal of the agent is to maximize expected cumulative rewards.

In terms of RL, this goal can be formulated as finding a policy $ \pi (a_t | s_t) $ such that expected reward $ \mathbb{E_{\pi_{\theta, \tau}}} [G_t] $ is maximized.

In high dimensional/ countinous action space, policy gradient method can be used to solve this problem. In "vanilla" policy gradient method, the policy is parameterized by some parameter $ \theta $ .i.e parametric policy $ \pi_{\theta} (a_t | s_t) $ and we directly optimize the policy by finding the optimal parameter $ \theta $.

Even though 'Vanilla' policy gradient/ REINFORCE are simple/easier to implement, they come with some learning issues:

PROBLEM : Usually give rise to high variance while estimating gradient. This is because, the objective function

$ J(\theta) = \mathbb{E_{\pi_{\theta, \tau}}} (G_t) $

contains expectation; so we can't directly compute the exact gradient. We use stochatic gradient estimates such as REINFORCE based on some batch of trajectories. This sampling approximation adds some variance. That means, we need large number of trajectories to get the best estimation. In addition, we can see that collecting trajectories could be a problem in complex environments-might take long time to run.

Contribution

Authors have introduced a family of policy optimization methods which is build up on the basis of work of Trust-Region Policy optimization. Two main ideas:

a. Clipped Surrogate Objective Function : It avoids large deviations of learned policy $ \pi_{\theta} $ from old policy $ \pi_{\theta old} $; which is formulated as:

$ L^{clip}(\theta) = \mathbb{E}_t [ min (r_t (\theta) \hat{A_t}, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A_t}) ] $

Here $ r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta old}(a_t | s_t)} $

$ \epsilon $ is hyperparameter, which restricts the new policy from being too far from old policy. $ \hat{A_t} $ can be discounted return or advantage function.

The clipping ensures the updates take place in "trust-region". Also, introduces less variance than vanilla gradient methods

b. Multiple Epochs for policy update:

PPO allows to run multiple epochs on the same trajectories and optimize the objective $ L^{clip}(\theta) $. This also reduces the sample inefficieny while learning. In order to collect data, PPO runs policy with parallel actors and then samples mini-batches of the data for training k-epochs using the objective function above.

Let's observe the behaviour of the objective function based on changes in advantage function.

CASE I : When $ \hat{A_t} $ is +ve :

The objective function can be written as :

$ L^{clip} (\theta) = min ( r_t(\theta), 1 + \epsilon ) \hat{A_t} $

Since $\hat{A_t} $ is +ve, when the action occurrence likelihood increases (i.e. $ \pi_{\theta}(a_t | s_t) $), the whole objective value will also increase. The min operator limits the increasing objective value. So, when $ \pi_{\theta}(a_t | s_t) ) > (1 + \epsilon) \pi_{\theta old}(a_t | s_t) $, ceiling occurs at $ (1 + \epsilon) \hat{A_t} $

CASE II : When $ \hat{A_t} $ is -ve :

The objective function can be written as :

$ L^{clip} (\theta) = max ( r_t(\theta), 1 - \epsilon ) \hat{A_t} $

Now, the objective function value only increases when the likelihood of the action is less likely (i.e. $ if \pi_{\theta}(a_t | s_t) $ decreases, $ L^{clip} (\theta) $ increases ) In this case, when $ (1 - \epsilon) \pi_{\theta old}(a_t | s_t) > \pi_{\theta}(a_t | s_t) ) $, max operator limits the value at $ (1 - \epsilon) \pi_{\theta old}(a_t | s_t) \hat{A_t} $

In [ ]:
 
Sijan Bhandari on