Search and Optimization

This is a Python library for Search and Optimization.

Welcome

This is a library for search and optimization algorithms. The basic topics are covered which include Descent Method, Stochastic Search, Path Search, MDP-related and RL related algorithms. By using this library, you are expected to see basic ideas behind the algorithms through simple but intuitive visualizations.

Hope you can have fun in search and optimization! Any problems with the algorithm or implementation or other problems please feel free to contact me!

Table of Contents

Requirements

The implementation is quite simple and intuitive. If you can use conda, you are ready to go! If not, the requirements are:

  • Python 3.8.x

  • numpy

  • matplotlib

How to use

  1. clone this repo.

git clone https://github.com/ruipengZ/Search-and-Optimization.git
  1. Install the required dependency

    • Using conda

      conda env create -f conda.yml
      
    • Using pip

      pip install -r requirements.txt
      
  2. Run certain algorithms in the library then see the visualization.

Algorithms

This repo covers the basic topics in Search and Optimization, including Descent Method, Stochastic Search, Path Search, MDP-related and RL related algorithms.

Descent Method

Descent Method in general starts at some initial point and tries to get to the local minimum in descendent way. It includes gradient descent, Newton’s Method and conjugate descent.

Gradient Descent

Gradient Descent is widely used in optimization and machine learning areas because its simple calculation. It tries to take a small step towards the gradient descent direction to minimize the function.

Here are the one and two dimensional example visualization for gradient descent.

  • One Dimensional Case

  • Two Dimensional Case

Newton’s Method

Gradient Descent uses the first derivative as guess of the function. Newton’s Method, instead uses a sequence of second-order Taylor approximations of around the iterates which gives rise to a faster descent process.

  • One Dimensional Case

  • Two Dimensional Case

However fast and do not need a step size (learning rate), Newton’s Method has some drawbacks and caveats:

  • The computation cost of inverting the Hessian could be non-trivial.

  • It does not work if the Hessian is not invertible.

  • It may not converge at all, but can enter a cycle having more than 1 point.

  • It can converge to a saddle point instead of to a local minimum

Reference:

See details on https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

Conjugate Descent

For quadratic problems, we can do better than generic directions of gradient. In Gradient Descent if we take the steepest descent, we always go onthogonal in every step. Can we go faster? Yes, Newton’s method gives us a faster one. But at the same time, we also want to avoid calculating the inversion of certain matrics. Conjugate gradients give us a better way to perform descent method because they allow us to minimize convex quadratic objectives in at most n steps and without inverting the matrics.

  • One Dimensional Case

  • Two Dimensional Case

Reference:

you can find more details in the reference slides

http://www.cs.cmu.edu/~pradeepr/convexopt/Lecture_Slides/conjugate_direction_methods.pdf

Comparison of Descent Methods

We can compare the visualization of the above three descent methods as follows:

As you can see, Gradient Descent slowly move to the minimum whereas Conjugate Descent and Newton’s Method is faster. In the two dimensional case above, we can observe that Conjugate Descent can be regarded as being between the method of Gradient Descent (first-order method) and Newton’s method (second-order method).

Markov Decision Process

A Markov decision process is composed of states, actions, transition probabilities and reward of states. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. At each states, we not only want to maximize the short-term reward but also long-term. So evaluating the value of a state is essential in MDP.

Value Iteration and Policy Iteration are algorithms where we have the full knowledge of the MDP (the transition probabilities are known), computing the optimal policy and value.

Value Iteration

Value Iteration start with arbitrary state value and use Bellman Update to update the values and pick the best policy.

Here is the visualization. Green nodes are state node and blue nodes are the environment with the transition probability on the arrow pointing to the next state. When we do Bellman Update on a certain state, it will be marked as red and a new value is updated on its left. The best action is labeled as red arrow. We do Bellman Update for rounds until it converges.

Policy Iteration

Policy Iteration starts with an arbitrary policy and solve the Bellman Equations to evaluate the values defined by the policy, then check each state to see if we can improve the max value. We keep doing this until the policy is fixed.

Here is the visualization on the same MDP as in the Value Iteration. Nodes and arrows are the same as in Value Iteration, where as every state start with an arbitrary policy and update it every round.

Comparison of Value and Policy Iteration

Here are the final state of the two algorithms:

As we can see from the final state of the two algorithms given the same MDP, the two algorithms output the same optimal policy, whereas the Policy Iteration may not output the final value of states because the algorithm stops when policy is stabilized.

MDP with Unknown Environment

In real world, we do not have the full knowledge of the transition probabilities in MDP. Can we improve or optimize a given policy without the transition model? There are three algorithms covered including Monte Carlo Policy Evaluation, Temporal Difference Policy Evaluation and Tabular Q-Learning.

Given a fixed policy, Monte Carlo Policy Evaluation and Temporal Difference Policy Evaluation can evaluate the value of each state by using sampling.

Monte Carlo Policy Evaluation

Monte Carlo Policy Evaluation simulates a lot of state sequences and use the average as the value of states.

The visualization is as follow. Every time the algorithm generate a sequence of states and calculate their values and use the average as the value of states. Note that every state only have one action which is the fixed policy.

Temporal Difference Policy Evaluation

Monte Carlo Policy Evaluating generate the whole sequence every time, which could be very time-consuming and even not practical when the MDP is infinite or has circles. Temporal Difference Policy Evaluation, however, utilizes the Bellman Equations and update on the way, without waiting for full sequences. In every step, it updates estimated values based on next sampled state without waiting for a final outcome.

Here is the visualization. Each round we only update state values based on the next state.

Tabular Q-Learning

Monte Carlo Policy Evaluation and Temporal Difference Policy Evaluation only estimate the state values given the fixed policy. But how can we take the best action in MDP without a transition model? Q-learning solves the problem of learning about the environment and improving policy at the same time.

It starts with knowing nothing and pick actions based on epsilon-greedy policy (choose either exploration or exploitation with epsilon probability), and then update the value of state taking certain action, which is Q-value. By updating the Q-value on the fly, the policy will provably converge to the optimal.

The visualization is as follow. We plot Q-value Q(s,a) in red next to the environment nodes. After Q-Learning, we can output a policy without knowing the transition model.

Deep Q-Learning

Tabular Q-learning store Q value for every state-action pairs. But we can use expressive function approximations to store values or/and policies. By doing this, we are no longer limited by table size or discretization and have better generalizability. Also we can directly manipulate the policies.

Deep Q-learning substitue the storting process in Tabular Q-learning with a deep neural network. It keep adding new experience in the pool of samples “Experience Replay” and every time takes a batch from the pool and fit a network to the new Q-value.

Here we display a more interesting example - a catching block game. A board on the bottom catches the falling block. We treat the current image of the game as state and the bottom board can take actions to move to the left or right. By doing deep Q-learning, we can get a smart AI game player.