π in Reinforcement Learning


Reinforcement learning (RL) is one of the most exciting areas of artificial intelligence today. As mentioned before, at its essence it is about teaching an AI agent to make decisions that lead to the highest reward over time — much like how humans learn by trial and error.

In todays post we will discuss the policy, denoted π — the strategy that defines how an agent behaves based at a given point within the environment. In other words it defines the mapping from states to actions.

The primary goal of a reinforcement learning algorithm is this: to learn an optimal policy that maximizes the agent's total expected cumulative reward over time. Let's explore this further.


Our Track Record

To backtrack, in these previous posts on RL we’ve established a brief overview of some foundational concepts that underlie optimized decision-making:

First off we explored the Bellman Optimality Equation, which defines how to recursively calculate the maximum expected reward from any given state — helping to derive the optimal policy.

Connecting the ideas, while the policy defines the agent’s strategy — how it chooses actions in each state, the Bellman optimality equation tells us how good each action is by evaluating the expected return. As a result this enables the agent to iteratively improve its policy, step by step, until it converges to the optimal policy.

We then introduced Markov Decision Processes (MDPs) — a formal framework for sequential decision-making, often under uncertainty. An MDP is commonly defined by states (S), actions (A), transition dynamics (P), a reward function (R), and a discount factor (γ), and sometimes an initial-state distribution (describes how an agent’s starting situation is chosen at the beginning of each episode. Rather than always starting in the exact same state so it can diversify, the environment may randomly select a starting state according to a probability distribution.)

These ingredients let us write the Bellman optimality equation. In model-based settings, where P and R are known or can be estimated, the optimality equation can be used for planning; in model-free settings, algorithms can still learn optimal values from experience without explicitly building the full model.

Afterwards, we unveiled the Markov property, an assumption entrenched in Markov models such as MDP's which simplifies complex systems by ensuring that all relevant information from the past is captured in the present — a concept that allows models to efficiently make predictions, while avoiding state space explosions through the use of sufficient statistics.

Building on this we examined Markov Chains, which offer the mathematical framework to model how these systems move between states via transition probabilities.

Last but not least, we also discussed the exploration vs. exploitation trade-off — a key principle in reinforcement learning that ensures agents don’t just stick to what they know, but also take random or calculated risks to discover better long-term strategies.

This ultimately influences how a policy is shaped over time. An optimal policy balances both: it exploits known high-reward actions while still exploring enough to discover potentially better strategies.

Algorithms like ε-greedy or softmax policies are designed to maintain this balance during training, steadily shifting toward exploitation as the agent learns more about its environment.

All together so far, these ideas are now forming a greater synergy for teaching machines how to learn, adapt, and make smarter decisions over time.

In this post, we will further examine what policies are and their own role in guiding ideal decision‑making, and how current research uses them in real‑world AI systems.


The 2 Types of Policies and Examples

A policy is the agent’s core decision-making mechanism, defining how each state of the environment is mapped to an action. Effectively it tells the agent: “When you see this particular state, do this action."

There are two main types of policies:

  • Deterministic Policy – This policy is a fixed rule that maps each state to a single, specific action, meaning there's no randomness in its decision-making; for any given situation (state), it always chooses the same action, making it predictable and useful for applications in industrial control, unlike with stochastic policies that uses probabilities. The deterministic policy is represented as a function π(s) = a, where state s leads to action a.
  • Stochastic Policy – Outputs a probability distribution, the agent may select different actions with certain probabilities — the policy assigns higher probabilities to favourable actions, and vice versa. This is useful for more complex systems that have environments that require POMDP's or numerous agents (Multi Agent RL). The notation used is π(a|s), which represents the conditional probability of taking a specific action a, given a particular state s.

While π(a|s) is the most common, other equivalent or similar notations are also used.

An example regarding the stochastic policy is an agent that maps cell differentiation, the process where an immature, unspecialized cell like a stem cell transforms into a specialized cell type like a neuron or white-blood cell.

Using a single-cell RL framework, researchers treat cell differentiation as a partially observable, sequential decision process, where the agent learns a stochastic policy π(a|s) that maps observed cellular states to differentiation actions — helping us in understanding the stochastic nature of biological processes, from the results of what the agent can infer from critical intervention points.

Current traditional methods have a hard time pinpointing when cells make these inflection points, however with methods like single-cell RL, in understanding how cells develop into different types during growth and disease, scientists can guide stem cells into specific cell types and repair damaged tissues — and design treatment plans that step in precisely with the pathways that have gone wrong.


How Policies Are Learned

There are two common approaches:

1. Policy‑Based Methods

Policy-based methods, which are prominent in robotics, are approaches where the agent learns the policy directly, which means it learns how to choose actions from states without first learning a value table or score for each action. Instead of asking “how good is this action?”, the agent adjusts the decision-making rule itself so that actions leading to higher long-term rewards become more likely over time. This is often done by slightly changing the policy after each episode or batch of experience in the direction that increases total reward, basing on what has worked best in the past.

2. Value‑Based Methods

Alternatively, value-based methods in reinforcement learning are approaches where the agent learns how good different states or actions are, rather than learning the policy directly. The agent estimates a value function that predicts the expected long-term reward from a state or from taking a specific action, and then chooses actions by selecting the one with the highest estimated value.

The best policy is then obtained by selecting the action with the highest Q-value in each state, which we'll delve into further. Instead of storing a policy explicitly, they use lookup tables (Q-tables) to assign values to state-action pairs (Q(s, a))

To paraphrase the key point, value-based methods do not explicitly store or represent the policy — but they implicitly define one through the value function.

3. Actor-Critic Methods

Actor-critic methods bridge policy-based and value-based learning. They maintain two components:

  • The actor, which learns and stores the policy directly and decides which actions to take.
  • The critic, which estimates the value function to judge how good the actors actions are.

Both the actor and the critic maintain their own sets of function-based parameters. The critic evaluates the outcomes of the actor’s decisions and provides feedback, which the actor uses to update its policy. As with many AI systems, the dual structure allows for more stable learning.

Back in 2016 Google DeepMind developed an AI system that used Deep RL (including elements that function like an actor-critic model) to optimize the control of Google's data center cooling systems, achieving an impressive 40% reduction in energy used for cooling and a 15% reduction in overall Power Usage Effectiveness overhead.

The Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO) algorithms are well-suited for these types of data centre control problems.


Policies in Practice: On‑Policy vs Off‑Policy Learning

In reinforcement learning, on-policy and off-policy describe how an agent learns from experience relative to the policy it is trying to improve.

  • On‑Policy Methods — The agent improves the same policy that it uses to choose actions and collect data. Examples include SARSA.
  • Off‑Policy Methods — The agent collects experience using one policy (behavior policy) while learning or improving another policy (target policy). An example is Q‑learning.

Wrapping it up on-policy learning (e.g., SARSA) updates the policy based on actions taken by the current policy during exploration. In contrast, off-policy learning (e.g., Q-learning) can learn from data generated by any policy, including past versions of itself or different exploration strategies, allowing it to optimize a target policy independently of the behavior policy used to collect experience.

On-policy methods are generally more stable because they learn directly from consistent behavior, but they are less sample-efficient, often requiring fresh data for each update. Off-policy methods are typically more sample-efficient because they can reuse previously collected data (e.g., via replay buffers), but they can be less stable due to the mismatch between the behavior policy and the target policy being learned.

These distinctions are important in research because they influence efficiency.


Conclusion

In reinforcement learning:

  • The policy (π) is the agent’s strategy.
  • It’s central to decision‑making and shapes how an agent behaves in an environment.
  • Policies are learned through trial and error and improved iteratively to maximize rewards.

Hopefully you found this post insightful and valuable for your own journey, there are still some things I would like to add in the future to add further intuition — concepts like Q-learning and policy-based methods. Next post however, I’ll be adding a review for the business side of the blog, of a standout episode from the Y Combinator podcast.

In this post I'll be exploring practical wisdom on designing products that truly differentiate themselves, things that have proven invaluable in my own startup journey.

I hope everyone’s having a marvellous Christmas break so far. Happy holidays and I'll be seeing you next time.

Read more