Memoryless: First steps into Markov Worlds (part 1)

How a Century-Old Idea Powers Modern AI

When people think of artificial intelligence, they often imagine powerful models that remember everything — tracking patterns across vast amounts of historical data. And in many modern systems, like deep learning networks, that’s largely true.

But one of the most fundamental ideas in AI is built on the exact opposite assumption: that the past doesn't matter — only the present does.

This idea is called the Markov property, and it’s surprisingly useful. It's the basis for entire classes of AI models, from your phone’s next-word prediction to robotic systems making real-time decisions.

In this post, we’ll explore what the Markov property really means, why it’s so powerful, and how it shapes the way intelligent machines learn and act, without necessarily needing a memory.

What is the Markov Property? A Beginner's Guide

Let's start with a simple idea. Imagine that you're playing a board game. Your next move depends only on where your piece is right now, on the current square. It doesn't matter how you got there — whether you had a lucky streak of sixes or an unfortunate series of ones. All that matters is your present position.

That's the Markov property in essence.

It states: “the probability of the next state depends only on the current state, not on the sequence of events that preceded it.

This clearly means that if we want to predict the next state of a system, we only need to look at its current state (state in AI refers to a collection of information that represents the current status of a system at a given moment). All the useful information about the past is encapsulated in that present state — a sufficient statistic.

Memoryless is what it is most often referred to, but as we will continue to see, it is really about information sufficiency.


From Probability Theory to Reinforcement Learning

So how does a model that “forgets the past” become the foundation for teaching machines to make decisions?

In reinforcement learning (RL), agents learn to take actions in an environment to maximize long-term rewards.

At the core of RL lies a critical aspect of problem formulation: depending on the problem, the environment or task can be modeled mathematically as a Markov Decision Process (MDP), which assumes the Markov property holds for the defined state representation.

The simple assumption is quite effective and is the engine behind many technologies you use:

  1. Text Prediction and Autocomplete: When your smartphone keyboard suggests the next word, it's often using a Markov model. It's calculating the probability of the next word appearing based on the last word you typed. For example, if you type "thank," the model knows there's a high probability the next word is "you." Everything else that came before that is unimportant.

  1. Speech Recognition: Systems like Siri and Alexa use a more advanced version called a Hidden Markov Model (HMM), (which we'll further discuss in a later post.) They analyze sequences of acoustic features, coming from the raw sound waves, and thus the model predicts the most likely sequence of words that could have produced those sounds.

  1. In finance, some AI models implement the Markov property as a simplifying assumption (e.g., in classic models like the Black-Scholes model). However, many advanced AI models do move beyond this to analyze large historical datasets, incorporating factors like volatility clustering and long-term dependencies.

The Markov property is a modeling choice. Rather than being a line of code you insert, when an AI researcher decides to use an HMM for speech recognition, or an MDP for teaching an AI agent to function in a manufacturing setting, they are making the structured assumption that the system they are modeling explicitly has this memoryless property.


The Limitations: When Memory Matters

This brings us to its drawbacks. The Markov property is not a perfect representation of reality. Its biggest limitation is the assumption itself. In many real-world scenarios, the past does matter.

For example, long-term climate patterns depend on accumulated conditions over time, not just the immediate previous day's weather. We'll expand further in the next section.

Defining the State

Certainly, you can indeed define the state to include plenty of necessary historical information relevant for predicting the future.

You can theoretically transform any process with memory into a Markovian one. By adding more data to the state space, you ensure only the current/enhanced state contains all information relevant to future predictions, thereby satisfying the memoryless assumption.

However, this can lead to the the "State Space Explosion" problem, which is quite the bottleneck in applying classic Markov assumptions to complex systems.

The primary issue here is that this augmented state space can become too computationally complex and intractable due to the increased features and dimensionality, which is the number of variables that describe the systems state in a given time.

That is why a 'sufficient statistic' is imperative — using summary variables instead of the entire past.

These compressed representations can fall into running averages, belief states, and filters (like Kalman filters). We can talk more about these in a future post.


Transformers: Beyond Markov

Modern AI models like transformers — a practical architecture, first publicized in 2017 in the well-known research paper "Attention Is All You Need," were developed specifically to overcome this limitation by incorporating mechanisms to handle long-range dependencies.

The key innovation was its capacity to process entire data sequences simultaneously.

What Transformers Do Differently

Instead of assuming a “memoryless” process, transformers learn relationships across words — even those far apart in a sentence. This enables them to handle:

  • Context
  • Syntax
  • Semantics
  • And even multiple languages at once

Are Transformers Still Linked to Markov Ideas?

In a way. Researchers have shown that transformers can simulate Markov processes and go beyond them. So understanding Markov models helps grasp how far AI has evolved — and what old limitations it had to break to preserve predictive structure without moving towards state-space complexity.

So this doesn't make the Markov property obsolete in the slightest, as rather, the two concepts serve different purposes and operate within different domains of AI.

To demonstrate their adaptability and also aid with the theoretical understanding, in a later post I'll be diving into Transformers further, and teach how they can learn to model various types of Markov processes in-context.

The Role of Probability: The Engine of Prediction

So, how does the model "know" what comes next? The answer is probability.

AI then won't make perfect predictions, but informed guesses.

Every prediction focusing only on the present state is based on probability, calculated from:

  • Past data
  • Transition frequencies
  • Observed outcomes

Understanding the Transition Matrix


Rather than tracking entire histories, the model learns how likely it is to move from one state to another, a concept known as transition probabilities. AI systems estimate these probabilities by analyzing large datasets — like thousands of weather sequences.

For instance:
If it’s sunny today, what’s the chance it will be cloudy or rainy tomorrow?
If it’s rainy today, how likely is it to be sunny next?


These probabilities are organized into a transition matrix — a table where each row represents a current state (e.g., sunny or rainy), and each column contains the probabilities of transitioning to every possible next state.

When you define all possible transitions this way, you build a Markov Chain — a mathematical model that describes how a system evolves from one state to another, assuming the Markov property holds.

The transition matrix serves as the formal representation of this “memoryless” structure — where the current state is considered a sufficient statistic for predicting the next step.

One can begin to see how this concept applies to systems like keyboard text prediction or robotic navigation.


To recap:

  • A Markov Chain is the full mathematical model — it describes a system that transitions from one state to another using probabilities governed by the Markov property.
  • A Transition Matrix is one component of a Markov Chain — it’s the table that holds the probabilities of moving between states.

The Mathematical Expression of the Assumption

Now that we’ve built an intuition for the Markov property, let us look at its formal definition using probability theory.

At its essence, the Markov property is expressed by the following equation:

P(next | now, past) = P(next | now)

In words:

“The probability of the next state, given the current state and all previous states, is the same as the probability of the next state given only the current state.”

The Markov property is the claim that all that extra information (the "past") on the left hand side (LHS) is actually irrelevant once you know the "now."

The equality then forces the LHS to be equal to the Right-Hand Side (RHS), which only includes the "now".

If this equality doesn’t hold — if the past still influences the prediction, then the process isn’t Markovian.

To express this idea using standard notation, adding in states (i, j, k)

let’s define:

  • Xt = the state of the system at time step t
  • Xt+1 = the next state
  • Xt−1, Xt−2, ... = previous states
  • 'k' is just a placeholder of any earlier state’s value

Adding in states (i, j, k), the formal definition becomes:

P(Xt+1​ = j ∣ Xt​ = i, Xt−1​ = k,...) = P(Xt+1​ = j ∣Xt​ = i)

From the beginning, The probability that Xt+1 = j given Xt​=i  and even given earlier values like Xt−1 = k… equals the probability that Xt+1 = j given only Xt=i .

However, a more easy way of stating this is:

“Given that the system is currently in state i, the probability of moving to state j next is unaffected by knowledge of earlier states like k.”

Now using more formal terms:

P(Xt+1​ = j ∣ Xt​ = i, Xt−1​ = k,...) = P(Xt+1​ = j ∣Xt​ = i)

“Given Xt​=i, the probability of Xt+1​ = j is the same whether we also condition on earlier states (e.g., Xt−1 = k).

This is known in probability as:

Conditional independence of the future from the past, given the present.

The formula is the definition itself, written in mathematical notation.


A Reminder on Notation

The subscript notation Xt+1Xt−1, etc., simply refers to time steps. These aren’t mathematical operations like addition or subtraction, just markers:

  • Xt: the current state
  • Xt+1: the next time step
  • Xt−1: the previous one

This formalism helps describe sequences in a clean, mathematical way, and it’s used in AI research.


Although we covered a bit, here we are still on the surface. I will be dedicating at least 5 parts on Markov models in this blog — there are stacks of important notes that I have to cover on the varying kinds of Markov models in order to reach a true understanding.

Conclusion and What's Next in the Series

I’ll be sharing Part 2 of this series sometime soon, where we’ll walk through a very short technical example that I dealt with some time ago, that demonstrates the Markov property in action, and after that I'll bring into light the world of Hidden Markov Models, where the true state of a system is hidden beneath the surface of what we observe.

Though over a century old, the theory behind the Markov property was initially developed by Andrey Markov in St. Petersburg, Russia, in the early 1900s.

Andrey's first major work on dependent sequences appeared in 1906, challenging the dominant view that statistical laws required independence.

Although we have various other mathematicians to thank for significantly expanding the theory after Andrey Markov, it was Markov's original work that gave rise to what we now call Markov chains, and laid the mathematical groundwork for vast areas of probability theory, statistics, and modern AI.

Quite interesting to see just how far back many of these things go.

If you have questions, feedback, or would like to chat about any of these topics, feel free to reach out:
Leon.axel9821@gmail.com

Thanks for reading, until next time.