This is timely! I have an assignment on these coming up soon. Can anyone with knowledge about this explain something. From what I can tell, many matrix multiplications move vectors so they are more inline with eigenvectors if they exist. So Markov Chains are just a continual movement in this direction. Some examples that don't do this that I can think of are the Identity matrix and rotations.. Is there a way to test if a matrix will have this effect? Is it just testing for existence of eigenvectors?
markov chains are used for my favourite financial algorithm; the allocation of overhead costs in cost accounting. wish there was an easy way to visualise a model with 500 nodes
Markov chains are a one dimensional chain of immediate previous states and base their prediction on that very specific linear chain of states.
A good way to immediately grasp the flaws of a Markov chain is to imagine it predicting a pixel in a 2D image.
Sure in a 1D chain of states that goes something like [red green blue] repeatedly a Markov chain is a great predictor. But imagine a 2D image where the pattern is purely in the vertical elements but you’re passing in elements left to right then next row. It’s going to try to make a prediction based on the recent horizontal elements and there no good way to have context of the pixel above. A Markov chain doesn’t mix two states, its context is the immediate previous chain of states. It’s really really hard for a Markov chain to take the pixel above into context if your feeding it pixels left to right (same is true for words but the pixel analogy is easier to grasp).
Now you may think a good solution to this is to have a Markov chain have some mechanism to take multiple previous states from multiple contexts (in this case vertical and horizontal) and somehow weight each one to get the next state. Go down this path and you essentially go down the path of early neural networks.
But how to build a Markov chain that does what an LLM does?
I've actually thought of this from time to time and come up with things like 'skip' states that just don't change the context on meaningless words, I've thought of maintaining distant states in some type of context on the markov chain, multiple contexts in multiple directions mixed with a neural network at the end (as per the 2D image example), etc. But then the big issue is in building the Markov state network.
The attention mechanism and subsequent multi-layer neural network it uses is honestly extremely inelegant. Basically a entire datacenter sledgehammer of contexts and back-propagated neural networks to make something that works while Markov chains would easily run on an Apple 2 if you got it right (it's very fast to linearly navigate a chain of states). But the hard part is making Markov chains that can represent language well. The best we've done is very simple linear next letter predictors.
I do honestly think there may be some value in stealing word weights from an attention mechanism and making Markov chains out of each of them. So each word starts navigating a markov chain of nearby words to end up at new states. You'd still mix all the contexts at the end with a neural network but it skips the computationally expensive attention mechanism. Still a hard problem to build these Markov chains sensibly though. Deepseek has shown us there's huge opportunities in optimizing the attention mechanism and Markov chains are super computationally simple but the 'how do you build good Markov chains' is a very hard problem.
LLMs share more with a Markov-Chain than many would like to admit, but the fundamental improvements is because the model is learning the state representation and that latent representation is essentially a lossy compression for nearly all written human text.
People really underestimate both the magnitude of the parameter space for leaner this and the scale of the training data.
But, at the end of the day, the model is still just sampling one token at a time and updating its state (and of course, that state is conditioned on all previous tokens, so that’s a other point of departure)
In the standard next-token markov model, just have one node for every token and 1 probability for every edge. E.g, "if last letter is "B" most likely next letter is "C""
To make a big language model in Markov space, you need to very large ngrams. "If last text is "th" next letter is "e""
These get incredibly sparse, very quickly. Most sentences are novel in your dataset.
Neural networks (and other models too) get around this by placing tokens in multidimensional space, instead of having individual nodes.
Stictly speaking (non-recurrent) LLMs are Markov chains.
Per e.g. Wikipedia: "In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event."
Even a dead simple two-layer neural network is a universal approximator. Meaning they can model any relationship, given enough neurons in the first layer. with as much accuracy as you want, subject to available resources for training time and model size.
Specific deep learning architectures and training variants reflect the problems they are solving, speeding up training, and reducing model sizes considerably. Both keep getting better, so efficiency improvements are not likely to plateau anytime soon.
They readily handle both statistically driven and pattern driven mappings in the data. Most modelling systems tend to be better or exclusively adaptive on one of those dimensions or the other.
Learning patterns means they don't just go input->prediction. They learn to recognize and apply relationships in the middle. So when people say they are "just" predictors, that tends to be misleading. They end up predicting, but they will do whatever they need to in between in terms of processing information, in order to predict.
They can learn both hard or soft pattern boundaries. Discrete and continuous relationships. And static and sequential patterns.
They can be trained directly on example outputs, or indirectly via reinforcement (learning what kind of outputs will get judged well and generating those). Those are only two of many flexible training schemes.
All those benefits and more make them exceptionally general, flexible, powerful and efficient tools relative to other classes of universal approximators, for large growing areas of data defined problems.
Imagine the nodes in those visualizations are not just a static state, but each have metadata attached to them used to infer the next state in the "chain", which is keyed on a value assigned from a mysterious lookup table based on the current state - so each time the state shifts the metadata on all states can also shift.
(There are also types of LLMs where that metadata is limited in access, one such type is when the current state can only check metadata on previous states and weigh it against the base value of the next states in the chain.)
Then, imagine each chain of states in a Markov Chain is a 2D hash map, like a grid plot. Our current LLMs are like an Nth-dimensional hash map instead, and can have a finite, but extremely large depth. This is pretty near impossible to visualize as a human, but if you're familiar with array/map operations, you should get the idea.
This is a very "base level" understanding, as my learning on LLMs stopped around the time Tensorflow stopped being the new hotness, but hopefully that gives you an idea.
Past comments: https://news.ycombinator.com/item?id=17766358
This is timely! I have an assignment on these coming up soon. Can anyone with knowledge about this explain something. From what I can tell, many matrix multiplications move vectors so they are more inline with eigenvectors if they exist. So Markov Chains are just a continual movement in this direction. Some examples that don't do this that I can think of are the Identity matrix and rotations.. Is there a way to test if a matrix will have this effect? Is it just testing for existence of eigenvectors?
previuos submissions
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
markov chains are used for my favourite financial algorithm; the allocation of overhead costs in cost accounting. wish there was an easy way to visualise a model with 500 nodes
What is the secret sauce that makes LLM better than a Markov chain?
Markov chains are a one dimensional chain of immediate previous states and base their prediction on that very specific linear chain of states.
A good way to immediately grasp the flaws of a Markov chain is to imagine it predicting a pixel in a 2D image.
Sure in a 1D chain of states that goes something like [red green blue] repeatedly a Markov chain is a great predictor. But imagine a 2D image where the pattern is purely in the vertical elements but you’re passing in elements left to right then next row. It’s going to try to make a prediction based on the recent horizontal elements and there no good way to have context of the pixel above. A Markov chain doesn’t mix two states, its context is the immediate previous chain of states. It’s really really hard for a Markov chain to take the pixel above into context if your feeding it pixels left to right (same is true for words but the pixel analogy is easier to grasp).
Now you may think a good solution to this is to have a Markov chain have some mechanism to take multiple previous states from multiple contexts (in this case vertical and horizontal) and somehow weight each one to get the next state. Go down this path and you essentially go down the path of early neural networks.
the way I see it is that an llm esis* a markov chain. the only difference is a very long context, and a lossily compressed state transition "table".
But how to build a Markov chain that does what an LLM does?
I've actually thought of this from time to time and come up with things like 'skip' states that just don't change the context on meaningless words, I've thought of maintaining distant states in some type of context on the markov chain, multiple contexts in multiple directions mixed with a neural network at the end (as per the 2D image example), etc. But then the big issue is in building the Markov state network.
The attention mechanism and subsequent multi-layer neural network it uses is honestly extremely inelegant. Basically a entire datacenter sledgehammer of contexts and back-propagated neural networks to make something that works while Markov chains would easily run on an Apple 2 if you got it right (it's very fast to linearly navigate a chain of states). But the hard part is making Markov chains that can represent language well. The best we've done is very simple linear next letter predictors.
I do honestly think there may be some value in stealing word weights from an attention mechanism and making Markov chains out of each of them. So each word starts navigating a markov chain of nearby words to end up at new states. You'd still mix all the contexts at the end with a neural network but it skips the computationally expensive attention mechanism. Still a hard problem to build these Markov chains sensibly though. Deepseek has shown us there's huge opportunities in optimizing the attention mechanism and Markov chains are super computationally simple but the 'how do you build good Markov chains' is a very hard problem.
LLMs share more with a Markov-Chain than many would like to admit, but the fundamental improvements is because the model is learning the state representation and that latent representation is essentially a lossy compression for nearly all written human text.
People really underestimate both the magnitude of the parameter space for leaner this and the scale of the training data.
But, at the end of the day, the model is still just sampling one token at a time and updating its state (and of course, that state is conditioned on all previous tokens, so that’s a other point of departure)
In the standard next-token markov model, just have one node for every token and 1 probability for every edge. E.g, "if last letter is "B" most likely next letter is "C""
To make a big language model in Markov space, you need to very large ngrams. "If last text is "th" next letter is "e""
These get incredibly sparse, very quickly. Most sentences are novel in your dataset.
Neural networks (and other models too) get around this by placing tokens in multidimensional space, instead of having individual nodes.
Attention. It's really all you need [1]
1: https://arxiv.org/abs/1706.03762
the attention Mafia. only one is a billionaire now though. the rewards have been reaped by the adopters instead of the inventors.
Stictly speaking (non-recurrent) LLMs are Markov chains.
Per e.g. Wikipedia: "In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event."
Even a dead simple two-layer neural network is a universal approximator. Meaning they can model any relationship, given enough neurons in the first layer. with as much accuracy as you want, subject to available resources for training time and model size.
Specific deep learning architectures and training variants reflect the problems they are solving, speeding up training, and reducing model sizes considerably. Both keep getting better, so efficiency improvements are not likely to plateau anytime soon.
They readily handle both statistically driven and pattern driven mappings in the data. Most modelling systems tend to be better or exclusively adaptive on one of those dimensions or the other.
Learning patterns means they don't just go input->prediction. They learn to recognize and apply relationships in the middle. So when people say they are "just" predictors, that tends to be misleading. They end up predicting, but they will do whatever they need to in between in terms of processing information, in order to predict.
They can learn both hard or soft pattern boundaries. Discrete and continuous relationships. And static and sequential patterns.
They can be trained directly on example outputs, or indirectly via reinforcement (learning what kind of outputs will get judged well and generating those). Those are only two of many flexible training schemes.
All those benefits and more make them exceptionally general, flexible, powerful and efficient tools relative to other classes of universal approximators, for large growing areas of data defined problems.
Imagine the nodes in those visualizations are not just a static state, but each have metadata attached to them used to infer the next state in the "chain", which is keyed on a value assigned from a mysterious lookup table based on the current state - so each time the state shifts the metadata on all states can also shift.
(There are also types of LLMs where that metadata is limited in access, one such type is when the current state can only check metadata on previous states and weigh it against the base value of the next states in the chain.)
Then, imagine each chain of states in a Markov Chain is a 2D hash map, like a grid plot. Our current LLMs are like an Nth-dimensional hash map instead, and can have a finite, but extremely large depth. This is pretty near impossible to visualize as a human, but if you're familiar with array/map operations, you should get the idea.
This is a very "base level" understanding, as my learning on LLMs stopped around the time Tensorflow stopped being the new hotness, but hopefully that gives you an idea.
scale
A Markov chain can't tell you the difference, and an LLM can.
So awesome project! The visualization is amazing!
[dead]