What is entropy, exactly? First try an easier one: What is gravity? Suppose you had never heard of gravity and asked me what it was. I answer the usual, “attraction at a distance.”
At this point you are as badly off as you were before. Do only certain objects attract each other? How strong is this “attraction”? On what does it depend? In what proportions?
Now I give a better answer. Gravity is a force that attracts all objects directly as the product of their masses and inversely as the square of the distance between them. I may have to backtrack a bit and explain what I mean by “force,” “mass,” “directly,” “inversely,” and “square,” but finally we’re getting somewhere. All of a sudden you can answer every question in the previous paragraph.
Of course I am no longer really speaking English. I’m translating an equation, Fg = G*(m1*m2)/r2. It turns out that we’ve been asking the wrong question all along. We don’t really care what gravity is; there is some doubt that we even know what gravity is. We care about how those objects with m’s (masses) and r’s (distances) act on each other. The cash value is in all those little components on the right side of the equation; the big abstraction on the left is just a notational convenience. We write Fg (gravity) so we don’t have to write all the other stuff. You must substitute, mentally, the right side of the equation whenever you encounter the term “gravity.” Gravity is what the equation defines it to be, and that is all. So, for that matter, is alpha. The comments to the previous sections on alpha theory are loaded with objections that stem from an inability, or unwillingness, to keep this in mind.
In a common refrain of science popularizers, Roger Penrose writes, in the preface to The Road to Reality: “Perhaps you are a reader, at one end of the scale, who simply turns off whenever a mathematical formula presents itself… If so, I believe that there is still a good deal that you can gain from this book by simply skipping all the formulae and just reading the words.” Penrose is having his readers on. In fact if you cannot read a formula you will not get past Chapter 2. There is no royal road to geometry, or reality, or even to alpha theory.
Entropy is commonly thought of as “disorder,” which leads to trouble, even for professionals. Instead we will repair to Ludwig Boltzmann’s tombstone and look at the equation:
S = k log W
S is entropy itself, the big abstraction on the left that we will ignore for the time being. The right-hand side, as always, is what you should be looking at, and the tricky part there is W. W represents the number of equivalent microstates of a system. So what’s a microstate? Boltzmann was dealing with molecules in a gas. If you could take a picture of the gas, showing each molecule, at a single instant–you can’t, but if you could–that would be a microstate. Each one of those tiny suckers possesses kinetic energy; it careers around at staggering speeds, a thousand miles an hour or more. The temperature of the gas is the average of all those miniature energies, and that is the macrostate. Occasionally two molecules will collide. The first slows down, the second speeds up, and the total kinetic energy is a wash. Different (but equivalent) microstates, same macrostate.
The number of microstates is enormous, as you might imagine, and the rest of the equation consists of ways to cut it down to size. k is Boltzmann’s constant, a tiny number, 10-23 or so. The purpose of taking the logarithm of W will become apparent when we discuss entropy in communication theory.
An increase in entropy is usually interpreted, in statistical mechanics, as a decrease in order. But there’s another way to look at it. In a beaker of helium, there are far, far fewer ways for the helium molecules to cluster in one corner at the bottom than there are for them to mix throughout the volume. More entropy decreases order, sure, but it also decreases our ability to succinctly describe the system. The greater the number of possible microstates, the higher the entropy, and the smaller the chance we have of guessing the particular microstate in question. The higher the entropy, the less we know.
And this, it turns out, is how entropy applies in communication theory. (I prefer this term, as its chief figure, Claude Shannon, did, to “information theory.” Communication theory deals strictly with how some message, any message, is transmitted. It abstracts away from the specific content of the message.) In communication theory, we deal with signals and their producers and consumers. For Eustace, a signal is any modulatory stimulus. For such a stimulus to occur, energy must flow.
Shannon worked for the telephone company, and what he wanted to do was create a theoretical model for the transmission of a signal — over a wire, for the purposes of his employer, but his results generalize to any medium. He first asks what the smallest piece of information is. No math necessary to figure this one out. It’s yes or no. The channel is on or off, Eustace receives a stimulus or he doesn’t. This rock-bottom piece of information Shannon called a bit, as computer programmers still do today.
The more bits I send, the more information I can convey. But the more information I convey, the less certain you, the receiver, can be of what message I will send. The amount of information conveyed by a signal correlates with the uncertainty that a particular message will be produced, and entropy, in communication theory, measures this uncertainty.
Suppose I produce a signal, you receive it, and I have three bits to work with. How many different messages can I send you? The answer is eight:
000
001
010
011
100
101
110
111
Two possibilities for each bit, three bits, 23, eight messages. For four bits, 24, or 16 possible messages. For n bits, 2n possible messages. The relationship, in short, is logarithmic. If W is the number of possible messages, then log W is the number of bits required to send them. Shannon measures the entropy of the message, which he calls H, in bits, as follows:
H = log W
Look familiar? It’s Boltzmann’s equation, without the constant. Which you would expect, since each possible message corresponds to a possible microstate in one of Boltzmann’s gases. In thermodynamics we speak of “disorder,” and in communication theory of “information” or “uncertainty,” but the mathematical relationship is identical. From the above equation we can see that if there are eight possible messages (W), then there are three bits of entropy (H).
I have assumed that each of my eight messages is equally probable. This is perfectly reasonable for microstates of molecules in a gas; not so reasonable for messages. If I happen to be transmitting English, for example, “a” and “e” will appear far more often than “q” or “z,” vowels will tend to follow consonants, and so forth. In this more general case, we have to apply the formula to each possible message and add up the results. The general equation, Shannon’s famous theorem of a noiseless channel, is
H = – (p1log p1 + p2log p2 + … pWlog pW)
where W is, as before, the number of possible messages, and p is the probability of each. The right side simplifies to log W when each p term is equal, which you can calculate for yourself or take my word for. Entropy, H, assumes the largest value in this arrangement. This is the case with my eight equiprobable messages, and with molecules in a gas. Boltzmann’s equation turns out to be a special case of Shannon’s. (This is only the first result in Shannon’s theory, to which I have not remotely done justice. Pierce gives an excellent introduction, and Shannon’s original paper, “The Mathematical Theory of Communication,” is not nearly so abstruse as its reputation.)
This notion of “information” brings us to an important and familiar character in our story, Maxwell’s demon. Skeptical of the finality of the Second Law, James Clerk Maxwell dreamed up, in 1867, a “finite being” to circumvent it. This “demon” (so named by Lord Kelvin) was given personality by Maxwell’s colleague at the University of Edinburgh, Peter Guthrie Tait, as an “observant little fellow” who could track and manipulate individual molecules. Maxwell imagined various chores for the demon and tried to predict their macroscopic consequences.
The most famous chore involves sorting. The demon sits between two halves of a partitioned box, like the doorman at the VIP lounge. His job is to open the door only to the occasional fast-moving molecule. By careful selection, the demon could cause one half of the box to become spontaneously warmer while the other half cooled. Through such manual dexterity, the demon seemed capable of violating the second law of thermodynamics. The arrow of time could move in either direction and the laws of the universe appeared to be reversible.
An automated demon was proposed by the physicist Marian von Smoluchowski in 1914 and later elaborated by Richard Feynman. Smoluchowski soon realized, however, that Brownian motion heated up his demon and prevented it from carrying out its task. In defeat, Smoluchowski still offered hope for the possibility that an intelligent demon could succeed where his automaton failed.
In 1929, Leo Szilard envisioned a series of ingenious mechanical devices that require only minor direction from an intelligent agent. Szilard discovered that the demon’s intelligence is used to measure — in this case, to measure the velocity and position of the molecules. He concluded (with slightly incorrect details) that this measurement creates entropy.
In the 1950s, the IBM physicist Leon Brillouin showed that, in order to decrease the entropy of the gas, the demon must first collect information about the molecules he watches. This itself has a calculable thermodynamic cost. By merely watching and measuring, the demon raises the entropy of the world by an amount that honors the second law. His findings coincided with those of Dennis Gabor, the inventor of holography, and our old friend, Norbert Wiener.
Brillouin’s analysis led to the remarkable proposal that information is not just an abstract, ethereal construct, but a real, physical commodity like work, heat and energy. In the 1980s this model was challenged by yet another IBM scientist, Charles Bennett, who proposed the idea of the reversible computer. Pursuing the analysis to the final step, Bennett was again defeated by the second law. Computation requires storage, whether on a transistor or a sheet of paper or a neuron. The destruction of this information, by erasure, by clearing a register, or by resetting memory, is irreversible.
Looking back, we see that a common mistake is to “prove” that the demon can violate the second law by permitting him to violate the first law. The demon must operate as part of the environment rather than as a ghost outside and above it.
Having slain the demon, we shall now reincarnate him. Let’s return for a moment to the equation, the Universal Law of Life, in Part 6:
max E([α – αc]@t | F@t-)
The set F@t- represents all information available at some time t in the past. So far I haven’t said much about E, expected value; now it becomes crucial. Eustace exists in space, which means he deals with energy transfers that take place at his boundaries. He has been known to grow cilia and antennae (and more sophisticated sensory systems) to extend his range, but this is all pretty straightforward.
Eustace also exists in time. His environment is random and dynamic. Our equation spans this dimension as well.
t- : the past
t : the present
t+ : the future (via the expectation operator, E)
t+ is where the action is. Eustace evolves to maximize the expected value of alpha. He employs an alpha model, adapted to information, to deal with this fourth dimension, time. The more information he incorporates, the longer the time horizon, the better the model. Eustace, in fact, stores and processes information in exactly the way Maxwell’s imaginary demon was supposed to. To put it another way, Eustace is Maxwell’s demon.
Instead of sorting molecules, Eustace sorts reactions. Instead of accumulating heat, Eustace accumulates alpha. And, finally, instead of playing a game that violates the laws of physics, Eustace obeys the rules by operating far from equilibrium with a supply of free energy.
Even the simplest cell can detect signals from its environment. These signals are encoded internally into messages to which the cell can respond. A paramecium swims toward glucose and away from anything else, responding to chemical molecules in its environment. These substances act to attract or repel the paramecium through positive or negative tropism; they direct movement along a gradient of signals. At a higher level of complexity, an organism relies on specialized sensory cells to decode information from its environment to generate an appropriate behavioral response. At a higher level still, it develops consciousness.
As Edelman and Tononi (p. 109) describe the process:
What emerges from [neurons’] interaction is an ability to construct a scene. The ongoing parallel input of signals from many different sensory modalities in a moving animal results in reentrant correlations among complexes of perceptual categories that are related to objects and events. Their salience is governed in that particular animal by the activity of its value systems. This activity is influenced, in turn, by memories conditioned by that animal’s history of reward and punishment acquired during its past behavior. The ability of an animal to connect events and signals in the world, whether they are causally related or merely contemporaneous, and, then, through reentry with its value-category memory system, to construct a scene that is related to its own learned history is the basis for the emergence of primary consciousness.
The short-term memory that is fundamental to primary consciousness reflects previous categorical and conceptual experiences. The interaction of the memory system with current perception occurs over periods of fractions of a second in a kind of bootstrapping: What is new perceptually can be incorporated in short order into memory that arose from previous categorizations. The ability to construct a conscious scene is the ability to construct, within fractions of seconds, a remembered present. Consider an animal in a jungle, who senses a shift in the wind and a change in jungle sounds at the beginning of twilight. Such an animal may flee, even though no obvious danger exists. The changes in wind and sound have occurred independently before, but the last time they occurred together, a jaguar appeared; a connection, though not provably causal, exists in the memory of that conscious individual.
An animal without such a system could still behave and respond to particular stimuli and, within certain environments, even survive. But it could not link events or signals into a complex scene, constructing relationships based on its own unique history of value-dependent responses. It could not imagine scenes and would often fail to evade certain complex dangers. It is the emergence of this ability that leads to consciousness and underlies the evolutionary selective advantage of consciousness. With such a process in place, an animal would be able, at least in the remembered present, to plan and link contingencies constructively and adaptively in terms of its own previous history of value-driven behavior. Unlike its preconscious evolutionary ancestor, it would have greater selectivity in choosing its responses to a complex environment.
Uncertainty is expensive, and a private simulation of one’s environment as a remembered present is exorbitantly expensive. At rest, the human brain requires approximately 20% of blood flow and oxygen, yet it accounts for only 2% of body mass. It needs more fuel as it takes on more work.
The way information is stored and processed affects its energy requirements and, in turn, alpha. Say you need to access the digits of π. The brute-force strategy is to store as many of them as possible and hope for the best. This is costly in terms of uncertainty, storage, and maintenance.
Another approach, from analysis, is to use the Leibniz formula:
Î /4 = 1 – 1/3 + 1/5 – 1/7 + 1/9 – …
This approach, unlike the other, can supply any arbitrary digit of π. And here you need only remember the odd numbers and an alternating series of additions and subtractions.
Which method is more elegant and beautiful? Which is easier?
Human productions operate on this same principle of parsimony. Equations treat a complex relation among many entities with a single symbol. Concepts treat an indefinite number of percepts (or other concepts). Architects look at blueprints and see houses. A squiggle of ink can call up a mud puddle, or a bird in flight. The aim, in every case, is maximal information bang for minimal entropy buck.
In an unpredictable environment, decisions must be made with incomplete information. The epsilon of an alpha model depends on its accuracy, consistency and elegance. An accurate model corresponds well to the current environment, a consistent model reduces reaction time, and an elegant model reduces energy requirements. Everything, of course, is subject to change as the environment changes. The ability to adapt to new information and to discard outdated models is just as vital as the ability to produce models in the first place.
Thus Eustace generates his alpha* process, operating on some subset of F@t- where t is an index that represents the increasing set of available information F. As Eustace evolves, the complexity of his actions increases and his goals extend in space and time, coming to depend less on reflex and more on experience. He adapts to the expected value for alpha@t+, always working with an incomplete information set. As antennae extend into space, so Eustace’s alpha model extends into a predicted future constructed from an experienced past.