Paper Review: Symmetry and its Discontents

Why should one expect the future to resemble the past?

This is one formulation of Hume’s problem of induction. Consider the claim that the sun will rise tomorrow. Why should we expect this? It is true that it has risen every day we’ve been alive, and every day for a few billion years before that. But why should we expect it to rise tomorrow? It is not logically inconsistent for the sun to rise a large number of days in a row, and then one day not.

Sandy Zabell claims that a theorem due to Bruno de Finetti leads to a partial solution of Hume’s problem of induction. This is one of the central claims of his essay “Symmetry and its Discontents” in the volume by the same name. Since I found this result incredibly interesting, I will focus on this argument. However, the rest of the paper is certainly worth reading as well, and (for example) answered a question I’ve wondered in the past: why didn’t the ancient Greeks have probability theory? Thus, clearly, this post does not exhaust the insights of this paper.

Suppose you have found a coin, and you are flipping it. It can come up either heads or tails. You have certain beliefs about the outcomes of the tosses before you flip it. For example, even though you have never flipped this particular coin before, you might think that heads and tails are equally likely on the next flip. As you flip this coin and observe the outcomes, you might shift your beliefs. For example, if H=heads and T=tails, and you observe the sequence

HHTHTHHTHHTTHHHHHHTH

you might shift your beliefs so that you think heads is more likely that tails on the next toss, since you have observed more heads in this sequence.

One intuitive way to think of this that you might already be employing is to think of the bias of the coin. For example, the coin could be biased towards heads, so that the chance of the coin landing heads on each toss is 2/3 (and thus the chance of it landing tails in 1/3). When you first find the coin and you don’t know its bias, you might have a few different hypotheses about the chances. For example, you might think it very likely that the coin has a chance of 1/2 towards head (and 1/2 towards tails). But you might also entertain the hypothesis that the coin has a chance of 3/4 to come down heads, and another hypothesis that it has a chance of 1/4 to come down heads.

These are different chance hypotheses. It is common in statistical inference to set up a statistical model that has these chance hypotheses, and then as we observe more data (for example, coin flips), we update our beliefs about the chances. We then use our beliefs about how likely it is that the coin has different biases to calculate how likely we think it is for the coin to come up heads (or tails) on the next toss. We also needn’t limit ourselves to considering only a finite number of chance hypotheses; using a little more sophisticated mathematics, we can have a probability distribution over all chance hypotheses.

So we have the following sketch: we find this coin. We have certain beliefs about how likely the coin is to come down heads or tails. We have some uncertainty about what the true chance of the coin coming up heads is, but as we observe more and more flips we can update our beliefs (for example, using Bayes’ theorem) about the chances. We then use our beliefs about the chances to make predictions about how likely the coin is to come up heads on the next toss. This is the kind of Bayesian inference that we take to characterize good inductive reasoning.

In this context, we might think that we have the following Hume’s-problem-of-induction-flavoured problem: what justifies us in believing that there is such a chance in the background? That is, it seems that we are getting to make inferences about the future because we think that this coin has a chance that is not changing, and that this is why the future will resemble the past. What justifies us in believing there is such a chance?

One might also have concerns about chance for other reasons. Zabell notes, for example, that if the exact same coin were spun in the exact same way, it would come up the same way. Barring quantum phenomena, coins are actually deterministic objects. There isn’t really a chance that the coin has. It is more our uncertainty about the toss than the coin itself that is bringing in probability. Although Zabell does not say this in the paper, I have heard other philosophers say this: “chance is cheesy.”

How, then, are we justified in using a statistical model that invokes chances?

This is where de Finetti’s theorem comes in to save the day. Suppose we don’t like chances for one reason or another. However, I am fine with having beliefs about the sequences of observations I could make, without appeal to chance. For example, before I flip the coin, I might think that the sequence

HTHHT

is more likely than

HHHHH

Suppose my beliefs about how the coin will land satisfy the following condition. For each sequence of a certain length that has the same number of heads, my belief that that sequence will obtain is the same. Zabell uses the example of three flips. If my beliefs satisfy this condition, then my belief (represented here by the probability function P) is such that

P(HTT) = P(THT) = P(TTH)

and

P(HHT) = P(HTH) = (THH)

That is, I assign the same probability to sequences with the same number of heads. We should notice two things here. The first is that it is not the case that P(HTT) need to equal P(HHT). These are sequences with different numbers of heads. The second thing to notice is that we have not referred to chances at all here. We are only talking about my belief (represented by the probability function P) that I will observe a certain sequence.

This condition is called exchangeability. Basically, if I have a certain degree of belief p that I will observe a sequence (for example HHT), then if however that sequence is rearranged (for example THH) I assign it the same degree of belief p then my beliefs are said to be exchangeable.

Now we can (informally) state de Finetti’s influential result:If the probability function P satisfies exchangeability, then the following two things are true:

  1. The limiting relative frequency of the sequence exists.
  2. We can represent this probability function using a distribution over chance hypotheses.

(Of course this is an incredibly informal statement of the theorem–for the actual statement, see Zabell’s essay.)

This is a fairly remarkable theorem. The first of the two results says that if our beliefs over a sequence of observations are exchangeable then the infinite sequence of such observations will have a limiting relative frequency (this is not always the case). The second says that talking about chances and chance hypotheses is mathematically equivalent to talking about just our beliefs.

These results help to address our concerns. We were concerned before that talking about chances was cheesy for a number of reasons–we didn’t want to commit ourselves to believing in chances. But we see from this result that we don’t have to. Whenever we talk about chances, we could always in principle reduce our chance talk to just statements about our beliefs. Thus, even though we might use chance hypotheses and the associated statistical models for the sake of expediency, this does not commit us into believing in chances.

Furthermore, the first of the two results means that if our beliefs are exchangeable, then we expect that in the long run there will be some stable relative frequency of heads to tails. Of course, talk of heads and tails is not important, and the result is much more general–this can be about whether or not the sun rises for example.

This gives us what Zabell takes to be a partial resolution to Hume’s problem of induction, if the problem is cast a certain way. Zabell describes the problem as follows:

“In the coin-tossing situation, this [problem] reduces to: in a long sequence of tosses, if a coin comes up heads with a certain frequency, why are we justified in believing that in future tosses of the same coin, it will come up heads (approximately) the same fraction of the time?” (p. 5)

We see how de Finetti’s results helps us. If our beliefs about the outcomes are exchangeable, then we believe that there will be a relative frequency of heads to tails in the long run. Thus, from exchangeability alone we get that we expect the world (of observations) to have some kind of global structure. Furthermore, because we can mathematically represent our beliefs as if we have hypotheses about chances, we can do Bayesian inference in a chance framework. Furthermore, for all but certain “highly opinionated, eccentric, or downright kinky ‘priors'” (Zabell, p. 5) the posterior distribution over chance hypotheses will be peaked around the observed relative frequency. For example, if we have observed one million tosses, and about 50% of them have come down heads and 50% have come down tails, then the posterior distribution over chance hypotheses will be peaked around the 1/2 heads chance hypothesis.

Thus, with respect to this particular version of Hume’s problem of induction, “de Finetti vanquishes Hume” (Zabell, p. 5).

As mentioned earlier, this is only one part of the paper. Zabell goes on to examine this assumption of exchangeability, and describes it as a kind of symmetry in our beliefs. He then explores justifications for such symmetry assumptions and the role they played in the historical development of inference and probability theory. I hope this summary has piqued your interest, and I strongly encourage you to read the paper (and the whole book!)–it is well worth it.

Paper Review: Free Variables and Local Causality

When can we treat certain experimental parameters as free variables?

Consider trying to figure out whether or not a certain drug is efficacious. We design an experiment that proceeds as follow:

  1. Break a collection of experimental subjects up into two groups randomly.
  2. To one of the groups give the drug, and to the other give a sugar pill.
  3. Record how the different groups respond to their different conditions.
  4. Analyze the results using some kind of statistical method. If the group that took the drug has a significantly better response than the group that didn’t, then tentatively conclude that the drug was effective. Otherwise conclude that the drug was ineffective.

Obviously this is a toy example; I’ve gone into none of the details, and in particular I do not claim that this exhausts our ability to design better experiments. But it will serve as an example for our purposes.

The reason why we might think something like this method would work is that the experimenters intervene on the system, and give different treatments to different groups. By comparing the results, we are trying to infer something about the causal structure of this situation. Whatever the virtues of the particular statistical test and experimental design we end up using, one of the key parts of this situation is that we think this intervention is independent in some some sense from the rest of the situation.

Consider the following case. Suppose we do this experiment, and we get perfect results. By this I mean that everyone who takes the drug gets completely better (or the drug has the maximum intended effect), and the group that takes the sugar pill has no change. Suppose we do this as many times as you like. You can probably feel the pull to the conclusion that the drug is efficacious.

However, suppose I instead offer this hypothesis. There were aliens that were watching Earth, and wanted to mess with us earthlings. So the aliens, using advanced technology, came down, invisible to us, and healed about half of the people that had been chosen for the experiment. They then hacked the brain of the scientists and the randomization process used to divide the group so that the different groups were divided such that the group given the drug was composed of the people the aliens healed, and the people the aliens didn’t heal were in the sugar pill group. Furthermore, anytime we ran another experiment of this drug on Earth the aliens did the same thing.

This hypothesis also explains the data perfectly. However, it is clear that in this case we cannot make the inference that the drug works. The drug itself could be totally ineffective.

Now, at this moment a word might come to mind.

Conspiracy.

In this case, the experimental parameters were not free variables, since they were determined by the aliens. Thus, the inference we wanted to draw was undermined.

Now this might seem silly. Again, conspiratorial. No one in their right mind would have this as a viable hypothesis for the situation. Not just because of the aliens, mind you, but because the hypothesis the drug works does not seek to describe the rest of the world. Thus, baring any additional information about aliens and what have you, the kind of hypothesis we care about is not one that lends itself to conspiratorial analysis.

How about in the case of physics? Unlike the drug case, physics seeks to describe the whole world. The world includes the experimenters. When we are evaluating a physical theory, can we treat the experimental parameters as free variables?

This is one of the central questions of physicist John S. Bell’s Free variables and local causality. 

***The original paper can be found here.***

In Epistemological Letters Bell had argued that quantum mechanics is not locally causal. By this he means that quantum mechanics allows for distant physical systems to affect each other. This violates the locality constraints of special relativity.

His argument relies on treating some experimental parameters as free variables. In particular, you need two humans (or at the very least measuring devices) at physically separated conditions who can choose which quantity of a physical system to measure (x as opposed to y spin, for example). If this condition (more technically specified, and with the correct way of setting up the experiment) is satisfied then, Bell showed, quantum mechanics is non-local.

In Free variables and local casuality Bell is responding to critics of his argument. In particular, Clauser, Horne, and Shimony (CHS) in Epistemological Letters disagree that we can treat the experimental parameters as free variables.

In the context of this thought experiment (the one in which Bell argues quantum mechanics violates local causality) a free variable is one of which there is no record. We can see what this means in the context of the alien experiment. In that case, there are certainly records. In particular, the most relevant record of choice of which person goes in which group is the set of people whom the aliens healed. It is because of this particular record that we cannot make the inference that the drug was effective. Similarly, in order for Bell’s argument to work there has to be no kind of “common hidden mechanism” (p. 1) correlating the results of the experiment for the spatially separated physical systems. Otherwise, in Bell’s words, “the apparent non-locality could be simulated” (p. 1).

Of course, in this case the idea is not that something like aliens might be the common hidden mechanism, but rather something in the nature of physics itself. Perhaps, for example, the starting conditions of the universe are such that whenever someone makes a choice of which quantity to matter, there will be no case in which non-locality occurs; it would be only simulated.

In order to defend this assumption, the one that rules out these kind of physical conspiracies, Bell appeals to a distinction betwen “analyzing various physical theories, on the one hand, and philosophizing about the unique real world on the other hand” (p. 2). He points out that when it comes to the actual physical world we can never in reality see what would have happened had we done something different. This is a problem of counterfactuals. In particular, we cannot ever repeat the same experiment twice. In Bell’s poetic words, “the hands of the clock will have moved, and the moons of Jupiter” (p. 2). Using the theory, in this case of quantum mechanics, we calculate the consequences of changing different free parameters in a hypothetical physical situation. We can then evaluate the theory partially on what it predicts will happen under these different hypothetical conditions.

Thus, on Bell’s picture of how we do theoretical physics, it is part of the methodology to see what the theory dictates in different physical setups. Thus, for the purposes of doing theoretical physics, we must be able to treat certain things in certain contexts as free variables.

The paper is short, and rewards reading. In particular, it is an interesting investigation of the methodology of theoretical physics. If Bell is right, then we must avoid such conspiratorial thinking in physics, for the sake of effective methodology.

Paper Review: Bayes, Bounds, and Rational Analysis

Bayesian learning and decision theory (arguably) characterize rationality for idealized agents. However, carrying out Bayesian calculations can often be costly. In particular, the kind of agent one is–whether a human, lizard, or computer–constrains the kind of information processing one can do. This gestures towards a question: what is the relationship between idealized rationality and rationality for real world agents?

***The original paper can be found here.***

Thomas Icard’s “Bayes, Bounds, and Rational Analysis” offers a very general theoretical model for evaluating the rationality of bounded agents. The natural application of this model is taking into account the computational costs of a physically implemented agent (human, lizard, computer). We can use this framework to tackle the above question. By taking into account the computational limits of a certain kind of actual agent, we can look at whether the optimal agent under those constraints approximates Bayesian rationality, or whether some other method that looks nothing like Bayes is the optimally bounded agent.

Icard is particularly interested in how the relationship between idealized Bayesian rationality and bounded rationality plays a methodological role in cognitive science. Cognitive science uses rational analysis to solve a particular problem. When cognitive scientists are trying to understand some aspect of our cognition, whether it is navigating a room or using a language, they try to develop models of how the brain performs this task. Though these models are constrained by the empirical data they gather from experiments, this is often not enough to identify a promising model–the space of possible models is too large. Thus, in conjunction with the empirical data, cognitive scientists often use rational analysis to help identify promising cognitive models. Icard, following Anderson, summarizes the methodology of rational analysis in six steps:

  1. Precisely specify the goals of the cognitive system.
  2. Develop a formal model of the environment to which the system is adapted.
  3. Make the minimal assumptions about computational limitations
  4. Derive the optimal behavioral function given items 1–3.
  5. Examine the empirical literature to see whether the predictions are confirmed.
  6. If the predictions are off, iterate. (Icard 2018, 82)

Many of the functions that cognitive scientists are interested in can be characterized as inference problems under uncertainty. For example, categorizing new objects one has never seen before, or understanding a new sentence one has never heard before. Thus, since Bayesian methods excel at inference problems under uncertainty, cognitive scientists often use Bayesian rational analysis to identify a plausible model. Often, for a given task and a given (class of people), Bayesian cognitive scientists will construct a reasonable prior probability distribution P(\theta) over some possible states of affairs and a conditional likelihood function P(s|\theta), which is how likely the agent thinks she will observe some data s given that \theta is the case. The scientists then suppose that the agent’s response to certain problems will directly rely on the posterior distribution P(\theta|s), which can be calculated using Bayes theorem. Thus, Bayesian cognitive scientists model aspects of our cognition as if we were performing Bayesian inference. The rationale is that Bayesian inference is, in a standard prediction problem, the optimal prediction method (this is Icard’s Fact 1 on page 89). Thus, the reasoning goes, since humans seem relatively successful at many tasks, and Bayesian methods are optimal for many of these tasks, it is a useful methodology to model our cognition as a Bayesian process.

However, there are issues with naïve application of Bayesian rational analysis. One is the problem we noted earlier: Bayesian methods can often be computationally costly and even intractable. A second is that the empirical results show that humans do not, in general, make optimal Bayesian predictions.

There is one very particular way in which humans deviate form making optimal predictions which I found incredibly striking. This is the phenomenon of posterior matching. Suppose you conduct an experiment in which you have a number of participants observe some new information, and then perform some kind of prediction task. If we ask them to give a single best prediction, a Bayes-optimal response would be to predict something like the mode of their posterior. However, we do not find this. Instead, we find that the distributions of responses will match the posterior distribution in the model.

Think about that for a second. Obviously this means that most (or at least a good number of) participants are not making the Bayes-optimal prediction, but I also want you to reflect on how strange this is. Why would humans do this? Icard has a particularly interesting and beautiful hypothesis for this phenomenon that falls right out of his model for evaluating bounded agents. Keep this in the back of your mind–we will return to it later.

In response to both the cost of Bayesian computations and the disagreement with the empirical results, many people have suggested that humans may approximate Bayesian calculations. However, Icard notes that it is one thing to claim that the ideal rational agent is a Bayesian, and another to claim that the best method for bounded agents to follow is one that explicitly approximates Bayes. Thus, he suggests a middle path. He proposes using his model for evaluating bounded agents to see whether or not approximate Bayesian agents are in fact the optimal agent in certain problems. If this is the so, then in these cases cognitive scientists would be justified in using models that approximate Bayesian computations. Icard calls this procedure boundedly rational analysis. Instead of focusing on what the ideal agent would do, pick an appropriate way to bound the agent in a given problem, and then see what the best method is under those constraints. If it is an approximately Bayesian method, then use that. Otherwise, look elsewhere.

Let’s now examine Icard’s model. The details are in the paper itself, and reward study. In this summary I will only try to extract a sketch of the model and a few applications.

This model takes an outside perspective. What I mean by this is that we are not modelling the agent on her own terms, with respect to her own beliefs and goals, but rather with respect to an objective state of affairs (or probability distribution over states of affairs) and some goals. If the agent shares these explicit goals then that is fine, but if not then we can still evaluate her.

Icard builds up the model in a few steps. Consider first evaluating the agent when we know that the state of the world is \theta. The agent receives some data s from the world, and has to choose some action a. We evaluate how well the agent does with respect to a utility function U(a,\theta), which depends on the action she takes and the state of the world. We also suppose that there is some likelihood function P(s|\theta), which again captures our (not the agent’s) perspective of the situation.

Next, we consider the agent as some function \mathcal{A}, which is a function from observations s to probability distributions over actions. Thus, \mathcal{A}(s)(a) is the probability that the agent will respond with action a after observing s. No assumption is made that this agent is Bayesian. Indeed, we can consider any agent.

Now, since we know the state of the world \theta and the data s, we can define a scoring function \sigma that tells us how good \mathcal{A} is in this particular case. In particular, \sigma is a function of \theta, s, and \mathcal{A}. Icard has specific conditions on what type of function \sigma can be, and I encourage those who are interested to consult the paper.

Now we have a very flexible way to evaluate an agent in a particular environment after she has made a particular observation. Using this as the core, Icard builds up a general framework for evaluating arbitrary agents in environments with a probability distribution over states of affairs and signals. Importantly, Icard also extends this framework so that it can incorporate computation costs into the evaluation of how successful the agent is. This gives us a very general framework for evaluating not just the success of different agents in different situations, but also for evaluating the success of different boundedly rational agents in different situations.

As an example application, Icard applies this model to the case of posterior matching described above. I won’t go into the details in this summary, though of course, as always, I encourage you to explore them yourself. However, I will provide a sketch of the approach.

Icard formulates a fairly general kind of cost function for physical agents. He considers and rejects one approach that immediately suggests itself. It seems quite natural at first to characterize the mental computation costs in terms of space, time, or energy requirements that a system would need to process stimuli in order to make a decision. However, the problem with this approach is that it is very sensitive to the specific computational architecture of the agent. Thus, we might want more of a general treatment of what the costs could be like, one that applies across a broader range of architectures.

This leads Icard to consider the computational costs in terms of the general thermodynamic costs associated with computing. Using the physical idea of work, he defines a class of what he calls entropic cost functions. He extracts and summarizes the intuition in a way that is worth quoting in full:

The intuition here is simple. Without exercising any control over one’s actions, any action would be a priori as likely as any other. Configuring oneself to ensure that a specific action is taken requires some physical work, including in situations when such work takes the form of mere thinking (about what to do, what the answer is, etc.). Even when one implicitly “knows” the right solution to a problem, distilling this solution sufficiently that one may act on it is not a costless process. The assumption is that the cost of this work is inversely proportional to the uncertainty remaining in the agent’s behavioral dispositions. (95)

Making a decision, or changing the probability of taking a certain action, takes physical work. We can also see how this connects back to the posterior matching puzzle. Changing our distribution over actions takes work. Thus, we might perform enough work to move our prior distribution, however that is encoded, to a posterior, and then sample our action from that distribution. Indeed, what this analysis shows is that, when you work out the math, this kind of rule (called a Luce choice rule) is the boundedly optimal solution in a situation with entropic costs (this is his Fact 2 (96). This would generate the kind of posterior matching we observe empirically.

I found this to be a rather remarkable hypothesis to explain the phenomenon of posterior matching. Though I think it will of course take further empirical work to see whether or not this kind of process is in fact what the brain does, this is exactly the kind of application that Icard has in mind for his evaluation model: narrowing down the search space of models for cognitive scientists.

Icard also extracts the following lesson from the investigation of his framework:

The broader lesson of this illustration is that acting optimally across a sufficiently wide range of problems, reflecting the true intricacy of typical cognitive tasks our minds face, is non-trivial. While boundedly rational agents in this setting may fall short of ideal Bayesian optimality—which make sense especially when conditional probabilities P(ϑ|s) are difficult to determine—they must nonetheless move in the direction of what Bayes’ Rule dictates, acting as though they are appropriately, if only partially, incorporating their evidence. (97)

This is a useful moral for not only trying to figure out how existing agents (humans, lizards) work, but also for thinking of how to design intelligent but bounded artificial agents. Indeed, this seems loosely to share the spirit of the Ramsey-Savage style representation theorems. The representation theorems tell us that if an agent’s preferences over gambles conform to certain rationality constraints, then we can represent her as is she were maximizing some utility function with respect to a unique probability function. Whatever her cognitive architecture is in fact doing, we can think of her as if she had utilities and probabilities. The Icard moral is similar: whatever is in fact going on in the agent, whether it be a bacterium, computer, or human, if it is acting across a range of problems, then we know that whatever it is doing, it is acting as if it is doing Bayesian updating, if only to an approximation.

I think this paper makes a valuable contribution to the literature on bounded rationality, and rationality in general. Both the model itself and the lessons we can extract from it merit our reflection.

Paper Review: Learnability can be undecideable

“Learnability can be undecideable” argues that the current mathematical frameworks we use to understand learnability are deeply flawed. It has been making big waves in the machine learning community, due to its rather spicy thesis. Thus, I figured it would be a good paper for me to review, and to explain its argument and implications.

***The original paper can be found here.***

The field of machine learning looks both at how we might build machines to learn certain concepts and which concepts are in principle learnable. For example, I might be interested in having a computer that can tell whether or not an email I receive is spam. In order to design such a program I must not only think about the program, but also how difficult it is to learn to distinguish “spam” from “not-spam”.

This paper is more focused on the way in which we characterize how difficult it is to learn a given class of concepts. It is not that we are trying to characterize how difficult it is to learn a particular concept, but rather how difficult it is to learn a class of concepts. For example, this approach would not be interested in how challenging it is to learn French, but rather how challenging it is to learn a language in general.

Knowing which types of concepts are learnable and which aren’t is very useful; it helps us to decide which kind of problems we would want to tackle with machine learning. If we know in advance that a class of concepts is unlearnable, then we wouldn’t want to waste time by having a machine learner try to learn a concept from that class.

Ben-David et al. then note that we have rich mathematical frameworks for characterizing learnability. One of the main frameworks is probably approximately correct (PAC) learning. In particular, not only is this a useful framework, but with it we can characterize the learnability of a concept class by a single parameter: Vapnik-Chervonenkis (VC) dimension. Each concept class will have a VC dimension, and it is a theorem that a class is learnable if and only if its VC dimension is finite. The smaller it is, the less data the learner needs to successfully learn a concept from that class.

Thus, VC dimension is a fairly general way to characterize the (PAC) learnability of a class. There are generalizations of PAC and VC dimension, and all use similar definitions. However, as a corollary to the main result (the title) of their paper, Ben-David et al. claim that there can be no general dimension (like VC dimension) that characterizes learning.

What does it mean for learnability to be undecideable? As they put it, there are “simple scenarios where learnability cannot be proved or refuted using the standard axioms of mathematics” (44). In other words, the standard axioms of mathematics are mute on whether or not some simply characterized concept class can be learned. This is pretty surprising. We have this whole mathematical framework set up in order to answer exactly this kind of question, and yet Ben-David et al. show that not only have we not yet figured out how to characterize whether or not an arbitrary concept class is learnable within this framework, but that this is fundamentally impossible.

How do they do this? The broad strokes of the strategy are as follows: first, for a particular simple machine learning problem, they show that learnability is equivalent to compression (of a set into a smaller set) in some sense. Then they show that compression is intimately linked to the cardinality, or size, of a set. Then they show that the size of the set in the machine learning problem depends on the continuum hypothesis (which I will discuss later), which is famously independent of the standard axioms. Thus, the argument goes from learnability, to compression, to cardinality, to the continuum hypothesis. Since the standard axioms of mathematics are insufficient to determine whether or not the continuum hypothesis is true, they are also insufficient to determine whether or not the simple concept class Ben-David et al. construct is learnable.

I won’t go into the details of the proofs; those are already in the paper for those want to understand the details. But I will provide a sketch of some of the steps, in order to convey the spirit of the argument.

In order to show the link between learnability and compression they consider what they call the finite superset reconstruction game. Consider that we have some set of points X, which could be finite or infinite. We have two playes, Alice and Bob. Alice receives a finite subset S \subseteq X. She then has to send a message S' to Bob, which is itself a subset of S. Thus, Bob receives the set S', which is such that S' \subseteq S \subseteq X. Furthermore, since S is finite we know that S' has to be finite. Now, Bob’s job is to look at S' and then make a guess about S. Bob uses some function \eta which outputs finite subsets of X to make his guess. We say that his guess \eta(S') is successful if it contains every member of S, that is, if S \subseteq \eta(S').

Obviously, if Alice just sends S' = S to Bob then Bob can just guess whatever he receives, but the more interesting case is if we require S' \subset S, so that S' contains fewer elements than S.

Clearly, if X is finite then this is trivial. Alice can send whatever S' she wants, and Bob will just guess X itself, since it is finite and certainly contains every member of S. However, things become more interesting when X is infinite, since Bob cannot just guess X at this point, as his guess always has to be a finite set.

How does this connect to learning? The authors suggest that we think of Alice and Bob as two different parts of a learning algorithm. In particular, we might think of Alice are the part of the algorithm that receives data (S) characterizing a concept from the world (X). She then has to choose a subset of these data (S') to send to Bob. Thus, we see that Alice plays the role of a teacher, and Bob a student. Alice needs to choose a subset that Bob can effectively learn from, and Bob needs to figure out how to make a successful guess (\eta(S')) of the concept. The paper discusses in detail how example this game is equivalent to the estimating the maximum (EMX) learning problem they consider.

This makes the link between compression and learning a little more obvious. Since EMX can be in some sense be described as this game, we can characterize the learnability of the concept classes in EMX according to how much Alice can compress S into $S’$. The more she can compress it without Bob losing his ability to guess successfully, the easier it is for him to learn–he needs fewer examples to be successful. Thus, we have a deep link between learnability and compressibility that can be exploited later in their argument.

Now we turn to the connection between compressibility and cardinality. So far it has seemed that X has played little role in our discussion, other than the observation that if X is finite then the whole game is trivial. However, the cardinality of X plays a key role in determining whether or not a concept class can be learned.

Suppose that we have some infinite set X. Consider all the finite subsets of X, which I will denote as \mathcal{F}_{fin}^{X}. We can think of \mathcal{F}_{fin}^{X} as our set of possible concepts; each member of \mathcal{F}_{fin}^{X} is a finite subset of X. Thus, when Alice observes some S, S \in \mathcal{F}_{fin}^{X}.

Now, the question is, since X is not finite, can Alice compress S into a set S' \subset S such that Bob is guaranteed to make a good guess (which means that S \subseteq \eta(S'))?

The answer depends on the cardinality of X. In mathematics we denote the cardinality, or size, of the natural numbers with \aleph_{0} (we call a set of this size countably infinite, because you can put its elements in a one-to-one correspondence with the natural numbers). However, there are larger infinities as well. In general, if k > j, then \aleph_{k} is a larger infinity than \aleph_{j}.

We know from Cantor’s diagonal argument that the cardinality of the real numbers is larger than the carnality of the natural numbers. However, there is a question–is the cardinality of the real numbers equal to \aleph_{1}, or not? In other words, are there any sets whose cardinality lies strictly between that of the naturals and the reals?

The hypothesis that the answer to this question is “no,” and that the cardinality of the reals is \aleph_{1} is called the “continuum hypothesis” (CH). It is a famous result that the standard set of axioms (known as ZFC) are insufficient to determine whether or not CH is true. In other words, CH is not provable or disprovable with the standard axioms of mathematics. In particular, if we add stronger axioms we can make it so that CH is provable with those axioms, but we could also add stronger axioms that make it so that the negation of CH is provable. In the second case, there is no natural number k such that \aleph_{k} is the cardinality of the continuum–there is an infinite number of other cardinalities between the cardinality of the naturals and the reals.

For the EMX problem used throughout the paper, the set X is the unit interval, [0,1], which has the cardinality of the continuum. Now we can start to see how learnability depends on CH. Ben-David et al. prove that if the set X has a cardinality \aleph_{k} for some natural number k, then the finite subsets of X are compressible, and thus learnable. However, the implication also goes the other way—if the finite subsets of X are compressible then the set X has a cardinality \aleph_{k} for some natural number k .

The set X in the EMX problem is the unit interval which has the cardinality of the reals. Now suppose CH is true. Then the cardinality of the unit interval is \aleph_{1}, and since 1 is a natural number, the concepts in the EMX problem are learnable. However, suppose that CH is false. Then there is no natural number k such that \aleph_{k} is the cardinality of the the unit interval. But then the finite sets of the unit interval are not compressible, and thus not learnable.

Thus, we see that whether or not CH is true directly influences whether or not a particular problem is learnable. Facts about whether or not a certain concept class is learnable are independent of the standard axioms of mathematics. As a corollary, Ben-David et al. show that because of this independence, there can be no parameter characterizing learnability, like we intended with VC dimension.

This is an unhappy state of affairs. The foundations of machine learning try to answers questions about what kind of concept classes are learnable. To do this, we have developed a rich, expressive mathematical framework. However, Ben-David et al. show that this framework leads to learnability being undecideable.

There are a few lessons we can learn from this. As the authors write, “when trying to understand learnability, it is important to pay close attention to the mathematical formalism we choose to use” (48). They point to a particular problem in the formalism: our choice to talk about functions rather than algorithms. In our discussion of the finite superset reconstruction game and learning in general, we have focused on whether or not certain compression functions existed. There is a great advantage to this. The math is easier to work with, and as the authors point out, the current frameworks “separate the statistical or information-theoretic issues from any computational considerations” (48). However, there are many functions that are not computable–this means that no computer, no matter how much time or memory it was given, could compute the function. Thus, the set of functions on these kind of infinite sets is a lot larger and more abstract than the set of computable functions. Furthermore, if we get even more specific than computable functions and want to talk about algorithms, we are bringing even more computational details to the table.

When I first heard of the result of this paper an odd question occurred to me, the answer to which I suspected was “no”: did this imply that we could use the physical fact of whether or not an actual computer could learn something to determine whether or not CH were true? However, we see from this insight of functions versus algorithms that the answer is in fact “no.” The learning frameworks we have in place now talk about the existence or non-existence of functions, not algorithms. However, any machine I build would be constrained by the physical laws, and by the limits of computation. Thus, whether or not there exist physically implementable algorithms that can learn is a different question from whether or not there exist functions that can learn, and so we could not infer from the learning success or failure of a physical machine that CH is true or false.

We have a methodological moral for computer science. When choosing a formalism in which to construct our best characterizations of learnability, there are trade-offs between ease and expressibility on the one hand and decideability on the other. In particular, avoiding the weeds of the computational details may push the whole enterprise to the point where it is too divorced from our purposes for it to be useful for us.