There are two main contributions from machine learning to statistical inference thus far: * The importance of polynomial time search through a space of possible hypotheses * Kernel (or nearest neighbor) methods, which propose a useful set of techniques for avoiding the "curse of dimensionality" This note covers the first (stay tuned for more about the second). -- * Machine Learning, or Inference as Search An important contribution of machine learning has been to look at inference as a search through a space of possible hypotheses for one (or more) that is well supported by a set of data. The idea of a hypothesis space is not novel in statistics -- robustness analysis uses the idea -- but the computational considerations associated with search are not generally considered in statistical analysis. Take a typical discrimination problem. Imagine there is a boolean feature vector F of size k (i.e. k input boolean variables) and we are trying to learn a method to map from instances of F to an boolean output (the dependent variable) from a set of N examples. A classical statistical approach would be to use multiple logistic regression or contingency table methods. If there are many input variables and they are not independent, this problem is generally considered intractable, since we have to compute the entire covariance matrix for the input variables. It is also often the case that we are not sure if each input variable is actually relevant to the discrimination, so some kind of dimensionality reduction might be necessary as well. One way to handle this kind of problem is what statisticians generally disdain as "fishing expeditions." We try looking at various subsets of the variables and the interaction terms, hoping to find a combination that gives good discrimination performance. * Multiple testing The reason that "fishing expeditions" are so discouraged is that it requires that we test many alternate hypotheses. When we do that, we are violating an assumption underlying most hypotheses testing methods, and our estimates of the probability of Type I errors (incorrectly rejecting the null) are too low. The traditional statistical approach to this problem is known as the Bonferoni correction. This correction, which is demonstrably correct when the hypotheses being tested are independent of each other, is in effect an exponential penalty. If A is the desired significance and we are going to test K different hypotheses, then we need to find a hypothesis with a significance level of A* where 1 - A = (1 - A*)^K. This derives simply from the definition of the probability of a Type I error for each hypothesis test and the worse case assumption that the type I errors are mutually exclusive among the hypotheses in the space, given the data that we have. Unfortunately, this exponential correction means that it is very difficult to find significant results when one tests more than a few hypotheses. It is important to remember that the Bonferroni correction is really an *upper bound* derived from a very unusual worst case. For example, the Bonferoni correction is almost certainly too strict in our discrimination problem example. The hypotheses that we are testing are a closely related family; they just involve adding interaction terms for variables we were already considering. Intuitively, it is quite likely that if the data were such that we would make a type I error on one of them, then it is also likely we would make a type I error on at least one other. The fact that type I errors are not mutually exclusive among the hypotheses means that the Bonferroni correction is incorrect. Unfortunately, there is no good model that allows us to quantify the dependencies among a set of related hypotheses, so we do not have a good model-based method for determining the correct significance of a hypothesis found by search. On the other hand, cross validation and randomization testing do allow us to make unbiased estimates of the accuracy of a hypothesis discovered by a search and its significance. Cross-validation and randomization testing are at this point well known statistical methods -- so what is really new and interesting and different about machine learning? * Searching hypothesis spaces One modest contribution is defining the idea of a hypothesis space, or a complete set of related hypotheses. We can make various inferences about such hypotheses spaces, including their size, how subsets of such spaces can be identified by the observations that are compatible with them, etc. However, it is not just the idea of a hypothesis space that is new in machine learning (although Tom Mitchell's 1980's "version space" conception of it is particularly clear and useful). The real contribution of many of these machine learning/data mining tools is the *search* component. In order to understand the importance of search, we have to look at a basic consideration in computer science that is absent (generally) from statistical inference: computational complexity, or the length of time necessary to perform a calculation. * Computational Complexity One of the most basic concepts in computer science is the idea of computational complexity. That is, what is the number of fundamental computational operations (e.g. comparisons, additions or multiplications) that must be performed in order to accomplish a particular task in the worst case? This complexity is generally formulated in terms of the size of the input to the program. This complexity is generally stated in "order" terms, ignoring specific coefficients and focusing on broad classes of computation complexities, such as linear, polynomial or exponential. Since the time it takes to accomplish an elementary operation is generally fixed in most computers (that's what the "Megahertz" ratings of a PC measures), we can talk about the number of operations and the time of execution interchangeably. For example, what is the worst case computational complexity of a program to count the number of occurrences of a particular number in a list of numbers? Since we are doing a worst case analysis, it is clear that the program must at the very least compare the target to each element of the input, so the number of elementary operations to accomplish this task must be at least linear in the size of the input. It is not difficult to show that the obvious program to step through the list and increment a counter when it finds an element which is equal to the target takes time linear in the size of its input. Our example program provides a linear method, and our lower bound says there is no way to improve by more than a constant factor, so we can say that we have an "big O" optimal algorithm ("little o" takes into account the actual coefficients, so there may be a better algorithm in that sense). A more interesting example is the problem of Boolean satisfiability. This problem takes as input a boolean formula, and asks if there is any assignment of its variables to true/false such that the formula as a whole is true. For example, the boolean formula "A and not B" is satisfied only when A is assigned true and B is assigned false. As far as we know, this problem requires time exponential in the size of the input to solve. No one has come up with an algorithm better than trying each possible assignment of truth values to variables, and the number of such assignments is 2^k when k is the number of variables in the formula. Because this takes time proportional to some number raised to the power of the size of the input, it is called an exponential algorithm, and is considered intractable. In general, tractable algorithms must have run time that is polynomial in the size of their inputs. For example, we can prove that the best possible program to sort a list of items must take time proportional to n times log n, where n is the number of items to sort. Many problems have an interesting property: although it takes an exponential time to find a solution to the problem, if one is provided with a possible solution, it takes only polynomial time to verify that the provided solution is indeed a solution. This class of problems is called NP, for non-deterministic polynomial. Boolean satisfiability is an NP problem; if I give you a correct assignment, then it is easy to show that verifying it takes polynomial (in fact, linear) time. The terms NP hard and NP complete, which you may have heard of, are just subsets of NP problems. * Inference as polynomial search Now, let's think about that discrimination problem as a machine learning person would. First, we'd like to define a hypothesis space which includes either the correct explanation for the observations or a reasonable approximation to it. Then, we'd like to describe a method for searching that space so that given some observations (the data) we have a good probability of finding that (approximately) correct hypothesis in the space in a polynomial amount of time. Finally, we'd like to generate an unbiased estimate of the accuracy of our induced hypothesis and the probability of having found it by chance. Let's take these in reverse order. First, we can use cross validation and randomization testing to develop unbiased estimates of the accuracy of our induced hypothesis and the chance of having found a hypothesis at least as predictive as the one we did find simply by chance. Second, we can use the PAC (probably approximately correct) framework to prove things about the ability of algorithms to find reasonably good hypotheses in exponentially large search spaces in polynomial time. What we try to prove is that an algorithm is able to guarantee that with some large probability p it is likely to find a hypothesis that is within some small error e of the correct one, in polynomial time. The computational complexity is generally dependent on p and e so that as you make them smaller and smaller, the complexity increases. You can think of these as approximation algorithms, which are polynomial time algorithms that offer results that are guaranteed to come within some distance of the optimal solution. Generally, approximation bounds are on the quality of the solution (e); in machine learning, we usually have to also introduce a term describing the probability of finding a solution of that quality (p). It is fairly hard to prove such things, but there is a growing collection of demonstrably good induction algorithms out there. There are also lots of polynomial time algorithms which don't have provable PAC properties, but have shown themselves to be of significant empirical value on the basis of cross validated accuracy (and significance) of the hypotheses that they find. The final issue, defining the hypothesis space itself, remains as difficult for machine learning as it is for statistical inference. Human intuition and knowledge must play the same kind of role in defining hypothesis spaces as it does in defining individual hypotheses to test. By allowing us to look at larger collections, we are removing some of the pressure on intuition to identify particular hypotheses, but that doesn't mean that the entire universe of possible hypotheses is searchable. In fact, there is an interesting set of results called the "No free lunch" theorems (Wolpert), which demonstrate that all induction methods, when applied to the entire universe of possible induction problems, perform exactly the same: they do not generalize at all. Fortunately, we don't care very much about the entire universe of possible induction problems. For example, most of us are only interested in induction problems whose solutions are compatible with the laws of physics. At any rate, although we cannot guarantee our induction methods will generalize any better than traditional ones, what we are trying to do in machine learning is to substitute large amounts of data and computational power for at least some human intuition. We need the additional data to avoid the multiple testing problem, and we need the computation to search through exponentially large spaces of hypotheses. If we have these, then we can reduce the demands on our intuition from defining specific hypotheses to test to defining large classes of hypotheses to test. I'm going to focus now on algorithms that are not generally proven in the PAC framework, but which have instead shown good practical utility. I'm going to talk about three different general search methods, and their applications to inductive inference. The search methods are: greedy search, beam search, and gradient descent. Greedy algorithms work by using some approximate decision measure, and reduce the size of the problem incrementally with it. Perhaps the best performing general class of machine learning algorithms are those that use greedy search based on information gain. These algorithms were pioneered by Ross Quinlan (the algorithms known as ID3 and C4.5), and have gotten some play in the statistical community as when Breiman added regression to them (CART). The basic idea in these algorithms is to build a discrimination tree so that after making a set of decisions based on the values of the input variables, the algorithm would reach a leaf node that is fairly "pure" in terms of the output class. Such a tree is built in greedy fashion: The first split is made by picking the input variable that gives the "purest" split, using information gain as the measure of purity. The algorithm is then applied recursively to each side of the split until the results splits are either completely pure, or that the number of observations is so small that further splits cannot be justified. Note that this approach has a problem shared by all greedy methods: it may well be the case that a better tree could be constructed if all of the splits were considered at once, instead of just one at a time. Of course, there are an exponential number of trees, so it is not possible to build a polynomial time algorithm that considers them all. Fixed amounts of "look ahead" can be included in polynomial time (e.g. looking at pairs of splits rather than single ones), but it is always possible that a greedy algorithm will miss a good combination. For this reason, there are no good PAC results for such algorithms, although their real world performance has been excellent. Another variation on the idea of greedy search is called "beam search." In beam search, we are also looking to solve an exponential search problem by looking at a series of simplified decisions, but instead of committing to one particular decision at each step, we keep track of some number of the best solutions seen so far. The number we keep is the width of the beam, and is generally fairly small. Genetic algorithm approaches to inductive inference are conducting beam searches through hypothesis space, with a beam equal to the size of the population. Although beam search also has yet to get any very interesting PAC results, genetic algorithms have also shown pretty good empirical results. A final type of search that we might use is familiar from numeric optimization methods: gradient descent. If we can define a measure on our hypothesis space with certain properties (e.g., having everywhere non-zero first and second derivatives) we can use that measure to direct a search for the global optimum, or at least for a good local optimum. Neural networks define spaces of non-linear combinations of their inputs with an error measure based on the training data, and then use gradient descent techniques to find a particular hypothesis that minimizes the error function. The different types of neural networks and training techniques are all variations that define different hypothesis spaces and optimization techniques, but they all depend on the same basic idea. In summary, one important set of contributions of machine learning to statistical inference is the idea of a polynomial time search through an exponential hypothesis space. If we are clever at how we define our hypothesis spaces, and careful about making unbiased estimates of the accuracy and significance of our results, we can use these techniques to trade data and compute power for fewer demands on our human intuitions.