**My e-learning has started, a week before the official start of the Stanford online classes.** They have already made available some videos and quizzes, and us students have started strong. The Machine Learning class Twitter account announces some interesting numbers, talking about tens of thousands of students enrolled and quizzes attempted.

I have already started learning the basics of Machine Learning, which starts with the **definition of Machine Learning** itself. At a certain point, it has been noted that there are some problems for which it is better to let the computer learn by itself, instead of programming it explicitly. That’s what the **1959** definition by **Arthur Samuel** is about: it states that ML is the field of study that gives **computers the ability to learn without being explicitly programmed**. This may sound strange, since we know that computers can’t really do anything that we haven’t programmed them to do, or some variation of that. But think about this: Arthur Samuel was not a great checkers player, but he managed to teach the computer to play and improve with every time, until the **computer became better at checkers than he was**. Also when scientists wanted to teach a helicopter to fly autonomously, they found that the best thing was to let it learn on its own. But how does that happen?

The same way that we do, computers sometimes learn from experience. They repeat and learn, repeat and learn. That brings us to the second definition of Machine Learning: **it allows computers to improve its performance over time on a specific task, from experience**. It was coined by **Tom Mitchell** in **1998** and the complete phrasing is that a computer is said to *learn* from experience **E** with respect to some task **T** and some performance measure **P** if its performance on T, as measured by P, improves with experience E.

What does that mean concretely? One way is that basically we can give a computer a **training set** of data as input, and it can try to find a function which fits the data reducing the amount of errors, it can basically **infer a function** to use to predict further similar results (like apartment prices given a set of variables). And I’ve already taken a sneak peek at the mathematics behind that.

Which is obviously extremely exciting and I would like to thank Stanford, **Prof. Andrew Ng** and the whole ml-class.org team for this opportunity.

Click “Continue reading” for more details.

[EDIT] You can find some Machine Learning video lectures by Andrew Ng at http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning

Machine Learning is divided into **supervised and unsupervised**. From what I have understood, we use supervised learning when we already know what we are looking for, say we are trying to classify something, for example decide if tomorrow’s weather is going to be cold, warm or hot, or if we are looking for an exact numeric value of something, like apartment prices**. Which means that supervised learning is divided into classification and regression problems (looking for the numeric value).**

Wikipedia puts it like this (retrieved 2011-01-09):

“

Supervised learningis the machine learning task of inferring a function fromsupervisedtraining data. The training data consist of a set oftraining examples. In supervised learning, each example is apairconsisting of an input object (typically a vector) and a desired output value (also called thesupervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which is called aclassifier(if the output is discrete, see classification) or aregression function(if the output is continuous, see regression). The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way (see inductive bias).”

A simple example would be this: we try to predict apartment prices based on the distance from the town center. So we have a training set of many x’s and y’s where x is the distance and y is the price. We put these on a graph, and try to find a function which fits as many of these dots as possible. It doesn’t have to fit all of them, we are just trying to minimize the error. So let’s say the function we are trying to find is linear, like θ0 + θ1*x. That leads to a second function, the cost function, which depends on the two thetas and is J(θ0, θ1) and is the one we have to minimize to find the way to make as little mistakes as possible. We can do this by brute force – trying different values and seeing what happens, or by more advanced algorithms. So far I’ve used the brute force approach with the help of a gradient descent algorithm which finds the lowest point(s) of the cost function bowl/u-shape. Basically, it does what we would do if trying to go downhill on a mountain – take little steps lower and lower, and using the steepness – the derivative, to calculate that.

That’s linear regression. With one variable. Just to get started.

**Unsupervised learning** sounds interesting – it lets us find **patterns that we didn’t define** ahead of time, like when looking for clusters on social networks, or when Google groups together its news. I haven’t learned much about that yet, but I’ll keep you posted.

Have fun, and learn something new every day.