I’ve seen a lot of attempts to explain neural networks to general audiences lately, and I’ve been disappointed with the amount of mysticism in them. People talk about “brains” and “artificial intelligence” as though deep learning were summoning some dark force from the neterworld to rise up into our plane of existence and take all our jobs.
But neural nets aren’t magic.
Neural networks, like all machine learning techniques, are a way to write computer programs where you don’t know exactly all the rules. Normally when you program a computer you have to write down exactly what you want it to do in every contingency. “Generate a random number and put it in a memory cell called random_number. Print ‘Guess a number’ on the screen. Make a memory cell called user_input. Read the characters the user typed until they press enter, convert that to a number, and save it in user_input. If user_input equals random_number, print ‘Good job’, otherwise print ‘Guess again’.”
Everything has to be spelled out precisely, and no background knowledge can be assumed. That’s fine if you (the programmer) know exactly what you want the computer to do. If you’re writing a piece of banking software you know exactly what the rules for tabulating balances are so programming is just a matter of writing them down. If you’re doing something like route planning you might have to think a little harder about what the rules should be, but a lot of smart people have spent a long time thinking up rules that you can look up on wikipedia. All is well.
But what if you want the computer to do something where nobody knows the rules, and the best minds have failed in their attempts to come up with them? Machine translation is a great example. As much as you might admire Chomsky there still isn’t a complete system of rules to parse and translate between languages, despite decades of research.
So, being a pragmatic engineer with computing power to burn, what do you do? Well, while you might not know what the rules to translate text are, you do know how to write a computer program that can tell whether another computer program is able to translate text. You write some software that goes out on the web and slurps down millions of web pages that have been translated into multiple languages, and pulls out all the text. Then, you write a program that tests a translation program by giving it text it saw in one language, and comparing the translation program’s output to the translation of that text from the webpage. The more of the words that are the same between the program output and the human translation, the better the program is doing. (This is a massive oversimplification and sentence alignment is really tricky but let’s ignore that for the purposes of this post.) In the jargon this test program is called a loss function.
There’s another thing that works to your advantage. While computers might be simplistic, they’re blidningly fast. And what’s a technique that gets you a solution if you’re simple but fast? Trial and error! So you need two pieces; you need a way for the computer to make tweaks to a broken program, and you need a way for it to keep the tweaks that made it work better and throw away the ones that made it work worse.
The algorithm for doing that tweaking and checking is called Stochastic Gradient Descent. You represent your program as a massive grid of millions of numbers (called a weight matrix). You can think of it as a giant grid of knobs arranged in layers, where each layer is wired into the next one. Input goes into the bottom layer, and signals flow through until they get to the top, which is where you see the output. The positions of the knobs determine how the input gets transformed into the output. If you have enough knobs (and you have millions of them) that grid could do some very complicated things.
You start by setting the knobs to random positions. You run the loss function on the knobs and get a score. You make a random tweak to some of the knobs, run it again and get another score. You then compare the new score to the old score. If the score is better (I’m saying “better” and not “higher” because lower scores are better and that’s confusing), you tweak more in the same direction as last time. If the score is worse, you go in the opposite direction. In math terms you move iteratively along the gradient of the loss with respect to the weight matrix.
You have your computer (or computers, as the case may be) make many many many tweaks and if you’re lucky you wind up with a program that does what you want.
That’s really it. Neural networks are just computer programs that computers come up with by trial and error given a testing program written by a programmer. No magic here.
If you want a more in-depth discussion of neural nets that goes into the math I highly recommend this lecture.