I went to this talk yesterday..
Here is his Stanford page:
And the research paper he talked about.
Some really interesting stuff. The thing I wanted to see, was that they have created a system to really parallelize Gradient Descent. Basically unlimited sized data(The Internet), unlimited size model(billions of parameters), run in parallel across arbitrary number of machines(2000). NOT done over map reduce.
Two neat things: It's asynchronous, so the model servers at the bottom are doing gradient descent on some portion of the model, and send their parameter DELTAs back up.. And it works out that even though different servers are working on different versions and views of the parameters, it still converges.
They also did an L-BFGS version, which I thought would be faster, but actually wasn't… because they also came up with an new analytical solution for adjusting the learning rate of gradient descent that was pretty slick.
They also had a new version of deep learning (Which gives some people convulsions, but I think is pretty cool), that was a relaxed version of sparse auto encoding. Deep learning uses unsupervised learning to find new features to be used in supervised learning, and it learns too!