I want to get more experience doing data science on large data sets with Spark.  I tend to learn something best by attempting a project I have intrinsic interest in.

So I’m looking for some large, publicly accessible data set that I could do interesting things with.  It has to be large enough that distributed storage and processing is actually necessary – I want experience using Spark for its intended purpose, not just familiarity with the Spark APIs.  Any suggestions?

Amazon has a bunch of public large data sets in their cloud services:

A lot of interesting stuff here, thank you! this looks like a good repository as well!

