I want to get more experience doing data science on large data sets with Spark. I tend to learn something best by attempting a project I have intrinsic interest in.
So I’m looking for some large, publicly accessible data set that I could do interesting things with. It has to be large enough that distributed storage and processing is actually necessary – I want experience using Spark for its intended purpose, not just familiarity with the Spark APIs. Any suggestions?
Amazon has a bunch of public large data sets in their cloud services: https://aws.amazon.com/public-datasets/
A lot of interesting stuff here, thank you!
http://www.jenunderwood.com/2016/01/14/my-favorite-public-data-sources this looks like a good repository as well!
