nostalgebraist:

furioustimemachinebarbarian:

nostalgebraist:

I want to get more experience doing data science on large data sets with Spark.  I tend to learn something best by attempting a project I have intrinsic interest in.

So I’m looking for some large, publicly accessible data set that I could do interesting things with.  It has to be large enough that distributed storage and processing is actually necessary – I want experience using Spark for its intended purpose, not just familiarity with the Spark APIs.  Any suggestions?

Amazon has a bunch of public large data sets in their cloud services: https://aws.amazon.com/public-datasets/

A lot of interesting stuff here, thank you!

http://www.jenunderwood.com/2016/01/14/my-favorite-public-data-sources this looks like a good repository as well!

Leave a Reply

Your email address will not be published. Required fields are marked *