My Blog

nostalgebraist:

furioustimemachinebarbarian:

nostalgebraist:

I want to get more experience doing data science on large data sets with Spark. I tend to learn something best by attempting a project I have intrinsic interest in.

So I’m looking for some large, publicly accessible data set that I could do interesting things with. It has to be large enough that distributed storage and processing is actually necessary – I want experience using Spark for its intended purpose, not just familiarity with the Spark APIs. Any suggestions?

Amazon has a bunch of public large data sets in their cloud services: https://aws.amazon.com/public-datasets/

A lot of interesting stuff here, thank you!

http://www.jenunderwood.com/2016/01/14/my-favorite-public-data-sources this looks like a good repository as well!

Leave a Reply Cancel reply