The first humans were gods—powerful ones. But as they multiplied, their powers and domains were broken and distributed among their offspring: from the sun goddess rose dawn and evening, from the earth god, stone and treasure. And as the children of the gods multiplied and became the human race, these domains splintered further and further and grew weaker and more specific, until in the present day Lars Nilson of Oslo, 55, is the unwitting god of the sound rain makes upon a single pond in India, and your great-grandmother may have been the goddess of calico cats, but you are only the god of one, named Jupiter, and he hates you.

Everyday Magic, 2017
I want to get more experience doing data science on large data sets with Spark.  I tend to learn something best by attempting a project I have intrinsic interest in.

So I’m looking for some large, publicly accessible data set that I could do interesting things with.  It has to be large enough that distributed storage and processing is actually necessary – I want experience using Spark for its intended purpose, not just familiarity with the Spark APIs.  Any suggestions?

Amazon has a bunch of public large data sets in their cloud services:

A lot of interesting stuff here, thank you! this looks like a good repository as well!

