This repository contains all the code & scripts for my 'Analyzing StackExchange data with Azure Data Lake' talk. This talk highlights the power of Azure Data Lake Store & Analytics and how they can be the center of your big data ecosystem.
During the talk I used a StackExchange data dump to demo the loading, storing, processing and visualizing data with Azure Data Lake Store, Data Lake Analytics & Power BI.
Stack Exchange has made their data available from all their websites under Creative Commons license. It includes data about users, posts, comments, votes, etc for every single site.
This data is used as a demo set since this reflect real-world data. The data contains information about every website by StackExchange going from users & posts to comments and votes and beyond.
Here is an example of how the folder for coffee-stackexchange-com
is structured:
+ coffee-stackexchange-com
- Badges.xml
- Comments.xml
- PostHistory.xml
- PostLinks.xml
- Posts.xml
- Tags.xml
- Users.xml
- Votes.xml
You can find the coffee-stackexchange-com
sample here, download all the data here or more information on StackExchange.
The demo uses a CSV representing all the countries defined by ISO 3166. This can be found at lukes/ISO-3166-Countries-with-Regional-Codes.
Not a fan of this data set? caesar0301/awesome-public-datasets contains a ton of alternatives.
- Azure Data Lake GitHub repository (link)
- U-SQL Documention (link)
- "Introducing Azure Data Lake" Microsoft Virtual Acadamey (link)
- U-SQL Tutorials (link)
- Comparison between Azure Blob Storage & Azure Data Lake Store (link)
- Martin Fowler on Data Lakes (link)
- "Mastering Azure Analytics" by Zoiner Tejada (link)
Licensed under the terms of the MIT license.