All told, the data set is a whopping 13.5TB and covers 110 billion unique interaction "events." Yahoo calls it the "largest machine learning dataset" ever publicly released, and we're inclined to believe them -- there aren't very many companies who could accumulate this much browsing data.
It's a huge amount of data, but fortunately you don't need to worry about advertisers mining it to make more targeted ads. Yahoo is specifically releasing it only to the academic research community to help people build more effective recommendation algorithms. As noted by the MIT Technology Review, the data set includes headlines that Yahoo's personalization algorithms show to visitors, a summary of the article, and which specific articles people click. There's also some demographic data for about 7 million users that includes age, gender and location -- but it's all been anonymized.
Improving recommendation algorithms is particularly relevant right now, as some of the biggest web properties rely on good recommendation engines to engage with their user. Netflix, Amazon, Google, Apple and Facebook (just to name a few) all rely on serving their users relevant recommendations to keep them engaged with their products and services. Yes, it's a way for those companies to make more money, but it also generally makes for a better user experience -- as long as those recommendations are good. Yahoo's huge data release will probably go a long way towards meeting that goal.
[Image credit: Noah Berger/Bloomberg via Getty Images]