Advertisement

Yahoo releases massive 13.5TB web-browsing data set to researchers

The anonymous data will help create better recommendation engines.

Yahoo's business may be struggling, but millions of people still visit its site to read the news every day. That gives the company unique insights into browsing and reading habits, and today the company has released a huge swath of that data. The "Yahoo News Feed dataset" incorporates anonymous browsing habits of 20 million users between February and May of 2015 across a variety of Yahoo properties, including its home page, main news site, Yahoo Sports, Yahoo Finance, Yahoo Movies and Yahoo Real Estate.

All told, the data set is a whopping 13.5TB and covers 110 billion unique interaction "events." Yahoo calls it the "largest machine learning dataset" ever publicly released, and we're inclined to believe them -- there aren't very many companies who could accumulate this much browsing data.

It's a huge amount of data, but fortunately you don't need to worry about advertisers mining it to make more targeted ads. Yahoo is specifically releasing it only to the academic research community to help people build more effective recommendation algorithms. As noted by the MIT Technology Review, the data set includes headlines that Yahoo's personalization algorithms show to visitors, a summary of the article, and which specific articles people click. There's also some demographic data for about 7 million users that includes age, gender and location -- but it's all been anonymized.

Improving recommendation algorithms is particularly relevant right now, as some of the biggest web properties rely on good recommendation engines to engage with their user. Netflix, Amazon, Google, Apple and Facebook (just to name a few) all rely on serving their users relevant recommendations to keep them engaged with their products and services. Yes, it's a way for those companies to make more money, but it also generally makes for a better user experience -- as long as those recommendations are good. Yahoo's huge data release will probably go a long way towards meeting that goal.

[Image credit: Noah Berger/Bloomberg via Getty Images]