Advertisement

Microsoft AI researchers mistakenly leaked 38TB of company data

It was through an SAS link that permitted access to a full Azure storage account.

MR.Cole_Photographer via Getty Images

A Microsoft AI research team that uploaded training data on GitHub in an effort to offer other researchers open-source code and AI models for image recognition inadvertently exposed 38TB of personal data. Wiz, a cybersecurity firm, discovered a link included in the files that contained backups of Microsoft employees' computers. Those backups contained passwords to Microsoft services, secret keys and over 30,000 internal Teams messages from hundreds of the tech giant's employees, Wiz says. Microsoft assures in its own report of the incident, however, that "no customer data was exposed, and no other internal services were put at risk."

The link was deliberately included with the files so that interested researchers could download pretrained models — that part was no accident. Microsoft's researchers used an Azure feature called "SAS tokens," which allows users to create shareable links that give other people access to data in their Azure Storage account. Users can choose what information can be accessed through SAS links, whether it's a single file, a full container or their entire storage. In Microsoft's case, the researchers shared a link that had access to the full storage account.

Wiz discovered and reported the security issue to Microsoft on June 22, and the company had revoked the SAS token by June 23. Microsoft also explained that it rescans all its public repositories, but its system had marked this particular link as a "false positive." The company has since fixed the issue, so that its system can detect SAS tokens that are too permissive than intended in the future. While the particular link Wiz detected has been fixed, improperly configured SAS tokens could potentially lead to data leaks and big privacy problems. Microsoft acknowledges that "SAS tokens need to be created and handled appropriately" and has also published a list of best practices when using them, which it presumably (and hopefully) practices itself.