IBM's CodeNet dataset can teach AI to translate computer languages

It is the ImageNet of code.

berkozel via Getty Images

AI and machine learning systems have become increasingly competent in recent years, capable of not just understanding the written word but writing it as well. But while these artificial intelligences have nearly mastered the English language, they have yet to become fluent in the language of computers — that is, until now. IBM announced during its Think 2021 conference on Monday that its researchers have crafted a Rosetta Stone for programming code.

Over the past decade, advancements in AI have mainly been “driven by deep neural networks, and even that, it was driven by three major factors: data with the availability of large data sets for training, innovations in new algorithms, and the massive acceleration of faster and faster compute hardware driven by GPUs,” Ruchir Puri, IBM Fellow and Chief Scientist at IBM Research, said during his Think 2021 presentation, likening the new data set to the venerated ImageNet, which has spawned the recent computer vision land rush.

“Software is eating the world," Marc Andreessen wrote in 2011. "And if software is eating the world, AI is eating software,” Puri remarked to Engadget. “It is this relationship between the visual tasks and the language tasks, when common algorithms could be used across them, that has led to the revolution in breakthroughs in natural language processing, starting with the advent of Watson Jeopardy, way back in 2012,” he continued.

In effect, we’ve taught computers how to speak human, so why not also teach computers to speak more computer? That’s what IBM’s Project CodeNet seeks to accomplish.”We need our ImageNet, which can snowball the innovation and can unleash this innovation in algorithms,” Puri said. CodeNet is essentially the ImageNet of computers. It’s an expansive dataset designed to teach AI/ML systems how to translate code and consists of some 14 million snippets and 500 million lines spread across more than 55 legacy and active languages — from COBOL and FORTRAN to Java, C++, and Python.

“Since the data set itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations,” Puri explained. “Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages.” In short, the dataset is constructed in a manner that enables bidirectional translation. That is, you can take some legacy COBOL code — which, terrifyingly, still constitutes a significant amount of this country’s banking and federal government infrastructure — and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.

“We believe natural language processing and machine learning can be applied to understanding software languages by doing automated reasoning and decision making, by being able to explain those decisions, just like we are able to do with computer vision and on the natural language processing side,” he said.

But just as with human languages, computer code is created to be understood within a specific context. However, unlike our bipedal linguistics, “programming languages can be compared, very succinctly, on a metric of ‘does the program compile, does the program do what it was supposed to do problem and, if there is a test set, does it knows, solve, and meet the criteria of the test,’” Puri posited. Thus, CodeNet can be used for functions like code search and clone detection, in addition to its intended translational duties and serving as a benchmark dataset. Also, each sample is labeled with its CPU run time and memory footprint, allowing researchers to run regression studies and potentially develop automated code correction systems.

Project CodeNet consists of more than 14 million code samples along with 4000-plus coding problems collected and curated from decades’ of programming challenges and competitions across the globe. “The way the data set actually came about,” Puri said, “there are many kinds of programming competitions and all kinds of problems — some of them more businesslike, some of them more academic. These are the languages that have been used over the last decade and a half in many of these competitions with 1000s of students or competitors submitting solutions.”

Additionally, users can run individual code samples “to extract metadata and verify outputs from generative AI models for correctness,” according to an IBM press release. “This will enable researchers to program intent equivalence when translating one programming language into another.”

While this dataset could theoretically be used to generate entirely new sequences of code, like what GPT-3 does with English, CodeNet’s strength lies within its ability to translate. “We are exactly trying to do what ImageNet did to computer vision,” he said. “It fundamentally changed the game, it was highly curated with a very targeted data set for a very broad domain. We hope CodeNet, with its diversity of tasks, its diversity of data, and with its large scale, will bring the same value.” Plus, Puri estimates that more than 80 percent of these presented problems each already have more than 100 variant answers, providing a broad array of possible solutions.

“We are very excited about this,” Puri exclaimed. “We hope and believe it will be to code what ImageNet was to computer vision.” IBM intends to release the CodeNet data to the public domain, allowing researchers worldwide equal and free access.