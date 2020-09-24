Benchmarking is a crucial step in developing ever more sophisticated artificial intelligence. It provides a helpful abstraction of the AI’s capabilities and allows researchers a firm sense of how well the system is performing on specific tasks. But they are not without their drawbacks. Once an algorithm masters the static dataset from a given benchmark, researchers have to undertake the time-consuming process of developing a new one to further improve the AI. As AIs have improved over time, researchers have had to build new benchmarks with increasing frequency. As a Thursday Facebook post points out, “While it took the research community about 18 years to achieve human-level performance on MNIST and about six years to surpass humans on ImageNet, it took only about a year to beat humans on the GLUE benchmark for language understanding.”
What’s more, these benchmarks might contain biases that the algorithm can exploit to improve its score -- such as image recognition AIs ignoring the subtle contextual differences between “how much” and “how many” and simply answering “2”. So Facebook’s AI research (FAIR) lab has taken a new approach to benchmarking: they’ve put humans in the loop to help train their natural language processing (NLP) AIs directly and dynamically.