How Facebook AI’s Dynabench will Change the Concept of Benchmarking in Machine Learning
Keeping humans in the loop for training AI models to attain better efficiency for human-machine interaction, Facebook AI releases a new benchmarking technique. This works based on the errors committed by the machine and correcting them. This model can be used to perform various natural language processing tasks.
Benchmarking, in general terms, is the study of how well a model is performing as compared to its initial stages or other models. It is a necessity in the Machine Learning (ML) community because it challenges the development of existing models and provokes the development of new and better models.
Need for a dynamic benchmark
Up until now, we have used static benchmarks that get exhausted very quickly due to the rate at which AI is growing. It also induces the researchers to overfit their model according to the benchmark. Another issue that is faced is data biases are nearly impossible to avoid due to the unintended consequences of overlap between train sets and test sets.
There unavailability of data collection through crowdsourcing platforms and their capability to process large-scale models also forced static benchmarking. Because the data collection was a very expensive and time-consuming approach. Other challenges of static benchmarking include the addition of unintended biases to improve a specific model’s performance, and annotation artifact and they strongarm the community into excessively focusing on one particular task, that is, attaining higher accuracy whereas the actual aim should be to attain the least error rate. We can observe these issues in the existing benchmarks like MNIST, ImageNet, GLUE
To overcome the drawbacks of static models, the Facebook AI team releases Dynabench, which is a new platform for dynamic benchmarking and data collection in AI over multiple rounds, a first of its kind. Most researchers test their models on a standardized set of data instances in the corpus, but this may lead to tests becoming outdated very quickly as there is rapid growth in the field of AI daily.
So, using Dynabench, new datasets are created using both models and humans in conjunction to analyze the performance of NLP models more accurately. It also measures how an AI model is easily fooled by a human in a dynamic environment. It aims to provide a better metric that would be able to provide the information about the model’s quality than what the current models do.
To understand this model better, let’s consider an example. Suppose, test-taker studies well for the test by learning and aces the test, but when questions are asked, he starts fumbling. This happens because he has not ‘understood’ things only ‘learned’ them. This is where Dynabench can perform better.
By simple analogy, the static datasets and benchmarks are similar to the student who learn classification based on a pre-defined dataset, and humans are the test evaluators, who will come with various challenges and questions for evaluation using Dynabench so the models can learn based on growing challenges and changing test data every day.
The model uses the points for which the system ‘fumbles’ and to make necessary modifications and train itself accordingly, then use this for next-generation AI models. Dynabench takes help from people to question the AI model and invite them to go to a website to interrogate the AI behind it. Today, this is done using GPT-3 where people are testing its limits, or trying to evaluate the chatbots trying to pass as humans.
The chief essence is based on using the dynamic adversarial data collection aiming to improve current benchmarking models. It uses crowdsourcing platforms for benchmarking process by using active involvement of humans in evaluation because naturally, a human can assess the model’s accuracy more efficiently than a set of pre-packaged dataset test questions as they come up with dynamic and real-world queries with every use. It relies on asking people a set of NLP algorithms and challenging questions to identify the model’s weak points or the areas that need to be worked on more.
The basic concept behind Dynabench is to use human creativity for challenging the model. Because, as of now, it is very easy for a human to fool the AI. Suppose, in the field of emotion detection, the wit, sarcasm, hyperboles, etc. used by a human may fool the system very easily.
Thus, Dynabench uses these weak points to create harder datasets with each round.
So, this starts by training on a set of data corpus as per the requirement just like all other models to date have been doing. The difference is seen in the testing phase.
Where other models used a fraction of the standardized corpus to test their data, Dynabench takes a combination of pre-packaged corpus instances along with human challenges from the real world. They can play tricks on the model by using wrong keywords on purpose or by making use of such references that exist only in the real world and the machine may not know.
The human evaluator determines the accuracy of the model performance and the areas or queries that the system did not perform well on. These points are then trained for in the next round.
Thus, with every round more and more challenges are faced by the system, that is not a part of any existing dataset or is from the area where that model needs to be implemented.
When the model finishes testing round, Dynabench identifies the areas where the model is weak and was fooled by humans and then compiles these weak areas into a new testing dataset.
This test dataset can be used by researchers to build newer and more sophisticated models.
This results in a model that can answer questions the previous model couldn’t so, it again undergoes testing to identity harder and more challenging questions it can’t answer and build a new test dataset.
This recurring process can be repeated frequently and easily, so that if any biases occur over time, Dynabench can identify them and come up with new examples to test whether those biases still exist.
This cycle is nearly impossible to end because for a system to reach the point where it can answer all questions that may be possible, is difficult.
Dynabench makes the AI model more robust to vulnerabilities and weaknesses because it is easy for human annotators to come up with a lot of examples to fool the system.
The model server is in the cloud, torchserve, and the connection of crowdsourced annotators to the platform is done using Mephisto, besides humans interacting with the system, who receive instant feedback on the model’s response.
What can it be used for?
Initially, the model focuses on four core tasks of natural language processing, because this is the field majorly affected by benchmark saturation. These tasks are question-answering, sentiment analysis, hate speech, and natural language interference.
Benefits of Dynabench
This metric reflects the performance of AI models better in important situations – having to interact with people who are different from each other in a lot of ways thus, behave and react differently in complex, changing ways that a static corpus can’t have.
It also makes the system highly robust to vulnerabilities as it tackles newer and more challenging tasks with each round. This leads to the model’s ability to deal with biases and artifacts that are set unintentionally.
The most important advantage, though, is that this model overcomes the biggest drawback of the static benchmark: saturation, as for Dynabench to reach saturation will take a huge amount of time and test instances which are nearly impossible. The test is much closer to real-world applications than all other models that exist until now.
Limitations of Dynabench
There is a high risk of cyclical progress of catastrophic forgetting, in which improved models tend to forget things/information that was once relevant and is not anymore but may be required later.
It also does not provide any tools to perform bias mitigation. Also, the prime essence of Dynabench, crowdsourcing, may sometimes be difficult to implement.
So, where is the future for Dynabench headed?
For now, it is working on language models only but may eventually start its work on neural networks, like image recognition models and speech recognition systems. Besides the four NLP tasks, the model is planned to be open to anyone for creating their tasks to get human annotators for finding out weaknesses in the models.
The main language of focus, as of now, is English but is hoped to be joined by other languages and modalities in the future.
The model also requires research to identify, if shifts are better to avoid cyclical progress and if they can overcome adverse effects.