Kaggle’s COVID-19 research challenge enters its 3rd week
In response to the number of deaths taking its toll due to COVID-19, concern for the deadliest of viruses of the century, the Allen Institute of AI has prepared a COVID-19 Open Research Dataset(CORD19). The White House has tied up with several leading research groups in top-level institutions of the US. The dataset comprises around 44,000 research/scholarly articles related to COVID-19 and other viruses related to COVID-19.
The dataset is available free of cost, all you need to do is just log in to Kaggle.com and download the 4 GB dataset or start working on the Kaggle’s kernel itself. The research dataset is available to explore for professional or would-be Data Scientists or data enthusiasts. The data available is provided to the AI community so as to apply for various advances in Natural Language Processing(NLP), Deep Learning(DL) and other AI techniques to generate possible insights and analysis regarding the pandemic, for example, to extract what possible words or word in the research has been or have been used most number of times, for eg the word ‘virus’ has been used 8,000 times in the 44,000 research articles in the dataset. The working of the dataset will be explained by me later in one of the articles.
There have been ten tasks to perform that will be related to the origin, evolution, risk factors, etc related to the pandemic. NLP will be the most commonly used technique applied by the Data Science enthusiasts, which will give a humungous exploration of the research dataset. This is where NLP helps, text processing, and generating a large number of valuable insights from textual data that helps not only in research but businesses as well. In one of our previous articles, one of the leading hospitals of Boston extracted data from various online news sites and online search queries of the people of the USA and applied some textual analytics(NLP) to find out the probable/suspected cases of the pandemic. NLP gives a good analytical insight into textual, voice data while Computer Vision helps in generating prominent insights for image or video data.
The prize for the competition is USD 1,000. The price is not the motive for anyone at the moment, after all, the primary thing is the kind of exploratory analysis it will give to the 10 tasks and help aware people across the globe, the virality of the virus, the diagnostics, what has been published about the medical care of it, etc. The deadline to submit the code is 25th of March and the data is also available on other platforms like Microsoft Academics, Semantic scholar, it has been created by Allen Institute of AI has partnered with the Chan-Zuckerberg Initiative, Georgetown University, Microsoft Research and National Library for Medicine. In the future, we will surely be discussing this dataset and I will also share my learnings with you here. You can participate in this CORD challenge by clicking here