COVID-19 Open Research Data Challenge

I would like share you my experience on participating in COVID-19 Open Research Dataset Challenge (CORD-19) hosted on Kaggle.

Dataset Details:

As part of challenge, Cord19 Dataset published by AllenAI Institute has been provided.
Dataset Name: Allen Covid19 dataset
Dataset link: https://www.semanticscholar.org/cord19/download
Dataset Update rate: Daily
Content:

Parsed Document:

Contains json files extracted from the articles and publications related to covid-19 in Allen Dataset i.e., PubMed, WHO research articles , bioRxiv and medRxiv

Cord19-Embeddings:

Embeddings extracted from the SPECTER Paper Embeddings

Metadata:

Metadata for this dataset

Task List:

There are 18 tasks available for this challenge. Our team has explored three of those tasks which are listed below:

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=570
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=583
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=587

Each task has set of subtasks which comprises of statements for which we need to find out what literature says about it.

For Example, What literature says about incubation period for virus, methods for data gathering , understanding coverage policies etc.

We have observed that all the above tasks can be solved by our proposed approach.

Our Approach:

We have decided to take a Question Answer Model approach to this problem. Each of the statement for which we need to know what literature says can be treated as a query.

There are many popular QA Models available. We have decided to chose Bert QA Model looking at its popularity and its ability to generalize the problem. Bert QA Model is a Deep learning model with state of art Transformers Architecture. The Bert model variant shall take inputs of length 512 i.e., Our document shall be split into 512 words each and should be passed to the model to identify the location of answer within that 512 words. But, The dataset given is very huge and will expand even more looking at the pace of research articles being published on COVID-19 during this time. Then this is not a feasible approach to process every document text thru BERT QA Model taking into consideration that model is huge and would take considerable processing time. So, to tackle this issue, we have decided to shortlist #10 documents which are probable to have answer for this query. This number can be extended depending on computational power and acceptable inference time. Few possibilities that we have attempted are TF-IDF Vector similarity between document text and the query, Using Spectral Paper Embedding provided in the dataset for each document, Whoosh Index Search etc.

In our final solution, we have finalized on whoosh index search for getting 10 similar documents as considering its simplicity in usage and less runtime.

Indexing using Whoosh :

Whoosh Index is an open source library available in python. Please refer the following link which explains on how to use whooshindex library for indexing and search

https://whoosh.readthedocs.io/en/latest/indexing.html

The dataset contains the literature in json format which contains title, abstract, body_text, authors and other subfields. As we are interested in looking for the answers of the statements which would probably be the research study of those literature, it would be in body_text. so, we would like to consider scope for searching the information related to query in the body_text rather than the abstract or title. So, we would index the body_text information of each document using whoosh_index.

Once we index the document text, it will create a one huge index file which contains the information of all the documents.

Query Preparation:

Instead of passing these statements as it is, we can extract the keywords out of the statement. What are the possible keywords? To understand what to extract as keywords from the query, its important to also know what a question represents; A question signifies an interrogation of an incident or objects or living beings or something else, where the interrogation happens on a Noun or Proper Noun or a Noun Phrase. Similarly, adjectives and verbs play significant role in understanding the questions (without getting into the details on how they play a role). Therefore, we consider extracting nouns, proper nouns, adjectives and verbs and filter english stop-words from the query. To tag the Parts-of-speech we used Spacy and then join then using "AND" and "OR" for making a query.

Search Query in Whoosh Index:

Once the statement is converted into query, this query shall be used to get the top 10 similar documents using Whoosh Index searcher. Please refer the following link for knowing more about how to search query using whoosh index search

https://whoosh.readthedocs.io/en/latest/searching.html

We can configure to limit the search results to #10 and also configure the format of the output. We have configured to get the title, body_text , document similarity score as result.

Get Answer thru QA Model:

As discussed earlier that we are using BERT QA Model for getting answer for query. we have come across a Bert Model which is fine tuned on CORD-19 Data , an open source model available for download thru the following link https://huggingface.co/manueltonneau/clinicalcovid-bert-base-cased

More information regarding this model training can be found in https://github.com/manueltonneau/covid-berts. There are many other pretrained models published like https://huggingface.co/deepset/covid_bert_base , https://www.kaggle.com/lvennak/covid-19-qa-bert. But we proceeded with the former one.

For each of the document from those top 10 similar documents, we take the body text of the document and split it through a running window of 512 words which have an overlap of 10% with consecutive window. This windowed words are passed through the Bert Tokenizer to get assigned with token ids for each of the word. This list of token ids are passed to the BertQA Model as input and this model would return the answer indexes of the token id list along with the confidence score. we can take best confidence scored answer among each document and store the answer.

The Top 5 confidence scored answers among all the documents can be shown as output at the end for user. i.e., As a whole in the end user perspective, we get the best 5 documents which answers be displayed as output for each query.

Below figure represents the block diagram model explaining our approach.

Pros and Cons of our Approach:

Here are few pros and cons which we have thought about our approach.

Pros :

This is a generalised approach which can be extended further to other similar tasks.
This can be extended to Multilingual queries with minor changes i.e., replacing the current pre-trained English Bert Model with Multilingual BERT model.
Using Whoosh Indexer reduced the time taken in building an indexer system, which further reduced the search time in retrieving the relevant documents from the

Cons :

The questions and answers are limited to a maximum of 512 tokens as it is a limitation from the BERT Model input length. This could be mitigated with a Semantic similarity solution.
Given the resource and time constraints, we were not able to retrain the BERT model as it would require GPU capabilities, however we re-utilised the pre-trained models which were trained on COVID datasets. Even if we wish to extend this further, resource would be a constraint.
For all our work we used simple keyword search to search through the Whoosh index, however with the advanced capabilities in place, Elasticsearch may have been a better choice as it provides semantic search capabilities in addition to the keyword search.

PS: Please feel free to comment on our articles and share it with your friends if you think they may find it useful.

Open Knowledge Share