Recommendation of Collaboration Opportunities for COVID-19 Trials
Link for the website: Trial Collaboration Website
Code: Github
Background
During the COVID-19 pandemic, on average 22 new clinical trials are initiated each day in the United States alone. Creating competitions for recruitment can potentially hurt all trials. Proactive identification of similar or related trials nearby would allow clinical trial designers and investigators to collaborate with peers in an effort to produce the most reliable results possible.
Here is a plot showing all the new interventional trials initiated across the US.
Hypothesis
A semi-automated clustering model or trial ‘recommendation engine’ will identify similar or related trials with minimal user input and allow researchers to collaborate in an effort to fight the COVID-19 pandemic.
Models and Analysis
The clinical trial data related to COVID-19 were extracted from ct.gov. They only have features with no target value which meant I could only use unsupervised learning method. Here I tried several clustering models.
Clustering Models
Since the number of data fields are very large and they include both structured and unstructured fields, I tried to process them separately.
Structured Fields
Clinical Trials can be grouped into two main groups: interventional and observational trials. Their fields are a little different and I processed them depending on which group they were in.
In data preprocessing step, I dealt with different fields.
“Condition names” (Example in one cell: Coronavirus|Coronavirus Infection|Covid-19) is a multiple-valued categorical attribute so I used multi-hot encoding. Some of the names are the same but with different representation. I applied fuzzy matching on that.
“Intervention names” (Example in one cell: COVID-19 Serology;Health Care Worker Survey) is also a multiple-valued categorical attribute and I used multi-hot embedding here.
The remaining fields can be separated into two categories. One category contains gender, Healthy volunteers, observation model( observation trials), phase, primary purpose, intervention model, allocation, masking (interventional trials). These fields were processed with one hot encoding. The other category contains maximum age, minimum age and they can be processed with standard scaling.
In model building step, I tried three clustering models.
The first was K-means. I adjusted different number for clusters to get optimal result.
The second was Agglomerative Clustering. I used four different linkage methods: single, average, complete and ward.
The third was Gaussian Mixture Model. I used four different covariance types: spherical, diag, tied and full.
Unstructured fields
Here unstructured fields mainly refer to free-text data, they are official title, study description, design description, outcome description, outcome description and outcome measure.
I tried three methods to embed the text.
First was BioBert embedding. I used BioBert to embed words and averaged vectors in an entry to compute cosine similarity between each two entries.
Second was simple bag-of-words embedding. I used bag of words embedding to compute vector representation of the text and applied clustering methods used in structured fields to compute their similarity.
I used BioBert to embed words and averaged vectors in an entry to compute cosine similarity between each two entries.
Third was Word2Vec embedding. I used word2vec embedding to compute vector representation of the text and applied clustering methods used in structured fields to compute their similarity.
Results
Here is a table containing trials recommended by BioBert for trial: Duvelisib to Combat COVID-19 (Phase 2).
| phase | brief_title | prim_purpose |
|---|---|---|
| Phase2 | Ruxolitinib to Combat COVID-19 | Treatment |
| Phase2 | Early Infusion of Vitamin C for Treatment of Novel COVID-19 Acute Lung Injury (EVICT-CORONA-ALI) | Treatment |
| Phase3 | Phase 3 Study to Evaluate Lenzilumab in Hospitalized Patients With COVID-19 Pneumonia | Prevention |
| Phase2 | University of Utah COVID-19 Hydrochloroquine Trial | Treatment |
| Phase2 | Sarilumab for Patients With Moderate COVID-19 Disease | Treatment |
| Phase 3 | COVID-19 Treated With Hydroxychloroquine Among In-patients With Symptomatic Disease | Treatment |
Issues
Since I am not a specialist in clinical trials, I asked a researcher in the lab and he said the result was not ideal.
Besides that, there were four main issues for this method.
- It was difficult in selecting number of clusters and other parameters.
- Reasons for clustering often unclear, but clinical trial designers wanted to know the reason behind the cluster.
- It was also hard change weights for different fields because I wrote all the code in jupyter notebook.
- There were more unstructured fields than structured fields and I processed them separately.
Visualizing COVID-19 Trials
Heat map for all Trials
Heat map for interventional trials
Solution to the above issues - another method (overlap computation)
Using the above clustering methods did not turned out well. It seemed that using the machine learning methods directly were not wise and not very helpful, instead I decided to use a more intuitive method to find similar trials for any target trial. Here I used field overlap for pairwise similarity computation.
For one target clinical trial, compute normalized total score between it and all other clinical trials. The calculation steps are shown below.
normalized_total_score=
(intervention_names_overlap * intervention_names_weight +
Condition_names_overlap * condition_names_weight +
study_type_score * study_type_weight +
primary_purpose_score * primary_purpose_weight +
location_overlap * location_weight +
intervention_type_overlap * intervention_type_weight +
phase_score * phase_weight +
int_obs_score * int_obs_weight +
allocation_score * allocation_weight +
masking_score * masking_weight +
start_date_score * start_date_weight +
gender_score * gender_weight +
age_score * age_weight ) / total_weight
Overlap = (overlap number) / (maximum overlap number)
Score: 1- their values match
0- their values do not match
There are two special fields: zipcode and outcome measure. For zipcode I fixed a radius of 100 and used the original zipcode as center to find all zipcodes within this radius, then computed overlap for any two zipcode sets. Outcome measure is free text so I just used BioBert embedding and then compute the vector similarity.
The field weights are adjustable so I built a website for clinical researcher.
Website
Eample trial id: NCT04372602
search page
Enter nctid and the neighbor number, adjust the weights using sliding bar.
result page