My comments are organized assuming different stages of implementation. Stage1 addresses cross site data harmonization. Once harmonized, stage 2 illustrates queries I've found useful in my own work. Once cross site queries are in place, Stage 3 discusses how a database of these queries themselves might be leveraged.
Stage 1: Harmonization Services
Handwritten notes, pdf files, databases from multiple vendors. None of these work together "out of the box". It takes effort to combine them. The jargon is that Data from these diverse sources must be "harmonized". This is hard even on a single hospital basis (Cogstack and MedCAT). The proposed system could provide this (and other functions) as a service to encourage Federation membership. Many of these systems utilize NLP (Natural Language Processing) packages. For example, Ross Mitchell provides a "pathology report text"->"PurpleBook ICD code" translator as an open source Dockerized container (GTC talk) that might help. Note that everyone benefits from services of this type since translation accuracy increases as the corpus of pathology reports increase, and at some level, queries need to be in a common language (e.g. ICD codes) across sites anyway. Finally, it would be helpful to anticipate working beyond Kidney Cancer. GA4GH creates standards for the multitude of genomic databases around the world. Its working hypothesis is that it will be technically possible to harmonize data (like the GWAS Catalog on a grander scale) and to federate, or loosely link, the disparate data warehouses.
Stage 2: Example Queries
Assuming the data is harmonized, what can be done with the collection? I get my data primarily from hackathons, where disparate teams of researchers work on a single patient case. Two Hackathon team queries that gave notable results were “Genetic Cohort Analysis” and “Genetic Pan Cancer and Familial Analysis”.
Genetic Cohort Analysis
- 2018 Hackathon Cohort Analysis:GEO - Rutgers (Saed Sayad) created his “Genes of Interest” list using GEO papillary kidney cancer data.
- 2020 Hackathon Cohort Analysis:TCGA - Clemson (Reed Bender) used my genome and TCGA papillary kidney cancer data to create an 18000 gene list of differential RNA-seq expressions for myself and 5 TCGA patients.
Common to both queries was the use of public datasets. So incorporating public datasets is clearly a requirement.
Genetic Pan Cancer and Familial Analysis
In November 2020, Quantum Insights (a 2018 p1RCC Hackthon team) discovered that my 2020 RNA-seq data “clustered” close to TCGA’s RNA-seq Thyroid Cancer dataset, and, unbeknownst to them, one of my siblings was diagnosed that year with Thyroid cancer. So two queries might be:
- Pan Cancer Analysis - What "target cancer" does patient X’s RNA-seq data cluster closest to?
- Familial Analysis - Does Patient X have a relative with that "target cancer"?
Even if harmonized, querying across hospital systems (think Federated Computing) for data relating to a relative’s condition (think HIPAA) may be problematic.
“Genetic/Image based Comorbidity Analysis” and “Epigenetic Analysis”
I have not done these, but would like to be able to do so.
- Genetic/Image based Comorbidity Analysis - Imaging that revels Bi-lateral kidney tumors raises the odds of an HPRC diagnosis. I have a brain meningioma. And I’ve read that NF2 is implicated in both Brain and Kidney tumors. What is the shared expression pattern for non-metastatic patients with overlapping tumor locations? Note that a similar query "What are the symptoms of non-metastatic patients with overlapping tumor locations?" does not even require genetic analysis or a diagnosis to reveal potentially useful biomarkers ("blood in urine" "flank pain", ...).
- Epigenetic Analysis - My sibling with Thyroid Cancer also has a Lung Carcinoid. In addition to having Similar Genomes, we have always lived close to one another (First in Houston, then in Silicon Valley). What is the overlap of our RNA-seq data? Such queries might also tie in the locations of superfund sites (such as Google)
Stage 3: Queries and Ranking Functions
Hackathons' research teams can be “reimagined” as ensembles of “classifiers”. Applying ensemble learning techniques to the event’s results have helped me determine where to look next in my own case (p1RCC). Central to the idea are
- Queries: the question being asked and
- Ranking functions: which assign a score to the proposed answers.
In the case of my 2018 hackathon, the question being asked was “what are Bill’s genes of interest”? The most interesting ranking function was Reed Bender's list of Bill’s tumor's differentially expressed genes obtained by RNA-seq analysis. This list was actually created during the 2020 hackathon and so acted as a “hold out” set to rank the answers from the 2018 hackathon. One of the seventeen participating 2018 teams, Biomarkers.ai (a Rutger’s team), ranked extraordinarily well using this metric. So of the 17 team methods/answers from the 2018 hackathon I can explore, I am exploring Biomarkers.ai’s methods/answer first.
Just like hackathon teams, Federated Queries going against KCA’s collection of databases also form an ensemble. And although these queries will be less structured, they themselves form a corpus that can be queried in its own right. One set of queries might, for example, revolve around “what are a p1RCC patient’s genes of interest”?. In this case, I personally may not even need the answer, I may just need to know the queries and their ranking. An initial query table and rankings table can even be centralized and be quite cheap to administer (e.g. simply published in html on the splash screen of the website). Note that there can be many different ranking systems. (E.g. Yahoo and Google publish their queries ranked by "trending" and "most popular"). And if constructed properly, the queries could be the basis for a genetic programming system that synthesizes new queries when the system is not actively used.
Promotion, Other Efforts and Openness
- Promotion/Gamification - KCA could have its own hackathon. E.g. kaggle or Clinical Reporting of MultiOmics Data or Pursuing Better Biomarkers for Immunotherapy Response in Cancer Through a Crowdsourced Data Challenge
- Other Efforts - Pentland's Data Unions, Fajgenbaum's work, Andrew Beam's list, Covid-19 Commons (Note that Matthew Trunnell argues that by itself, Federated Computing may not address the core rate-limiting factors--legal and ethical review--in biomedical research),
- Openness - All the data mentioned in this post is open access. And all the work I do goes toward creating more open access data and research. Because keeping it closed hinders research on my disease. And slowing down the research means more patients like me will die faster. Hopefully this effort will provide some mechanism to create more open data.