The 2 Tiers of Biomedical Science

July 3, 2024
  1. Tier 1: Pattern Recognition
    • This involves large-scale data analysis to identify patterns and correlations in health and medical data. The primary tools here are statistical methods and machine learning algorithms, which can sift through vast amounts of data to find associations.
    • A common issue in this tier is the replication crisis. Findings often cannot be replicated because the analyses rely heavily on statistical correlations without understanding the underlying causal mechanisms.
  2. Tier 2: Causal Investigation
    • This tier focuses on understanding the causal mechanisms behind the observed patterns. It requires designing experiments or observational studies that can test causal hypotheses.
    • The data needs for this tier are more specific and require dynamic generation. As our understanding of causal mechanisms evolves, so do the data requirements.
    • Causal explanations are crucial for answering interventional and counterfactual queries, necessary ingredients for progress in personalized medicine.

Solving the central problems in biomedical science - especially as they relate to personalized medicine - is not analogous to solving the problems in natural language processing. The focus on aggregating larger and more multi-modal datasets is an important one, however its relative importance to the advancement of science is far less than to the advancement of large language models. Much more important is the ability to generate new knowledge, and to apply it in a patient specific way. Because the number of causal conjectures relevant to some area of study are virtually limitless, no dataset is likely to be even close to adequate to the task of answering the associated sets of causal queries.

A brief example:

A cancer patient has just completed a clinical trial testing the efficacy of a new personalized tumor vaccine pipeline. After surgical removal and sequencing of her tumor, she is administered a personalized vaccine based on genetic features of her tumor (e.g., neoantigens) and her HLA profile. Another scientist conjectures that the time to remission after immunotherapy depends on the obvious factors (e.g., precise composition of the vaccine, the immune evasion potential of her tumor's mutations, etc.), but also on an unstudied factor:  the affects of sleep quantity and quality on the availability of cancer killing T-cells in her blood.

Investigating causal questions relating sleep quality and tumor recurrence may be relatively straightforward with an appropriate observational study in the event the explicit causal model renders the causal effect identifiable. Alternately, the causal question may not be amenable to observation alone (defeating the backdoor criterion, frontdoor adjustment, etc.) thereby necessitating a randomized controlled trial. In either case, resolving the causal questions depends on observing changes in a dynamical biological system and interpreting those changes in the context of a model. Previously collected datasets may be useful to the task but are likely to be far from sufficient.

In order to move beyond the trope that progress in biomedical science is primarily a problem of the aggregation of large datasets, we need digital infrastructure and an institutional culture that enables a shift to Tier 2 science.