Solving the central problems in biomedical science - especially as they relate to personalized medicine - is not analogous to solving the problems in natural language processing. The focus on aggregating larger and more multi-modal datasets is an important one, however its relative importance to the advancement of science is far less than to the advancement of large language models. Much more important is the ability to generate new knowledge, and to apply it in a patient specific way. Because the number of causal conjectures relevant to some area of study are virtually limitless, no dataset is likely to be even close to adequate to the task of answering the associated sets of causal queries.
A brief example:
A cancer patient has just completed a clinical trial testing the efficacy of a new personalized tumor vaccine pipeline. After surgical removal and sequencing of her tumor, she is administered a personalized vaccine based on genetic features of her tumor (e.g., neoantigens) and her HLA profile. Another scientist conjectures that the time to remission after immunotherapy depends on the obvious factors (e.g., precise composition of the vaccine, the immune evasion potential of her tumor's mutations, etc.), but also on an unstudied factor: the affects of sleep quantity and quality on the availability of cancer killing T-cells in her blood.
Investigating causal questions relating sleep quality and tumor recurrence may be relatively straightforward with an appropriate observational study in the event the explicit causal model renders the causal effect identifiable. Alternately, the causal question may not be amenable to observation alone (defeating the backdoor criterion, frontdoor adjustment, etc.) thereby necessitating a randomized controlled trial. In either case, resolving the causal questions depends on observing changes in a dynamical biological system and interpreting those changes in the context of a model. Previously collected datasets may be useful to the task but are likely to be far from sufficient.
In order to move beyond the trope that progress in biomedical science is primarily a problem of the aggregation of large datasets, we need digital infrastructure and an institutional culture that enables a shift to Tier 2 science.