Named entity linking (NEL) is the task of linking ambiguous mentions in text to entities in a knowledge base. NEL is a core preprocessing step in downstream applications, including search and question answering.
- Pre-deep-learning approaches to NEL have been rule-based or leverage statistical techniques and manual feature engineering to filter and rank candidates (survey paper).
- In recent years, deep learning systems have become the new standard (overview paper of deep learning approaches to entity disambiguation and entity matching problems). The most recent state-of-the-art models generally rely on deep contextual word embeddings with entity embeddings. For example, Pre-training of Deep Contextualized Embeddings of Words and Entities for Named Entity Disambiguation and Empirical Evaluation of Pretraining Strategies for Supervised Entity Linking.
- We've seen a recent shift in simplifying the model even more to just use tranformers without explicit entity embeddings with models like BLINK (uses a bi-encoder) and the Dual and Cross-Attention Encoders (uses cross-encoder).
- Other trends have been to enhance the training data further. The system Bootleg system uses weak labeling of the training data to noisily assign entity links to mentions, increasing performance over rare entities.
- Ikuya Yamada has a wonderful GitHub survey of recent trends in Entity Linking
-
Sensitive to inputs, not models
- The variation of imaging configurations (e.g. site locations), hardware, and processing techniques (e.g. CT windowing) lead to large performance shifts
- Recent medical imaging challenges (segmentation: knee, brain, reconstruction: MRI), found that, to a large extent, the choice of model is less important than the underlying distribution of data (e.g. disease extent)
-
Towards multi-modal data fusion
- Radiologist reports (and more generally text) have been used to improve learned visual representations (e.g. ConVIRT) and to source weak labels in annotation-scarce settings (e.g. (PET/CT))
- Auxiliary features from other rich, semi-structured data, such as electronic health records (EHRs), successfully complemented standard image representations
-
Data Models for Dataset Drift Controls in Machine Learning With Images
- Drift synthesis enables the controlled generation of physically faithful drift test cases. The experiments presented here show that the average decrease in model performance is ten to four times less severe than under post-hoc augmentation testing.
- The gradient connection between task and data models allows for drift forensics that can be used to specify performance-sensitive data models which should be avoided during deployment of a machine learning model.
- Drift adjustment opens up the possibility for processing adjustments in the face of drift. This can lead to speed up and stabilization of classifier training at a margin of up to 20% in validation accuracy.
-
Collecting the right data for training and evalution can require wetlab work – especially in computational drug discovery.
-
Non-standard data modalities are common in computational biology.
- Biological Interaction Networks (e.g. Network-based in silico drug efficacy screening, Identification of disease treatment mechanisms through the multiscale interactome
- Chemical Graphs (e.g. Strategies for pre-training graph neural networks
- DNA, RNA and Amino Acid sequences (e.g.Sequential regulatory activity prediction across chromosomes with convolutional neural networks
- 3D structures (e.g. Learning from protein structure with geometric vector perceptrons
-
In order to facilitate the extraction of relevant signal from large biological datasets, methods have been designed to prune irrelevant features and integrate knowledge across datasets.
- AMELIE helps improve diagnosis of Mendelian disorders by integrating information from a patient’s phenotype and genotype and automatically identifying relevant references to literature.
- This article discusses the importance of creating effective feature selection methods to filter irrelevant features from large whole genome datasets. Other works (such as this one and this one) discuss approaches for identifying putative genetic variants by incorporating information from interaction networks or utilizing independent control datasets.
- Approaches for extracting biological information from medical literature (such as chemical-disease relation extraction and genotype-phenotype association extraction) have benefitted from data programming techniques as well as the incorporation of weakly labeled data.