My Profile Photo

Ling Chen


Computational biologist specialized in machine learning of genomics


About me


Hi! I’m Ling Chen, a PhD student in computational biology at Vanderbilt.

I’m passionate about using creative, data-driven approaches to make scientific discoveries and solve problems. My skills are broad: machine learning, statistical analyses, data visualization, website development. And I’m always learning more!


My reasearch

My research interest lies in using machine learning algorithms to model and understand the human genome, in order to uncover the association between non-coding mutations and the disease. Here are couple of projects that I’ve worked on:

Does DNA has a regulatory grammar?
Project repo: https://github.com/lingchen42/simDNA
Abstract:
Determining the effect of non-coding mutations on phenotypes remains challenging due to the complexity of mammalian gene regulatory elements. There are several models of how combinatorial binding of transcription factors, i.e. regulatory grammar, drive enhancer activity, range from the flexible TF billboard model to the stringent enhanceosome model. However, we have limited knowledge of the prevalence of these sequence architectures across enhancers. Deep neural networks (DNNs) have achieved state-of-the-art performance in predicting gene regulatory sequences, but they’ve provided limited insight into the regulatory sequence architecture due to the difficulty of interpreting the biological meaning of the complex combinatorial patterns they learn. Here we developed several hypothesis driven simulation analyses to explore the ability of convolutional neural networks (CNNs) to capture the regulatory grammar of enhancers.
We created a synthetic dataset with regulatory grammar based on the previous hypotheses of combinatorial binding rules, including homotypic clusters, heterotypic clusters and enhanceosomes, with real TF motifs across all TF families. We then trained a variant of CNN, residual neural network (ResNet) to model the sequences with simulated regulatory grammar under a range of scenarios to mimic the real multi-label regulatory sequence prediction tasks, considering possible heteogeneity in the output enhancer catergories and fraction of transcription factor binding sites (TFBSs) outside of regulatory grammar. Next, we developed a gradient based unsupervised clustering method to retrieve the regulatory grammar from the ResNet models. We found that the simualted regulatory grammar are best learned in the penultimate layer of the DNNs and the proposed method can accurately retrieve the regulatory grammar despite of heterogeneity in the enhancer categories and large fraction of TFBSs outside of regulatory grammar. We also identified scenarios where the DNNs failed to learn an accurate representation of regulatory grammar, including not using stringent controls as negative samples, output categories differing in multiple sequence features, and overwhelming amount of TFBSs outside of regulatory grammar. These results demonstrated that the efficiency of DNNs learning regulatory grammar varies highly depends on the prediction tasks and provided a framework to interpret the regulatory grammar.

Machine learning prediction framework for regulatory DNA evolution
Publication
Project repo: https://github.com/lingchen42/EnhancerCodeConservation
Abstract:
Genomic regions with gene regulatory enhancer activity turnover rapidly across mammals. In contrast, gene expression patterns and transcription factor binding preferences are largely conserved between mammalian species. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. To investigate this hypothesis, we evaluated the extent to which sequence patterns that are predictive of enhancers in one species are predictive of enhancers in other mammalian species by training and testing two types of machine learning models. We trained support vector machine (SVM) and convolutional neural network (CNN) classifiers to distinguish enhancers defined by histone marks from the genomic background based on DNA sequence patterns in human, macaque, mouse, dog, cow, and opossum. The classifiers accurately identified many adult liver, developing limb, and developing brain enhancers, and the CNNs outperformed the SVMs. Furthermore, classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. We observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. This indicates that many short sequence patterns predictive of enhancers are largely conserved. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Our results suggest that short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution.

Gene Regulatory Enhancers with Evolutionarily Conserved Activity Are More Pleiotropic than Those with Species-Specific Activity
Publication
Abstract:
Studies of regulatory activity and gene expression have revealed an intriguing dichotomy: There is substantial turnover in the regulatory activity of orthologous sequences between species; however, the expression level of orthologous genes is largely conserved. Understanding how distal regulatory elements, for example, enhancers, evolve and function is critical, as alterations in gene expression levels can drive the development of both complex disease and functional divergence between species. In this study, we investigated determinants of the conservation of regulatory enhancer activity for orthologous sequences across mammalian evolution. Using liver enhancers identified from genome-wide histone modification profiles in ten diverse mammalian species, we compared orthologous sequences that exhibited regulatory activity in all species (conserved-activity enhancers) to shared sequences active only in a single species (species-specific-activity enhancers). Conserved-activity enhancers have greater regulatory potential than species-specific-activity enhancers, as quantified by both the density and diversity of transcription factor binding motifs. Consistent with their greater regulatory potential, conserved-activity enhancers have greater regulatory activity in humans than species-specific-activity enhancers: They are active across more cellular contexts, and they regulate more genes than species-specific-activity enhancers. Furthermore, the genes regulated by conserved-activity enhancers are expressed in more tissues and are less tolerant of loss-of-function mutations than those targeted by species-specific-activity enhancers. These consistent results across various stages of gene regulation demonstrate that conserved-activity enhancers are more pleiotropic than their species-specific-activity counterparts. This suggests that pleiotropy is associated with the conservation of regulatory across mammalian evolution.

High-thoughput analysis tool for CRISPR repair pattern
Project repo: https://github.com/lingchen42/dsbra
Developed high-thoughput sequencing data analysis tool to perform bioinformatics analyses of CRISPR induced DNA double strand break patterns for our collaborators (>100k of sequencing reads)


Work experience

Data Scientist Intern | Illumina
June 2019 - Present

  • Created predictive analytics solutions for sequencing run quality utilizing modern machine learning algorithms.
  • Developed a web dashboard to provide instant machine learning model insight for product engineers and customer support.

Teaching Assistant | Vanderbilt University
Sep 2014 - Feb 2017

  • Taught weeekly labe sessions in “Introduction to Biology Lab” course (90+ undergraduate students), where students learned biological lab and analytical skills, such as natural selection simulations, DNA extraction and PCR, statistical analyses with R, etc.

Data Analyst Intern | Beijing Genomics Institute (BGI)
Jul 2011 - Aug 2011

  • Performed literature review of 300+ articles for diseases mutations
  • Organized diseased mutations with literature support in a disease mutation database.

Education

Doctor of Philosophy | Vanderbilt University
Aug 2014 - Dec 2019 (expected)
Computational Biology
Advisor: Dr. Tony Capra

Bachelor’s degree | Nankai University
Seq 2010 - June 2014
Biological Siences, 2014 Poling Class.
Advisor: Dr. Quan Chen, Dr. Yushan Zhu, Dr. Jinhong Zhang


Publications

My google scholar
2019

  • Colbran, Laura L., Ling Chen, and John A. Capra. “Sequence Characteristics Distinguish Transcribed Enhancers from Promoters and Predict Their Breadth of Activity.” Genetics (2019): genetics-301895.

2018

  • Chen, Ling*, Alexandra E. Fish*, and John A. Capra. “Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties.” PLoS computational biology 14.10 (2018): e1006484.

2017

  • Fish, Alexandra*, Ling, Chen*, and John A. Capra. “Gene Regulatory Enhancers with Evolutionarily Conserved Activity Are More Pleiotropic than Those with Species-Specific Activity.” Genome biology and evolution 9.10 (2017): 2615-2625.

  • Colbran, Laura L., Ling Chen, and John A. Capra. “Short DNA sequence patterns accurately identify broadly active human enhancers.” BMC genomics 18.1 (2017): 536

2016

  • Ma, Biao, Hongcheng Cheng, Ruize Gao, Chenglong Mu, Ling Chen, Shian Wu, Quan Chen, and Yushan Zhu. “Zyxin-Siah2–Lats2 axis mediates cooperation between Hippo and TGF-β signalling pathways.” Nature communications 7 (2016): 11123.

2015

  • Ma, Biao*, Yan Chen*, Ling Chen, Hongcheng Cheng, Chenglong Mu, Jie Li, Ruize Gao et al. “Hypoxia regulates Hippo signalling through the SIAH2 ubiquitin E3 ligase.” Nature cell biology 17, no. 1 (2015): 95.