COMPUTATIONAL APPROACHES TO EXPAND THE APPLICATIONS OF CHROMATIN STATE ANNOTATIONS

24/3/2025
COMPUTATIONAL APPROACHES TO EXPAND THE APPLICATIONS OF CHROMATIN STATE ANNOTATIONS

Abstract: 

Genome-wide mappings of chromatin marks such as histone modifications and open chromatin sites provide valuable information for annotating the non-coding genome. Computational approaches such as ChromHMM have been applied to discover the combinatorial and spatial patterns of chromatin marks in a biosample, characterize them as chromatin states, and subsequently annotate the biosample’s epigenome into chromatin states. As more biosamples’ chromatin marks data are generated, it becomes more challenging to manually study biological similarities and differences in the chromatin state maps across many biosamples. We therefore have developed methods to derive epigenome annotations that incorporate data from multiple biosamples and highlight notable epigenetic properties. 

First, we introduced a large-scale application of ChromHMM that generates a universal chromatin state map for the human genome that can be shared across cell types. In particular, we trained ChromHMM with input data from >1,000 experiments in >100 human biosamples from Roadmap and ENCODE projects. We denoted the resulting chromatin state map the ‘full-stack’ annotation. We conducted comprehensive analyses to characterize the full-stack states’ biological interpretations, and uncovered patterns of cell-type-specific or constitutive regulatory activities in each state. The full-stack annotation, along with detailed state characterizations, are useful for researchers in understanding the epigenetic contexts of genomic loci of interests. 

Building on this work, we developed and analyzed an equivalent universal chromatin state annotation for the mouse genome. We trained such an annotation using input data from >900 ChiP-seq/ATAC-seq or DNase-seq experiments from the Mouse ENCODE project, and related the resulting states with those from the human full-stack model. Given the wide applications of mice as a model organism to study human disease mechanisms, the mouse full-stack annotation is expected to be highly useful for researchers to investigate the mouse epigenetic landscapes. 

Lastly, we developed a method named CSREP to derive a genome-wide probabilistic summary chromatin state map given data from a group of biosamples with common biological properties. We validated CSREP’s output summary chromatin state maps for groups of samples with shared tissue types from Roadmap and Epimap projects, and showed that CSREP can better predict genomic locations of individual chromatin states in held-out biosamples. We further showed an extension of CSREP where the summary chromatin state maps for two groups of samples are used to prioritize differential chromatin state changes between the two groups. 

Overall, our work aims to derive genome-wide chromatin state annotations that can aggregate and derive the patterns of epigenetic assays within and across different cell identities. All methods we present can we widely applicable to newer and larger datasets that will be made available in the future, while the data of annotations we provide can be useful to the larger community in understanding the regulatory patterns across the genome of various organisms.

Section 1: Introduction to Bioinformatics/Computational biology and Epigenetics

Bioinformatics and computational biology encompass the study of data science and machine learning and their application to assist and speed up biological discoveries. To this end, bioinformatics often constitutes scoping the right biological applications for existing machine learning and statistical frameworks that were originally designed for more structured and human-generated data such as images and texts. Second, bioinformatics and computational biology also involves understanding the data format and constraints—often through understanding the biological contexts and applications— and then develops novel statistical and machine learning frameworks that can accommodate data characteristics. 

My PhD thesis work in bioinformatics specializes in developing algorithms and data engineering solutions specifically for epigenetics datasets. In order to understand the contexts and impacts of this work, we first introduce the general definition of epigenetics. Epigenetics is the study changes in gene expression that do not involve DNA sequence changes, i.e. changes to cells’ molecular profile that are often influenced by environmental factors rather that by changes to the heritable genetic code (the DNA). Unlike the genome, which is almost identical in all cells in one’s body, the epigenome is different in different cell types, variable over time, influenced by environment and the aging process. There are different biomarkers of the epigenetics, measured through experimental assays that either profile the 3D folding structure of the genome, or profile the presence/absence if different epigenomic marks (which is associated with various functions that control gene’s expression). 

More specifically, my PhD thesis—summarized and outlined in this report—focuses on tackling the problem in human epigenome annotations, i.e. creating meaningful ‘labels’ for each position along the genome based on epigenetic biomarkers. These meaningful labels are of great values to biologists interested in understanding the mechanisms through which genes are turned on/off, and hence can lead to changes in one’s phenotypes, i.e. health-disease status, physical traits. Since there exist many categories of epigenetic markers (each of which if profiled genome-wide can constitutes gigabytes of data), it is necessary that machine learning and statistical methods are developed to discover the patterns of presence/absence of different biomarkers across the length of the genome, and then effectively categorize the genomes into different labels. These learned labels are needed to interpret the functionalities of different regions in the genome in different types. For examples, regions that can be labelled as ‘promoter’ corresponds to the beginning of the genes, and active promoters will turn on the gene expression. On the other hand, regions labelled as ‘heterochromatin’ (often recognized by high signals of epigenetic marks H3K9me3) correspond to regions of the genome that is often tightly-packed, not accessible to other chromatin marks, (hence, these regions may not actively increase gene expression). Overall, these epigenetic-associated labels are highly useful for biologists to understand the potential contexts and mechanisms of the DNA code itself. For example, if one is interested in a particular genetic mutation that are found to be associated with a certain condition, one can investigate this mutations’ epigenetic contexts to understand if this mutations lie in an active/inactive regions that can consequently turn genes on/off .

Section 2: Introduction to three advancements in machine learning applications for genome annotation through epigenetics profiles

Epigenomic marks such as histone modifications, open chromatin regions or methylation can correspond to different categories of gene regulatory elements (Barski et al., 2007; Boyle et al., 2008; Thurman et al., 2012; Xie et al., 2013). Methods such as ChromHMM (Ernst and Kellis, 2010, 2012) or Segway (Hoffman et al., 2012) were developed to learn the recurrent combinatorial patterns of multiple chromatin marks across the genome of a biosample, classify these patterns into chromatin states, and eventually generate a genome annotation for that biosample. The resulting annotation maps out various regulatory elements such as promoters, enhancers, or inactive domains, and is useful in understanding the epigenomic contexts of biological phenomena. As epigenomic data becomes more abundant and diverse in assays and profiled biosamples, new questions emerge. 

The first question, which is addressed in Section 3, involves the possibility and applicability of aggregating the patterns across epigenomic mappings for multiple cell and tissue types, then learning a single chromatin state annotation shared across the input biosamples. This requires training a chromatin state discovery tool (ChromHMM) such that input datasets from different tissue types are stacked as independent tracks and the learned. states are defined jointly across tissue types. This approach was previously limited due to scalability challenges and is different from the per-biosample learning that was widely used before. Here, we developed a model using data from 1032 experiments in 127 biosamples from different human cell/tissue types (ENCODE Project Consortium, 2012; Roadmap Epigenomics Consortium et al.), denoted as the full-stack model. This results in a universal human genome annotation that is shared across cell types, and the states can correspond to constitutive regulatory functions or rather cell-type-specific activities. We conducted thorough analyses to characterize the states and uncover many states with different cell-type-specific regulatory functions such as Brain-related enhancers, embryonic stem cell (ESC)-related bivalent promoters, chromosome X-specific quiescent regions, etc. We also used the full-stack annotation to analyze the epigenetic contexts of various classes of genetic variations such as cancer-associated somatic mutations, structural variants, rare and common variants, etc. We argue, through reasoning and quantitative analyses with external genome annotations, that the full-stack annotation offers complementary values to existing biosample-specific annotations from Roadmap (Roadmap Epigenomics Consortium et al.) and ENCODE consortia (ENCODE Project Consortium, 2012). 

In Section 4, we extend the above-mentioned approach of  genome annotation by training an equivalent stacked ChromHMM model on 901 Chip-seq/DNase-seq/ATAC-seq datasets from 26 mouse cell types from the Mouse ENCODE project (Stamatoyannopoulos et al., 2012; Yue et al., 2014). We conducted equivalent analyses as in the human model to characterize the biological implications of the resulting states, and generate a mouse full-stacked annotation (H. T. Vu and Ernst, 2022). We also analyzed the relationships between states from the two organisms’ annotations and how the similarities and differences between the two models are also reflected in functional conservation scores between the two species (Kwon and Ernst, 2021). The human and mouse universal chromatin state maps have been widely adopted in various projects where we collaborated with other labs to elucidate the epigenetic contexts of regions involved in aging (Lu et al., 2022), mammalian maximum lifespan prediction (Li et al., 2021) and Huntington’s disease. 

Another question that emerge with more abundant epigenomic data is capturing probabilistic summary chromatin state annotations for a group of related biosamples (such as those with shared sex, tissue/cell type, case/control status, etc.). We developed a method named CSREP for this purpose, which is presented in Section 5. CSREP trains an ensemble of multivariate logistic regression classifiers that predicts state annotation in one biosample, given the corresponding annotations in others.  We can take the difference between CSREP’s summary chromatin state maps for two groups of samples to derive a map of genome-wide differential chromatin scores between these two groups. We conducted leave-one-out analysis, using chromatin state annotation data from 11 cell/tissue groups from Roadmap (Roadmap Epigenomics Consortium et al.), to evaluate how well CSREP’s summary chromatin state map for a group of samples can predict the genomic locations of individual chromatin states in a held-out sample. Here, CSREP resulted in better prediction compared to a baseline approach that simply counts state-frequency across input sample. We further showed, through various analyses, that the differential chromatin scores outputted by CSREP for two groups of samples can predict external assays that distinguish the two groups. For example, CSREP differential scores between ESC and Brain sample groups better predict genomic locations of Brain-specific or ESC-specific peaks of multiple chromatin marks (H3K27ac, H3K9ac and DNase). Using CSREP, we generated the summary chromatin state maps for 11 cell/tissue groups from Roadmap (Roadmap Epigenomics Consortium et al.) and 75 groups from Epimap (Boix et al., 2021).

Section 3: Universal annotation of the human genome through integration of over a thousand epigenomic datasets

Genome-wide maps of histone modifications, histone variants and open chromatin provide valuable information for annotating the non-coding genome features, including various types of regulatory elements (Barski et al., 2007; Boyle et al., 2008; Ernst et al., 2011; Thurman et al., 2012; Xie et al., 2013). These maps -- produced by assays such as chromatin immunoprecipitation followed by high-throughput sequencing to map histone modifications or DNase-seq to map open chromatin-- can facilitate our understanding of regulatory elements and genetic variants that are associated with disease (Claussnitzer et al., 2015; Gjoneska et al., 2015; Kheradpour et al., 2013; Lay et al., 2014; Lee et al., 2017; Taberlay et al., 2014; Varshney et al., 2017). Efforts by large scale consortia as well as many individual labs have resulted in these maps for many different human cell and tissue types for multiple different chromatin marks (Barski et al., 2007; Consortium, 2007; ENCODE Project Consortium, 2012; Fernández et al., 2015; Kheradpour et al., 2013; Meuleman et al., 2015; Mikkelsen et al., 2007; Stunnenberg et al., 2016; Wang et al., 2020; Zhu et al., 2013).

The availability of maps for multiple different chromatin marks in the same cell type motivated the development of methods such as ChromHMM and Segway that learn ‘chromatin states’ based on the combinatorial and spatial patterns of marks in such data (Ernst and Kellis, 2010, 2012; Hoffman et al., 2012). These methods then annotate genomes in a per-cell-type manner based on the learned chromatin states. They have been applied to annotate more than a hundred diverse cell and tissue types (Ernst et al., 2011; Meuleman et al., 2015; Libbrecht et al., 2019). Previously, large collections of per-cell-type chromatin state annotations have been generated using either (1) independent models that learn a different set of states in each cell type or (2) a single model that is learned across all cell types, resulting in a common set of states across cell types, yet generating per-cell-type annotations (in some cases per-tissue-type annotations are generated, but we will use the terms cell-type and tissue interchangeably for ease of presentation). This latter approach is referred to as a ‘concatenated’ approach (Ernst and Kellis, 2012, 2017). Variants of the concatenated approach attempt to use information from related cell types to reduce the effect of noise, but still output per-cell-type annotations (Biesinger et al., 2013; Zhang et al., 2016). These models that produce per-cell-type annotations tend to be most appropriate in studies where researchers are interested in studying individual cell types. 

A complementary approach to applying ChromHMM to data across multiple different cell types referred to as the ‘stacked’ modeling approach was also previously suggested (Ernst and Kellis, 2012, 2017). Instead of learning per-cell-type annotations based on a limited number of datasets available in each cell type, the stacked modeling approach can learn a single universal genome annotation based on the combinatorial and spatial patterns in datasets from multiple marks across multiple cell types. This approach differs from the concatenated and independent modeling approaches as those approaches only identify combinatorial and spatial patterns present among datasets within one cell type. 

Such a universal annotation from stacked modeling provides potential complementary benefits to existing concatenated and independent chromatin state annotations. First, since the model can learn patterns from signals from the same assay across cell types, a stacked model may help differentiate regions with constitutive chromatin activities from those with cell-type-specific activities. Previously, subsets of the genome assigned to individual chromatin states from ‘concatenated’ annotations were post-hoc clustered to analyze chromatin dynamics across cell and tissue types (Ernst et al., 2011; Meuleman et al., 2015). However, such an approach does not provide a view of the dynamics of all the data at once, which the stacked modeling provides. Second, the stacked modeling approach bypasses the need to pick a specific cell or tissue type when analyzing a single partitioning and annotation of the genome. Focusing on a single cell or tissue type may not be desirable for many analyses involving other data that are not inherently cell-type-specific, such as those involving conserved DNA sequence or genetic variants. For example, when studying the relationship between chromatin states and evolutionarily conserved sequences, if one uses per-cell-type chromatin state annotations from one cell type, many bases will lack an informative chromatin state assignment (e.g. many bases are in a quiescent state), while subsets of those bases will have a more informative annotation in other cell types. Third, if one tries to analyze per-cell-type annotations across cell types, one would need a post-hoc method to reason about an exponentially large number of possible combinations of chromatin states across cell types (if each of K cell types has M states, there are MK possible combinations of states for a genomic position) many of which would likely lack biologically meaningful distinctions. In contrast, for the stacked model, there will be a single annotation per position out of a possibly much smaller fixed number of states (compared to MK). These states are directly informative of cross-cell type activity, though the state definitions can be more complex. Finally, annotations by the stacked modeling leverages a larger set of data for annotation, and thus has the potential to be able to identify genomic elements with greater sensitivity and specificity. 

Despite the potential complementary advantages of the ‘stacked’ modeling approach, it has only been applied on a limited scale to combine data from a small number of cell types for highly specialized purposes (Chronis et al., 2017; Mortazavi et al., 2013). No large-scale application of the stacked modeling approach to many diverse cell and tissue types has been previously demonstrated. This may have in part been due to large-scale applications of stacked modeling raising scalability challenges not present in modeling approaches for concatenated and independent annotations. 

Here, we present a large-scale application of the stacked modeling approach with more than a thousand human epigenomic datasets as input, using a version of ChromHMM for which we enhanced the scalability. We conduct various enrichment analyses on the states resulting from the stacked modeling and give biological interpretations to them. We show that compared to the per-cell-type annotations from independent and concatenated models, the stacked model’s annotation shows greater correspondence to various external genomic annotations not used in the model learning. We analyze the states in terms of enrichment with different types of variation, and highlight specific states of the stacked model that are enriched with phenotypically associated genetic variants, cancer-associated somatic mutations, and structural variants. We expect the stacked model annotations and detailed characterization of the states that we provide will be a valuable resource for studying the epigenome and non-coding genome, complementing existing per-cell-type annotations. 

This work was published in Vu and Ernst, 2022. 

Section 4: Universal chromatin state annotation of the mouse genome

Mouse is widely adopted as a model organism for human for many reasons including their genetic and physiological proximity to humans, relatively short life span, and availability as test subjects for genetic manipulations (Vanhooren and Libert, 2013; Aitman et al., 2011; Perlman, 2016). A wealth of epigenomic datasets in mouse, include maps of histone modifications and variants and sites of accessible DNA, has accumulated thanks to efforts from different consortia and individual labs, which can be used to annotate the mouse genome, including non-coding regions (Kazachenka et al., 2018; Stamatoyannopoulos et al., 2012; Yue et al., 2014; Zhu et al., 2021; Hon et al., 2013; Tsai et al., 2009; Rugg-Gunn et al., 2010). This type of data has previously been integrated methods such as ChromHMM and Segway (Ernst and Kellis, 2010, 2012; Hoffman et al., 2012; Libbrecht et al., 2021) to generate chromatin state maps for various organisms including different mouse and human cell and tissue types (Yue et al., 2014; ENCODE Project Consortium, 2012; van der Velde et al., 2021; Bogu et al., 2015; Sugathan and Waxman, 2013; Gorkin et al., 2020). These chromatin state maps have traditionally been used to annotate genomes in a per-cell-type manner using either the ‘independent’ or ‘concatenated’ modeling approaches (for ease of presentation, we will refer to tissue types also as cell types) (Ernst and Kellis, 2017; Libbrecht et al., 2021).

In the previous section, we presented an alternative ‘stacked’ modelling approach of ChromHMM to learn chromatin states from over 1000 human datasets representing more than 100 cell types, to generate a universal annotation of the human genome that can annotate all human cell types (H. Vu and Ernst, 2022). This modeling provided a single annotation of the genome per position based on data from all the input cell types. Such an annotation, denoted full-stack annotation, offers complementary advantages to per-cell-type annotations, such as differentiating constitutively active regions from cell-type-specific ones and simplifying genome annotations across cell types through a single annotation shared across cell types as opposed to one for each. Additionally, the full-stack annotation allows researchers to bypass picking a single cell type for analyses or conducting analyses separately for every cell type. This can be particularly useful in studies involving data that is not inherently cell-type-specific such as analyses of genetic variants or conserved DNA sequence. However, an analogous full-stack annotation has not been previously available in mouse.

To address this, we train a full-stacked model with ChromHMM using input data from >900 mouse datasets of 14 chromatin marks from 26 mouse cell type groups. We analyze these states with respect to their enrichments with external datasets and annotations to provide detailed characterizations for each state. We also analyze to what extent each state is conserved in human. We expect the mouse full-stack annotations along with the provided biological characterizations will be a useful resource for studying this key model organism.   

This work was published in Vu and Ernst, 2023.

Section 5: A framework for group-wise summarization and comparison of chromatin state annotations

Genome-wide maps of chromatin marks such as histone modifications and variants provide valuable information for annotating non-coding genome features (Barski et al., 2007; Ernst et al., 2011; Zhu et al., 2013; Xie et al., 2013). Efforts by large consortia and individual labs have produced chromatin state maps for many cell and tissue types (Roadmap Epigenomics Consortium et al., 2015; ENCODE Project Consortium, 2012; Zhu et al., 2013; Xie et al., 2013). A popular representation of such data is chromatin states defined by the combinatorial and spatial patterns of multiple marks, which are generated by methods such as ChromHMM and Segway (Libbrecht et al., 2021; Ernst and Kellis, 2010, 2012; Hoffman et al., 2012), and correspond to diverse classes of genomic elements including various types of enhancers and promoters.

Chromatin state maps have been produced for hundreds of different biological samples. In many cases there are multiple samples representing similar cell and tissue types (Boix et al., 2021; Roadmap Epigenomics Consortium et al., 2015). In such cases, to simplify analyses and visualizations, it may be desirable to have a single chromatin state annotation that summarizes the annotations for all samples in a pre-defined sample group of interest. A straightforward approach to this task is to take the most frequent chromatin state assigned at each position across samples in the group. However, when the number of samples in a group is small or the number of states is large, such an approach can be particularly vulnerable to noise. Furthermore, such an approach does not consider additional information available about the different chromatin states. For example, if a location was assigned to three different states in three samples, the summary annotation among these three states based on the frequency-based method would be arbitrary. However, by leveraging information about the co-occurrence of state assignments genome-wide, there is additional information to predict the most likely chromatin state annotation for a new sample from the group. 

A related challenge is to identify differences in chromatin state annotations between two groups at a high resolution and on a per-state basis. Several methods have been developed for comparing chromatin state annotations between groups of samples, but typically either work at a coarse resolution, or do not identify differences on a per–chromatin-state basis. For instance, ChromDiff (Yen and Kellis, 2015) presents a statistical testing framework to uncover pre-defined broad regions such as gene bodies with significant differences for specific chromatin states across the two groups, but was not specifically designed for detecting differences at the resolution of the chromatin state annotations. EpiAlign (Ge et al., 2019) scores the alignment patterns between two user-input sequences of chromatin state annotations in two samples, hence is also most applicable for comparing broad domains that encompass multiple chromatin state segments. Another method, chromswitch (Jessa and Kleinman, 2018) also offers a framework to score the differential chromatin state annotations within broader user-specified input genomic locus, and is not designed for detecting chromatin state differences genome-wide at the same resolution of the annotations. EpiCompare (He and Wang, 2017) is primarily a webtool that can be used for detecting cell-type-specific chromatin state differences in terms of enhancer or promoter states, but does not support detecting differences for individual states or other types of chromatin states. SCIDDO (Ebert and Schulz, 2021) conducts fast genome-wide detection of differential chromatin domains between two groups of samples while incorporating a measure of similarity among states. However, as SCIDDO provides a single differential score per position, it does not directly answer the question of which chromatin states change at each genomic position. Another method, dPCA (Ji et al., 2013), works directly on chromatin mark signals and does not quantify state differences across groups of samples. 

To effectively summarize the chromatin state annotations for a group of samples and prioritize the chromatin state differences between two groups on a per-state basis, at high resolution, we introduce CSREP. CSREP leverages both the information about the input samples’ chromatin states at a position, as well as information of states’ co-occurrences in different samples within the same group across the genome. CSREP does this by first generating probabilistic estimates of chromatin state annotations to summarize a group of samples using an ensemble of multi-class logistic regression classifiers. These classifiers predict the state assignment in a sample at a position, given the annotations in other samples at the corresponding genomic position. From those predictions, CSREP is then able to produce a single summary state assignment per position. Furthermore, CSREP can use the difference of summary probabilistic predictions for two groups of samples to quantify the difference in state assignments between the two groups on a per-state basis, e.g. one genome-wide score track per chromatin state. CSREP’s ability to summarize chromatin states for a group of samples beyond simple counting is a unique feature of CSREP relative to existing methods for detecting differential chromatin states or domains mentioned above. CSREP is also distinguished from these existing methods by a combination of (1) considering differential chromatin state annotations at the resolution of the input annotations instead of over broad domains, (2) generating outputs genome-wide instead of at user-specified loci, and (3) providing state-specific and directionally meaningful scores for all states.  

Using CSREP, we generate the summary chromatin state maps for 11 groups of tissue/cell types from Roadmap Epigenomics Project (Roadmap Epigenomics Consortium et al., 2015),  and for 75 groups from the EpiMap Portal (Boix et al., 2021), which can be easily viewed on genome browsers (Vu et al., 2024, Data Availability). We show that CSREP can better predict chromatin state assignments in held-out samples than a counting-based baseline method. We also verify that the resulting summary chromatin maps show correspondence with the group’s average gene expression profile. Additionally, we show that CSREP’s differential scores can recover differential epigenetic signals on chromosome X between Male and Female samples. We also show that CSREP differential scores between samples from two different tissue groups can predict regions of differential peaks for various chromatin marks. The CSREP implementation is designed to be user-friendly and includes a detailed tutorial, available at https://github.com/ernstlab/csrep. We expect CSREP will be a useful tool for summarizing chromatin state maps within groups and finding differences across groups. Additionally, we expect the summary annotations for different tissue groups that we generated with CSREP to be a useful resource.

This work was published in Vu et al., 2024.

Image

Figure 1: Illustration of full-stack modeling annotations.

The figure illustrates the full-stack modeling at two loci. The top track shows chromatin state annotations from the full-stack modeling colored based on the legend at right. Below it are signal tracks for a subset of the 1032 input datasets. Data from seven (DNase I hypersensitivity, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, and H3K9me3) of the 32 chromatin marks are shown, colored based on the legend at right. These data are from 15 of the 127 reference epigenomes each representing different cell and tissue groups. The loci on left highlights a genomic region for which a portion is annotated as constitutive promoter states (TSS1-2). The loci on the right panel highlights a region for which a portion is annotated as a brain enhancer state (EnhA6), which has high signals of H3K27ac in reference epigenomes of the group Brain. 

Image

Figure 2. 2: Full-stack state emission parameters.

(A) Each of the 100 rows in the heatmap corresponds to a full-stack state. Each of the 1032 columns corresponds to one dataset. For each state and each dataset, the heatmap gives the probability within the state of observing a binary present call for the dataset’s signal. Above the heatmap there are two rows, one indicating the cell or tissue type of the dataset and the other indicating the chromatin mark. The corresponding color legends are shown towards the bottom. The states are displayed in 16 groups with white space between each group. The states were grouped based on biological interpretations indicated by the color legend at the bottom. Full characterization of states is available in Vu and Ernst, 2022, Supplementary Data 2.1-5. The model’s transition parameters between states can be found in Vu and Ernst, 2022, Supplementary Figure 2.6. Columns are ordered such that datasets profiling the same chromatin marks are next to each other.

(B) Each row corresponds to a full-stack state as ordered in (A). The columns correspond to the top 10 datasets with the highest emission value for each state, in order of decreasing ranks, colored by their associated chromatin marks as in (A). 

(C) Similar to (B), but datasets are colored by the associated cell or tissue type group. On right, the cell or tissue groups primarily associated with some of the enhancer states is noted. 

Image

Figure 4. 1: Overview of CSREP.

(A) CSREP uses an ensemble of multi-class logistic regression models. In each model, the chromatin state map at the target sample is predicted based on the one-hot encoding of chromatin state assignments at the corresponding genomic positions in other samples. Multi-class logistic regression outputs the probabilities that each genomic position (row) in the target sample will be assigned to each state (column). CSREP averages the prediction matrices for target samples, to output the summary state assignment probability matrix. Sam.: sample; P(Si=s): probability that genomic position i is annotated as state s. (B) The operations to obtain differential chromatin state assignment scores between two groups with multiple samples. CSREP calculates the summary chromatin state assignment matrices for two groups and then subtracts one group’s summary matrix from the other’s to obtain differential chromatin scores. Differential chromatin scores are bounded between -1 (brown) and 1 (blue). (C) Visualization of CSREP’s output in a genomic region (hg19, chr5:156,012,600-156,022,400). The top of the subpanel shows the CSREP’s summary chromatin state probabilities for 18 states across seven Brain reference epigenomes. Each track shows the probabilities of assignment for one state, as named and colored on the left. The middle subpanel shows the 18-state chromatin state maps for 7 Brain samples and 5 ESC samples from Roadmap Epigenomics (Roadmap Epigenomics Consortium et al., 2015), and the CSREP’s output summary chromatin state maps for each group, outlined in black. States are colored as in legends at the left of this subpanel. The last subpanel shows the differential chromatin scores when ESC’s summary state probabilities are subtracted from Brain’s. Each track shows one state’s differential scores. Scores between 0 and 1 are colored black, while those between -1 and 0 are colored grey. This region is also shown in an expanded format in Vu et al., 2024, Supplementary Figure 4.1

Image

Figure 4. 2: Performance of CSREP in summarizing multiple samples’ 

chromatin state maps from a group.

(A) Visualization of one arbitrarily selected 500-kb region (chr5: 42,821,109-43,321,109, hg19). The first 10 tracks show chromatin state maps of 10 samples of the Digestive group from the Roadmap Epigenomics Consortium, which were input to CSREP. The following track shows the summary chromatin state map from CSREP, which shows strong agreement with the input. States are colored based on the legend on the lower left. In the following 18 tracks, each track shows CSREP’s probabilities of assignment for each of 18 states, with the state annotations shown in the legend on left.

(B) Boxplots showing the CSREP and base_count methods’ average, range and 25, 75% quantiles of the AUROCs across 64 samples, for each of the 18 chromatin states. The AUROCs were calculated in leave-one-out cross validation analysis where we used a group’s summary probabilistic chromatin state map to predict genomic locations of individual chromatin states in a left-out sample from the same cell/tissue group (Vu et al., 2024, Methods). States 1-18 (x-axis) are annotated as in (A).

(C) Boxplots showing the Spearman correlations between a group of samples’ (1) summary probabilities of state 1_TssA (active TSS) at annotated TSSs, and (2) the corresponding group’s average gene expression (Vu et al., 2024, Methods). We obtained the correlations for 8 groups of cell types from the Roadmap Epigenomics Project, and 65 groups from EpiMap. Each dot shows the Spearman correlation for data from a group of samples. Results of paired t-test to compare CSREP vs. base_count’s output correlations are shown on top. The alternative hypothesis for the t-test is that correlations resulted from CSREP are higher than those from base_count (Vu et al., 2024, Methods). 

References

  1. Aitman,T.J. et al. (2011) The future of model organisms in human disease research. Nat Rev Genet, 12, 575–582.

  2. Barski,A. et al. (2007) High-resolution profiling of histone methylations in the human genome. Cell, 129, 823–837.

  3. Biesinger,J. et al. (2013) Discovering and mapping chromatin states using a tree hidden Markov model. In, BMC bioinformatics. Springer, p. S4.

  4. Bogu,G.K. et al. (2015) Chromatin and RNA Maps Reveal Regulatory Long Noncoding RNAs in Mouse. Mol Cell Biol, 36, 809–819.

  5. Boix,C.A. et al. (2021) Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature, 590, 300–307.

  6. Boyle,A.P. et al. (2008) High-resolution mapping and characterization of open chromatin across the genome. Cell, 132, 311–322.

  7. Chronis,C. et al. (2017) Cooperative binding of transcription factors orchestrates reprogramming. Cell, 168, 442–459.

  8. Claussnitzer,M. et al. (2015) FTO obesity variant circuitry and adipocyte browning in humans. New England Journal of Medicine, 373, 895–907.

  9. Consortium,E.P. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. nature, 447, 799.

  10. Ebert,P. and Schulz,M.H. (2021) Fast detection of differential chromatin domains with SCIDDO. Bioinformatics, 37, 1198–1205.

  11. ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74.

  12. Ernst,J. et al. (2011) Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473, 43–49.

  13. Ernst,J. and Kellis,M. (2017) Chromatin-state discovery and genome annotation with ChromHMM. Nature protocols, 12, 2478.

  14. Ernst,J. and Kellis,M. (2012) ChromHMM: automating chromatin-state discovery and characterization. Nature methods, 9, 215–216.

  15. Ernst,J. and Kellis,M. (2010) Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature biotechnology, 28, 817–825.

  16. Fernández,A.F. et al. (2015) H3K4me1 marks DNA regions hypomethylated during aging in human stem and differentiated cells. Genome research, 25, 27–40.

  17. Ge,X. et al. (2019) EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences. Nucleic acids research, 47, e77–e77.

  18. Gjoneska,E. et al. (2015) Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. Nature, 518, 365–369.

  19. Gorkin,D.U. et al. (2020) An atlas of dynamic chromatin landscapes in mouse fetal development. Nature, 583, 744–751.

  20. He,Y. and Wang,T. (2017) EpiCompare: an online tool to define and explore genomic regions with tissue or cell type-specific epigenomic features. Bioinformatics, 33, 3268–3275.

  21. Hoffman,M.M. et al. (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods, 9, 473.

  22. Hon,G.C. et al. (2013) Epigenetic memory at embryonic enhancers identified in DNA methylation maps from adult mouse tissues. Nat Genet, 45, 1198–1206.

  23. Jessa,S. and Kleinman,C.L. (2018) Chromswitch: a flexible method to detect chromatin state switches. Bioinformatics, 34, 2286–2288.

  24. Ji,H. et al. (2013) Differential principal component analysis of ChIP-seq. PNAS, 110, 6789–6794.

  25. Kazachenka,A. et al. (2018) Identification, Characterization, and Heritability of Murine Metastable Epialleles: Implications for Non-genetic Inheritance. Cell, 175, 1259-1271.e13.

  26. Kheradpour,P. et al. (2013) Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome research, 23, 800–811.

  27. Kwon,S.B. and Ernst,J. (2021) Learning a genome-wide score of human–mouse conservation at the functional genomics level. Nature communications, 12, 1–14.

  28. Lay,F.D. et al. (2014) Reprogramming of the human intestinal epigenome by surgical tissue transposition. Genome research, 24, 545–553.

  29. Lee,J. et al. (2017) The LDB1 complex co-opts CTCF for erythroid lineage-specific long-range enhancer interactions. Cell reports, 19, 2490–2502.

  30. Li,C.Z. et al. (2021) Epigenetic predictors of maximum lifespan and other life history traits in mammals. bioRxiv.

  31. Libbrecht,M.W. et al. (2019) A unified encyclopedia of human functional DNA elements through fully automated annotation of 164 human cell types. Genome biology, 20, 180.

  32. Libbrecht,M.W. et al. (2021) Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns. PLoS Computational Biology, 17, e1009423.

  33. Lu,A.T. et al. (2022) Universal DNA methylation age across mammalian tissues. 2021.01.18.426733.

  34. Meuleman,W. et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518, 317–330.

  35. Mikkelsen,T.S. et al. (2007) Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature, 448, 553–560.

  36. Mortazavi,A. et al. (2013) Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps. Genome research, 23, 2136–2148.

  37. Perlman,R.L. (2016) Mouse models of human disease: An evolutionary perspective. Evolution, Medicine, and Public Health, 2016, 170–176.

  38. Roadmap Epigenomics Consortium et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518, 317–330.

  39. Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes | Nature.

  40. Rugg-Gunn,P.J. et al. (2010) Distinct histone modifications in stem cell lines and tissue lineages from the early mouse embryo. Proc Natl Acad Sci U S A, 107, 10783–10790.

  41. Stamatoyannopoulos,J.A. et al. (2012) An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome biology, 13, 1–5.

  42. Stunnenberg,H.G. et al. (2016) The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell, 167, 1145–1149.

  43. Sugathan,A. and Waxman,D.J. (2013) Genome-wide analysis of chromatin states reveals distinct mechanisms of sex-dependent gene regulation in male and female mouse liver. Mol Cell Biol, 33, 3594–3610.

  44. Taberlay,P.C. et al. (2014) Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer. Genome research, 24, 1421–1432.

  45. Thurman,R.E. et al. (2012) The accessible chromatin landscape of the human genome. Nature, 489, 75–82.

  46. Tsai,H.-W. et al. (2009) Sex differences in histone modifications in the neonatal mouse brain. Epigenetics, 4, 47–53.

  47. Vanhooren,V. and Libert,C. (2013) The mouse as a model organism in aging research: usefulness, pitfalls and possibilities. Ageing research reviews, 12, 8–21.

  48. Varshney,A. et al. (2017) Genetic regulatory signatures underlying islet gene expression and type 2 diabetes. Proceedings of the National Academy of Sciences, 114, 2301–2306.

  49. van der Velde,A. et al. (2021) Annotation of chromatin states in 66 complete mouse epigenomes during development. Communications biology, 4, 1–15.

  50. Vu,H. and Ernst,J. (2022) Universal annotation of the human genome through integration of over a thousand epigenomic datasets. Genome biology, 23, 1–37.

  51. Vu,H.T. and Ernst,J. (2022) Universal chromatin state annotation of the mouse genome. 2022.12.19.521116.

  52. Wang,Q. et al. (2020) Imprecise DNMT1 activity coupled with neighbor-guided correction enables robust yet flexible epigenetic inheritance. Nature Genetics, 52, 828–839.

  53. Xie,W. et al. (2013) Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell, 153, 1134–1148.

  54. Yen,A. and Kellis,M. (2015) Systematic chromatin state comparison of epigenomes associated with diverse properties including sex and tissue type. Nature communications, 6, 1–13.

  55. Yue,F. et al. (2014) A comparative encyclopedia of DNA elements in the mouse genome. Nature, 515, 355–364.

  56. Zhang,Y. et al. (2016) Jointly characterizing epigenetic dynamics across multiple human cell types. Nucleic acids research, 44, 6721–6731.

  57. Zhu,C. et al. (2021) Joint profiling of histone modifications and transcriptome in single cells from mouse brain. Nat Methods, 18, 283–292.

Zhu,J. et al. (2013) Genome-wide chromatin state transitions associated with developmental and environmental cues. Cell, 152, 642–654.