Introduction

1-1 Overview

 Welcome to idCHD, an integrated database for coronary heart disease (CHD)! It provides a feature-rich and user-friendly web interface for data exploration and network analysis for CHD and cardiological science community.
 The challenges posed by complex genetic synergism and tissue heterogeneity of cardiac system render CHD curation and drug R&D rather more difficult than other diseases. Until today, CHD is still the biggest threat to people's health worldwide, being responsible for approximately 9 million deaths per year. Single-cell omics is one of the most thrilling techniques that resolves tissue heterogeneity at single-cell resolution. However, the large amounts of data, non-standardized analysis pipeline, and difficulties in analysis of single-cell data prevent it from better usage. Thus, we re-analyzed public single-cell data under a unified framework and built a web interface for easy exploration. Besides, we introduce network science discipline to assist our understanding of the complex interactions between genetic components, drugs, and diseases.
 We hope this tool helps your research a bit!

1-2 Content

 we collected 165 scRNA-seq samples from the gene expression omnibus (GEO) database, totaling 585,412 cells from 29 major categories.
 Single-cell datasets can be accessed in CELLS , where plentiful visualization modes of gene expression patterns, co-expression programs, geneset enrichment analysis (GSEA) scores, and predicted cell-cell interactions are availbale.
 Given the extensive influence of experimental conditions, operations, sample sources, and species on the transcriptional states of cells and cell type conversions, we have outlined the following aspects of single-cell research:
(1) Description: Includes he topic, purpose, or method employed in the research.
(2) Subject information: Covers the species, strain, and genotype of the subjects used in the study (if genetically edited).
(3) Modeling: Specifies the modeling or operational approach utilized, as well as the time elapsed after the operation.
(4) Sampling: Provides details about the body parts from which the tissue was isolated, the specific marker employed in flow cytometry sorting, the cell number, and the main cell types defined by marker genes.
(5) Platform: Indicates the single-cell platform and sequencing devices used, along with the modality of the dataset.
 Additionally, CHD has been found to be associated with various co-morbidities.
We believe that this relationship is partially attributed to the extensive genetic basis of CHD, and we propose that a "disease-gene-drug" network can aid in unraveling this complex association. To that end, we have conducted a comprehensive search for the latest research on CHD and its complications, collecting genes primarily associated with CHD. We have also noted the relationships between genes and complications. Furthermore, we have integrated CHD-related drugs into the Network module, emphasizing protein-protein interactions (PPI) and miRNA-target interactions (MTI) by highlighting these connections. With the ability to add human genes and transcripts to the "nodes chart", users can construct their own CHD networks and download them to their local computers.
 In the Statisticssection, we have summarized the key metrics of idCHD. Currently, the database includes 6,000+ genes or transcripts from five different species (human, mouse, rat, rabbit, and pig). It encompasses 136,020 PPIs sourced from STRING, 10,043 MTIs from miRTarBase, 3,157 drugs from DrugBank, and 48 CHD-related pathways. Moreover, we have recorded over 600 genetic variants or CpG loci. For each supporting article, we have meticulously reviewed the abstract and, in some cases, the full text, documenting the following aspects:
(1) Orientation of the Study: Examines what the study reveals about the gene and the disease. Does it identify new risk genes or genetic variants? Does the gene influence drug side effects or efficacy? The Orientation of the Study is categorized into six sections:
 A. Related Gene: Genes validated by experiments or predicted through bioinformatics. This category typically includes significant genes identified in cohort studies, hub genes in WGCNA networks, and functional genes validated through animal experiments.
 B. Drug-related Gene: Genes found to impact drug effects.
 C. TCM-related Gene: Genes identified as targets of Traditional Chinese Medicine (TCM) components.
 D. Gene Variant: Additional tag indicating possible single nucleotide polymorphisms (SNPs) or insertions-deletions (indels) that may affect disease phenotypes, either through dysregulation or protein dysfunction. Some gene variants may lack direct evidence associating them with specific genes, in which case we record the nearest genes.
 E. Sex-related Gene: Additional tag indicating gender differences in expression levels.
 F. Abberrently Methylated Gene: Additional tag indicating observed hypermethylation or hypomethylation within a gene in patients. Verified CpG loci are recorded.
(2) Methodology of the Research: Categorizes research into six groups:
 A. Control Study (or Case-Control Study): Cross-sectional studies comparing gene expression levels between patients and healthy populations.
 B. Cohort Study: Longitudinal studies comparing phenotypes between cohorts with or without predisposing factors.
For instance, the diagnostic potential of plasma concentration of a gene can be inferred from the prevalence rates of two cohorts, with one cohort exhibiting up-regulated plasma concentration and the other being normal.
 C. In vivo Experiment: Animal studies such as rat model-based control studies.
 D. In vitro Experiment: Utilizes tissue samples or cell lines derived from humans or animals, such as serum, umbilical vein endothelial cells, circulating monocytes, or extracellular vesicles.
 E. Summary Statistics: Utilizes data from previous studies to calculate statistical significance of gene expression differences.
 F. Integrated Bioinformatics:
Integrates datasets from previous studies and employs weighted correlation network analysis (WGCNA), co-expression analysis, functional enrichment, and network analysis to identify core genes.
(3) Diseases and Subjects. As part of our mission to explore the intersection of the genetic basis of CHD and its co-morbidities, we have retained the exact phenotypes that have been studied and proven to be related to the gene/transcript in the References section. For the sake of concision, we have employed abbreviations to refer to certain diseases. The complete abbreviation table can be found in the Appendix. In terms of subjects, we have documented the body part from which the tissue sample or cell lines were obtained. Additionally, some cell types are abbreviated, and their corresponding full names are also provided in Appendix.
(4) Ancestry and Population. For observational studies and certain in vitro studies, we have noted the ethnicity and specific phenotypes of the participants. In cases where ethnicity is not disclosed, we have indicated the country name.

1-3 Usage

For newcomers.
First time visiting? No worries! Let's quickly go through the key components of idCHD together.
idCHD consists of six interconnected divisions accessed through the navigation bar icons. The "" HOME page showcases the core features of idCHD and occasionally presents important announcements about upcoming features. On the "" BROWSE page, you can find tabulated information on CHD risk genes, supporting articles, single-cell datasets, drugs, and CHD-related pathways. The tables can be easily navigated using the tabs at the top of the left card.
The "" NETWORK page provides a web tool for swift network construction with genes, diseases, and drugs. Upon clicking the button, you'll initially see a demo network. Please note that there might be a brief delay of approximately 10 seconds as the website loads a substantial amount of data. After that, you can select your preferred genes and drugs and add them to the NODES CART. Click CONFIRM, and the system will generate your customized network! (Pro tip: Avoid selecting too many nodes to maintain network stability.)
The "" CELLS page allows exploration of single-cell data. Upon clicking the icon, you'll be presented with a scatter chart displaying the clustering result of dataset GSM2840136 using the t-SNE algorithm (UMAP and diffusion map is optional). Another scatter chart will show the same coordinates but color-coded based on the expression level of the gene Serpinb2, which exhibits the highest variance among cells in this sample. You can easily switch the gene being presented using the dropdown list in the control panel. Additionally, the control panel provides a dataset panel where you can check the characteristics of each dataset and, if desired, switch to a different dataset.
The "" STATISTICS page presents the in-depth data of idCHD. For example, the Sankey chart reveals the quantitative relationships between disease-gene-drug networks, providing detailed statistics for each subgroup.
Lastly, the "" ABOUT page encompasses all the illustrative content of idCHD and allows visitors to download tabulated datasets.

 Documentation

2-1 Method

Genes. Genes and references are retrieved from PubMed by entering keywords "coronary disease", constricting the publish date from 2018/05 to 2023/03, and only those with abstract and not reviews. The command used is shown below.

$ esearch -db pubmed -query "coronary disease AND english [LANG] AND has abstract [FILT] NOT review [PTYP]" | efilter -mindate 2018/05 | efetch -format xml | xtract -pattern PubmedArticle -element PMID AbstractText > File_Name.txt

As a result, totally 43,088 PMIDs and abstracts were downloaded using EDirect in Cygwin64 terminal. The python package nltk (version 3.7) and human genome annotation datasets from NCBI were used for tokenization, data cleaning, and gene name extraction. Finally, 1,842 candidate gene symbols and related PMIDs remained for manual conformation. We refered each gene symbol to the corresponding abstracts and filtered genes not related to CHD and tokens representing different meaning from genes (for example, "CAD" might represent "carbamoyl-phosphate synthetase 2, aspartate transcarbamylase, and dihydroorotase", but also might be abbreviation of "coronary artery disease" or "computer-aided design"). At the same time, we read through methods and results to add more details to the references (see the introduction).

Single-cell datasets. CHD-related articles were screened manually from PubMed using "single-cell, coronary" as keywords. For each sample, we collected information on species, strain, sex, genotype, and disease state of the animal or patient if provided, as well as cell marker, counts, scRNA-seq method, and sequencing platform of the dataset. All these information are accessible in CELLS. Publicly available human and mouse-derived scRNA-seq datasets were downloaded from GEO database and analyzed using CHD-tailored pipeline. Researchers can refer to our CHD-tailored analysis pipeline to conduct better analysis on their own CHD datasets. The pipeline is principally based on the guidelines provided by Single-cell Best Practices (www.sc-best-practices.org):

★ Pre-processing
1. Data format transformation.
 Most of the datasets in idCHD were downloaded from Gene Expression Omnibus (GEO). The data formats of datasets from GEO are mainly of three types, all of which were conversed to h5ad format in our pipeline. The 10X format (MTX, TSV) data were read by Scanpy using read_10x() funtcion in a sample-wise way, then integrated using scvi-tools with default parameters. The TXT format matrix were first read by read_text() function, then manually divided into samples by adding batch annotations in the .obs dataframe. We have not integrated samples that should belong to the same subject or model, and kept their GEO titles as their sample names. Especially, we integrated replicated samples in GSE146285, as those samples were acquired with 384-well plates.
2. Cell quality control.
 CHD samples are mainly from heart, blood, vascular tissue, and atherosclerosis plaques, thus needs specialized preprocessing parameters. During handling those datasets, we observed some datasets, for example, GSE130699, contain cell clusters with high mitochondrial gene ratios (>6%). This may be due to the nature of cardiomyocyte, or cells undergoing apoptosis. Certain datasets, such as GSM4005125 and GSM4762820, contain cell clusters with >60% transcripts from top expressed genes, which may come from special cell types from the blood.
 Compared with whole cell sequencing methods, single-nucleus RNA-seq (snRNA-seq) methods sometimes yield lower (median<1000) non-zero gene counts, and sometimes model batch causes this (GSE145154).
 Considering all those factors, we decided to keep cells with relative higher pct of mito genes (8%) than usual (<6%), cells with a higher thresthold of median absolute deviation (MAD) of pct of counts from the 20 most expressed genes (7 MADs), and cells with <1000 n of non-zero gene (5 MADs).
3. Normalization of expression matrix.
 The expression matrix were log normlized for featrue selection, or log normalized to 10,000 reads per cell for cell type annotation by scanpy.
★ Dimension reduction
4. Feature selection
 Non-negative matrix factorization (NMF) is noise-robust feature selection method with well interpretability, but is time-consuming given large datasets. Thus, we use the devianceFeatureSelection() function through rpy2 interface to get the top 3,000 highly variable genes for NMF, yielding a 30-D representation of each sample, and used those features for downstream analysis. NMF was conducted by using NMF package with method "snmf/r" and seed "nndsvd".
 For SCVI-integrated datasets, we used SCVI representations as the features for 2D embedding.
5. 2D embedding
 Uniform manifold approximation and projection (UMAP), t-distributed stochastic neighbor embedding (t-SNE), and diffusion map are conducted respectively for 2D visualization. Neighborhood connections were calculated by Scanpy, and set method="gauss" and metric="cosine", except for several datasets where method is set to default because of broadcast error.
★ Cellcular annotation
6. Celltyping
 Cell types were determined by celltypist using pretrained models, and subclusters were found by Leiden clustering at the resolution of 0.4. For mouse datasets, three celltypist models for murine heart and peripheral blood were trained by using the Mouse Cell Atlas (MCA) 2.0 dataset "Adult-Heart", "Peripheral-Blood", and a concatenation of "Bone-Marrow", "Bone-Marrow-c-kit", and "Bone-Marrow-Mesenchyme", respectively, as there were no public celltypist model available for murine heart, bone marrow and peripheral blood1. For human datasets, models "Immune_ALl_High" and "Healthy_Adult_Heart" were used.
7. Find DEGs
 We performed sample-wise and dataset-wise DEG analysis on basis of cell types and experiment samples, respectively. For samples which were predicted to contain more than two major cell types, the DEG analysis was conducted on major cell types. For other samples, subtypes were used for analysis.
8. Cell-cell interaction (CCI) analysis
 CCI analysis were performed by cellphonedb using the top 1,000 highly variable genes from each sample on all subtype pairs.
9. Gene set enrichmen analysis (GSEA)
 GSEA analysis were performed by running AUCell using the Reactome genesets. The results were averaged across each cell subtypes and all-zero pathways were discarded. Then, the pathways and cell subtypes were hierarchically clusted using the AgglomerativeClustering function from the sklearn package.

2-2 Download

 About Us

3-1 Authors

prof.fan
Xiaohui Fan
Corresponding author

Professor | Director of Innovation Center of Yangtze River Delta, Zhejiang University.

Single-cell Omics
Spatial transcriptomics
Network Pharmacology
System Biology
E-mail
Address
Website
fanxh@zju.edu.cn
Room 325, CPS, ZJU.
Prof. Fan's Research Lab
Innovation Center of Yangtze River Delta, Zhejiang University

prof. Liao
Jie Liao
Corresponding author

Professor | Principal Investigator of Innovation Center of Yangtze River Delta, Zhejiang University.

Spatial Omics
Single-cell Omics
Systems Biology
Bioinformatics
E-mail
Address
Website
liaojie@zju.edu.cn
Room 103, YRD, ZJU.
Prof. Liao's Personal Page
Innovation Center of Yangtze River Delta, Zhejiang University

Tianhao Wang
Co-first author

PhD student of College of Pharmaceutical Sciences, Zhejiang University.

Cardiovascular Disease
Single-cell Omics
Bioinformatics
System Biology
E-mail
Address
wangth268@zju.edu.cn
Room 326, CPS, ZJU.

Yining Hu
Co-first author

Master student of College of Pharmaceutical Sciences, Zhejiang University.

Coronary Heart Disease
Single-Cell Sequencing
E-mail
Address
huyn299@163.com
Room 324, CPS, ZJU.

Wenbo Guo
Co-first author

PhD of College of Pharmaceutical Sciences, Zhejiang University.

Single-cell Omics
Ischemic Stroke
E-mail
Address
wb_guo@126.com
Room 357, CPS, ZJU.

3-2 Publications

 Appendix

Appendix Table 1. Abbreviations and full names of CHD-related co-morbidities
Abbreviation Disease Full Name Abbreviation Disease Full Name
CAD coronary artery disease ICAD insignificant coronary stenosis
ACS acute coronary syndrome UA unstable angina
AMI acute myocardial infarction STEMI ST-segment elevation myocardial infarction
NSTEMI non-ST-segment elevation myocardoal infarction SCD sudden cardiac death
VD ventricular dysfunction CCS chronic coronary syndrome
SA stable angina CAA coronary artery aneurysm
MI myocardial infarction RI myocardial ischemia/reperfusion injury
CAE coronary artery ectaisa ISR in-stent restenosis
INOCA ischemia with non-obstructive coronary arteries PCAD premature coronary artery disease
CAC coronary artery calcification CMD coronary microvascular dysfunction
PE plaque erosion PR plaque rupture
CAM coronary artery malformation ACA anomalous coronary artery
SCAD spontaneous coronary artery dissection AP angina pectoris
AS atherosclerosis MIn oxydative myocardial injury
HMI hypoxic myocardial injury AHF acute heart failure
CC coronay collateral KD Kawasaki disease
CGL congenital generalized lipodystrophy CI cerebral infarction
DM diabetes mellitus T1DM type 1 diabetes mellitus
T2DM type 2 diabetes mellitus CAV cardiac allograft vasculopathy
MDD major depressive disorder HL hearing loss
IS ischemic stroke HS hemorrhage stroke
TGCV triglyceride deposit cardiomyovasculopathy RAS renal artery stenosis
PAH pulmonary arterial hypertension VT ventricular tachycardia
CKD chronic kidney disease FGR fetal growth restriction
AD Alzheimer''s disease AB aberrent birthweight
SMI severe mental illness SCZ schizophrenia
GE gastrointestinal event FH familial hypercholesterolemia
NAFLD nonalcoholic fatty liver disease NSCLC non-small cell lung cancer
WCH white-coat hypertension AML acute myeloid leukemia
AP angina pectoris SCI spinal cord injury
cCS congenital cold syndrome CIMT carotid artery intima-media thickness
FTD frontotemporal dementia CCM cerebral cavernous malformation
RA rheumatoid arthritis MPS VII mucopolysaccharidosis type VII
PAD peripheral artery disease SD sleep duration
RR repeat revascularization AF atrial fibrillation
PEc preeclampsia CR clopidogrel resistance
LC liver cirrhosis COPD chronic obstructive pulmonary disease
DR diabetic retinopathy MC myotonia congenita
CIH chronic intermittent hypoxia HF heart failure
MACE major adverse cardiovascular events CME coronary microembolization
Appendix Table 2. Abbreviations and full names of tissues and cell lines involved in idCHD
Abbreviation Full Name Abbreviation Full Name
PBMC peripheral blood mononuclear cell SMC smooth muscle cell
EPC endothelial progenitor cell HUVEC human umbilical venous endothelial cell
HCASMC human coronary artery smooth muscle cell EAT epicardial adipose tissue
HCAEC human coronary artery endothelial cell EC endothelial cell
VSMC vascular smooth muscle cell WBC white blood cell
EV extracellular vesicle VEC vascular endothelial cell
NKC natural killer cell