Identification of key pathways and genes in polycystic ovary syndrome via integrated bioinformatics analysis and prediction of small therapeutic molecules

To enhance understanding of polycystic ovary syndrome (PCOS) at the molecular level; this investigation intends to examine the genes and pathways associated with PCOS by using an integrated bioinformatics analysis. Based on the expression profiling by high throughput sequencing data GSE84958 derived from the Gene Expression Omnibus (GEO) database, the differentially expressed genes (DEGs) between PCOS samples and normal controls were identified. We performed a functional enrichment analysis. A protein-protein interaction (PPI) network, miRNA- target genes and TF - target gene networks, were constructed and visualized, with which the hub gene nodes were identified. Validation of hub genes was performed by using receiver operating characteristic (ROC) and RT-PCR. Small drug molecules were predicted by using molecular docking. A total of 739 DEGs were identified, of which 360 genes were up regulated and 379 genes were down regulated. GO enrichment analysis revealed that up regulated genes were mainly involved in peptide metabolic process, organelle envelope and RNA binding and the down regulated genes were significantly enriched in plasma membrane bounded cell projection organization, neuron projection and DNA-binding transcription factor activity, RNA polymerase II-specific. REACTOME pathway enrichment analysis revealed that the up regulated genes were mainly enriched in translation and respiratory electron transport and the down regulated genes were mainly enriched in generic transcription pathway and transmembrane transport of small molecules. The top 10 hub genes (SAA1, ADCY6, POLR2K, RPS15, RPS15A, CTNND1, ESR1, NEDD4L, KNTC1 and NGFR) were identified from PPI network, miRNA - target gene network and TF - target gene network. The modules analysis showed that genes in modules were mainly associated with the transport of respiratory electrons and signaling NGF, respectively. We find a series of crucial genes along with the pathways that were most closely related with PCOS initiation and advancement. Our investigations provide a more detailed molecular mechanism for the progression of PCOS, detail information on the potential biomarkers and therapeutic targets.


Introduction
Polycystic ovary syndrome (PCOS) is one of the most prevalent endocrine disorder around the world, with an estimated about one in 15 women worldwide [1]. PCOS exposes patients to a major psychosocial burden and is characterized by hyperandrogenism and chronic anovulation [2]. Diabetes, heart disease, obesity, non-alcoholic fatty liver disease and hypertension are the risk factors associated with PCOS [3][4][5][6][7]. Therefore, it is of prime importance to identify the etiological factors, molecular mechanisms, and pathways to discover novel diagnostic markers, prognostic markers and therapeutic targets for PCOS.
Numerous research strategies have recently investigated the molecular mechanisms of PCOS. Highthroughput RNA sequencing technology has received extensive attention among these research strategies and has generated significant advances in the field of endocrine disorder with marked clinical applications ranging from molecular diagnosis to molecular classification, patient stratification to prognosis prediction, and discovery of new drug targets to response prediction [8]. In addition, gene expression profiling investigation on PCOS have been performed using high-throughput RNA sequencing, and several key genes and diagnostic biomarkers have been diagnosed for this syndrome, including the profiling of many of differentially expressed genes (DEGs) associated in different pathways, biological processes, or molecular functions [9]. Integrated bioinformatics analyses of expression profiling by high throughput sequencing data derived from different investigation of PCOS could help identify the novel diagnostic markers, prognostic markers and further demonstrate their related functions and potential therapeutic targets in PCOS.
Therefore, in the current investigation, the dataset (GSE84958) was then retrieved from the publicly available Gene Expression Omnibus database (GEO, http:// www.ncbi.nlm.nih.gov/geo/) [10] to identify DEGs and the associated biological processes PCOS using comprehensive bioinformatics analyses. The DEGs were subjected to functional enrichment and pathway analyses; moreover, a protein-protein interaction (PPI) network, miRNAs -target gene regulatory network and TFs -target gene regulatory network were constructed to screen for key genes, miRNA and TFs. The aim of this investigation was to identify key genes and pathways in PCOS using bioinformatics analysis, and then to explore the molecular mechanisms of PCOS and categorize new potential diagnostic therapeutic biomarkers of PCOS. We anticipated that these investigations will provide further understanding of PCOS pathogenesis and advancement at the molecular level.

RNA sequencing data
Expression profiling by high throughput sequencing dataset GSE84958 was downloaded from NCBI-GEO, a public database of next-generation sequencing, to filter the DEGs between PCOS and normal control. The expression profiling by high throughput sequencing GSE84958 was based on GPL16791 platforms (Illumina HiSeq 2500 (Homo sapiens)) and consisted of 30 PCOS samples and 23 normal control.

Identification of DEGs
The limma [11] in R bioconductor package was used to analyze the DEGs between PCOS samples and normal control samples in the expression profiling by high throughput sequencing data of GSE84958. The adjusted P-value and [logFC] were calculated. The Benjamini & Hochberg false discovery rate method was used as a correction factor for the adjusted P-value in limma [12]. The statistically significant DEGs were identified according to P<0.05, and [logFC] > 2.5 for up regulated genes and [logFC] < -1.5 for down regulated genes. All results of DEGs were downloaded in text format, hierarchical clustering analysis being conducted.

GO and pathway enrichment of DEGs in PCOS
To reflect gene functions, GO (http://geneontology.org/) [13] has been used in three terms: biological processes (BP), cellular component (CC) and molecular function (MF). ToppGene (ToppFun) (https://toppgene.cchmc. org/enrichment.jsp) [14] is an online database offering a comprehensive collection of resources for functional annotation to recognize the biological significance behind a broad list of genes. The functional enrichment analyses of DEGs, including GO analysis and REACTOME (https://reactome.org/) [15] pathway enrichment analysis, were performed using ToppGene in the present study, using the cut-off criterion P-value<0.05 and gene enrichment count>2.

PPI networks construction and module analysis
The Search Tool for the Retrieval of Interacting Genes/ Proteins (STRING: http://string-db.org/) is online biological database and website designed to evaluate PPI information [16] Proteins associated with DEGs were selected based on information in the STRING database (PPI score >0.7), and then PPI networks were constructed using Cytoscape software (http://cytoscape.org/) [17]. In this investigation, node degree [18], betweenness centrality [19], stress centrality [20] and closeness centrality [21], these constitutes a fundamental parameters in network theory, were adopted to calculate the nodes in a network. The topological properties of hub genes were calculated using Cytoscape plugin Network Analyzer. The PEWCC1 (http://apps.cytoscape.org/apps/ PEWCC1) [22], a plugin for Cytoscape, was used to screen the modules of the PPI network. The criteria were set as follows: degree cutoff=2, node score cutoff= 0.2, k-core=2 and maximum depth=100. Moreover, the GO and pathway enrichment analysis were performed for DEGs in these modules.

Construction of miRNA -target regulatory network
Furthermore, the target genes of the significant target genes were predicted by using miRNet database (https:// www.mirnet.ca/) [23], when the miRNAs shared a common target gene. Finally, the miRNA -target genes regulatory network depicted interactions between miRNAs and their potential targets in PCOS were visualized by using Cytoscape.

Construction of TF -target regulatory network
Furthermore, the target genes of the significant target genes were predicted by using TF database (https:// www.mirnet.ca/) [23], when the TFs shared common target genes. Finally, the TF-target genes regulatory network depicted interactions between TFs and their potential targets in PCOS were visualized by using Cytoscape.

Receiver operating characteristic (ROC) curve analysis
The ROC curve was used to evaluate classifiers in bioinformatics applications. To further assess the predictive accuracy of the hub genes, ROC analysis was performed to discriminate PCOS from normal control. ROC curves for hub genes were generated using pROC in R [24] based on the obtained hub genes and their expression profiling by high throughput sequencing data. The area under the ROC curve (AUC) was determined and used to compare the diagnostic value of hub genes.

Validation of the expression levels of candidate genes by RT-PCR
Total RNA was extracted from PCOS (UWB1.289 (ATCC® CRL-2945™)) and normal ovarian cell line (MES-OV (ATCC® CRL-3272™)) using TRI Reagent® (Sigma, USA). The Reverse transcription cDNA kit                   (Thermo Fisher Scientific, Waltham, MA, USA) and 7 Flex real-time PCR system (Thermo Fisher Scientific, Waltham, MA, USA) were used for reverse transcription and real-time quantitative reverse transcriptase polymerase chain reaction (qRT-PCR) assay. Polymerase chain reaction primer sequences are listed in Table 1. β-actin was used as an internal control for quantification. The relative expression levels of target transcripts were calculated using the 2 -ΔΔCt method [25]. The thermocycling conditions used for RT-PCR were as follows: initial denaturation at 95°C for 15 min, followed by 40 cycles at 95°C for 10 sec, 60°C for 20 sec and 72°C for 20 sec.

Molecular docking studies
Surflex-docking studies of the standard drug molecule used in polycystic ovary syndrome were used on over expressed genes and were collected from PDB data bank using perpetual SYBYL-X 2.0 software. Using ChemDraw Software, all the drug molecules were illustrated, imported and saved in sdf. templet using open babel free software. The protein structures of POLR2K (), RPS15, RPS15 alpha and SAA1 of their co-crystallised protein of PDB code 1LE9, 3OW2, 1G1X and 4IP8 respectively were extracted from Protein Data Bank [26][27][28]. Gasteiger Huckel (GH) charges were applied along with the TRIPOS force field to all the drug molecules and is standard for the structure optimization process. In addition, energy minimization was achieved using MMFF94s and MMFF94 algorithm methods. The protein preparation was carried out after incorporation of protein. The co-crystallized ligand was extracted from the crystal structure and all water molecules; more hydrogen was added and the side chain was set. For energy minimisation, the TRIPOS force field was used. The interaction efficiency of the compounds with the receptor was represented in kcal / mol units by the    -01  49  EPHB4,STK36,MAK,SPAST,ZFYVE27,TTLL3,MYO7A,MYO10,  SDCCAG8,SPTB,SARM1,RAP1GAP2,NPHP4,GFRA3,  AMIGO1,HECW2,GLI2,GLI3,SYNE2,WRAP73,NGFR,NTN1,  NLGN2,FGFR3,PLD1,CEP126,NYAP1,DNAH5,UBXN10,  PLXNB3,SEMA6B,FAT4,NPTX1,NEDD4L,STXBP5,FBF1,  LAMA5,RFFL,ODF2,FAS,RFX3,ATL1,TRAK1,WDR19,FEZ1 Surflex-Dock score. The interaction between the protein and the ligand, the best pose was incorporated into the molecular area. The visualization of ligand interaction with receptor is done by using discovery studio visualizer.

Identification of DEGs
Expression profiling by high throughput sequencing dataset was obtained from the National Center for Biotechnology Information GEO database containing PCOS samples and normal control samples: GSE84958. Then, the R package named "limma" was processed for analysis with adjusted P < 0.05, and [logFC] > 2.5 for up regulated genes and [logFC] < -1.5 for down regulated genes. All DEGs were displayed in volcano maps (Fig. 1). A total of 739 DEGs including 360 up regulated and 379 down regulated genes ( Table 2) were identified in PCOS samples compared to normal control samples. The results are shown in the heatmap (Fig. 2).

GO and pathway enrichment of DEGs in PCOS
The top 739 DEGs were chosen to perform GO and REACTOME pathway enrichment analyses. Gene Ontology (GO) analysis identified that the DEGs were significantly enriched in BP, including the peptide metabolic process, intracellular protein transport, plasma membrane bounded cell projection organization and cell morphogenesis (Table 3). In terms of CC, DEGs were mainly enriched in organelle envelope, catalytic complex, neuron projection and cell junction were the most significantly enriched GO term (Table 3). In addition, MF demonstrated that the DEGs were enriched in the RNA binding, transcription factor binding, DNA-binding transcription factor activity, RNA polymerase II-specific and ATP binding (Table 3). REACTOME pathway enrichment analysis was used to screen the signaling pathways for differential genes. These DEGs were mainly involved in the translation, respiratory electron transport, generic transcription pathway and transmembrane transport of small molecules (Table 4).

Construction of miRNA -target regulatory network
After combining the results of miRNA-target genes with the interactive network of miRNAs, 281 hub genes were selected and 2138 were miRNAs. The genes and miR-NAs are shown in Fig. 4a (Table 6).  (Table 6).

Validation of the expression levels of hub genes by RT-PCR
Aiming to further verify the expression patterns of selected hub genes, real-time PCR, which allows quantitative analysis of hub gene expression, was applied. The results showed that the relative expression levels of 10 hub genes including SAA1, ADCY6, POLR2K, RPS15, RPS15A, CTNND1, ESR1, NEDD4L, KNTC1 and NGFR were consistent with the expression profiling by high throughput sequencing (Fig. 6).

Molecular docking studies
In the present analysis, the docking simulations are performed to classify the active site conformation and significant interactions with the receptor binding sites responsible for complex stability. The over expressed genes is recognized in polycystic ovary syndrome and their x-ray crystallographic proteins structure are selected from PDB for docking studies. The standard drugs containing steroid nucleus are most commonly used either alone or in combination with other drugs. The docking studies of standard molecules containing the steroid ring have been carried out using Sybyl X 2.1 drug design software. The docking studies were performed to know the biding interaction of standard molecules on identified overexpressed genes of protein. The X-RAY crystallographic structure of one protein in each of four over expressed genes of POLR2K, RPS15, RPS15 and SAA1 of their co-crystallised protein of PDB code 1LE9, 3OW2, 1G1X and 4IP8 respectively were selected for the docking (Fig. 7). A total of three drug molecules of ethinylestradiol (ETE), levonorgestril (LNG) and desogestril (DSG) were docked with over expressed proteins to assess the binding affinity with proteins. The binding score greater than six are said to be good, all three drug molecules obtained binding score greater than 7 respectively. The molecules ETE obtained with a high binding score of 9.943 with SAA1 of PDB code 4IP8 and 8.260, 8 Table 7). The molecule ETE and LNG has highest binding score its interaction with protein 4IP8 and hydrogen bonding and other bonding interactions with amino acids are depicted by 3D (Fig. 8) and 2D ( Fig. 9)

Discussion
PCOS is a most prevalent endocrine disorder with hyperandrogenism and chronic anovulation [29]. If not treated promptly and effectively, PCOS can seriously reduce the quality of life. There is no doubt that considerate syndrome at the molecular level will help to develop their diagnosis and treatment [30].
Up to now, various biomarkers have been identified to be linked with PCOS and might be selected as therapeutic targets, but the detailed mechanism of gene regulation leading to syndrome advancement remains elusive [31]. In our investigation, we aimed to identify biomarkers of PCOS and uncover their biological functions through bioinformatics analysis. Dataset GSE84958 was selected as expression profiling by high throughput sequencing dataset in our analysis. As a result, 360 up regulated and 379 down regulated  genes at least 4-fold change between PCOS and normal control samples were screened out. ABI3BP protein expression in heart tissue was significantly related with cardiovascular disease [32], but this gene might be liable for progression of PCOS. Romo-Yáñez et al [33] have revealed the expression of BNIP3 was linked with diabetic in pregnancies, but this gene might be responsible for progression of PCOS. F13A1 is an essential regulatory factor to be associated in PCOS development [34]. An investigation has reported that the ITIH4 can promote non-alcoholic fatty liver disease [35], but this gene might be important for progression of PCOS. Da et al [36] have suggested that the TET3 is an important role in controlling type 2 diabetes progressions, but this gene might be key role in PCOS.
Among all three of molecules of ethinylestradiol, levonorgestrel and desogetril respectively, ethinylestradiolhas obtained highest binding score (c-score) of 9.943 with protein of PDB code 4IP8 and obtained 8. and GLN-66 respectively. It is assumed that the highest binding score (c-score) of ethinylestradiol is due to the presence of aromatic ring and the phenolic -OH group.
In conclusion, we used a series of bioinformatics analysis methods to find the crucial genes and pathways associated in PCOS initiation and development from expression profiling by high throughput sequencing containing PCOS samples and normal control samples. Our investigations provide a more specific molecular mechanism for the advancement of PCOS, detail information on the potential biomarkers and therapeutic targets. However, the interacting mechanism and function of genes need to be confirmed in further experiments.

Informed consent
No informed consent because this study does not contain human or animals participants.

Authors' contributions
Praveenkumar Devarbhavi -Investigation and resources. Lata Telang -Writing original draft and investigation. Basavaraj Vastrad -Writing original draft, and review and editing. Anandkumar Tengli -Investigation, and review and editing. Chanabasayya Vastrad -Software and investigation. Iranna Kotturshetti -Supervision and resources. The authors read and approved the final manuscript.