The Carolina Center for Exploratory Genetic Analysis (CCEGA) focused on developing an interdisciplinary infrastructure to identify the complex genetic traits that underly human diseases, bringing together data from clinical studies, population studies and model systems. CCEGA believes the next breakthroughs in our understanding of biology and disease will be made possible by the integrated analysis of genetic data and its expression as phenotypes. CCEGA’s work centers on enabling this kind of multidisciplinary, multi-investigator research. The center involves three complementary groups of scientists at the University of North Carolina at Chapel Hill: (a) experimental geneticists, (b) quantitative experts in statistics and biostatistics, and (c) computer scientists with expertise in algorithm development, software construction, and high-performance computing.

Phase one of CCEGA focused on building a community of investigators and deploying a prototype infrastructure for analyzing relationships among genotypes and phenotypes in three contexts:

  • Family linkage studies, which examine the relationship between genotypes and susceptibility to specific diseases and conditions, in this case alcoholic addiction.
  • Gene expression profile studies, which develop a picture of genes and cellular activity in order to identify patterns and signatures related to disease, in this case breast cancer.
  • Public health studies, which look at communities and their risk factors for diseases, in this case atherosclerosis.

To accommodate the diverse, multi-investigator databases necessary to answer these complex questions, RENCI worked with scientists to develop a prototype, extensible data model and provide access to data via a portal constructed using the Open Grid Computing Environment toolkit. The newest methods of integrated data analysis were incorporated into a portal-based workflow. These included new techniques in linkage analysis (oligogenic analysis, multivariate linkage analysis, epistasis, and genotype by environment interaction), subspace clustering, and association analysis (quantitative trait and nucleotide analysis).

RENCI and its scientific partners also explored new visualization techniques for examining and interacting with large data sets and high performance computing for implementing computationally intensive analysis techniques. To reduce the barriers between data providers and data analyzers, CCEGA and RENCI conducted intensive, specialized workshops, colloquia and intramural meetings.


National Institutes of Health/National Center for Research Resources, Grant Number 5-P20-RR020751-01-02

Co-Principal Investigators at UNC-Chapel Hill

  • James Evans, Terry Magnuson, Karen Mohlke, Fernando Manuel Pardo, Charles Perou, Patrick Sullivan, David Threadgill, Kirk Wilhelmsen, Department of Genetics
  • Susan Paulsen, Jan Prins, Wei Wang, Department of Computer Science
  • Fred Wright, Fei Zou, Department of Biostatistics
  • Bradley Hemminger, School of Information and Library Science
  • Andrew Nobel, Department of Statistics
  • Kari North, Department of Epidemiology
  • Alexander Tropsha, School of Pharmacy
  • K.T.L. Vaughan, Health Sciences Library

Project Team

  • Charles Schmitt
  • Clark Jeffries
  • Jeff Tilson

Fred A. Wright, Hanwen Huang, Xiaojun Guan, Kevin Gamiel, Clark Jeffries, William T. Barry, Fernando Pardo-Manuel, Patrick F. Sullivan, Kirk C. Wilhelmsen, and Fei Zou. Simulating Association Studies: a Data-based Resampling Method for Candidate Regions or Whole Genome Scans (accepted for publication in Bioinformatics), 2007.

Jeffries, C. Hairpin Database: Why and How? Genomic Impact of Eukaryotic Transposable Elements conference, Asilomar, CA, April 2006

Jeffries, C. Bipartite and tripartite systems and matrices from genetic control research, Linear Algebra and its Applications 409 (2005) 70-78.

Jeffries, C., Jarstfer, M., Perkins, D.: Folded RNA from an intron of one gene might inhibit expression of a competing gene, in silico Biology 5 (2005), 0037.

Jeffries, C., Perkins, D., Jarstfer, M.: Systematic discovery of the grammar of translational inhibition by RNA hairpins, Journal of Theoretical Biology (accepted for publication).

J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel, J. Prins, “Mining Approximate Frequent Itemsets In the Presence of Noise: Algorithm and Analysis”, SIAM Conference on Data Mining (SDM), 2006.

J. Liu, S. Paulsen, W. Wang, A. Nobel, J. Prins, “Mining approximate frequent itemset from noisy data”, Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), 2005.

Hemminger BM, Saelim B, Sullivan PF. TAMAL: An integrated approach to choosing SNPs for genetic studies of human complex traits. Bioinformatics 2006.

Introduction and Context Dan Reed Chancellor’s Eminent Professor Vice-Chancellor for Information Technology and CIO Director, Renaissance Computing Institute (RENCI)

Workshop Format Kirk Wilhelmsen, Department of Genetics

Addiction Family Study Kirk Wilhelmsen, Department of Genetics

Strong Heart Kari North, Epidemiology

Diabetes, Fusion Karen Mohlke, Department of Genetics

CATIE (Clinical Antipsychotic Trial of Intervention Effectiveness), Schizophrenia Pat Sullivan, Department of Genetics

Cystic Fibrosis Mike Knowles, Department of Medicine

Cancer Epidemiology Bob Millikan, Epidemiology

Head and Neck EpidemiologyAndy Olshan, Epidemiology

Renal Disease Gene Expression Ron Falk, Department of Medicine

ELSI/Prospective Studies Jim Evans, Department of Genetics



HAP-SAMPLE is a web application for simulating SNP genotypes for case-control and affected-child trio studies by resampling from Phase I/II HapMap SNP data. The user provides a list of SNPs to be “genotyped,” along with a disease model file that describes causal SNPs and their effect sizes. The simulation tool is appropriate for candidate regions or whole-genome scans.


This project was supported by Grant 5-P20-RR020751-01-02 from the National Institutes of Health Center for Research Resources as part of the Carolina Center for Exploratory Genetic Analysis. Other sources of support included Carolina Environmental Research Center (EPA RD-83272001), NIGMS R01 GM074175, and CF Foundation Zou05P0. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or the National Center for Research Resources.