CLCG Capabilities and Facilities
 

 

 

Research Capabilities Overview:
The Coordinated Laboratory for Computational Genomics (CLCG) was founded in 1996, and has been engaged upon a variety of research topics in bioinformatics and computational biology. These include: gene and mutation discovery, expression analysis and functional genomics, genetic and physical mapping, genome-scale analyses, and information management. The Center for Bioinformatics and Computational Biology (CBCB) was established in 2002 to expand involvement of research in BCB on the UI campus, and to coalesce efforts of current investigators. The CLCG is a charter member lab of the CBCB.

The CLCG has an integrated sequence processing pipeline that facilitates cDNA-based gene discovery projects. This pipeline accepts raw chromatograph files, as well as sequence text as input, and returns a set of high-quality, annotated sequences. The pipeline includes base-calling, novel and repeat/low-complexity feature identification, quality assessment/filtering, clustering, and annotation of the individual sequences. Standard components are used where applicable (phred, RepeatMasker, BLAST), but several components have been locally developed (ESTprep, UIcluster). To date, we have processed and submitted data for more than 1,000,000 ESTs across more than ten organisms (Scheetz et al., in press).

The CLCG also has several years of experience with expression analysis, including micro-array (synthetic oligo-nucleotide, cDNA, and Affymetrix), SAGE, and EST-based expression platforms. Current infrastructure for expression analysis includes a centralized data-store (database – UIMADS, and hierarchical file system) for all expression-related data (regardless of source). This encompasses a broad spectrum of storage needs, from the image and chromatogram files which are stored on disk array and/or DVD archive media, to analyzed expression data and experimental descriptions which are stored within a relational database. In addition to gathering and analyzing experimental expression data, we also have experience in the design of the experiments themselves - controlling/minimizing sources of variability, design of custom probe sets, and design of oligo-nucleotide probes. We have utilized the 10-base SAGE tags (NlaI) for the past two years, and have recently moved to the longer 14-base system utilizing the RsaI restriction enzyme.

Support for mapping projects within the CLCG includes support for both genetic and physical mapping. This infrastructure includes software (RHSCORER and RHMAPPER) and experience for radiation hybrid mapping, as well as a high-throughput system for storage and access to high-throughput genotyping resources (GenoScape and GenoMap) for genetic mapping.

Several analyses of genome-scale data sets are currently being pursued within the CLCG. Current projects include identification of microRNAs and their targets, full-length cDNAs, inter- and intra-species gene families, transcription factor binding sites, transcription start sites, alternative splice forms, alternative polyadenylation sites, and simple and complex repetitive elements. These projects utilize diverse publicly available and locally developed software.

We have also developed an integrated system for managing information relating to disease gene mutation identification. The TrAPSS system (Transcript Annotation Prioritization and Screening System) manages sets of candidate genes for a wide range of diseases and syndromes. The novel PAR algorithm (Prioritization of Annotated Regions; T. Braun, in preparation) is used to identify regions that are most likely to harbor disease-causing mutations. In addition, semi-automated primer selection, ordering and screening results may all be stored within TrAPSS - greatly speeding the screening process.

Similar database-centric systems have also been developed to manage clinical and genotyping data for disease linkage experiments, and to manage the information related to full-length cDNA sequencing. GenoScape and GenoMap (Scheetz et al., in press) are related components that combine to facilitate high-throughput genotyping projects. Similarly, our full-length cDNA sequencing pipeline provides the ability to track the progress of individual cDNA libraries and clones, and to provide feedback into the system regarding the quality of the clones selected for full-length sequencing.

References:

GenoMap
T.E. Scheetz, T.A. Braun, T.L. Casavant, K.J. Munn, E.M. Stone, and V.C. Sheffield, GenoMap: A Distributed System for Unifying Genotyping and Genetic Linkage Analysis, Parallel Computing , Vol 24, 1998, pp. 1567-1592.

UIcluster
Pedretti K, Scheetz T, Braun T, Roberts C, Robinson N, Casavant T. A Parallel Expressed Sequence Tag (EST) Clustering Program. Lecture Notes in Computer Science, Vol 2127, 2001, pp. 490.

ESTprep
Scheetz TE, Trivedi N, Roberts CA, Kucaba T, Berger B, Robinson NL, Birkett CL, Gavin AJ, O’Leary B, Braun TA, Bonaldo MF, Robinson JP, Sheffield VC, Soares MB, Casavant TL. ESTprep: preprocessing cDNA sequence reads. Bioinformatics. 2003 Jul 22;19(11):1318-24.

Mapping
Scheetz TE, Raymond MR, Nishimura DY, McClain A, Roberts C, Birkett C, Gardiner J, Zhang J, Butters N, Sun C, Kwitek-Black A, Jacob H, Casavant TL, Soares MB, Sheffield VC. Generation of a high-density rat EST map.
Genome Res. 2001 Mar;11(3):497-502.

Gene Discovery
Scheetz TE, Laffin JJ, Berger B, Mackerly S, Baumes SA, Brown II B, Chang S, Coco J, Conklin J, Crouch K, Donohue M, Doonan G, Estes C, Eyestone M, Fishler K, Gardiner J, Guo L, Johnson B, Keppel C, Kreger R, Lebeck M, Marcelino R, Miljkovich V, Perdue M, Qui L, Rehmann J, Reiter RS, Rhoads B, Schaefer K, Smith C, Sunjevaric I, Trout K, Wu N, Birkett CL, Bischof J, Gackle B, Gavin A, Mokrzycki B, Moressi C, O’Leary B, Pedretti K, Roberts C, Smith M, Tack D, Trivedi N, Kucaba T, Freeman T, Lin J, Bonaldo MF, Casavant TL, Sheffield VC, Soares MB. High-throughput gene discovery in the rat. Genome Research, in press.

Tuggle CK, Green JA, Fitzsimmons C, Woods R, Prather RS, Malchenko S, Soares BM, Kucaba T,Crouch K, Smith C, Tack D, Robinson N, O’Leary B, Scheetz T, Casavant T, Pomp D, Edeal BJ, Zhang Y, Rothschild MF, Garwood K, Beavis W. EST-based gene discovery in pig: virtual expression patterns and comparative mapping to human. Mamm Genome. 2003 Aug;14(8):565-79.

Dimopoulos G, Casavant TL, Chang S, Scheetz T, Roberts C, Donohue M, Schultz J, Benes V, Bork P, Ansorge W, Soares MB, Kafatos FC. Anopheles gambiae pilot gene discovery project: identification of mosquito innate immunity genes from expressed sequence tags generated from immune-competent cell lines. Proc Natl Acad Sci U S A. 2000 Jun 6;97(12):6619-24.

Computing Facilities and Space:
The CLCG and CBCB consist of approximately 30 full-time faculty, post-docs, staff, and students occupying 3,000 sq.ft. housed in the Seamans Center for the Engineering Arts and Sciences Building on the University of Iowa main campus. This recently remodeled facility is wired for high-speed networking (10- and 100-megabit, and gigabit ethernet – hardwired and wireless), and includes 2 dedicated Linux clusters, more than 100 computing systems (workstations and servers), 137 CPUs, 97 Gigabytes of RAM, and 4 Terabytes of Disk space. The following computer resources are available:

1. A dedicated compute server cluster of 18 Linux systems (36 CPUs) connected with a dedicated, switched, copper Gigabit Ethernet intranet. (18 Dual AMD MP-2400 (2.2 GHz, 2GB memory, 40GB disk each.
2. A second dedicated compute server cluster of 16 Linux systems (32 CPUs) connected with a dedicated, switched, fiber-optic Gigabit Ethernet intranet. (12 Dual Pentium III (500 MHz, 1GB memory, 9GB disk each), and 4 Dual Pentium III (500 MHz, 2GB memory, 9GB disk each)).
3. A dedicated, dual fiber channel, redundant disk storage system(RAID), 1.2 TB usable space.

4. A second dedicated, dual fiber channel, redudant disk storage system, 412GB usable space.

5. A collection of data-server systems. (1) Dual Xeon (2.4 Ghz, 2 GB memory, 110GB disk), (1) Dual Pentium III (600 MHz, 1GB memory, 507GB disk), (1) Dual Pentium III (500 MHz, 1GB memory, 81GB disk), (1) Dual Pentium III (550 MHz, 512 MB, 80 GB)
6 . In addition, substantial computing infrastructure is currently in place for development and monitoring of production computing. This includes: (29) Pentium II/III/Athlon workstations running Linux (350-850 MHz, 128-1GB memory, 4-36GB disk per system); 27 laptop Linux systems; a SPARC-20 database server with 128MB of memory and 30GB disk, and a large collection of networked Windows 2000/XP and Macintosh systems.

Space:
Office, and laboratory space is available for all Principal Investigators, co-investigators, post-docs, staff and students. Convenient meeting space is also available. Offices are equipped with computers and printers, video-conferencing, and telephone conferencing. PI offices are equipped with video projection capabilities for small to medium sized collaborative meetings, and for multi-site video conferencing. All computers are connected to a 100Mbit switched Ethernet backbone, and most of the space is covered by 802.11a and b standard wireless Ethernet. All key personnel involved in this project have Linux and/or Mac/PC computers (desktop and portable) connected to the network. Many of the faculty and staff have high-speed connections at home as well.