CD-HIT is a computer program for clustering and comparing large sets of protein or nucleotide sequences. It helps to significantly reduce the computational and manual efforts in various sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. CD-HIT is 2 to 3 orders of magnitude faster than other methods. It can handle extremely large databases and has been used extensively in various fields. CD-HIT is becoming increasingly popular based on users'feedback and the growing number of publications that cited CD-HIT. CD-HIT has thousands of users now and is routinely used in many popular databases, such as UniProt and PDB. Researchers are now facing serious challenges and problems from the explosive growth of public sequence databases as a result of high-throughput genome sequencing projects and the very recent environmental metagenomic projects. The routine analysis, from searching a database to building a multiple alignment, is getting more computational expensive and complicated. An efficient clustering method is crucial to address many of the challenges and help researchers to overcome the problems. Currently, no other available program can replace CD-HIT in terms of speed and the ability to handle very large datasets. Therefore, CD-HIT will be playing a more important role in the future. The goal of this proposal is the further improvement and development of the CD-HIT program and related applications to better serve the increasing user community and to address the issues raised by users of CD-HIT. The algorithm will be improved to achieve better performance and overcome the existing limitations. Efforts will be spent towards more accurate clustering results while still maintaining the ultrahigh speed. New functions will be implemented to meet various clustering and comparing needs. More enhanced maintenance and better software engineering techniques will take place to provide regular program releases and updates, better portability, shorter trouble shooting cycles, and richer documentation. Subject to University policies, CD-HIT will be continually an open source package. In addition, a web server will be set up for easier public access to CD-HIT's applications. The server will provide further analysis and visualization tools, interface and links to other bioinformatics resources. Pre-calculated popular datasets will be made available to the public to eliminate the need for individual labs to repeat the same work. Project Narrative CD-HIT is a fast computer program for clustering and comparing biological sequences used by thousands of researchers in public health related studies. It directly helps researchers to significantly reduce the efforts in sequence analysis and to correct the bias within public databases. Continued development of CD-HIT will better serve researchers who are facing more challenges in sequence analysis by the explosive growth of public sequence databases.

Agency
National Institute of Health (NIH)
Institute
National Center for Research Resources (NCRR)
Type
Research Project (R01)
Project #
3R01RR025030-02S1
Application #
7892867
Study Section
Special Emphasis Panel (ZRG1-BST-Q (01))
Program Officer
Yang, Liming
Project Start
2009-08-26
Project End
2010-08-25
Budget Start
2009-08-26
Budget End
2010-08-25
Support Year
2
Fiscal Year
2009
Total Cost
$135,714
Indirect Cost
Name
University of California San Diego
Department
Miscellaneous
Type
Schools of Arts and Sciences
DUNS #
804355790
City
La Jolla
State
CA
Country
United States
Zip Code
92093
Wu, Sitao; Li, Robert W; Li, Weizhong et al. (2012) Worm burden-dependent disruption of the porcine colon microbiota by Trichuris suis infection. PLoS One 7:e35470
Baldwin 6th, Ransom L; Wu, Sitao; Li, Weizhong et al. (2012) Quantification of Transcriptome Responses of the Rumen Epithelium to Butyrate Infusion using RNA-seq Technology. Gene Regul Syst Bio 6:67-80
Fu, Limin; Niu, Beifang; Zhu, Zhengwei et al. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150-2
Li, Robert W; Wu, Sitao; Baldwin 6th, Ransom L et al. (2012) Perturbation dynamics of the rumen microbiota in response to exogenous butyrate. PLoS One 7:e29392
Li, Weizhong; Fu, Limin; Niu, Beifang et al. (2012) Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform 13:656-68
Wu, Sitao; Zhu, Zhengwei; Fu, Liming et al. (2011) WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics 12:444
Li, Robert W; Wu, Sitao; Li, Weizhong et al. (2011) Metagenome plasticity of the bovine abomasal microbiota in immune animals in response to Ostertagia ostertagi infection. PLoS One 6:e24417
Sun, Shulei; Chen, Jing; Li, Weizhong et al. (2011) Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource. Nucleic Acids Res 39:D546-51
Niu, Beifang; Zhu, Zhengwei; Fu, Limin et al. (2011) FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes. Bioinformatics 27:1704-5
Huang, Ying; Niu, Beifang; Gao, Ying et al. (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26:680-2

Showing the most recent 10 out of 12 publications