CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences

Li, Weizhong

Abstract

CD-HIT is a computer program for clustering and comparing large sets of protein or nucleotide sequences. It helps to significantly reduce the computational and manual efforts in various sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. CD-HIT is 2 to 3 orders of magnitude faster than other methods. It can handle extremely large databases and has been used extensively in various fields. CD-HIT is becoming increasingly popular based on users'feedback and the growing number of publications that cited CD-HIT. CD-HIT has thousands of users now and is routinely used in many popular databases, such as UniProt and PDB. Researchers are now facing serious challenges and problems from the explosive growth of public sequence databases as a result of high-throughput genome sequencing projects and the very recent environmental metagenomic projects. The routine analysis, from searching a database to building a multiple alignment, is getting more computational expensive and complicated. An efficient clustering method is crucial to address many of the challenges and help researchers to overcome the problems. Currently, no other available program can replace CD-HIT in terms of speed and the ability to handle very large datasets. Therefore, CD-HIT will be playing a more important role in the future. The goal of this proposal is the further improvement and development of the CD-HIT program and related applications to better serve the increasing user community and to address the issues raised by users of CD-HIT. The algorithm will be improved to achieve better performance and overcome the existing limitations. Efforts will be spent towards more accurate clustering results while still maintaining the ultrahigh speed. New functions will be implemented to meet various clustering and comparing needs. More enhanced maintenance and better software engineering techniques will take place to provide regular program releases and updates, better portability, shorter trouble shooting cycles, and richer documentation. Subject to University policies, CD-HIT will be continually an open source package. In addition, a web server will be set up for easier public access to CD-HIT's applications. The server will provide further analysis and visualization tools, interface and links to other bioinformatics resources. Pre-calculated popular datasets will be made available to the public to eliminate the need for individual labs to repeat the same work. Project Narrative CD-HIT is a fast computer program for clustering and comparing biological sequences used by thousands of researchers in public health related studies. It directly helps researchers to significantly reduce the efforts in sequence analysis and to correct the bias within public databases. Continued development of CD-HIT will better serve researchers who are facing more challenges in sequence analysis by the explosive growth of public sequence databases.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Center for Research Resources (NCRR)
Type: Research Project (R01)
Project #: 5R01RR025030-02
Application #: 7682840
Study Section: Special Emphasis Panel (ZRG1-BST-Q (01))
Program Officer: Yang, Liming

Project Start: 2008-09-01
Project End: 2011-06-30
Budget Start: 2009-07-01
Budget End: 2010-06-30
Support Year: 2
Fiscal Year: 2009
Total Cost: $270,375
Indirect Cost

Institution

Name: University of California San Diego
Department: Miscellaneous
Type: Schools of Arts and Sciences
DUNS #: 804355790

City: La Jolla
State: CA
Country: United States
Zip Code: 92093

Related projects


NIH 2010 R01 RR	CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences Li, Weizhong / University of California San Diego	$252,376
NIH 2009 R01 RR	CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences Li, Weizhong / University of California San Diego	$270,375
NIH 2009 R01 RR	CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences Li, Weizhong / University of California San Diego	$135,714
NIH 2008 R01 RR	CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences Li, Weizhong / University of California San Diego	$347,625

Publications

Wu, Sitao; Li, Robert W; Li, Weizhong et al. (2012) Worm burden-dependent disruption of the porcine colon microbiota by Trichuris suis infection. PLoS One 7:e35470

Baldwin 6th, Ransom L; Wu, Sitao; Li, Weizhong et al. (2012) Quantification of Transcriptome Responses of the Rumen Epithelium to Butyrate Infusion using RNA-seq Technology. Gene Regul Syst Bio 6:67-80

Fu, Limin; Niu, Beifang; Zhu, Zhengwei et al. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150-2

Li, Robert W; Wu, Sitao; Baldwin 6th, Ransom L et al. (2012) Perturbation dynamics of the rumen microbiota in response to exogenous butyrate. PLoS One 7:e29392

Li, Weizhong; Fu, Limin; Niu, Beifang et al. (2012) Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform 13:656-68

Wu, Sitao; Zhu, Zhengwei; Fu, Liming et al. (2011) WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics 12:444

Li, Robert W; Wu, Sitao; Li, Weizhong et al. (2011) Metagenome plasticity of the bovine abomasal microbiota in immune animals in response to Ostertagia ostertagi infection. PLoS One 6:e24417

Sun, Shulei; Chen, Jing; Li, Weizhong et al. (2011) Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource. Nucleic Acids Res 39:D546-51

Niu, Beifang; Zhu, Zhengwei; Fu, Limin et al. (2011) FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes. Bioinformatics 27:1704-5

Huang, Ying; Niu, Beifang; Gao, Ying et al. (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26:680-2

Showing the most recent 10 out of 12 publications

Comments

Be the first to comment on Weizhong Li's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: