With little chance for discovery and decreasing budgets, yet sustained pressure to publish, the unethical practices of duplicate publications and plagiarism are significant. With no robust method to identify existing and potential duplicate scientific articles by editors and reviewers means that this can go unchecked, until now. eTBLAST, a text similarity search tool available to all on the web, has been used to demonstrate that we can detect with high sensitivity and specificity putative duplicate/plagiarized articles by systematically comparing each Medline abstract (or abstract in review) to all other Medline records. We hypothesize that rigorous identification of purveyors of this behavior, the exhaustive tagging of duplicate articles and the availability of a search tool customized for use by editors, reviewers, granting officials, etc. to detect potential problem manuscripts before they are accepted for publication will be a substantial deterrent, ultimately improving the quality of reported science for all. We will address this through the following specific aims: 1) Refine statistical predictors, thresholds, signatures and algorithms to maximize the efficiency by which we can detect putative duplicate and plagiarized articles within Medline. 2) Systematically check every Medline record against every other to develop a public database of questionable articles that have been reviewed/verified manually to assign a probability of duplication. 3) Perform an analysis of trends, rates and any statistically relevant distributions to understand and address root causes for this behavior. 4) Create a secure resource that is available and open to all journals/reviewers, thus enabling them to estimate novelty and probable overlap with previous publications prior to acceptance. ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
1R01LM009758-01
Application #
7286877
Study Section
Special Emphasis Panel (ZRG1-HOP-S (50))
Program Officer
Ye, Jane
Project Start
2007-09-30
Project End
2009-09-29
Budget Start
2007-09-30
Budget End
2008-09-29
Support Year
1
Fiscal Year
2007
Total Cost
$286,015
Indirect Cost
Name
University of Texas Sw Medical Center Dallas
Department
Other Health Professions
Type
Schools of Medicine
DUNS #
800771545
City
Dallas
State
TX
Country
United States
Zip Code
75390
Garner, H R (2011) Combating unethical publications with plagiarism detection services. Urol Oncol 29:95-9
McIver, L J; Fondon 3rd, J W; Skinner, M A et al. (2011) Evaluation of microsatellite variation in the 1000 Genomes Project pilot studies is indicative of the quality and utility of the raw data and alignments. Genomics 97:193-9
Galindo, Cristi L; McIver, Lauren J; Tae, Hongseok et al. (2011) Sporadic breast cancer patients' germline DNA exhibit an AT-rich microsatellite signature. Genes Chromosomes Cancer 50:275-83
Errami, Mounir; Sun, Zhaohui; George, Angela C et al. (2010) Identifying duplicate content using statistically improbable phrases. Bioinformatics 26:1453-7
Sun, Zhaohui; Errami, Mounir; Long, Tara et al. (2010) Systematic characterizations of text similarity in full text biomedical publications. PLoS One 5:e12704
Long, Tara C; Errami, Mounir; George, Angela C et al. (2009) Scientific integrity. Responding to possible plagiarism. Science 323:1293-4
Errami, Mounir; Sun, Zhaohui; Long, Tara C et al. (2009) Deja vu: a database of highly similar citations in the scientific literature. Nucleic Acids Res 37:D921-4
Errami, Mounir; Hicks, Justin M; Fisher, Wayne et al. (2008) Deja vu--a study of duplicate citations in Medline. Bioinformatics 24:243-9
Wren, Jonathan D (2008) URL decay in MEDLINE--a 4-year follow-up study. Bioinformatics 24:1381-5
Giles, Cory B; Wren, Jonathan D (2008) Large-scale directional relationship extraction and resolution. BMC Bioinformatics 9 Suppl 9:S11