Duplicate code can be hard to find, especially in a large project. But PMD's Copy/Paste Detector (CPD) can find it for you! CPD has been through three major incarnations:
Each rewrite made it much faster, and now it can process the JDK 1.4 java.* packages in about 4 seconds (on my workstation, at least).
Here's a screenshot of CPD after running on the JDK java.lang package.
Note that CPD works with Java, JSP, C, C++, Fortran and PHP code. Your own language is missing ? See how to add it here
Here are the duplicates CPD found in the JDK 1.4 source code.
Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache (just the httpd-2.0/server/ directory).
Andy Glover wrote an Ant task for CPD; here's how to use it:
<target name="cpd"> <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" /> <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt"> <fileset dir="/home/tom/tmp/ant"> <include name="**/*.java"/> </fileset> </cpd> </target>
|encoding||The character set encoding (e.g., UTF-8) to use when reading the source code files, but also when producing the report. A piece of warning, even if you set properly the encoding value, let's say to UTF-8, but you are running CPD encoded with CP1252, you may end up with not UTF-8 file. Indeed, CPD copy piece of source code in its report directly, therefore, the source files keep their encoding.||No|
|format||The format of the report (e.g. csv, text, xml); defaults to text.||No|
|ignoreLiterals||if true, CPD ignores literal value differences when evaluating a duplicate block. This means that foo=42; and foo=43; will be seen as equivalent. You may want to run PMD with this option off to start with and then switch it on to see what it turns up; defaults to false.||No|
|ignoreIdentifiers||Similar to ignoreLiterals but for identifiers; i.e., variable names, methods names, and so forth; defaults to false.||No|
|language||Flag to select the appropriate language (e.g. cpp, cs java, php, ruby, and ecmascript); defaults to java.||No|
|minimumtokencount||A positive integer indicating the minimum duplicate size.||Yes|
|outputfile||The destination file for the report. If not specified the console will be used instead.||No|
Also, you can get verbose output from this task by running ant with the -v flag; i.e.:
ant -v -f mybuildfile.xml cpd
Also, you can get an HTML report from CPD by using the XSLT script in pmd/etc/xslt/cpdhtml.xslt. Just run the CPD task as usual and right after it invoke the Ant XSLT script like this:
<xslt in="cpd.xml" style="etc/xslt/cpdhtml.xslt" out="cpd.html" />
To run CPD from the command line, just give it the minimum duplicate size and the source directory:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /usr/local/java/src/java
You can also specify the language:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /path/to/c/source --language cpp
You may wish to check sources that are stored in different directories:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /path/to/other/source --files /path/to/other/source --files /path/to/other/source --language fortran
There should be no limit to the number of '--files', you may add... But if you stumble one, please tell us !
And if you're checking a C source tree with duplicate files in different architecture directories you can skip those using --skip-duplicate-files:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /path/to/c/source --language cpp --skip-duplicate-files
You can also the encoding to use when parsing files:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /usr/local/java/src/java --encoding utf-16le
You can also specify a report format - here we're using the XML report:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /usr/local/java/src/java --format net.sourceforge.pmd.cpd.XMLRenderer
The default format is a text report, and there's also a net.sourceforge.pmd.cpd.CSVRenderer report.
Note that CPD is pretty memory-hungry; you may need to give Java more memory to run it, like this:
$ java -Xmx512m net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /usr/local/java/src/java
Please note that if CPD detects duplicated source code, it will exit with status 4 (since 5.0). This behavior has been introduced to ease CPD integration into scrips or hook, such as SVN hooks.