I’ve previously written about PMD, a static code analysis tool that examines Java source files and can detect potential problems. Another significant piece of PMD is called CPD, the copy/paste detector. CPD can look through Java, JSP, C, C++, Fortran, or PHP source code files and find sections of code that are repeated. By using CPD to analyze your source code you can identify duplicate code that might be suitable for refactoring. CPD can ignore differences in literals (e.g., same method with different hard-coded constants) and differences in identifiers (e.g., same method but different variable names). It can also be configured to ignore duplicated blocks of less than a certain size. Like PMD, CPD can be run from the command-line, but is more often invoked from Maven, Ant, or by your IDE.

Ant

To use PMD CPD in an Ant project, add the following files to a directory in your source tree (e.g., tools/pmd-4.2.5):

  • lib/
    • asm-3.1.jar
    • jaxen-1.1.1.jar
    • pmd-4.2.5.jar
    • cpdhtml.xslt

 

If you prefer to keep the PMD libraries outside your source tree, change the locations for the tools.dir and/or thepmd.lib.dir properties in the lines below. Add the following lines to your build.xml file. build.dir,compile.classpathjunit.lib.dirsrc.dir, and test.src.dir should be set appropriately.

<property name="tools.dir" location="tools" />
<property name="pmd.dir" location="${tools.dir}/pmd-4.2.5" />
<property name="pmd.lib.dir" location="${pmd.dir}/lib" />
<property name="pmd.cpd.xsl.file" location="${pmd.dir}/cpdhtml.xslt" />
<property name="pmd.build.dir" location="${build.dir}/pmd" />
<property name="report.dir" location="${build.dir}/reports" />
<property name="cpd.report.dir" location="${report.dir}/cpd" />
<property name="cpd.minimum.tokens" value="50" />
<property name="cpd.ignore.literals" value="true" />
<property name="cpd.ignore.identifiers" value="true" />

<path id="test.classpath">
  <path refid="compile.classpath" />
  <fileset dir="${junit.lib.dir}" includes="**/*.jar" />
</path>
<path id="pmd.classpath">
  <path refid="test.classpath" />
  <pathelement path="${pmd.build.dir}" />
  <fileset dir="${pmd.lib.dir}" includes="**/*.jar" />
</path>

<taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" 
  classpathref="pmd.classpath" />

<target name="cpd" description="Search for cut-and-pasted code">
  <property name="cpd.report.xml" location="${cpd.report.dir}/cpd.xml" />
  <mkdir dir="${cpd.report.dir}" />

  <cpd minimumTokenCount="${cpd.minimum.tokens}" format="xml" 
    outputFile="${cpd.report.xml}" ignoreLiterals="${cpd.ignore.literals}" 
    ignoreIdentifiers="${cpd.ignore.identifiers}">
    <fileset dir="${src.dir}" includes="**/*.java" />
    <fileset dir="${test.src.dir}" includes="**/*.java" />
  </cpd>

  <property name="cpd.report.html" location="${cpd.report.dir}/index.html" />
  <xslt in="${cpd.report.xml}" style="${pmd.cpd.xsl.file}"out="${cpd.report.html}" />
  <echo message="CPD report is at ${cpd.report.html}" />
</target>

Matched block size

The cpd.minimum.tokens property controls the minimum length of the matched sections of code. A token translates roughly to a word, operator, function name, etc. Setting this value is truly a personal preference. Using 50 tokens might be a good start for new code, while 100 might be more appropriate to start with for legacy code. Tune the value to keep the list of findings low enough to address, and lower the value as duplicates are removed until you find that the sections identified as duplicates are smaller than you care to refactor out.

Ignore literals

If the cpd.ignore.literals property is set to true, code which is similar other than literal values will still match. So the body of the method

public static void countEs(final String word) {
int count = 0;
for (int i = 0; i < word.length(); i++) { char letter = word.charAt(i); if ('e' == letter) { count++; } } System.out.println("Found " + count + " e's in the word " + word); } [/java] would match with the body of the method [java highlight="5,9"] public static void countAs(final String word) { int count = 0; for (int i = 0; i < word.length(); i++) { char letter = word.charAt(i); if ('a' == letter) { count++; } } System.out.println(count + " letter A's were found in the word " + word); } [/java] even though the characters in the if statement are different ('e' vs. 'a') and the output messages are different. In a case like this, the two methods could easily be refactored into one by extracting the literal values and passing them in as parameters.

Ignore identifiers

If the cpd.ignore.identifiers property is set to true, code which is similar other than variable names will still match. So the method

public static void countEs(final String word) {
int count = 0;
for (int i = 0; i < word.length(); i++) { char letter = word.charAt(i); if ('e' == letter) { count++; } } System.out.println("Found " + count + " e's in the word " + word); } [/java] would match with the method [java] public static void countNumOfEs(final String str) { int num = 0; for (int j = 0; j < str.length(); j++) { char c = str.charAt(j); if ('e' == c) { num++; } } System.out.println("Found " + num + " e's in the word " + str); } [/java] even though the method name and variable names were different. In this case, refactoring could be as simple as removing one of the methods.

Maven

To use PMD CPD in a Maven project, add the following lines to your pom.xml file:

<reporting>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-pmd-plugin</artifactId>
      <configuration>
        <minimumTokens>50</minimumTokens>
      </configuration>
    </plugin>
  </plugins>
</reporting>

You can invoke the PMD CPD analysis with mvn pmd:cpd, or PMD CPD will report its findings as part of the Maven site report (mvn site).

Options

The minimum.tokens property controls the minimum length of the matched sections of code, similar to Ant. There does not seem to be a way to control the literal and identifier matching in Maven as there is in Ant. Ignoring literals seems to be off, while ignoring identifiers is on. I’d be interested to hear from anyone that knows how to control these in Maven.

More Information

SecureCI™

SecureCI is a free virtual machine that includes a variety of code analysis tools, including PMD, integrated with Ant and Maven for build-time analysis and Hudson for continuous integration. Other tools in SecureCI include Subversion for source control, Trac for wiki and bug tracking, and ratproxy for security scanning of web applications.

Leave a comment

Your email address will not be published. Required fields are marked *

X