CPD works generically on the tokens produced by aCpdLexer.To add support for a new language, the crucial piece is writing a CpdLexer thatsplits the source file into the tokens specific to your language. Thankfully youcan use a stockAntlr grammar or JavaCCgrammar to generate a lexer for you. If you cannot use a lexer generator, forinstance because you are wrapping a lexer from another library, it is still relativelyeasy to implement the Tokenizer interface.
Use the following guide to set up a new language module that supports CPD.
<module> entry, so that it is built alongside theother languages.CpdLexer.For Antlr grammars you can take the grammar fromantlr/grammars-v4 and place it insrc/main/antlr4 followed by the package name of the language. You then need to call the antlr4 plugin and the appropriate ant wrapper with targetcpd-language to generate the lexer from the grammar. To do so, editpom.xml (eg likethe Golang module).Once that is done,mvn generate-sources should generate the lexer sources for you.
You can now implement a CpdLexer, for instance by extendingAntlrCpdLexer. The following reproduces the Go implementation:
// mind the package convention if you are going to make a PRpackagenet.sourceforge.pmd.lang.go.cpd;publicclassGoCpdLexerextendsAntlrCpdLexer{@OverrideprotectedLexergetLexerForSource(CharStreamcharStream){returnnewGolangLexer(charStream);}}getImage(AntlrToken). There you canchange each token e.g. into uppercase, so that CPD sees the same strings and can find duplicates even whenthe casing differs. SeeTSqlCpdLexer for an example. You will also need a“CaseChangingCharStream”, so that antlr itself is case-insensitive.etc/grammar and edit thepom.xml like thePython implementation does.You can then subclassJavaccCpdLexer instead of AntlrCpdLexer.IGNORE_CASE=true), then you need to implementJavaccTokenDocument.TokenDocumentBehavior, which can change each tokene.g. into uppercase. SeePLSQLParser for an example.Create aLanguage implementation, and make it implementCpdCapableLanguage.If your language only supports CPD, then you can subclassCpdOnlyLanguageModuleBase to get going:
// mind the package convention if you are going to make a PRpackagenet.sourceforge.pmd.lang.go;publicclassGoLanguageModuleextendsCpdOnlyLanguageModuleBase{// A public noarg constructor is required.publicGoLanguageModule(){super(LanguageMetadata.withId("go").name("Go").extensions("go"));}@OverridepublicTokenizercreateCpdLexer(LanguagePropertyBundlebundle){// This method should return an instance of the CpdLexer you created.returnnewGoCpdLexer();}}To make PMD find the language module at runtime, write the fully-qualified name of your language class into the filesrc/main/resources/META-INF/services/net.sourceforge.pmd.lang.Language.
At this point the new language module should be available inCPD and usable by CPD like any other language.
Update the test that asserts the list of supported languages by updating theSUPPORTED_LANGUAGES constant inBinaryDistributionIT.
Add some tests for your CpdLexer by following thesection below.
Add a page in the documentation. Create a new markdown file<langId>.md indocs/pages/pmd/languages/. This file should have the following frontmatter:
---title: <Language Name>permalink: pmd_languages_<langId>.htmllast_updated: <Month> <Year> (<PMD Version>)tags: [languages, CpdCapableLanguage]---On this page, language specifics can be documented, e.g. when the language was first supported by PMD.There is also the following Jekyll Include, that creates summary box for the language:
{% include language_info.html name='<Language Name>' id='<langId>' implementation='<langId>::lang.<langId>.<langId>LanguageModule' supports_cpd=true %}To make the CpdLexer configurable, first define some property descriptors usingPropertyFactory. Look atCpdLanguagePropertiesfor some predefined ones which you can reuse (prefer reusing property descriptors if you can).You need to overridenewPropertyBundleand calldefinePropertyDescriptor to register the descriptors.After that you can access the values of the properties from the parameterofcreateCpdTokenizer.
To implement simple token filtering, you can useBaseTokenFilteras a base class, or another base class innet.sourceforge.pmd.cpd.impl.Take a look at theKotlin token filter implementation, or theJava one.
Add a Maven dependency onpmd-lang-test (scopetest) in yourpom.xml.This contains utilities to test your CpdLexer.
Create a test class extending fromCpdTextComparisonTest.To add tests, you need to write regular JUnit@Test-annotated methods, andcall the methoddoTest with the name of the test file.
For example, for the Dart language:
packagenet.sourceforge.pmd.lang.dart.cpd;publicclassDartTokenizerTestextendsCpdTextComparisonTest{/********************************** Implementation of the superclass ***********************************/publicDartTokenizerTest(){super("dart",".dart");// the ID of the language, then the file extension used by test files}@OverrideprotectedStringgetResourcePrefix(){// "testdata" is the default value, you don't need to override.// This specifies that you should place the test files in// src/test/resources/net/sourceforge/pmd/lang/dart/cpd/testdatareturn"testdata";}/************** Test methods ***************/@Test// don't forget the JUnit annotationpublicvoidtestLiterals(){// This will look for a file named literals.dart// in the directory identified by getResourcePrefix,// tokenize it, then compare the result against a baseline// literals.txt file in the same directory// If the baseline file does not exist, it is created automaticallydoTest("literals");}}