- Notifications
You must be signed in to change notification settings - Fork1
A similarity measurer on two programming assignments on Online Judge.
License
StardustDL/codesim
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
A similarity measurer on two programming assignments on Online Judge.
Recommend OS: Ubuntu 20.04.
Other Linux distribution is OK, but Windows and Mac OS with Python 3.10 may fail since codesim depends onortools.
Install Python(>=3.7), pip, g++, and objdump.
An example script for Ubuntu 20.04.
# Ubuntu 20.04 has Python 3.8 installed, use python3 to run pythonapt update# Install g++ and objdumpapt install build-essential
Development Way Install requirements.
cd srcpip install -r requirements.txtPackage Way Build and install a portable Python Wheel package.
cp README.md ./srccd srcpython -m pip install --upgrade build twinepython -m build -o ../distpython -m pip install ../dist/codesim-0.0.1-py3-none-any.whlDevelopment Way
cd srcpython -m codesim<file1><file2># verbose mode to see logpython -m codesim<file1><file2> [-v/-vv/-vvv..]
Package Way If you have installed the built package, then just use the installed package.
python -m codesim<file1><file2>codesim<file1><file2>
The code similarity measuring algorithm originates from
Jiang Y, Xu C. Needle: Detecting code plagiarism on student submissions[C]//Proceedings of ACM Turing Celebration Conference-China. 2018: 27-32.
Some test cases are fromCodeNet Dataset.
The code similarity measuring algorithm originates from
Jiang Y, Xu C. Needle: Detecting code plagiarism on student submissions[C]//Proceedings of ACM Turing Celebration Conference-China. 2018: 27-32.
Algorithm implementation details are fromhere.
We want to measure similarity between two programming assignmentsg++ -std=c++17 -pedantic.
The compiling and optimization removes comments, macros and unnessesary code, ignores local variable names and code format.Many redundant changes will have zero or minor impacts after compiler optimization and it is a good way to normalize a program.To decrease obfuscation changes' impacts further, we use opcode sequence as a function's figureprint and ignore operands.
A program is a set of functions, and a function is a sequence of opcodes.
We first compile the input code byg++ with-O2 optimization level.To keep the generated object file clean, we use-c option to prevent generating initializing function.
Then we useobjdump to disassembly object files, collect and filter (ignorenop and unrecogized opcodes) opcode sequence.
One common kind of obfuscation changes is splitting one function into many functions.To address this, we calculate inter-function similarity (as same as the program similarity) with intra-function similarity.The main idea is mapping each instruction in program
Intra-function similarity models the similarity of a instruction in a specific function context.
Let
Formally, the intra-function similarity between
For efficiency, we use the following strategies: use integer for opcode to speed up comparison, calculate
We models the mapping problem by a weighted flow network graph
Let
We use sigmoid function's center part,
Then the unnormalized inter-function similarity from
Then normalize
Finally the inter-function similarity between
For efficiency, we use the following strategies: calculate
About
A similarity measurer on two programming assignments on Online Judge.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.