April 23, 2008
Getting Better Search ResultsSuperfluous Results
In reviewing the results of the comparison, often some specific files or specific source-code elements would show up throughout the results, skewing results and hiding important correlation information. For example, open-source files may have been used in one or both sets of files. In searching for plagiarized code, the open-source files would be highly correlated with each other, but these correlations were not important. Pieces of these files would show up throughout both sets of files and flagged as highly correlated.
Similarly, there were specific statements, comments, and identifiers that showed up in many places, but were not relevant to finding plagiarized code. Users searching for plagiarized code may find that two programs running on Linux both use the same system calls. Thus, files with these system calls will have a higher correlation. Common identifier names like "index," "count," and "result" showed up in many files, increasing correlation values, but were not necessarily signs of plagiarism.
Had these results been known upfront, some of them could have been eliminated before CodeMatch was run. However, given the number of files and the number of source-code elements, it was impractical to find these elements before performing the correlation. Also, the correlation itself pointed out many of these superfluous elements.
CodeMatch Post-Process Filtering
To make examining the correlation results more useful and let users focus on the kinds of correlation that are most important, I added the ability to filter the results. After CodeMatch produces a database of results, this filtering can be performed on the database:
After the filtering is performed on the database, the correlation scores between file pairs are adjusted accordingly. I found that for large file sets, this filtering reduced the manual process of reviewing the results in order to find plagiarized source-code files from days to hours or even minutes.
My experience using filtering with CodeMatch can be generalized to any kind of information retrieval process. To understand how filtering can be used, it is important to first understand the different kinds of information retrieval processes and information display methods.
|
|
||||||||||||||||||||||||||||||
|
|
|
|