Project Details
Projekt Print View

Plagiarism Detection in an LLM-World – Curating a Novel Dataset for Scientific Plagiarism Detection Systems

Subject Area Methods in Artificial Intelligence and Machine Learning
Term since 2024
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 554559555
 
The advent and widespread accessibility of generative Large Language Models (LLMs), such as GPT or LLaMA, and specifically their publicly available applications, such as ChatGPT, have markedly transformed various facets of our lives, yielding profound positive impacts. Specifically, in the academic realm, tools like ChatGPT have become invaluable assets, assisting in research, drafting papers, and enhancing overall learning experiences. However, the same attributes that make these technologies beneficial also pose significant threats. The proliferation of LLMs has introduced complex challenges, from facilitating malware and social engineering attacks over automated influence campaigns, spam, harassment, and, of particular interest to this proposal, in the domain of academic integrity. A prime concern, and the focus of this proposal, is the anticipated escalation in both the frequency and sophistication of plagiarism cases facilitated by tools like ChatGPT in the foreseeable future. The lack of adequate and realistic large-scale plagiarism datasets has significantly hindered the advancement and practicality of automated Plagiarism Detection Systems (PDS). Previously, the creation of such datasets at scale was considered impossible due to the absence of an automated solution, or an "automatic plagiarist." However, with the availability of LLMs, we argue that ChatGPT can fulfill the role of such an automatic plagiarist. This not only enables the generation of synthetic plagiarism on a large scale but, more importantly, allows for the creation of more realistic plagiarism. This equips the research community with the long-awaited resources to achieve essential progress and desired practicality in PDS. The goal of this project is to develop and disseminate a realistic benchmark dataset for external PDS, tailored to address the anticipated future challenges of real-world plagiarism by exploiting the capabilities of generative LLMs to paraphrase and otherwise disguise the plagiarism to various degrees. Our synthetic plagiarism will be generated using multiple LLMs to represent the current market landscape of available LLMs. We plan to design the dataset as an easy to use and expandable benchmark for PDS and publish it as a shared task in a suitable setting to ensure high visibility and usability for the research community.
DFG Programme WBP Fellowship
International Connection Japan
 
 

Additional Information

Textvergrößerung und Kontrastanpassung