Doceng 2011 - Citation pattern matching algorithms for citation - Based plagiarism detection

>> GIPP: Yes. Hello. Welcome. First, I'd like to mention that I'm here with my colleague and co-author Norman Marushka. So, he's over there. Let's start with the outline. So, the paper is about new plagiarism detection system. And first, I want to talk about the motivation, why we have developed this new approach. And, then I will talk about the approach and how we try to identify plagiarism. And, then I will talk about algorithms we have developed to identify them. And, I will give a short overview about our prototype, it's called Side Plaque and, of course, if you build such system you want to evaluate it and show whether it's actually better than the existing systems. So, that's what I--we'll talk about that after that and then I'll give an outlook. So, the currently available plagiarism systems, they have the problem that they are solely text based. And professor Doctor Viva Wolfe, she runs evaluations on the existing systems and her conclusion is that plagiarism detection systems find copies but not plagiarism. And the reason for that is that they compare the text of the documents and to try to find identical patterns, so, like N gram. So, words that follow in the same--in the same sequence and some lots of stuff. But if you paraphrase something, or if you translate something, then the existing systems hardly find anything. And--I mean, usually, if a scientist plagiarizes something then he will make a little bit of an effort and maybe try to change some words. So, for that reason, it's very, very difficult to identify and plagiarize documents, and, of course, human examiner, he will see from the field, he might find it, but there are plenty of studies that have shown that even in database, like archive, and arc and even in--yes, well-known journals that the people that peer reviewed the articles before the publication have not identified these plagiarized cases. Later, I will talk about the doctoral thesis of the former German Defense Minister and he was caught--some people already laughed, yeah. He stepped down, and--but it took--yes, several years before people actually found out that his book--I mean, it wasn't just his thesis but he even published it as a book. But still, it took a long time before he was identified as someone who plagiarized. So, now, I'd like to give a brief overview about existing systems. So, in general, you can distinguish these available systems into a local similarity assessment and a global similarity assessment. You can see that here, so, the left part [INDISTINCT] local similarity assessment, and the right part is the global. That's it, the main... >> Sir, can you come back to the microphone? >> Oh. Sorry, yes. Of course, for the--for the recording. So, the existing systems are based on text comparison, on screen comparison, and keyword analysis. For example, one way, these similarities are calculated as just by year performing and statistic analysis of frequencies and occurrences of certain keywords or another very popular method is to create fingerprints on documents and another method that is used, but the performance is not the best is to analyze the use of certain words and grammatical constructs, punctuations, and so on, so that you can identify whether the style is consistent within one document or whether maybe different people have worked on that although, that's not really helpful to identify plagiarism for--in the context of scientific articles. I'd like to talk about fingerprinting a little bit more because that's the most commonly used technique. So, you see on the right side--I hope that the font size is large enough. You see an example and it's--this is a sentence, so you can create a fingerprint from this sentence and then you store or create a hash value stored in the database and then you just compare these hash values in the large database. In this way you can see whether maybe a sentence or a paragraph has been, yes, used before somewhere else. And if you find the collusion, so if you find two identical sentences in such a hash table in the bucket, then it might be worse to look at it closer and to see whether this is a identical sentence or paragraph. But this shows again the problem, that if you create such a fingerprint, even if you just alter a single word then the hash value would be completely different and you wouldn't recognize the sentences and maybe that it has been used in another source before. Now, I would like to talk about the new approach we want to present. We call it citation-based plagiarism detection. Might sound strange in the beginning because usually, if you plagiarize something then you think that maybe you don't really use citations. But we have observed that if people plagiarize, they use citations because if you want to publish something you need citations and usually, people paraphrase their work but they use a lot of the citations that also have been in the original work and maybe alter them a little bit. So, in this figure, you can see that in document A, there's a citation on document C, D and E and in document B there are also references or citations to document C, D and E and there in a very similar order in document B, there are two additional citations but otherwise, pretty similar. So, if you just compare the order of citations, you recognize a pattern. And, the idea of citation-based plagiarism detection is to look for these patterns in scientific documents and to check them for--yes. Well, there might be something wrong, maybe there is something plagiarized. So, the suitability for plagiarism detection depends on certain characteristics. For example, the order of citations or the textual proximity of citations for the citations under the same paragraph or maybe in the same sentence. Another factor we take into account is the probability co-occurrence. For example, if there are--in two papers and both papers are very popular and have been cited in thousands of times, of course, it's more likely that they are co-cited in other documents as well as if two document that have hardly been cited that they're co-cited. Another factor that we consider is the citation style. So, depending on citation style, the order of citations can alter. So you can have citations five, three and four in one bracket or you can just have three to five. So, in this case, citation five and four would not be right next to each other or very often in scientific documents you have tables that contain multiple citations maybe even taking over from another--from another source. And then we have some challenges in order to identify them. So we have transpositions so that the order of citations changes or scaling so that maybe some other citations are in between the citations or that the whole alignment is completely different so that maybe in document A, a part of a whole paragraph from the conclusion is used in another document in the introduction and so on so that it's just mixed. In order to identify these patterns, we have developed some algorithms, they are called Longest Common Citation Sequence, Greedy Citation Tiling and Citation Chunking, Bibliographic Coupling is an approach that is already being used not for the purpose of citation of plagiarism detection but for, yes, citation analysis purposes. In this table, the distinguished--whether they look for global similarity or local similarity and whether the order is preserved or not preserved. And now, we'll talk a little bit about these algorithms. So, on the left, we see Bibliographic Coupling and Bibliographic Coupling just describes or measures the amount of shared citations between two documents. So, in this case we see that document A and document B that they both cite document C and D so the coupling strength will be two. In our analysis with database with PubMed, we have found documents that have--that you have more than 20 citations. But of course, this alone doesn't indicate or show or improve that this might be plagiarized, that's quite normal that if you have publications from a certain field that they share references. On the right, there's, yes, the description of the Longest Common Citation Sequence. So, this just examines whether if the two documents, whether they share citations in the same order so all the red numbers are citations and the red numbers are the citations that are shared by both documents. And in this case, three, four and five would be the longest sequence. Then, we have the algorithm, Greedy Citation Tiling and Greedy Citation Tiling aims to identify all matching substrings. And I think it's--yes, pretty easy to understand. So, I won't talk too much about it. A bit more complex is Citation Chunking. With Citation Chunking we don't have a fixed length for the string--citation string we analyze but the size of the string depends on the citations that occur before. So the substring of citations are after. And if there are similarities then that will be included in the--in the citation chunk and this way we have an algorithm that works quite well to--for, yes, for local similarity, not so much to compare similarity--similarities between two documents in general but on a very local basis for example within one paragraph or one section. I think the time is not enough that I will talk about the algorithm detail but maybe if there's some questions later I will come back to that. So, if you develop a new plagiarism detection system, it's very difficult to evaluate that because if you--yes, use an artificially created test set then it might not be really realistic because if you try to disguise plagiarism then they are very different methods, you can--the simplest form is just that you copy and paste or that you maybe paraphrase everything a little bit or that you paraphrase it a lot. So, by creating an artificial test set the results might not be that realistic. Another possibility would be to use actually plagiarized documents but there you have the problem that you usually don't know what--where is the plagiarism in this document, you don't have the ground truth, you don't know how many sections were plagiarized and what were the sources and if you use available systems, then you only find what you can find with the available systems, anyway so you will not find more than that. And officially, it would be best to have real document, real world example, and that is not artificially created but it should also be well examined so that you really know as many parts of the document that have been plagiarized. So, I will talk a little bit about our evaluation of the real plagiarism case of the former German Defense Minister, Mr. Guttenberg, it was his doctoral thesis, I think around 350 or 400 pages. And I will talk briefly about our evaluation of the PopNet database, it's an--we use an open excess part of that. And, yes, here you see a little summary of the different test collections that we considered using and you see the advantages and disadvantages. The PAN Corpus is an artificial created corpus which we would have loved to use because it was created for conference to measure the performance of different plagiarism detection tool, so it sounds perfect. The only problem is that all existing systems only analyze the text of documents. So, in this artificially created corpus, no citation information is available, that's why we couldn't use it. So, now we'll come to the thesis of Mr. Guttenberg. What you can see on the left is the thesis with all the pages and all the colored parts show plagiarized parts. So you see, it's a lot in there, and form different sources and this graph and this figure was derived from the findings of the Guttenberg project, it was a crowd sourcing project where hundreds of volunteers tried to find as many--yes, all the pieces that have been stolen. So, we believe that it's quite comprehensive and includes probably not all the plagiarized parts but at least it should be a very good data set to evaluate different algorithms. What you see on the right is a figure that shows how long it took to identify all these different various plagiarized sections. And, what is quite obvious that in the beginning it might be quite easy to find the plagiarized section but it gets harder and harder. The five--the first five percent of plagiarism is pretty easy, you just need to digitize, maybe you make an--in this case it was only a book available but if you can make OCR and then you use the existing systems and then you might find around five percent, that's what we found with the software tool called eFORCE which is supposed to be one of the best packagers. So, with this commercial tool we found--yes, five percent of the translated plagiarism, I have to say or even less than five percent of the translated plagiarism. Overall, 63% of the lines in this document of plagiarized and 94% of the pages contain plagiarism. This light shows, on the left the original source of the document and underneath you see Guttenberg's so that's the doctoral thesis. And same colors symbolize in same sources or same citations. And for this evaluation, we used the 16 sections in this thesis that have been identified as plagiarized but they were all translation and translated from other sources so it's not just German or English but translated from--for example from English to German. And the existing tools couldn't identify any of these translated sections. So, we wanted to know how well would the citation-based approach work to identify these sections. And, as you see, even with applying any algorithms, you see that there are certain pattern you can recognize. So, the big graph on the--on the bottom on the right, you see--so, there are only two pages, but they contain, I don't know how many citation, there were maybe fifteen, something like that. And you see that there are some similarities. If you apply these algorithms I just presented and if you clean this up, the whole sequence of citations then the result is, that's what you see at the very bottom on the right and you see that that the auto-citation is quite similar. And that's some of the coincidence, obviously. So, and if you use this algorithms and see that there are more than three or four citations in the same order, that might be is a strong hint that this section has been plagiarized. The table above shows some figures so the copy and paste plagiarism that was identified in the thesis. If you the existing software, you have pretty good results. So, you find, like, 70% or more of the copy and paste plagiarism sections. And the citation-base approach is not suitable to identify copy and paste plagiarism because obviously if the section is very short, it only contains maybe one or two or maybe not even a single citation so it doesn't deliver very good results. For the sections of disguised plagiarism, the tools identified less than 10%, so depends how strong it is disguised. And in our case, yes, it was around 30%. Then idea plagiarism is of course very, very hard to find because if you compare text, you might not find any words that are similar or if you find them, then maybe there are a hundred words in between. So the existing systems are not really suitable to identify idea plagiarism. With citation-based, you can find them sometimes, it really depends on the way someone--how someone, yes, plagiarize. And even for a human examiner, it's very hard to identify idea plagiarism, just, yes, the nature of it. And--but where the citation-based approach deliver pretty good results was with the translated plagiarism. So, these sixteen examples, you see that some of them, for example on page two hundred twenty-four, there were only two citations, so you wouldn't be able to identify such a section is plagiarized if you use the citation approach because if you would look at all these sections, you would just have way too many false-positives. But if you put a special, maybe on three or four citations in the same order, then you get pretty useful results. Are there any questions to this figure? Okay. This graph shows the results when we ran the analysis on PopNet. I won't go too much into detail; I just want to show some figures. So, what you see here is on the left, it's not a logarithmic scale but on the right it is. So, you see that on the--yes, on the Y axis, you see the document path amount and then on the X axis you see the coupling strengths. I talked about this so the amount of shared references within--between two documents. And you see for example that--yes, that if you have--if you look for coupling strengths of eight that you still find quite a lot of documents. So, this alone is not really used for--to identify plagiarism because you would just have way too many false-positives. That's why it's important to combine it with all the other algorithms where you consider the order of citations on--and patterns and so on. The strength and weaknesses of the algorithms are displayed in this table. So, what you see is that the Citation Chunking algorithm delivers pretty good results for most cases and most forms of plagiarism and the longest--we have Greedy Citation Tiling have some advantages in some cases and the same is true for the Longest Common Citation Sequence. In the following, I'd like to talk about our software system that we have developed to identify this plagiarism and the steps that are involved. So, in the first step we just convert the PDF document to an XML document and identify the citations. We use an open source software for that, we called Password with some minor modifications. For example, in our case it's important that we know where and which sentences the citation is and where exactly it stand. So, we're not just interested in extracting the bibliography of document but we want to know where in the full text the corresponding citations are located. And then we match them with our database, we do some article deed application and author name disambiguation, then we run the citation-base plagiarism detection tools algorithms and compare the entries in the database, see whether we find any matching entries. And then we generate a report using the citation-based approach. And we also use text-based and plagiarism detection software because as I said for example, short sentences and so on, they can be better detected with text-based approaches. Yes, then in the very end they are combined and you get the results. So, I think my time is nearly over. So, to conclude, the existing systems, they deliver pretty good job for copy and paste plagiarism but they struggle to identify plagiarism that has been disguised or like if it has been paraphrased, translated or idea of plagiarism. And the--we see the strength of the citation-based approach in cases where the document has been altered a lot, like in translations or strong paraphrasing. The citation-based approach should not be seen as a substitute but rather an extension of the existing approaches and, yes, the results at least we were pretty happy with them and we hope to implement that on larger scale and establish a larger database with citation information and see where we can apply it in the future. There's also some--yes, maybe I'll talk about the outlook a little bit. Something else we want to include is a consideration of the argumentive data structure, so--that we analyze keywords, for example, that said, Hamilton argued something and where as someone else said something else, that you include these words and that you compare not just the pattern of citations but also how these citations stand to each other in what kind of context. And last but not least we don't only want to look for plagiarism because in many case it can be interesting to see--if you read a paper, what other papers has--have read by the author that not have been cited so that does not mean that it is--that it's plagiarism. But with putting the threshold a bit lower, you can identify documents that have been read maybe, for example, some sources from this document has been used for his own work and so, it can give you maybe some ideas for further reading that might be interesting and this reading was at least of interest for the author. Yes, that's it. Thank you. Any questions? >> Okay. We're running a little bit behind time but we have time for one, possibly two questions. Okay. I'm going to ask you a question. What's the performance of your algorithm applied? And how long will it take to do the analysis? >> Yes. So, at the moment we--for example, on PopNet was several hundred thousand articles and depending on the algorithm, they have different running times but throughout the week but if you have computed these in times then obviously, it's way faster. You--if you just have a new document, then it's a matter of a few seconds. >> So, you take [INDISTINCT] >> Yes. Yes. >> I'm curious of--instead of a [INDISTINCT] what applications you have take to use as not so much [INDISTINCT] where you're trying to determine how much is new in the paper? >> Yes. >> Could you repeat the question for... >> So, if I understood the question correctly, you--the question is how can you--whether you could use this technique to identify what's worth reading in the paper you got or whether it's just something that has been published before, maybe even by the same author? >> Exactly. >> Yes. So, I think that is really good question, because if you use this technique you could for example, highlight the parts of documents that are really new. So, you would save a lot of time for the reader and he could focus on, yes, what's really relevant and what's worth reading and not just reading all the related work before that you have read somewhere else before. So you can use it for recommendation systems. That's definitely a good way to give it. >> Okay. >> Okay. >> Thank you very much. >> Thank you.