Tip:
Highlight text to annotate it
X
>> GIPP: Yes. Hello. Welcome. First, I'd like to mention that I'm here with my colleague
and co-author Norman Marushka. So, he's over there. Let's start with the outline. So, the
paper is about new plagiarism detection system. And first, I want to talk about the motivation,
why we have developed this new approach. And, then I will talk about the approach and how
we try to identify plagiarism. And, then I will talk about algorithms we have developed
to identify them. And, I will give a short overview about our prototype, it's called
Side Plaque and, of course, if you build such system you want to evaluate it and show whether
it's actually better than the existing systems. So, that's what I--we'll talk about that after
that and then I'll give an outlook. So, the currently available plagiarism systems, they
have the problem that they are solely text based. And professor Doctor Viva Wolfe, she
runs evaluations on the existing systems and her conclusion is that plagiarism detection
systems find copies but not plagiarism. And the reason for that is that they compare the
text of the documents and to try to find identical patterns, so, like N gram. So, words that
follow in the same--in the same sequence and some lots of stuff. But if you paraphrase
something, or if you translate something, then the existing systems hardly find anything.
And--I mean, usually, if a scientist plagiarizes something then he will make a little bit of
an effort and maybe try to change some words. So, for that reason, it's very, very difficult
to identify and plagiarize documents, and, of course, human examiner, he will see from
the field, he might find it, but there are plenty of studies that have shown that even
in database, like archive, and arc and even in--yes, well-known journals that the people
that peer reviewed the articles before the publication have not identified these plagiarized
cases. Later, I will talk about the doctoral thesis of the former German Defense Minister
and he was caught--some people already laughed, yeah. He stepped down, and--but it took--yes,
several years before people actually found out that his book--I mean, it wasn't just
his thesis but he even published it as a book. But still, it took a long time before he was
identified as someone who plagiarized. So, now, I'd like to give a brief overview about
existing systems. So, in general, you can distinguish these available systems into a
local similarity assessment and a global similarity assessment. You can see that here, so, the
left part [INDISTINCT] local similarity assessment, and the right part is the global. That's it,
the main... >> Sir, can you come back to the microphone?
>> Oh. Sorry, yes. Of course, for the--for the recording. So, the existing systems are
based on text comparison, on screen comparison, and keyword analysis. For example, one way,
these similarities are calculated as just by year performing and statistic analysis
of frequencies and occurrences of certain keywords or another very popular method is
to create fingerprints on documents and another method that is used, but the performance is
not the best is to analyze the use of certain words and grammatical constructs, punctuations,
and so on, so that you can identify whether the style is consistent within one document
or whether maybe different people have worked on that although, that's not really helpful
to identify plagiarism for--in the context of scientific articles. I'd like to talk about
fingerprinting a little bit more because that's the most commonly used technique. So, you
see on the right side--I hope that the font size is large enough. You see an example and
it's--this is a sentence, so you can create a fingerprint from this sentence and then
you store or create a hash value stored in the database and then you just compare these
hash values in the large database. In this way you can see whether maybe a sentence or
a paragraph has been, yes, used before somewhere else. And if you find the collusion, so if
you find two identical sentences in such a hash table in the bucket, then it might be
worse to look at it closer and to see whether this is a identical sentence or paragraph.
But this shows again the problem, that if you create such a fingerprint, even if you
just alter a single word then the hash value would be completely different and you wouldn't
recognize the sentences and maybe that it has been used in another source before. Now,
I would like to talk about the new approach we want to present. We call it citation-based
plagiarism detection. Might sound strange in the beginning because usually, if you plagiarize
something then you think that maybe you don't really use citations. But we have observed
that if people plagiarize, they use citations because if you want to publish something you
need citations and usually, people paraphrase their work but they use a lot of the citations
that also have been in the original work and maybe alter them a little bit. So, in this
figure, you can see that in document A, there's a citation on document C, D and E and in document
B there are also references or citations to document C, D and E and there in a very similar
order in document B, there are two additional citations but otherwise, pretty similar. So,
if you just compare the order of citations, you recognize a pattern. And, the idea of
citation-based plagiarism detection is to look for these patterns in scientific documents
and to check them for--yes. Well, there might be something wrong, maybe there is something
plagiarized. So, the suitability for plagiarism detection depends on certain characteristics.
For example, the order of citations or the textual proximity of citations for the citations
under the same paragraph or maybe in the same sentence. Another factor we take into account
is the probability co-occurrence. For example, if there are--in two papers and both papers
are very popular and have been cited in thousands of times, of course, it's more likely that
they are co-cited in other documents as well as if two document that have hardly been cited
that they're co-cited. Another factor that we consider is the citation style. So, depending
on citation style, the order of citations can alter. So you can have citations five,
three and four in one bracket or you can just have three to five. So, in this case, citation
five and four would not be right next to each other or very often in scientific documents
you have tables that contain multiple citations maybe even taking over from another--from
another source. And then we have some challenges in order to identify them. So we have transpositions
so that the order of citations changes or scaling so that maybe some other citations
are in between the citations or that the whole alignment is completely different so that
maybe in document A, a part of a whole paragraph from the conclusion is used in another document
in the introduction and so on so that it's just mixed. In order to identify these patterns,
we have developed some algorithms, they are called Longest Common Citation Sequence, Greedy
Citation Tiling and Citation Chunking, Bibliographic Coupling is an approach that is already being
used not for the purpose of citation of plagiarism detection but for, yes, citation analysis
purposes. In this table, the distinguished--whether they look for global similarity or local similarity
and whether the order is preserved or not preserved. And now, we'll talk a little bit
about these algorithms. So, on the left, we see Bibliographic Coupling and Bibliographic
Coupling just describes or measures the amount of shared citations between two documents.
So, in this case we see that document A and document B that they both cite document C
and D so the coupling strength will be two. In our analysis with database with PubMed,
we have found documents that have--that you have more than 20 citations. But of course,
this alone doesn't indicate or show or improve that this might be plagiarized, that's quite
normal that if you have publications from a certain field that they share references.
On the right, there's, yes, the description of the Longest Common Citation Sequence. So,
this just examines whether if the two documents, whether they share citations in the same order
so all the red numbers are citations and the red numbers are the citations that are shared
by both documents. And in this case, three, four and five would be the longest sequence.
Then, we have the algorithm, Greedy Citation Tiling and Greedy Citation Tiling aims to
identify all matching substrings. And I think it's--yes, pretty easy to understand. So,
I won't talk too much about it. A bit more complex is Citation Chunking. With Citation
Chunking we don't have a fixed length for the string--citation string we analyze but
the size of the string depends on the citations that occur before. So the substring of citations
are after. And if there are similarities then that will be included in the--in the citation
chunk and this way we have an algorithm that works quite well to--for, yes, for local similarity,
not so much to compare similarity--similarities between two documents in general but on a
very local basis for example within one paragraph or one section. I think the time is not enough
that I will talk about the algorithm detail but maybe if there's some questions later
I will come back to that. So, if you develop a new plagiarism detection system, it's very
difficult to evaluate that because if you--yes, use an artificially created test set then
it might not be really realistic because if you try to disguise plagiarism then they are
very different methods, you can--the simplest form is just that you copy and paste or that
you maybe paraphrase everything a little bit or that you paraphrase it a lot. So, by creating
an artificial test set the results might not be that realistic. Another possibility would
be to use actually plagiarized documents but there you have the problem that you usually
don't know what--where is the plagiarism in this document, you don't have the ground truth,
you don't know how many sections were plagiarized and what were the sources and if you use available
systems, then you only find what you can find with the available systems, anyway so you
will not find more than that. And officially, it would be best to have real document, real
world example, and that is not artificially created but it should also be well examined
so that you really know as many parts of the document that have been plagiarized. So, I
will talk a little bit about our evaluation of the real plagiarism case of the former
German Defense Minister, Mr. Guttenberg, it was his doctoral thesis, I think around 350
or 400 pages. And I will talk briefly about our evaluation of the PopNet database, it's
an--we use an open excess part of that. And, yes, here you see a little summary of the
different test collections that we considered using and you see the advantages and disadvantages.
The PAN Corpus is an artificial created corpus which we would have loved to use because it
was created for conference to measure the performance of different plagiarism detection
tool, so it sounds perfect. The only problem is that all existing systems only analyze
the text of documents. So, in this artificially created corpus, no citation information is
available, that's why we couldn't use it. So, now we'll come to the thesis of Mr. Guttenberg.
What you can see on the left is the thesis with all the pages and all the colored parts
show plagiarized parts. So you see, it's a lot in there, and form different sources and
this graph and this figure was derived from the findings of the Guttenberg project, it
was a crowd sourcing project where hundreds of volunteers tried to find as many--yes,
all the pieces that have been stolen. So, we believe that it's quite comprehensive and
includes probably not all the plagiarized parts but at least it should be a very good
data set to evaluate different algorithms. What you see on the right is a figure that
shows how long it took to identify all these different various plagiarized sections. And,
what is quite obvious that in the beginning it might be quite easy to find the plagiarized
section but it gets harder and harder. The five--the first five percent of plagiarism
is pretty easy, you just need to digitize, maybe you make an--in this case it was only
a book available but if you can make OCR and then you use the existing systems and then
you might find around five percent, that's what we found with the software tool called
eFORCE which is supposed to be one of the best packagers. So, with this commercial tool
we found--yes, five percent of the translated plagiarism, I have to say or even less than
five percent of the translated plagiarism. Overall, 63% of the lines in this document
of plagiarized and 94% of the pages contain plagiarism. This light shows, on the left
the original source of the document and underneath you see Guttenberg's so that's the doctoral
thesis. And same colors symbolize in same sources or same citations. And for this evaluation,
we used the 16 sections in this thesis that have been identified as plagiarized but they
were all translation and translated from other sources so it's not just German or English
but translated from--for example from English to German. And the existing tools couldn't
identify any of these translated sections. So, we wanted to know how well would the citation-based
approach work to identify these sections. And, as you see, even with applying any algorithms,
you see that there are certain pattern you can recognize. So, the big graph on the--on
the bottom on the right, you see--so, there are only two pages, but they contain, I don't
know how many citation, there were maybe fifteen, something like that. And you see that there
are some similarities. If you apply these algorithms I just presented and if you clean
this up, the whole sequence of citations then the result is, that's what you see at the
very bottom on the right and you see that that the auto-citation is quite similar. And
that's some of the coincidence, obviously. So, and if you use this algorithms and see
that there are more than three or four citations in the same order, that might be is a strong
hint that this section has been plagiarized. The table above shows some figures so the
copy and paste plagiarism that was identified in the thesis. If you the existing software,
you have pretty good results. So, you find, like, 70% or more of the copy and paste plagiarism
sections. And the citation-base approach is not suitable to identify copy and paste plagiarism
because obviously if the section is very short, it only contains maybe one or two or maybe
not even a single citation so it doesn't deliver very good results. For the sections of disguised
plagiarism, the tools identified less than 10%, so depends how strong it is disguised.
And in our case, yes, it was around 30%. Then idea plagiarism is of course very, very hard
to find because if you compare text, you might not find any words that are similar or if
you find them, then maybe there are a hundred words in between. So the existing systems
are not really suitable to identify idea plagiarism. With citation-based, you can find them sometimes,
it really depends on the way someone--how someone, yes, plagiarize. And even for a human
examiner, it's very hard to identify idea plagiarism, just, yes, the nature of it. And--but
where the citation-based approach deliver pretty good results was with the translated
plagiarism. So, these sixteen examples, you see that some of them, for example on page
two hundred twenty-four, there were only two citations, so you wouldn't be able to identify
such a section is plagiarized if you use the citation approach because if you would look
at all these sections, you would just have way too many false-positives. But if you put
a special, maybe on three or four citations in the same order, then you get pretty useful
results. Are there any questions to this figure? Okay. This graph shows the results when we
ran the analysis on PopNet. I won't go too much into detail; I just want to show some
figures. So, what you see here is on the left, it's not a logarithmic scale but on the right
it is. So, you see that on the--yes, on the Y axis, you see the document path amount and
then on the X axis you see the coupling strengths. I talked about this so the amount of shared
references within--between two documents. And you see for example that--yes, that if
you have--if you look for coupling strengths of eight that you still find quite a lot of
documents. So, this alone is not really used for--to identify plagiarism because you would
just have way too many false-positives. That's why it's important to combine it with all
the other algorithms where you consider the order of citations on--and patterns and so
on. The strength and weaknesses of the algorithms are displayed in this table. So, what you
see is that the Citation Chunking algorithm delivers pretty good results for most cases
and most forms of plagiarism and the longest--we have Greedy Citation Tiling have some advantages
in some cases and the same is true for the Longest Common Citation Sequence. In the following,
I'd like to talk about our software system that we have developed to identify this plagiarism
and the steps that are involved. So, in the first step we just convert the PDF document
to an XML document and identify the citations. We use an open source software for that, we
called Password with some minor modifications. For example, in our case it's important that
we know where and which sentences the citation is and where exactly it stand. So, we're not
just interested in extracting the bibliography of document but we want to know where in the
full text the corresponding citations are located. And then we match them with our database,
we do some article deed application and author name disambiguation, then we run the citation-base
plagiarism detection tools algorithms and compare the entries in the database, see whether
we find any matching entries. And then we generate a report using the citation-based
approach. And we also use text-based and plagiarism detection software because as I said for example,
short sentences and so on, they can be better detected with text-based approaches. Yes,
then in the very end they are combined and you get the results. So, I think my time is
nearly over. So, to conclude, the existing systems, they deliver pretty good job for
copy and paste plagiarism but they struggle to identify plagiarism that has been disguised
or like if it has been paraphrased, translated or idea of plagiarism. And the--we see the
strength of the citation-based approach in cases where the document has been altered
a lot, like in translations or strong paraphrasing. The citation-based approach should not be
seen as a substitute but rather an extension of the existing approaches and, yes, the results
at least we were pretty happy with them and we hope to implement that on larger scale
and establish a larger database with citation information and see where we can apply it
in the future. There's also some--yes, maybe I'll talk about the outlook a little bit.
Something else we want to include is a consideration of the argumentive data structure, so--that
we analyze keywords, for example, that said, Hamilton argued something and where as someone
else said something else, that you include these words and that you compare not just
the pattern of citations but also how these citations stand to each other in what kind
of context. And last but not least we don't only want to look for plagiarism because in
many case it can be interesting to see--if you read a paper, what other papers has--have
read by the author that not have been cited so that does not mean that it is--that it's
plagiarism. But with putting the threshold a bit lower, you can identify documents that
have been read maybe, for example, some sources from this document has been used for his own
work and so, it can give you maybe some ideas for further reading that might be interesting
and this reading was at least of interest for the author. Yes, that's it. Thank you.
Any questions? >> Okay. We're running a little bit behind
time but we have time for one, possibly two questions. Okay. I'm going to ask you a question.
What's the performance of your algorithm applied? And how long will it take to do the analysis?
>> Yes. So, at the moment we--for example, on PopNet was several hundred thousand articles
and depending on the algorithm, they have different running times but throughout the
week but if you have computed these in times then obviously, it's way faster. You--if you
just have a new document, then it's a matter of a few seconds.
>> So, you take [INDISTINCT] >> Yes. Yes.
>> I'm curious of--instead of a [INDISTINCT] what applications you have take to use as
not so much [INDISTINCT] where you're trying to determine how much is new in the paper?
>> Yes. >> Could you repeat the question for...
>> So, if I understood the question correctly, you--the question is how can you--whether
you could use this technique to identify what's worth reading in the paper you got or whether
it's just something that has been published before, maybe even by the same author?
>> Exactly. >> Yes. So, I think that is really good question,
because if you use this technique you could for example, highlight the parts of documents
that are really new. So, you would save a lot of time for the reader and he could focus
on, yes, what's really relevant and what's worth reading and not just reading all the
related work before that you have read somewhere else before. So you can use it for recommendation
systems. That's definitely a good way to give it.
>> Okay. >> Okay.
>> Thank you very much. >> Thank you.