LISTSERV 16.5 - CODE4LIB Archives

The code used for overlap detection within the arXiv corpus (see [1] 
which significantly extended earlier work [2]) does a matching based on 
a sliding window of hashed 7-word sequences on extracted ASCII text. 
Perhaps more the required for the case in question, but this approach 
scales to a corpus of 1M articles. I'm afraid it is a little finicky to 
compile on current systems given all the changes in C++ library 
organization since I wrote it in 2005/2009. I'm working off-and-on to 
tidy it up but haven't got there yet... So, FWIW, code at:

https://github.com/zimeon/docsim

Cheers,
Simeon

[1] http://arxiv.org/abs/1412.2716
[2] http://arxiv.org/abs/cs/0702012

On 1/23/15 9:44 AM, Mark A. Matienzo wrote:
> I believe Turnitin and SafeAssign both compare the text of submissions to
> against external sources (e.g., SafeAssign uses ABI/INFORM, among others).
> I am not certain if they compare submissions against each other.
>
> However, if you're looking for something along the lines of what Dre
> suggests, you could use ssdeep, which is an implementation of a piecewise
> hashing algorithm [0]. The issue with that you would have to assume that
> all students would probably be using the same file format.
>
> You could also using something like Tika to extract the text content from
> all the submissions, and then compare them against each other.
>
> [0] http://ssdeep.sourceforge.net/
> [1] http://tika.apache.org/
>
> Mark
>
> --
> Mark A. Matienzo <[log in to unmask]>
> Director of Technology, Digital Public Library of America
>
> On Fri, Jan 23, 2015 at 8:47 AM, Andreas Orphanides <[log in to unmask]>
> wrote:
>
>> My first thought was something like programatically doing a pairwise diff
>> of the files, 5500 times. I was surprised I couldn't find a utility that
>> just does this.
>>
>> But i did find something called diffuse [1], that allows you to graphically
>> compare any number of text files in a diff-like fashion. This would
>> probably at least be able to tell you which files need closer scrutiny.
>>
>> I think you'd presumably have to be able to extract the text from each
>> file; I doubt it would work on raw Word docs or PDFs, so that might be a
>> stopper.
>>
>> It seems like the realm of source control has a lot of software designed to
>> help with this problem, so there might be other similar things out there.
>> But probably not anything designed to natively handle print-ready files.
>>
>> -dre.
>>
>>
>> [1] http://diffuse.sourceforge.net/about.html
>>
>> On Fri, Jan 23, 2015 at 7:26 AM, Judy Meirose <[log in to unmask]> wrote:
>>
>>> Can anyone recommend a plagiarism checking software besides Turnitin and
>>> SafeAssign?  I need to compare about 100 student assignments against each
>>> other to make sure they don't copy each other's assignments.
>>>
>>> Thanks.
>>>
>>> Judy K. Meirose
>>> Systems Librarian
>>> Florida Coastal School of Law
>>> 8787 Baypine Rd
>>> Jacksonville, FL
>>> (904)680-7603
>>>
>>> This email transmission, and any documents, files or previous e-mail
>>> messages attached to it, may contain confidential, privileged and/or
>>> proprietary information for the sole use of the intended recipient(s). If
>>> you are not an intended recipient or a person responsible for delivering
>> it
>>> to an intended recipient, any disclosure, copying, distribution or use of
>>> any of the information contained in or attached to this transmission is
>>> strictly prohibited. If you have received this transmission in error,
>>> please: (1) immediately notify me by reply e-mail; and (2) destroy the
>>> original (and any copies) of this transmission and its attachments
>> without
>>> reading or saving in any manner.
>>>
>>