The code used for overlap detection within the arXiv corpus (see 
which significantly extended earlier work ) does a matching based on
a sliding window of hashed 7-word sequences on extracted ASCII text.
Perhaps more the required for the case in question, but this approach
scales to a corpus of 1M articles. I'm afraid it is a little finicky to
compile on current systems given all the changes in C++ library
organization since I wrote it in 2005/2009. I'm working off-and-on to
tidy it up but haven't got there yet... So, FWIW, code at:
On 1/23/15 9:44 AM, Mark A. Matienzo wrote:
> I believe Turnitin and SafeAssign both compare the text of submissions to
> against external sources (e.g., SafeAssign uses ABI/INFORM, among others).
> I am not certain if they compare submissions against each other.
> However, if you're looking for something along the lines of what Dre
> suggests, you could use ssdeep, which is an implementation of a piecewise
> hashing algorithm . The issue with that you would have to assume that
> all students would probably be using the same file format.
> You could also using something like Tika to extract the text content from
> all the submissions, and then compare them against each other.
>  http://ssdeep.sourceforge.net/
>  http://tika.apache.org/
> Mark A. Matienzo <[log in to unmask]>
> Director of Technology, Digital Public Library of America
> On Fri, Jan 23, 2015 at 8:47 AM, Andreas Orphanides <[log in to unmask]>
>> My first thought was something like programatically doing a pairwise diff
>> of the files, 5500 times. I was surprised I couldn't find a utility that
>> just does this.
>> But i did find something called diffuse , that allows you to graphically
>> compare any number of text files in a diff-like fashion. This would
>> probably at least be able to tell you which files need closer scrutiny.
>> I think you'd presumably have to be able to extract the text from each
>> file; I doubt it would work on raw Word docs or PDFs, so that might be a
>> It seems like the realm of source control has a lot of software designed to
>> help with this problem, so there might be other similar things out there.
>> But probably not anything designed to natively handle print-ready files.
>>  http://diffuse.sourceforge.net/about.html
>> On Fri, Jan 23, 2015 at 7:26 AM, Judy Meirose <[log in to unmask]> wrote:
>>> Can anyone recommend a plagiarism checking software besides Turnitin and
>>> SafeAssign? I need to compare about 100 student assignments against each
>>> other to make sure they don't copy each other's assignments.
>>> Judy K. Meirose
>>> Systems Librarian
>>> Florida Coastal School of Law
>>> 8787 Baypine Rd
>>> Jacksonville, FL
>>> This email transmission, and any documents, files or previous e-mail
>>> messages attached to it, may contain confidential, privileged and/or
>>> proprietary information for the sole use of the intended recipient(s). If
>>> you are not an intended recipient or a person responsible for delivering
>>> to an intended recipient, any disclosure, copying, distribution or use of
>>> any of the information contained in or attached to this transmission is
>>> strictly prohibited. If you have received this transmission in error,
>>> please: (1) immediately notify me by reply e-mail; and (2) destroy the
>>> original (and any copies) of this transmission and its attachments
>>> reading or saving in any manner.