The code used for overlap detection within the arXiv corpus (see [1] which significantly extended earlier work [2]) does a matching based on a sliding window of hashed 7-word sequences on extracted ASCII text. Perhaps more the required for the case in question, but this approach scales to a corpus of 1M articles. I'm afraid it is a little finicky to compile on current systems given all the changes in C++ library organization since I wrote it in 2005/2009. I'm working off-and-on to tidy it up but haven't got there yet... So, FWIW, code at: https://github.com/zimeon/docsim Cheers, Simeon [1] http://arxiv.org/abs/1412.2716 [2] http://arxiv.org/abs/cs/0702012 On 1/23/15 9:44 AM, Mark A. Matienzo wrote: > I believe Turnitin and SafeAssign both compare the text of submissions to > against external sources (e.g., SafeAssign uses ABI/INFORM, among others). > I am not certain if they compare submissions against each other. > > However, if you're looking for something along the lines of what Dre > suggests, you could use ssdeep, which is an implementation of a piecewise > hashing algorithm [0]. The issue with that you would have to assume that > all students would probably be using the same file format. > > You could also using something like Tika to extract the text content from > all the submissions, and then compare them against each other. > > [0] http://ssdeep.sourceforge.net/ > [1] http://tika.apache.org/ > > Mark > > -- > Mark A. Matienzo <[log in to unmask]> > Director of Technology, Digital Public Library of America > > On Fri, Jan 23, 2015 at 8:47 AM, Andreas Orphanides <[log in to unmask]> > wrote: > >> My first thought was something like programatically doing a pairwise diff >> of the files, 5500 times. I was surprised I couldn't find a utility that >> just does this. >> >> But i did find something called diffuse [1], that allows you to graphically >> compare any number of text files in a diff-like fashion. This would >> probably at least be able to tell you which files need closer scrutiny. >> >> I think you'd presumably have to be able to extract the text from each >> file; I doubt it would work on raw Word docs or PDFs, so that might be a >> stopper. >> >> It seems like the realm of source control has a lot of software designed to >> help with this problem, so there might be other similar things out there. >> But probably not anything designed to natively handle print-ready files. >> >> -dre. >> >> >> [1] http://diffuse.sourceforge.net/about.html >> >> On Fri, Jan 23, 2015 at 7:26 AM, Judy Meirose <[log in to unmask]> wrote: >> >>> Can anyone recommend a plagiarism checking software besides Turnitin and >>> SafeAssign? I need to compare about 100 student assignments against each >>> other to make sure they don't copy each other's assignments. >>> >>> Thanks. >>> >>> Judy K. Meirose >>> Systems Librarian >>> Florida Coastal School of Law >>> 8787 Baypine Rd >>> Jacksonville, FL >>> (904)680-7603 >>> >>> This email transmission, and any documents, files or previous e-mail >>> messages attached to it, may contain confidential, privileged and/or >>> proprietary information for the sole use of the intended recipient(s). If >>> you are not an intended recipient or a person responsible for delivering >> it >>> to an intended recipient, any disclosure, copying, distribution or use of >>> any of the information contained in or attached to this transmission is >>> strictly prohibited. If you have received this transmission in error, >>> please: (1) immediately notify me by reply e-mail; and (2) destroy the >>> original (and any copies) of this transmission and its attachments >> without >>> reading or saving in any manner. >>> >>