On Jan 23, 2015, at 9:44 AM, Mark A. Matienzo wrote:
> I believe Turnitin and SafeAssign both compare the text of submissions to
> against external sources (e.g., SafeAssign uses ABI/INFORM, among others).
> I am not certain if they compare submissions against each other.
My understanding of TurnItIn, at least initially, was that they
built their corpus on existing submissions.
(they had some deals with universities back when they started up
to use their service for free or cheap, so that they could build
up their corpus).
> However, if you're looking for something along the lines of what Dre
> suggests, you could use ssdeep, which is an implementation of a piecewise
> hashing algorithm . The issue with that you would have to assume that
> all students would probably be using the same file format.
> You could also using something like Tika to extract the text content from
> all the submissions, and then compare them against each other.
I'd agree on extracting the text. MS Word used to store documents
as strings of edits, making it difficult to compare two
documents for similarity without parsing the format.
(I don't know if they still do this in .docx)