You might want to check out the BASE system from Bielefeld, Germany (http://www.base-search.net), which have access to a lot of OA sources, and implemented an classification system (metadata+fulltext) for DDC themselves on a semi-automatic generated training corpus across all disciplines (might also be useful as a benchmark). They reported last year, that ~1,4% of their OA metadata had manually assigned subject classification on it .
Am 29.06.2012 um 15:29 schrieb Arash.Joorabchi:
> Hi all,
> I am working on developing a software system designed to analyze the
> content of research documents (e.g., research papers, articles, etc.)
> archived in scientific repositories (e.g., http://citeseerx.ist.psu.edu
> <http://citeseerx.ist.psu.edu/> , http://arxiv.org/ ) and automatically
> classify them according to FAST and DDC. In order to objectively qualify
> the performance of the system, a collection of research documents which
> have been manually classified according to the DDC and been assigned
> FAST subject heading would be required. I was wondering if anyone is
> aware of such dataset existing online.