Print

Print


Hi,

I thought a lot about this question in the past, and my answer is:
yes, you can apply statistical formulas. But you should know well each
field of your record: what kind of information could they contain,
whether you could set rules about that which you can apply for the
individual records. Some factors which are important:

- the "completeness" of the records: the ratio of the fields filled and unfilled
- the value of an individual field matches the rules or not (say you
expect a number in the range of 1 to 5, but you get 6)
- the probability that a given field value could be unique
- the probability that a record is not duplication of another record

Some concrete example from my Europeana past:
- there are mandatory fields, and if they are empty, the quality goes down
- there are fields which should match a known standard, for example
ISO language codes - you can apply rules to decide whether the value
fits or not
- the "data provider" field is a free text - no formal rule - but no
individual record could contain unique value, and when you import
several thousands of new record, they should not contain more than a
couple new values
- there are fields which should contain URLs or emails or dates, we
can check whether they fit for formal rules, and their content are in
a reasonable range (we should not have record created in the future
for example)
- you can measure whether the optional fields are fulfilled, and in which ratio

At the end you will have a couple of measurements, and you can apply
weighting to calculate a final classification number.

You can do a lot to set up rules with faceted search, and of course
you can use statistical tools, such as R, Julia which helps to get a
picture of distribution of the values.

Hope it helps.

Regards,
Péter

-- 
Péter Király
software developer

Göttingen Society for Scientific Data Processing - http://gwdg.de
eXtensible Catalog - http://eXtensibleCatalog.org