You may want to allow people just to give you the robots.txt file which references the sitemap.  I also register the sitemaps individually with the big search engines for our site, but I found that very large sitemaps aren't processed very well.  So, for our site I think I limited the number of items per sitemap to 40,000.  Which results in ten sitemaps for the digital objects and an additional sitemap for all the collections.

Or else perhaps give more boxes, so we can include all the sitemaps utilized in our systems.



Mark V Sullivan
Digital Development and Web Coordinator
Technology and Support Services
University of Florida Libraries
352-273-2907 (office)
352-682-9692 (mobile)
[log in to unmask]

From: Code for Libraries [[log in to unmask]] on behalf of Jason Ronallo [[log in to unmask]]
Sent: Friday, February 01, 2013 11:14 AM
To: [log in to unmask]
Subject: [CODE4LIB] digital collections sitemaps


I've seen registries for digital collections that make their metadata
available through OAI-PMH, but I have yet to see a listing of digital
collections that just make their resources available on the Web the
way the Web works [1]. Sitemaps are the main mechanism for listing Web
resources for automated crawlers [2]. Knowing about all of these
various sitemaps could have many uses for research and improving the
discoverability of digital collections on the open Web [3].

So I thought I'd put up a quick form to start collecting digital
collections sitemaps. One required field for the sitemap itself.
Please take a few seconds to add any digital collections sitemaps you
know about--they don't necessarily have to be yours.

At this point I'll make the data available to anyone that asks for it.

Thank you,


[1] At least I don't recall seeing such a sitemap registry site or
service. If you know of an existing registry of digital collections
sitemaps, please let me know about it!
[2] For more information on robots see
[3] For instance you can see how I've started to investigate whether
digital collections are being crawled by the Common Crawl: