Print

Print


Hi Mark,

Actually, the sitemap.org protocol allows for a sitemap to include
references to multiple child sitemaps
http://www.sitemaps.org/protocol.html#index.

Which is what we did at my former employer:
http://digitalcollections.library.gsu.edu/sitemap/sitemap.xml

And thus the robots.txt only includes a single sitemap:
http://digitalcollections.library.gsu.edu/robots.txt
When we add extra collections, it just goes into the sitemap.xml, so we are
not continuously updating the robots.txt.

Chad



On Fri, Feb 1, 2013 at 11:33 AM, Sullivan, Mark V <[log in to unmask]>wrote:

> Jason,
>
> You may want to allow people just to give you the robots.txt file which
> references the sitemap.  I also register the sitemaps individually with the
> big search engines for our site, but I found that very large sitemaps
> aren't processed very well.  So, for our site I think I limited the number
> of items per sitemap to 40,000.  Which results in ten sitemaps for the
> digital objects and an additional sitemap for all the collections.
>
> http://ufdc.ufl.edu/robots.txt
>
> Or else perhaps give more boxes, so we can include all the sitemaps
> utilized in our systems.
>
> Cheers!
>
> Mark
>
>
> Mark V Sullivan
> Digital Development and Web Coordinator
> Technology and Support Services
> University of Florida Libraries
> 352-273-2907 (office)
> 352-682-9692 (mobile)
> [log in to unmask]
>
>
>
> ________________________________________
> From: Code for Libraries [[log in to unmask]] on behalf of Jason
> Ronallo [[log in to unmask]]
> Sent: Friday, February 01, 2013 11:14 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] digital collections sitemaps
>
> Hi,
>
> I've seen registries for digital collections that make their metadata
> available through OAI-PMH, but I have yet to see a listing of digital
> collections that just make their resources available on the Web the
> way the Web works [1]. Sitemaps are the main mechanism for listing Web
> resources for automated crawlers [2]. Knowing about all of these
> various sitemaps could have many uses for research and improving the
> discoverability of digital collections on the open Web [3].
>
> So I thought I'd put up a quick form to start collecting digital
> collections sitemaps. One required field for the sitemap itself.
> Please take a few seconds to add any digital collections sitemaps you
> know about--they don't necessarily have to be yours.
>
>
> https://docs.google.com/spreadsheet/viewform?formkey=dE1JMDRIcXJMSzJ0YVlRaWdtVnhLcmc6MQ#gid=0
>
> At this point I'll make the data available to anyone that asks for it.
>
> Thank you,
>
> Jason
>
> [1] At least I don't recall seeing such a sitemap registry site or
> service. If you know of an existing registry of digital collections
> sitemaps, please let me know about it!
> [2] http://www.sitemaps.org/ For more information on robots see
> http://wiki.code4lib.org/index.php/Robots_Are_Our_Friends
> [3] For instance you can see how I've started to investigate whether
> digital collections are being crawled by the Common Crawl:
> http://jronallo.github.com/blog/common-crawl-url-index/
>