LISTSERV 16.5 - CODE4LIB Archives

Mark,

Thank you for your feedback. Yes, I see what you mean. I've changed
the form a bit to allow for adding a block of sitemaps.

In cases where I reach the size limit for a sitemap, I'll use a
sitemap index file to group multiple sitemaps. I can see how that's
not always the case and some systems may make it difficult.
http://www.sitemaps.org/protocol.html#index
(Oh, and what Chad said about not having to update the robots.txt all the time.)

I didn't make a space for robots.txt because lots of sites still do
not add their sitemaps to their robots.txt. Since the robots.txt is
always at the root, this is also something that can be autodiscovered.
I've found in my previous explorations of sites in collection
registries that while there is often a robots.txt it usually does not
contain any sitemaps. That might just be the set that I've dealt with,
though.

Jason

On Fri, Feb 1, 2013 at 11:33 AM, Sullivan, Mark V <[log in to unmask]> wrote:
> Jason,
>
> You may want to allow people just to give you the robots.txt file which references the sitemap.  I also register the sitemaps individually with the big search engines for our site, but I found that very large sitemaps aren't processed very well.  So, for our site I think I limited the number of items per sitemap to 40,000.  Which results in ten sitemaps for the digital objects and an additional sitemap for all the collections.
>
> http://ufdc.ufl.edu/robots.txt
>
> Or else perhaps give more boxes, so we can include all the sitemaps utilized in our systems.
>
> Cheers!
>
> Mark
>
>
> Mark V Sullivan
> Digital Development and Web Coordinator
> Technology and Support Services
> University of Florida Libraries
> 352-273-2907 (office)
> 352-682-9692 (mobile)
> [log in to unmask]
>
>
>
> ________________________________________
> From: Code for Libraries [[log in to unmask]] on behalf of Jason Ronallo [[log in to unmask]]
> Sent: Friday, February 01, 2013 11:14 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] digital collections sitemaps
>
> Hi,
>
> I've seen registries for digital collections that make their metadata
> available through OAI-PMH, but I have yet to see a listing of digital
> collections that just make their resources available on the Web the
> way the Web works [1]. Sitemaps are the main mechanism for listing Web
> resources for automated crawlers [2]. Knowing about all of these
> various sitemaps could have many uses for research and improving the
> discoverability of digital collections on the open Web [3].
>
> So I thought I'd put up a quick form to start collecting digital
> collections sitemaps. One required field for the sitemap itself.
> Please take a few seconds to add any digital collections sitemaps you
> know about--they don't necessarily have to be yours.
>
> https://docs.google.com/spreadsheet/viewform?formkey=dE1JMDRIcXJMSzJ0YVlRaWdtVnhLcmc6MQ#gid=0
>
> At this point I'll make the data available to anyone that asks for it.
>
> Thank you,
>
> Jason
>
> [1] At least I don't recall seeing such a sitemap registry site or
> service. If you know of an existing registry of digital collections
> sitemaps, please let me know about it!
> [2] http://www.sitemaps.org/ For more information on robots see
> http://wiki.code4lib.org/index.php/Robots_Are_Our_Friends
> [3] For instance you can see how I've started to investigate whether
> digital collections are being crawled by the Common Crawl:
> http://jronallo.github.com/blog/common-crawl-url-index/


On Fri, Feb 1, 2013 at 11:33 AM, Sullivan, Mark V <[log in to unmask]> wrote:
> Jason,
>
> You may want to allow people just to give you the robots.txt file which references the sitemap.  I also register the sitemaps individually with the big search engines for our site, but I found that very large sitemaps aren't processed very well.  So, for our site I think I limited the number of items per sitemap to 40,000.  Which results in ten sitemaps for the digital objects and an additional sitemap for all the collections.
>
> http://ufdc.ufl.edu/robots.txt
>
> Or else perhaps give more boxes, so we can include all the sitemaps utilized in our systems.
>
> Cheers!
>
> Mark
>
>
> Mark V Sullivan
> Digital Development and Web Coordinator
> Technology and Support Services
> University of Florida Libraries
> 352-273-2907 (office)
> 352-682-9692 (mobile)
> [log in to unmask]
>
>
>
> ________________________________________
> From: Code for Libraries [[log in to unmask]] on behalf of Jason Ronallo [[log in to unmask]]
> Sent: Friday, February 01, 2013 11:14 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] digital collections sitemaps
>
> Hi,
>
> I've seen registries for digital collections that make their metadata
> available through OAI-PMH, but I have yet to see a listing of digital
> collections that just make their resources available on the Web the
> way the Web works [1]. Sitemaps are the main mechanism for listing Web
> resources for automated crawlers [2]. Knowing about all of these
> various sitemaps could have many uses for research and improving the
> discoverability of digital collections on the open Web [3].
>
> So I thought I'd put up a quick form to start collecting digital
> collections sitemaps. One required field for the sitemap itself.
> Please take a few seconds to add any digital collections sitemaps you
> know about--they don't necessarily have to be yours.
>
> https://docs.google.com/spreadsheet/viewform?formkey=dE1JMDRIcXJMSzJ0YVlRaWdtVnhLcmc6MQ#gid=0
>
> At this point I'll make the data available to anyone that asks for it.
>
> Thank you,
>
> Jason
>
> [1] At least I don't recall seeing such a sitemap registry site or
> service. If you know of an existing registry of digital collections
> sitemaps, please let me know about it!
> [2] http://www.sitemaps.org/ For more information on robots see
> http://wiki.code4lib.org/index.php/Robots_Are_Our_Friends
> [3] For instance you can see how I've started to investigate whether
> digital collections are being crawled by the Common Crawl:
> http://jronallo.github.com/blog/common-crawl-url-index/