Print

Print


For code4lib.org server-related stuffs, I'm your huckleberry.

Screen scraping an HTTPS site can be complicated for a number of reasons,
mostly depending on how smart the scraper is, the "quality" of the
certificate, etc.

I would be happy to make the webserver logs available to someone if they
wanted to try to determine if something is screen scraping the site.

If we just want to make HTTPS an option, that's easy enough. It's also easy
enough to just redirect anything trying to hit HTTP to HTTPS. If someone
wants to handle the procurement for the certificate, putting a new one in
place every year is not much hassle.






On Tue, Nov 5, 2013 at 9:07 AM, William Denton <[log in to unmask]> wrote:

> On 4 November 2013, Ross Singer wrote:
>
>  While I'm not opposed to providing code4lib.org via HTTPS, I don't think
>> it's as simple as "let's just do it!".  Who will be responsible for making
>> sure the cert is up to date?
>>
>
> I will for a while!  I'll make some entries in my calendar.
>
>  Who will pay for certs (if we don't go with startcom)?
>>
>
> Good question.  There was a small working group formed a little while ago
> that was looking at a formal Code4Lib organization ... did anything come of
> that?  Cary Gordon kicked it off, I think.  If there was a formal
> arrangement then that would be the right place to manage the costs of an
> SSL cert.
>
> But there is no formal arrangement yet, so we could rustle it up amongst
> ourselves (I'll chip in) or we could make it part of the annual conference
> costs ($100ish isn't an onerous burden).
>
> We don't have to get it working forever right now.  We just need to get it
> working.  Then we can worry about it next year.
>
> I've forgotten who at Oregon State is tending the server ... whoever it
> is, can you email me?
>
> By the way, if anyone out there has been thinking about privacy
> post-Snowden and has some ideas about what libraries and archives can do
> about it, this would be a good subject for a talk at the conference next
> year [0] ...
>
>  Also, forcing all traffic to HTTPS unnecessarily complicates some things,
>> e.g. screen scrapers (and before you say, "well, screen scraping sucks,
>> anyway!", I think it's not a stretch to say that "microdata parser" falls
>> under "screen scraping".  Or RDFa.).
>>
>
> Fair enough, but even if not mandatory or preferred, HTTPS should be
> available everywhere HTTP is used, and that's something we can work
> towards.  People log in to code4lib.org and wiki.code4lib.org by sending
> their passwords in the clear!  That is uncool.
>
> (Question:  Why does HTTPS complicate screen-scraping?  Every decent tool
> and library supports HTTPS, doesn't it?)
>
> Bill
>
> [0] http://wiki.code4lib.org/index.php/2014_Prepared_Talk_Proposals
> --
> William Denton
> Toronto, Canada
> http://www.miskatonic.org/
>