Web scraping

Got it! This site "www.robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

Web scraping is a technique for automatically or semi-automatically downloading content from the Internet.

Before proceeding, some ethical considerations are covered.

Before scraping content, check any conditions placed on that content by the hosting site.
Do not scape any site to the extent that you are using a lot of their bandwidth for which they are paying.
Do not scrape the same content more than once.

There are several levels of scraping, which may have different rules for how to acquire and use the content. These levels include the following.

Personal use
Educational use
Research use
Commercial use

Scraping for personal use (modeling personal behavior in terms of click rates and download volume) are fairly permissive.

At the other end of the spectrum, scraping for commercial use should be thoroughly investigated and approval received from the host site (in terms of their terms of use, etc) before doing scraping.

As a general personal rule, I sometimes scrape sites who want their content to be consumed and, in who often provide ways to do that.

So, if I could sit at a computer for an hour and click on a few hundred links, do a "save as" command, and process or consume the content, I

The example used here is that of headline feeds whereby news headlines over time can be used for (personal) research purposes to identify and study trends, sentiment, etc. The full articles can be used but, for many research purposes, just a headline and an accompanying paragraph is sufficient.

News feeds can be obtained via an type of XML (Extensible Markup Language) feed called RSS (Really Simple Syndication).

There are many ways to analyze the collected data, but first one must have a way to collect the data and then actually collect the data.

Another feed type that I have used is academic conference announcements - to consolidate and display upcoming conferences of interest.

One (manual) approach is to use an RSS reader. I have used the Open Source Thunderbird email system to subscribe to RSS feeds. These feeds are stored in a standard email format.

I have used Lua to read that email format (conference announcements).

I have used Python to (more easily) read that email format (for student submissions via email, headlines, etc.).

Another approach is to use Python to directly read the RSS feed on a periodic basis and save the results for later processing.

An RSS reader will update the RSS feed many times a day (depending on the setting).

I have found that, for research purposes, a once-a-day download and save is sufficient.

Here is the refined approach to collect the raw data. This works well as most RSS feeds contain data for the last several days.

Note that sites, once in a while, change their RSS URL so a good program would detect that so that that URL can be adjusted.

Whenever accessing RSS feeds anagrammatically, it can be useful to subscribe to the same feed using an actual RSS reader so you can see what you should be getting.

The reason for not using an actual RSS reader is that one must then insure that the program is run once-a-day. Forget a data, a missing day of data.

How does one insure that the program is run once a day?

Linux: Create a cron job to run once a day.
Windows: Create a job using "Task Scheduler" (not as easy).

I typically set the job to run in the middle of the morning (Eastern Time Zone) when there is a lot of unused bandwidth in the United States.

How does not insure that the computer on which the job is run every day?

I have used a (small and older and very low power) Raspberry Pi that runs all the time (on a UPS) and runs the designated cron jobs and saves the data locally.

There are (at least) two general ways for automated scraping of web sites.