|My Workshop for the South African|
Online Users Group Meeting in
beautiful Pretoria, South Africa (6/7/16)
Last week, I had the absolute pleasure of traveling to Pretoria, South Africa to present at the South African Online Users Group meeting. In addition to presenting at a local law firm and giving a keynote speech on the value of law library service, I also got to teach a three hour workshop on web scraping and website monitoring. We kept it simple at first, and eventually got so far in depth with the discussion that even I got confused. I'll hit some of the highlights of my workshop here, specifically with using Microsoft's Excel.
Web scraping is the action of using computer software techniques to extract information directly from websites. Since many of us use Excel to compile the information, whether it is text, numbers, geographical information, or currency, I like to show all the ways that you can directly pull data into Excel and start manipulating it.
So why would you use web scraping? There are a number of reasons. You may want to create some type of prototype that shows how you can interpret and analyze data and you need some sample data to test or demo your project. You may want to arrange information in different ways to make the data more valuable to you or your audience. Data analysis is a common reason to pull blocks of data to prove a trend or tenancy from the data. And finally, you may want to take external data and combine it with internal data (say Industry Data combined with client data) and compare or evaluate internal trends.
Web scraping is something that you really want to do as a last resort. It can be complicated, and time consuming to set up, so you want to see if the website may have an API that you can use or if you can buy the data from the owners. If it is a simple one-time thing for a small amount of information, then think about copying and pasting rather than scraping. If it is something that you'll do over and over again, or it is a large amount of data, then scraping may be the better option. You also want to make sure that you aren't breaking any laws or terms of services for the website. The key phrase that I borrow for this is, "Don't Be Evil." People work hard to collect and produce data, don't blatantly steal someone else's work, especially if it will do financial or other harm to them.
Now that the do's and don't's are out of the way, let's look at some options you can use to pull data into Excel from web pages.
- Simple Cut & Paste:
If the data is pretty clean, and within a table on a webpage, you can keep it very easy by
- highlighting the information and copy it (CTRL+C or right-click "Copy")
- the trick in pasting the data is using the "Paste Special" option when you paste the data back to Excel (right-click -> paste special -> text (or unicode text) )
- Data -> From Web:
Excel has a very simple tool on the "Data" tab called "From Web." This allows you to simply put in the URL to a website, and Excel will bring up a browser screen that will let you pick the data from that website. This option is simple, but sometimes will pull in all of the data from the page, and you'll have to do clean up later. Still much easier than recreating the data, and much cleaner than a copy and paste. Here's a short video that shows how to do that.
- Excel's Power Query:
In Excel 2016, there is a new built-in tool for downloading from the web that you can use through the "Data" tab. If you have this version, go to Data -> New Query -> From other Sources -> From Web.
If you do not have this version, don't worry. You can download the Power Query tool bar add-in from Microsoft from this page.
Here is a good (and short) video on how to use Power Query for Excel.
- Using APIs with Excel
Now we start getting a little more complicated with the use of Excel formulas to pull data via an API. In order to access an API, you will need to sign up for the API and receive a "Key" that is unique to you (and you need to protect that Key from others seeing it.) Once you have that (or you can use open XML code like RSS feeds) you'll use two functions in Excel:
Again, it is much easier to show you a quick video than to walk you through this process, so here is a short video explaining how to use Excel and APIs.
- Specialized Add-Ins for Excel
There are add-ins that you can upload into Excel to expand functions that pull data from websites and social media outlets.
SEO Tools for Excel - This is a very powerful tool that allows you to pull data from different types of websites and social media sites. There is a free version that comes with lots of functionality, but an anoying start up screen whenever you launch Excel that makes you wait for a few seconds before you can use Excel. The paid version comes with extra functions and no anoying pop-up start screen.
Excel Scrape HTML Add-In - This is a free add-in that allows you some additional function features in Excel to pull data from websites.
Both of these add-ins are a little more complicated than I want to go into at this point, but you should know that these (and others) exist out there and my help you in your data gathering.