My Workshop for the South African
Online Users Group Meeting in
beautiful Pretoria, South Africa (6/7/16)

It’s been a while now since I’ve written a truly geeky post that didn’t focus on law libraries. I’m going to try to make up for that with this one on how to pull data from websites using Excel.

Last week, I had the absolute pleasure of traveling to Pretoria, South Africa to present at the South African Online Users Group meeting. In addition to presenting at a local law firm and giving a keynote speech on the value of law library service, I also got to teach a three hour workshop on web scraping and website monitoring. We kept it simple at first, and eventually got so far in depth with the discussion that even I got confused. I’ll hit some of the highlights of my workshop here, specifically with using Microsoft’s Excel.

Web scraping is the action of using computer software techniques to extract information directly from websites. Since many of us use Excel to compile the information, whether it is text, numbers, geographical information, or currency, I like to show all the ways that you can directly pull data into Excel and start manipulating it.

So why would you use web scraping? There are a number of reasons. You may want to create some type of prototype that shows how you can interpret and analyze data and you need some sample data to test or demo your project. You may want to arrange information in different ways to make the data more valuable to you or your audience. Data analysis is a common reason to pull blocks of data to prove a trend or tenancy from the data. And finally, you may want to take external data and combine it with internal data (say Industry Data combined with client data) and compare or evaluate internal trends.

Web scraping is something that you really want to do as a last resort. It can be complicated, and time consuming to set up, so you want to see if the website may have an API that you can use or if you can buy the data from the owners. If it is a simple one-time thing for a small amount of information, then think about copying and pasting rather than scraping. If it is something that you’ll do over and over again, or it is a large amount of data, then scraping may be the better option. You also want to make sure that you aren’t breaking any laws or terms of services for the website. The key phrase that I borrow for this is, “Don’t Be Evil.” People work hard to collect and produce data, don’t blatantly steal someone else’s work, especially if it will do financial or other harm to them.

Now that the do’s and don’t’s are out of the way, let’s look at some options you can use to pull data into Excel from web pages.

  1. Simple Cut & Paste: 
    If the data is pretty clean, and within a table on a webpage, you can keep it very easy by
    – highlighting the information and copy it (CTRL+C or right-click “Copy”)
    – the trick in pasting the data is using the “Paste Special” option when you paste the data back to Excel (right-click -> paste special -> text (or unicode text) )
  2. Data -> From Web:
    Excel has a very simple tool on the “Data” tab called “From Web.” This allows you to simply put in the URL to a website, and Excel will bring up a browser screen that will let you pick the data from that website. This option is simple, but sometimes will pull in all of the data from the page, and you’ll have to do clean up later. Still much easier than recreating the data, and much cleaner than a copy and paste. Here’s a short video that shows how to do that.

  3. Excel’s Power Query:
    In Excel 2016, there is a new built-in tool for downloading from the web that you can use through the “Data” tab. If you have this version, go to Data -> New Query -> From other Sources -> From Web.
    If you do not have this version, don’t worry. You can download the Power Query tool bar add-in from Microsoft from this page.
    Here is a good (and short) video on how to use Power Query for Excel.

  4. Using APIs with Excel
    Now we start getting a little more complicated with the use of Excel formulas to pull data via an API. In order to access an API, you will need to sign up for the API and receive a “Key” that is unique to you (and you need to protect that Key from others seeing it.) Once you have that (or you can use open XML code like RSS feeds) you’ll use two functions in Excel:
    =WEBSERVICE, and
    =FILTERXML
    Again, it is much easier to show you a quick video than to walk you through this process, so here is a short video explaining how to use Excel and APIs.

  5. Specialized Add-Ins for Excel
    There are add-ins that you can upload into Excel to expand functions that pull data from websites and social media outlets.
    SEO Tools for Excel – This is a very powerful tool that allows you to pull data from different types of websites and social media sites. There is a free version that comes with lots of functionality, but an anoying start up screen whenever you launch Excel that makes you wait for a few seconds before you can use Excel. The paid version comes with extra functions and no anoying pop-up start screen.
    Excel Scrape HTML Add-In – This is a free add-in that allows you some additional function features in Excel to pull data from websites.
    Both of these add-ins are a little more complicated than I want to go into at this point, but you should know that these (and others) exist out there and my help you in your data gathering.

This covers most of the Excel portions of my workshop I conducted last week. It shows that there are some powerful built-in functions in Excel that can help you with data gathering and make it much easier for you to manipulate and analyze data you find on the web. Go forth and scrape!

One of the best features that Lex Machina provides for Intellectual Property attorneys is their increased accuracy of information pulled from PACER. The improvements that Lex Machina has made on Cause-of-Action (CoA) and Nature- of-Suit (NoS) codes entered into PACER make it an invaluable resource to clearly identify relevant matters and weed out irrelevant cases. By improving the data, Lex Machina reduces the “garbage in – garbage out” effect that exists in the PACER database.

Now Lex Machina has turned its focus on cleaning up another annoyance found in PACER data, as well as many of the other platforms that pull data from PACER. The Attorney Data Engine analyzes the PACER information and identifies the attorneys that are actually associated with the case, even if those attorneys do not show up on the attorney list via PACER.

I talked with Karl Harris, Vice-President of Products at Lex Machina, a couple weeks ago, and he gave me some insights on the new Attorney Data Engine, and how they are increasing the accuracy of identifying attorneys and law firms that are actually working on the cases filed through PACER. Karl mentioned that in New Jersey and Delaware, two very important states when it comes to Intellectual Property cases, only about 54% of the attorneys that work on the cases, actually show up in the PACER information. That means that nearly half of the attorneys are missing from the metadata produced by PACER. When accuracy is important, missing nearly half of the attorney names can cause quite a problem.

For those of us that ever put on a demo for an attorney of docket information, we know that one of the first questions the attorney asks is “can you find ‘X’ case, which I represented ‘Y’ client?” If you cannot find that information, the demo may as well end right there. Attorneys are issue spotters. If you cannot get accurate information, they will not trust that the product actually works.

With the new Lex Machina Attorney Data Engine, you should be able to find the attorney information, even if PACER missed it.

Here is an overview of the three components of the Attorney Data Engine:

  1. The PACER metadata itself: Every time Lex Machina crawls PACER data, they keep a historical record and can identify when attorneys are added or removed from a case over time. This makes the PACER data better by itself.
  2. Pro Hac Vice Extractor: Docket entries will mention when attorneys are added Pro Hac Vice to a case. Lex Machina also keeps a record of attorneys associated to law firms (over time.)
  3. Signature Block Analyzer: Lex Machina analyzes the documents attached to the docket entries and identifies the signature blocks for each attorney. Even if the attorney’s name doesn’t show up in the Docket entry, if they have a signature block, they are then associated with the case. 
Karl Harris states that the Attorney Data Engine makes Lex Machina “the best source for reliably figuring out which attorneys are involved in which cases.” 
It will be interesting to watch Lex Machina grow over the next couple of years, and to see how its new parent company, Lexis, assists in advancing its progress through access to additional data points. It is not a far jump to see how the Attorney Data Engine processes can be turned into a Company Data Engine using Lexis’ company information databases. Lexis has the content, and Lex Machina has the analytical resources to make that content better. It should make for some interesting results as the two companies learn how to adapt processes to the different products. 

Next Thursday afternoon at 3:30 PM in the Maryland A conference room of the Gaylord National Resort in Washington D.C., Sean Brady and I will be giving our presentation: APP 13 – The Future of Data Delivery Or: How I Learned to Stop Browsing and Love the App.  

Ours is the last session on the last day of ILTA.  For those of you who have ever been to ILTA you know that by Thursday afternoon, most people have learned, networked, and partied (not necessarily in that order) all they possibly can for one week.  Many will have already checked out and be headed toward the airport by the time Sean and I take the stage.  Even those staying at the hotel Thursday night will have mostly “checked out” mentally for the week.   In fact, it is entirely possible that Sean and I will be having a private conversation on The Future of Data Delivery in the Maryland A conference room of the Gaylord National Resort in Washington D.C. at 3:30 PM next Thursday afternoon.  In order to limit the likelihood of that happening, we are asking for your help.

Our presentation is a follow-up to the presentation we gave last year called: The Future of the Browser: Moving Beyond IE, which ILTA members can listen to here.  In that presentation, I argued that the future of the browser was moot because the browser was dead, to be replaced by the App.  And Sean argued that the browser was the end-all and be-all of human endeavor, never to be surpassed by…

     I didn’t say that!!

Hey, Sean, I’m writing this blog post!

     Yeah, but…

Look, you can give your version of events at our session.

     But no one’s going to be at our session.

That’s not my problem.  Go write your own blog post.

Sorry folks, he’s a little touchy because as it turns out, I was pretty much right about the App thing last year.

     Oh Really!?  OK genius, what App are you writing this blog post in?

Uh… Chrome.

    A browser.  Ha!  I rest my case!

Anyway, as I was saying… we need your help.  Our presentation this year is a follow-up to last year’s talk.  We’re going to discuss the current trends in data delivery, the continued shift to mobile, the dominance of the app, the consumerization of everything, and what we as legal technologists can and should do about it.  I am sure we will present points of view on this topic that you will not hear anywhere else at ILTA this year.

I can be sure of that because in addition to our own points of view, we would like to present YOUR points of view.  We have a quick little Google form we would like you to fill out…

     In a browser!

Yes, in a browser!  Are you done?!

    Sorry.

Ahem… We would like you to give us a little information about your thoughts on the subject.  We want to know how your firm is handling these issues and your general thoughts on The Future of Data Delivery.  We’re looking for everything from the mundane (the DMS will be social), to the ridiculous (after the singularity we will all communicate via techno-telepathy).

For those of you attending ILTA, stop us in the hall and let us know your thoughts, or we will be handing out our business cards all week with the web form URL on the back. http://goo.gl/FTC58

     Why didn’t you just make the web form into an App?

Because I didn’t think of it in time, smarty.

    So Apps take time, and web forms IN BROWSERS don’t.

Please folks, fill out the form and if you’re at ILTA, stick around Thursday afternoon and attend APP13 at 3:30 PM in Maryland A, to see what we and your peers have to say about The Future of Data Delivery.

    It’s the browser.

Ughhh.

If you have a service that gets 20 million unique users a month, and you have personal information on those users, what could you do with it? For the staffers at music streaming site, Grooveshark, the answer was to build a free tool that shows the demographics, culture and lifestyles, habits, and product preferences of those users that listen to particular artist. This morning, Grooveshark launched a data analytics resource called Beluga. This resource allows you to type in the name of a musical artist and find out information about the fans that listen to their music. The press release that Grooveshark wrote this morning says that the information is designed to help artist better position themselves to their fanbase (it could also be related to the fact that all four major record labels are suing Grooveshark… but, we’ll stick to what Beluga does for this post.)

Grooveshark’s co-founder, Josh Greenberg believes that by exposing these analytics, artists can “learn about their fans, route their tours, sell merchandise, work on building a following, and take their careers to the next level.” He thinks there is also value from the sales and marketing aspects of the record label as well since they can position themselves “to partner with artists who connect with their target audience, presenting endless opportunities.”

If you’ve used Grooveshark in the past, you know that from time to time they ask you to fill out a quick survey. It seems that we are seeing the first of the results of those surveys, and it makes for some interesting information when looking at musical groups. Here’s a sample of results I got when I searched on Eric Clapton:

  • 55+year old females are moderately over-represented (and yes, they look Wonderful Tonight)…
  • Fans of Clapton seemed to have retired in the past year (remembering Days of Old)…
  • Israelis seem to love themselves some slow-hand
  • Clapton users seem to show some Old Love and rent their movies from Blockbuster online…
  • Poor people don’t really like to listen to Clapton (no Hard Times for Clapton fans)…
  • Clapton listeners tend to bank at Citibank (cause, Nobody Knows You When You’re Down and Out)…
  • Apparently, Clapton sounds best when you’re driving and Traveling Alone.
What is the accuracy of these survey results?? I’m not sure. I guess the case could be made for not giving too much faith in the results you get from a voluntary web survey. However, I did look at the survey results for some smaller named bands that I like, and the surveys seemed to fall about where I’d expect for these bands, although the geography preferences tended to trend outside the US (maybe because Americans are less likely to fill out the survey??) 
Regardless of whether or not you trust the results, if you love to see examples of what can be done with large amounts of data, then you’ll find yourself having some fun looking at Beluga

I saw an interesting visual of where all the Dewey & LeBoeuf partners went on a Thomson Reuters’ site earlier today. The graphic illustrates the firms that picked up the laterals, the names of the lateral partners, and their practice areas. The Dewey situation has been like watching a train slowly wreck, day after day, for about four months now. The visualization helps to see where everyone is going, and what practice areas are affected.
[Note: I got a notice from the AmLaw Daily that they are keeping an up-to-date list and visual as well.]
[Note2: Also, Law Shucks has a great Lateral Tracker to monitor more than just the Dewey defections.]

I thought I’d see if we could take the visualization a bit further and pull out a couple of handy tools to do this, and maybe get a little use out of some crowdsourcing at the same time.

  1. Google Docs – I’ve uploaded a freely editable (by anyone) spreadsheet that lists the Dewey departures, along with the practice groups, offices, date announced, and the link to a news article verifying the departure. Feel free to add or edit the Google Doc as you see fit!
  2. ManyEyes – We’ve used this before to graph out information. This is a great free resource from IBM.
Although the implosion is no laughing matter (unless you like all the puns that Above The Law has been using for the past few weeks), it does make for some interesting graphics to show how substantial the losses are.