Archive for the ‘Web Scrapers’ Category

Building a Telecom Dictionary scraping web using rvest in R [Tunable Transparency]

Tuesday, December 5th, 2017

Building a Telecom Dictionary scraping web using rvest in R by Abdul Majed Raja.

From the post:

One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise.

Great for scraping an existing glossary but as always, it isn’t possible to extract information that isn’t captured by the original glossary.

Things like the scope of applicability for the terms, language, author, organization, even characteristics of the subjects the terms represent.

Of course, if your department invested in collecting that information for every subject in the glossary, there is no external requirement that on export all that information be included.

That is your “data silo” can have tunable transparency, that is you enable others to use your data with as much or as least semantic friction as the situation merits.

For some data borrowers, they get opaque spreadsheet field names, column1, column2, etc.

Other data borrowers, perhaps those willing to help defray the cost of semantic annotation, well, they get a more transparent view of the data.

One possible method of making semantic annotation and its maintenance a revenue center as opposed to a cost one.

Web Scraping Reference: …

Thursday, April 6th, 2017

Web Scraping Reference: A Simple Cheat Sheet for Web Scraping with Python by Hartley Brody.

From the post:

Once you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. I’ve probably built hundreds of scrapers over the years for my own projects, as well as for clients and students in my web scraping course.

Occasionally though, I find myself referencing documentation or re-reading old code looking for snippets I can reuse. One of the students in my course suggested I put together a “cheat sheet” of commonly used code snippets and patterns for easy reference.

I decided to publish it publicly as well – as an organized set of easy-to-reference notes – in case they’re helpful to others.

Brody uses Beautiful Soup, a Python library that will parse even the worst formed HTML.

I mention this so I will remember the next time I scrape Wikileaks, instead of the download then repair with Tidy, parse with Saxon/XQuery, there are easier ways to do the job!


Text [R, Scraping, Text]

Wednesday, August 17th, 2016

Text by Amelia McNamara.

Covers “scraping, text, and timelines.”

Using R, focuses on scraping, works through some of “…Scott, Karthik, and Garrett’s useR tutorial.”

In case you don’t know the useR tutorial:

Also known as (AKA) Extracting data from the web APIs and beyond:

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

Covers other resources and materials.


4330 Data Scientists and No Data Science Renee

Monday, April 11th, 2016

After I posted 1880 Big Data Influencers in CSV File, I got a tweet from Data Science Renee pointing out that her name wasn’t in the list.

Renee does a lot more on “data science” and not so much on “big data,” which sounded like a plausible explanation.

Even if “plausible,” I wanted to know if there was some issue with my scrapping of Right Relevance.

Knowing that Renee’s influence score for “data science” is 81, I set the query to scrape the list between 65 and 98, just to account for any oddities in being listed.

The search returned 1832 entries. Search for Renee, nada, no got. Here’s the 1832-data-science-list.

In an effort to scrape all the listings, which should be 10,375 influencers, I set the page delay up to Ted Cruz reading speed. Ten entries every 72,000 milliseconds. 😉

That resulted in 4330-data-science-list.

No joy, no Renee!

It isn’t clear to me why my scraping fails before recovering the entire data set but in any reasonable sort order, a listing of roughly 10K data scientists should have Renee in the first 100 entries, much less the first 1,000 or even first 4K.

Something is clearly amiss with the data but what?

Check me on the first ten entries for data science as the search term but I find:

  • Hilary Mason
  • Kirk Borne – no data science
  • Nathan Yau
  • Gregory Piatetsky – no data science
  • Randy Olson
  • Jeff Hammerbacher – no data science
  • Chris Dixon @cdixon – no data science
  • dj patil @dpatil
  • Doug Laney – no data science
  • Big Data Science no data science

The notation, “no data science,” means that entry does not have a label for data science. Odd considering that my search was specifically for influencers in “data science.” The same result obtains if you choose one of the labels instead of searching. (I tried.)

Clearly all of these people could be listed for “data science,” but if I am searching for that specific category, why is that missing from six of the first ten “hits?”

As far as Data Science Renee, I can help you with that to a degree. Follow @BecomingDataSci, or @DataSciGuide, @DataSciLearning & @NewDataSciJobs. Visit her website: Podcasts, interviews, posts, just a hive of activity.

On the mysteries of Right Relevance and its data I’m not sure what to say. I posted feedback a week ago mentioning the issue with scraping and ordering, but haven’t heard back.

The site has a very clever idea but looking in from the outside with a sample size of 1, I’m not impressed with its delivery on that idea.

Issues I don’t know about with Web Scraper?

If you have contacts with Right Relevance could you gently ping them for me? Thanks!

1880 Big Data Influencers in CSV File

Friday, April 8th, 2016

If you aren’t familiar with Right Relevance, you are missing an amazing resource for cutting through content clutter.

Starting at the default homepage:


You can search for “big data” and the default result screen appears:


If you switch to “people,” the following screen appears:


The “topic score” line moves, so you can require a higher or lesser score for inclusion in the listing. That is helpful if you want only the top people, articles, etc. on a topic or want to reach deeper into the pool of data.

As of yesterday, if you set the “topic score” to the range 70 to 98, the number of people influencers was 1880.

The interface allows you to follow and/or tweet to any of those 1880 people, but only one at a time.

I submitted feedback to Right Relevance on Monday of this week pointing out how useful lists of Twitter handles could be for creating Twitter seed lists, etc., but have not gotten a response.

Part of my query to Right Relevance concerned the failure of a web scraper to match the totals listed in the interface (a far lower number of results than expected).

In the absence of an answer, I continue to experiment with the Web Scraper extension for Chrome to extract data from the site.

Caveat: In order to set the delay for requests in Web Scraper, I have found the settings under “Scrape” ineffectual:


In order to induce enough delay to capture the entire list, I set the delay in the exported sitemap (in JSON) and then imported it into another sitemap. Could have reached the same point by setting the delay under the top selector, which was also set to SelectorElementScroll.

To successfully retrieve the entire list, that delay setting was 16000 miliseconds.

There may be more performant solutions but since it ran in a separate browser tab and notified me of completion, time wasn’t an issue.

I created a sitemap that obtains the user’s name, Twitter handle and number of Twitter followers, bigdata-right-relevance.txt.

Oh, the promised 1880-big-data-influencers.csv. (File renamed post-scraping due to naming constraints in Web Scraper.)

At best I am a casual user of Web Scraper so suggestions for improvements, etc., are greatly appreciated.

101 webscraping and research tasks for the data journalist

Monday, August 17th, 2015

101 webscraping and research tasks for the data journalist by Dan Nguyen.

From the webpage:

This repository contains 101 Web data-collection tasks in Python 3 that I assigned to my Computational Journalism class in Spring 2015 to give them regular exercise in programming and conducting research, and to expose them to the variety of data published online.

The hard part of many of these tasks is researching and finding the actual data source. The scripts need only concern itself with fetching the data and printing the answer in the least painful way possible. Since the Computational Journalism class wasn’t intended to be an actual programming class, adherence to idioms and best codes practices was not emphasized…(especially since I’m new to Python myself!)

Too good of an idea to not steal! Practical and immediate results, introduction to coding, etc.

What 101 tasks do you want to document and with what tool?

PS: The Computational Journalism class site has a nice set of online references for Python.

Harvesting Listicles

Friday, May 22nd, 2015

Scrape website data with the new R package rvest by

From the post:

Copying tables or lists from a website is not only a painful and dull activity but it’s error prone and not easily reproducible. Thankfully there are packages in Python and R to automate the process. In a previous post we described using Python’s Beautiful Soup to extract information from web pages. In this post we take advantage of a new R package called rvest to extract addresses from an online list. We then use ggmap to geocode those addresses and create a Leaflet map with the leaflet package. In the interest of coding local, we opted to use, as the example, data on wineries and breweries here in the Finger Lakes region of New York.

Lists and listicles are a common form of web content. Unfortunately, both are difficult to improve without harvesting the content and recasting it.

This post will put you on the right track to harvesting with rvest!

BTW, as a benefit to others, post data that you clean/harvest in a clean format. Yes?

rvest: easy web scraping with R

Monday, November 24th, 2014

rvest: easy web scraping with R

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.

Great overview of rvest and its use for web scraping in R.

Axiom: You will have web scraping with you always. 😉 Not only because we are lazy, but disorderly to boot.

At CRAN: (Author: Hadley Wickham)

‘Magic’ – Quickest Way to Turn Webpage Into Data

Thursday, November 6th, 2014

‘Magic’ – Quickest Way to Turn Webpage Into Data

From the post: recently launched it’s newest feature called ‘Magic’.

The tool, which they are providing free of charge, is useful for transforming web page(s) into a table that can be downloaded as a static CSV or accessed via a live API.

To use Magic, users simply need to paste in a URL, hit the “Get Data” button and’s algorithms will turn that page into a table of data.The user does not need to download or install anything.

I tried this on a JIRA page (no go) but worked fine on CEUR Workshop Proceedings.

I will be testing it on a number of pages. Could be the very thing for one page or lite mining that comes up.


I first saw this in a tweet by Christophe Lalanne.

Scrape the Gibson: Python skills for data scrapers

Monday, October 13th, 2014

Scrape the Gibson: Python skills for data scrapers by Brian Abelson.

From the post:

Two years ago, I learned I had superpowers. Steve Romalewski was working on some fascinating analyses of CitiBike locations and needed some help scraping information from the city’s data portal. Cobbling together the little I knew about R, I wrote a simple scraper to fetch the json files for each bike share location and output it as a csv. When I opened the clean data in Excel, the feeling was tantamount to this scene from Hackers:

Ever since then I’ve spent a good portion of my life scraping data from websites. From movies, to bird sounds, to missed connections, and john boards (don’t ask, I promise it’s for good!), there’s not much I haven’t tried to scrape. In many cases, I dont’t even analyze the data I’ve obtained, and the whole process amounts to a nerdy version of sport hunting, with my comma-delimited trophies mounted proudly on Amazon S3.

Important post for two reasons:

  • Good introduction to the art of scraping data
  • Set the norm for sharing scraped data
    • The people who force scraping of data don’t want it shared, combined, merged or analyzed.

      You can help in disappointing them! 😉

Scrapy and Elasticsearch

Wednesday, July 30th, 2014

Scrapy and Elasticsearch by Florian Hopf.

From the post:

On 29.07.2014 I gave a talk at Search Meetup Karlsruhe on using Scrapy with Elasticsearch, the slides are here. This post evolved from the talk and introduces you to web scraping and search with Scrapy and Elasticsearch.

Web Crawling

You might think that web crawling and scraping only is for search engines like Google and Bing. But a lot of companies are using it for different purposes: Price comparison, financial risk information and portals all need a way to get the data. And at least sometimes the way is to retrieve it through some public website. Besides these cases where the data is not in your hand it can also make sense if the data is aggregated already. For intranet and portal search engines it can be easier to just scrape the frontend instead of building data import facilities for different, sometimes even old systems.

The Example

In this post we are looking at a rather artificial example: Crawling the page for recent meetups to make them available for search. Why artificial? Because has an API that provides all the data in a more convenient way. But imagine there is no other way and we would like to build a custom search on this information, probably by adding other event sites as well. (emphasis in original)

Not everything you need to know about Scrapy but enough to get you interested.

APIs for data are on the up swing but web scrapers will be relevant to data mining for decades to come.

Crowdscraping – You Game?

Tuesday, July 8th, 2014

Launching #FlashHacks: a crowdscraping movement to release 10 million data points in 10 days. Are you in? by Hera.

From the post:

The success story that is OpenCorporates is very much a team effort – not just the tiny OpenCorporates core team, but the whole open data community, who from the beginning have been helping us in so many ways, from writing scrapers for company registers, to alerting us when new data is available, to helping with language or data questions.

But one of the most common questions has been, “How can I get data into OpenCorporates“. Given that OpenCorporates‘ goal is not just every company in the world but also all the public data that relates to those companies, that’s something we’ve wanted to allow, as we would not achieve that alone, and it’s something that will make OpenCorporates not just the biggest open database of company data in the world, but the biggest database of company data, open or proprietary.

To launch this new era in corporate data, we are launching a #FlashHacks campaign.

Flash What? #FlashHacks.

We are inviting all Ruby and Python botwriters to help us crowdscrape 10 million data points into OpenCorporates in 10 days.

How you can join the crowdscraping movement

  • Join and sign up!
  • Have a look at the datasets we have listed on the Campaign page as inspiration. You can either write bots for these or even chose your own!
  • Sign up to a mission! Send a tweet pledge to say you have taken on a mission.
  • Write the bot and submit on the platform.
  • Tweet your success with the #FlashHacks tag! Don’t forget to upload the FlashHack design as your twitter cover photo and facebook cover photo to get more people involved.

Join us on our Google Group, share problems and solutions, and help build the open corporate data community.

If you are interested in covering this story, you can view the press release here.

Also of interest: Ruby and Python coders – can you help us?

To join this crowdscrape, sign up at:

Tweet, email, post, etc.

Could be the start of a new social activity, the episodic crowdscrape.

Are crowdscrapes an answer to massive data dumps from corporate interests?

I first saw this in a tweet by Martin Tisne.

Large Scale Web Scraping

Friday, May 9th, 2014

We Just Ran Twenty-Three Million Queries of the World Bank’s Website – Working Paper 362 by Sarah Dykstra, Benjamin Dykstra, and Justin Sandefur.


Much of the data underlying global poverty and inequality estimates is not in the public domain, but can be accessed in small pieces using the World Bank’s PovcalNet online tool. To overcome these limitations and reproduce this database in a format more useful to researchers, we ran approximately 23 million queries of the World Bank’s web site, accessing only information that was already in the public domain. This web scraping exercise produced 10,000 points on the cumulative distribution of income or consumption from each of 942 surveys spanning 127 countries over the period 1977 to 2012. This short note describes our methodology, briefly discusses some of the relevant intellectual property issues, and illustrates the kind of calculations that are facilitated by this data set, including growth incidence curves and poverty rates using alternative PPP indices. The full data can be downloaded at

That’s what I would call large scale web scraping!

Useful model to follow for many sources, such as the U.S. Department of Agriculture. A gold mine of reports, data, statistics, but all broken up for the manual act of reading. Or at least that is a charitable explanation for their current data organization.

Web Scraping: working with APIs

Tuesday, March 18th, 2014

Web Scraping: working with APIs by Rolf Fredheim.

From the post:

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

The final installment of Rolf’s course for humanists. He promises to repeat it next year. Should be interesting to see how techniques and resources evolve over the next year.

Forward the course link to humanities and social science majors.

Web Scraping part2: Digging deeper

Tuesday, February 25th, 2014

Web Scraping part2: Digging deeper by Rolf Fredheim.

From the post:

Slides from the second web scraping through R session: Web scraping for the humanities and social sciences.

In which we make sure we are comfortable with functions, before looking at XPath queries to download data from newspaper articles. Examples including BBC news and Guardian comments.

Download the .Rpres file to use in Rstudio here.

A regular R script with the code only can be accessed here.

A great part 2 on web scrapers!

Web-Scraping: the Basics

Friday, February 21st, 2014

Web-Scraping: the Basics by Rolf Fredheim.

From the post:

Slides from the first session of my course about web scraping through R: Web scraping for the humanities and social sciences

Includes an introduction to the paste function, working with URLs, functions and loops.

Putting it all together we fetch data in JSON format about Wikipedia page views from

Solutions here:

Download the .Rpres file to use in Rstudio here

Hard to say how soon but eventually data in machine readable formats is going to be the default and web scraping will be a historical footnote.

But it hasn’t happened yet so pass this on to newbies who need advice.

Snowden Used Low-Cost Tool to Best N.S.A.

Sunday, February 9th, 2014

Snowden Used Low-Cost Tool to Best N.S.A. by David E. Sanger and Eric Schmitt.

From the post:

Intelligence officials investigating how Edward J. Snowden gained access to a huge trove of the country’s most highly classified documents say they have determined that he used inexpensive and widely available software to “scrape” the National Security Agency’s networks, and kept at it even after he was briefly challenged by agency officials.

Using “web crawler” software designed to search, index and back up a website, Mr. Snowden “scraped data out of our systems” while he went about his day job, according to a senior intelligence official. “We do not believe this was an individual sitting at a machine and downloading this much material in sequence,” the official said. The process, he added, was “quite automated.”

The findings are striking because the N.S.A.’s mission includes protecting the nation’s most sensitive military and intelligence computer systems from cyberattacks, especially the sophisticated attacks that emanate from Russia and China. Mr. Snowden’s “insider attack,” by contrast, was hardly sophisticated and should have been easily detected, investigators found.

Moreover, Mr. Snowden succeeded nearly three years after the WikiLeaks disclosures, in which military and State Department files, of far less sensitivity, were taken using similar techniques.

Mr. Snowden had broad access to the N.S.A.’s complete files because he was working as a technology contractor for the agency in Hawaii, helping to manage the agency’s computer systems in an outpost that focuses on China and North Korea. A web crawler, also called a spider, automatically moves from website to website, following links embedded in each document, and can be programmed to copy everything in its path.

A highly amusing article that explains the ongoing Snowden leaks and perhaps a basis for projecting when Snowden leaks will stop….not any time soon! The suspicion is that Snowden may have copied 1.7 million files.

Not with drag-n-drop but using a program!

I’m sure that was news to a lot of managers in both industry and government.

Now of course the government is buttoning up all the information (allegedly), which will hinder access to materials by those with legitimate need.

It’s one thing to have these “true to your school” types in management at agencies where performance isn’t expected or tolerated. But in a spy agency that you are trying to use to save your citizens from themselves, that’s just self-defeating.

The real solution for the NSA and any other agency where you need high grade operations is to institute an Apache meritocracy process to manage both projects and to fill management slots. It would not be open source or leak to the press, at least not any more than it does now.

The upside would be the growth, over a period of years, of highly trained and competent personnel who would institute procedures that assisted with their primary functions, not simply to enable the hiring of contractors.

It’s worth a try, the NSA could hardly do worse than it is now.

PS: I do think the NSA is violating the U.S. Constitution but the main source of my ire is their incompetence in doing so. Gathering up phone numbers because they are easy to connect for example. Drunks under the streetlight.

PPS: This is also a reminder that it isn’t the cost/size of the tool but the effectiveness with which it is used that makes a real difference.

How NetFlix Reverse Engineered Hollywood [+ Perry Mason Mystery]

Saturday, January 4th, 2014

How NetFlix Reverse Engineered Hollywood by Alexis C. Madrigal.

From the post:

If you use Netflix, you’ve probably wondered about the specific genres that it suggests to you. Some of them just seem so specific that it’s absurd. Emotional Fight-the-System Documentaries? Period Pieces About Royalty Based on Real Life? Foreign Satanic Stories from the 1980s?

If Netflix can show such tiny slices of cinema to any given user, and they have 40 million users, how vast did their set of “personalized genres” need to be to describe the entire Hollywood universe?

This idle wonder turned to rabid fascination when I realized that I could capture each and every microgenre that Netflix’s algorithm has ever created.

Through a combination of elbow grease and spam-level repetition, we discovered that Netflix possesses not several hundred genres, or even several thousand, but 76,897 unique ways to describe types of movies.

There are so many that just loading, copying, and pasting all of them took the little script I wrote more than 20 hours.

We’ve now spent several weeks understanding, analyzing, and reverse-engineering how Netflix’s vocabulary and grammar work. We’ve broken down its most popular descriptions, and counted its most popular actors and directors.

To my (and Netflix’s) knowledge, no one outside the company has ever assembled this data before.

What emerged from the work is this conclusion: Netflix has meticulously analyzed and tagged every movie and TV show imaginable. They possess a stockpile of data about Hollywood entertainment that is absolutely unprecedented. The genres that I scraped and that we caricature above are just the surface manifestation of this deeper database.

If you like data mining war stories in detail, then you will love this post by Alexis.

Along the way you will learn about:

  • Ubot Studio – Web scraping.
  • AntConc – Linguistic software.
  • Exploring other information to infer tagging practices.
  • More details about Netflix genres in general terms.

Be sure to read to the end to pick up on the Perry Mason mystery.

The Perry Mason mystery:

Netflix’s Favorite Actors (by number of genres)

  1. Raymond Burr (who played Perry Mason)
  2. Bruce Willis
  3. George Carlin
  4. Jackie Chan
  5. Andy Lau
  6. Robert De Niro
  7. Barbara Hale (also on Perry Mason)
  8. Clint Eastwood
  9. Elvis Presley
  10. Gene Autry

Question: Why is Raymond Burr in more genres than any other actor?

Some additional reading for this post: Sellling Blue Elephants

Just as a preview, the “Blue Elephants” book/site is about selling what consumers want to buy. Not about selling what you think is a world saving idea. Those are different. Sometimes very different.

I first saw this in a tweet by Gregory Piatetsky.

Web-Scraper for Google Scholar Updated!

Saturday, September 8th, 2012

Web-Scraper for Google Scholar Updated! by Kay Cichini.

From the post:

I have updated the Google Scholar Web-Scraper Function GScholarScaper_2 to GScholarScraper_3 (and GScholarScaper_3.1) or it was deprecated due to changes in the Google Scholar html-code. The new script is more slender and faster. It returns a dataframe or optionally a CSV-file with the titles, authors, publications & links. Feel free to report bugs, etc.

An R function to use in web-scraping Google Scholar.

Remember, anything we can “see,” can be “mapped.”

Clunky interfaces and not using REST can make capture/mapping more difficult but local delivery means data can be captured and mapped. (full stop)

Web scraping with Python – the dark side of data

Thursday, December 29th, 2011

Web scraping with Python – the dark side of data

From the post:

In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites.

“..user-unfriendly websites.”? What about “user-hostile websites?” 😉

Looks like a good presentation up to “user-unfriendly.”

It will be useful for anyone who needs data from sites that are not configured to delivery it properly (that is to users).

I suppose “user-hostile” would fall under some prohibited activity.

Would make a great title for a book: “Penetration and Mapping of Hostile Hosts.” Could map of vulnerable hosts with their exploits as a network graph.