rvest
is an R package that makes it easy to scrape text from web pages.
This code is from the GitHub page for the package. It shows how to scrape the rating, cast, and poster for The Lego Movie from IMBD.
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
rating
## [1] 7.8
cast <- lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
cast
## [1] "Will Arnett" "Elizabeth Banks" "Craig Berry"
## [4] "Alison Brie" "David Burrows" "Anthony Daniels"
## [7] "Charlie Day" "Amanda Farinos" "Keith Ferguson"
## [10] "Will Ferrell" "Will Forte" "Dave Franco"
## [13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"
#Scrape the website for the url of the movie poster
poster <- lego_movie %>%
html_nodes("#img_primary img") %>%
html_attr("src")
poster

CSS selector
The trick to all of this is the text you put in the html_nodes
function. For example, this bit of code uses the text “#titleStoryLine p” to grab the storyline on the IMBD page.
storyline <- lego_movie %>%
html_nodes("#titleStoryLine p") %>%
html_text()
cat(storyline)
##
## The LEGO Movie is a 3D animated film which follows lead character, Emmet a completely ordinary LEGO mini-figure who is identified as the most "extraordinary person" and the key to saving the Lego universe. Emmet and his friends go on an epic journey to stop the evil tyrant, Lord Business. Written by
## DeAlan Wilson www.ComedyE.com
The text “#titleStoryLine p” is a CSS selector that identifies the part of the page that you want to scrape. Here’s a fun tutorial on how to use selectors to grab elements on a page: http://flukeout.github.io/.
So the text “#titleStoryLine p” indicates that we selected the “p” element after the element with the id “titleStoryLine”.
SelectorGadget
But how do you know what CSS selector you should use to grab what you want on a page? Well I could just look at the text in the lego_movie
object.
lego_movie
## {xml_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="styleguide-v2" class="fixed">\n<script>\n if (typeof ue ...
Or I could use http://selectorgadget.com/. It’s a tool for using your cursor to find the selector you need on a webpage. To use it, drag this link to your bookmark bar. Then go to IMDB, or whatever page you want to scrape, and click the SelectorGadget bookmark. Hover over the element you want to select and click.
In the image below, I’ve hovered over the Storyline paragraph and clicked. The paragraph is green but you can see that another part of the page is highlighted in yellow.
This means that if I use the CSS selector in the box at the bottom of the image (“p”), I will get the Storyline and all of the other text that is highlighted in yellow. To narrow my selection to just the Storyline, I click on the yellow text and exclude it from my selection.
The yellow highlight turns to red, and you can see the CSS selector “#titleStoryLine p” in the box at the bottom of the image.