rvest is an R package that makes it easy to scrape text from web pages.
This code is from the GitHub page for the package. It shows how to scrape the rating, cast, and poster for The Lego Movie from IMBD.
library(rvest) lego_movie <- read_html("http://www.imdb.com/title/tt1490017/") rating <- lego_movie %>% html_nodes("strong span") %>% html_text() %>% as.numeric() rating ## [1] 7.8 cast <- lego_movie %>% html_nodes("#titleCast .itemprop span") %>% html_text() cast ## [1] "Will Arnett" "Elizabeth Banks" "Craig Berry" ## [4] "Alison Brie" "David Burrows" "Anthony Daniels" ## [7] "Charlie Day" "Amanda Farinos" "Keith Ferguson" ## [10] "Will Ferrell" "Will Forte" "Dave Franco" ## [13] "Morgan Freeman" "Todd Hansen" "Jonah Hill" #Scrape the website for the url of the movie poster poster <- lego_movie %>% html_nodes("#img_primary img") %>% html_attr("src") poster CSS selector The trick to all of this is the text you put in the html_nodes function.
If you’ve been introduced to R as a simple way to do data analysis you might have come across this strange operator, %>%. It’s called a pipe because it passes data from one function to another. Here’s an example of subsetting and transforming data using the pipe from the magrittr package:
library(magrittr) dat <- airquality %>% subset(Ozone > 40) %>% transform(Celsius = (Temp - 32) * (5/9)) %>% head() dat ## Ozone Solar.