Web scraping & APIs

Motivation

Navigating the digital age means unlocking the treasure trove of data available online. Web scraping and APIs aren’t just technical skills; they’re your keys to a universe of data for any project you can imagine. Think about the ease of analyzing trends, financial markets, or even sports—all through data you gather and analyze yourself.

In this section, Ilia walks us through the essentials of harnessing web data, offering a powerful alternative for those looking to source unique datasets for their projects. Knowing these techniques empowers you to find and utilize data that sparks your curiosity and fuels your research. Let’s dive in and discover how these tools can transform your approach to data collection.

Video lecture

First, we start with a video lecture given by Ilia on web scraping and the use of APIs during the Text Mining course of 2022. The rest of this page contains a set of practice exercises that were shared during this lecture.


In-class R script shown in the video above 📄

In-class and practice datasets 📁

Selector gadget 🔗

Practice web scraping in R

Note

Unlike the lab sessions, we do not provide the Python code, but the principles behind web scraping in R and Python remain the same.

Using CSS

In this pratice, we learn how to use the rvest package to extract information from the famous IMDB (Internet Movie Database) site of the 50 most popular movies (https://www.imdb.com/search/title/?groups=top_250&sort=user_rating). The page was saved (downloaded) and is also available in the data/ folder. Alternatively, you can directly work on the link. However, bear in mind that thr structure of online websites can change in time, therefore, the code below might need adjustments (i.e., change in tags).

First, we load the page.

library(rvest)
library(magick)
library(tidyverse)
library(flextable)
library(pdftools)
library(tesseract)
# local file (html)
imdb.html <- read_html("data/IMDb _Top 250.html") 
# or alternatively use the link
# imdb.html <- read_html("https://www.imdb.com/search/title/?groups=top_250&sort=user_rating") # webpage

Now, we identify the positions of the titles. On the web page (opened preferably with Chrome) right-click on a title and select “Inspect”. The tag corresponding to the titles appears on the developer window (partially reproduced below).

<div class="lister-item-content">
<h3 class="lister-item-header">
    <span class="lister-item-index unbold text-primary">1.</span>
    <a href="https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>
    <span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
[...]
</div>

Looking above, the title (“The Shawshank Redemption”) is under the div tag with class="lister-item-content", then the sub-tag h3 within it then the tag a within it. The html_nodes function can target this tag. The “dot” after div indicates the class value. It actually targets all such tags.

titles <- imdb.html %>%
  html_nodes("div.lister-item-content h3 a") 
head(titles) 

The results are cleaned from the html code (i.e., only the texts remain) using html_text2 function.

titles <- html_text2(titles)
head(titles)

Another way would have been to use the fact that the targeted h3 tags have a class value. Modify the previous code to extract tags a within h3 with class value “lister-item-header”.

Answer

titles <- imdb.html %>% 
  html_nodes("h3.lister-item-header a") %>% 
  html_text2()
titles

Now, repeat that approach for the year and the run time. You may use the function substr to extract the year from the text.

Answer

For the years:

## Extract the years
years <- imdb.html %>% 
  html_nodes("div.lister-item-content h3 .lister-item-year") %>%
  html_text2()
years <- as.numeric(substr(years,
                           start = 2,
                           stop = 5))
# take only characters 2 to 5 corresponding to the year
##years <- as.numeric(gsub("[^0-9.-]", "", years)) # an alternative: keep only the numbers in a string

For the run times, first they are extracted in the format “120 min”. Then, the run time is split by space which gives “120” and “min”. The unlist command casts this to a vector. Then we take one element every two (corresponding to the minutes).

runtimes <- imdb.html %>%
  html_nodes("span.runtime") %>%
  html_text2()
runtimes <- as.numeric(
  unlist(
    strsplit(runtimes, " "))[seq(from = 1, by = 2, len = 50)]) 
# by space, 

Suppose that now we want to extract the description. In this case, there is no unique class value identifying the field (see the html code). However, one can note that it is the 4th paragraph (element) within a div tag with a useful class value. To access the k-th paragraph you can use p.nth-child(k) starting from the correct hierarchical position. For example, p:nth-child(2) extract the 2-nd paragraph.

For the 4-th paragraph (i.e., the wanted description), a possible code is thus

desc <- imdb.html %>% 
  html_nodes("div.lister-item-content p:nth-child(4)") %>% 
  html_text2()
head(desc)

To finish, we build a data frame containing this information (tibble format below).

imdb.top.50 <- tibble(data.frame(
  Titles = titles,
  Years = years,
  RunTimes = runtimes,
 Desc = desc))
imdb.top.50 %>% 
  head() %>% 
  flextable() %>% 
  autofit()

XPath

In the previous part, we used the CSS to identify the tags. We now use an alternative: the XPath. The Xpath is preferably used when we want to extract a specific text. For example, we want to extract the description of the first description: right-click and select inspect. Then right-click the corresponding code line, and select “Copy xpath”. Pass this, to the xpath parameter of html_nodes like below:

desc1 <- imdb.html %>% 
  html_nodes(xpath="//*[@id='main']/div/div[3]/div/div[1]/div[3]/p[2]/text()") %>% 
  html_text2()
desc1
Note

In the xpath, you must turn the quotes around main to simple quotes.

This is convenient when you want to extract a particular text. You can also use the Selector Gadget from Chrome to extract multiple Xpath.

Parsing PDF files

In this part, we practice the text extraction from a PDF file. First, we use the pdf_text function to read the text on the file “cs.pdf”.

cs.text <- pdf_text("data/cs.pdf")
cs.text[[1]]

The resulting object is a vector of strings (one element per page). By inspecting the first one, you see that there are lots of EOL characters (). Suppose now that we want to extract separately all the lines, we can use the function readr::read_lines that will split them accordingly.

cs.text <- cs.text %>% 
  read_lines()
head(cs.text)
The package tabulizer may not work

Please note that the package tabulizer is not updated as often anymore. Therefore, even when trying to get it from its github page (https://github.com/ropensci/tabulizer), it may not work. Nevertheless, we have kept the details here, and you can use it from your side or alternatively look for similar packages online.

if (!require("remotes")) {
    install.packages("remotes")
}
# on 64-bit Windows
remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))

Suppose now that we want to extract the Table in Appendix p.10. The library tabulizer and its function extract_tables allows to make a step in this direction.

cs.table <- tabulizer::extract_tables("data/cs.pdf",
  output = "data.frame",
  pages = 10, 
  guess = TRUE
)
cs.table[[1]][1:10,] ## 10 first rows of the table (which is stored in the first element of a list)

We can note that special characters (here special dashes “-”) could not be recognized. But the encoding can now be work out and the table be cleaned for further usage. The technique is far from perfect and is unlikly to be of any use for automatic extraction on a large number of documents, except if they are all very well structured.

Optical Character Recognition (OCR)

In this part, we use the Optical Character Recognition (OCR) functionalities accessible via R to read a text from an image. First, we read the image using image_read in the package magick, the result is then passed to the image_ocr function in the package tesseract.

text <- image_read("data/TextFR.png") %>%
  image_ocr()
text

Depending on your OS, you may find several mistakes in the final result. An improvement can be obtain by indicating to image_ocr() the language, here French. You may need to download the language first:

tesseract::tesseract_download("fra")
text <- image_read("data/TextFR.png") %>%
  image_ocr(language = "fra")
text

It is still not perfect. Further improvement can be obtained by cleaning the image before to apply the OCR. This can be done with several tools in package magick. Below some proposition:

text <- image_read("data/TextFR.png") %>%
  image_resize("4000x") %>% ## resize the picture width -> 4000 keeping the ratio
  image_convert(colorspace = 'gray') %>% ## change the background color
  image_trim() %>% ## removes edges that are the background color from the image
  image_ocr(language = "fra") 
text
cat(text) ## a more friendly version

The result should be close to perfect.

API

In this final part, you are invited to create your API key for The Guardian Open Plateform and to use it to extract papers using the guardianapi package.

First, register to the open platform and save your key character. Then we request articles about “Arsenal Football Club” between for parts of August of 2022.

library(guardianapi)
gu_api_key()
#My Key: "fa9a4ddf-1e70-404f-889c-70ef31414cc5"
#Enter your key
Arsenal <- gu_content("Arsenal Football Club",
                      from_date = "2022-08-01",
                      to_date = "2022-08-22")

As we see on the first article, the text hence read is an HTML code. We now turn it into a text using rvest.

library(rvest)
read_html(Arsenal$body[1]) %>% html_text()

From this point onwards, you can tokenize and analyze the data on your own.