library(rvest)
library(magick)
library(tidyverse)
library(flextable)
library(pdftools)
library(tesseract)
Web scraping & APIs
Motivation
Navigating the digital age means unlocking the treasure trove of data available online. Web scraping and APIs aren’t just technical skills; they’re your keys to a universe of data for any project you can imagine. Think about the ease of analyzing trends, financial markets, or even sports—all through data you gather and analyze yourself.
In this section, Ilia walks us through the essentials of harnessing web data, offering a powerful alternative for those looking to source unique datasets for their projects. Knowing these techniques empowers you to find and utilize data that sparks your curiosity and fuels your research. Let’s dive in and discover how these tools can transform your approach to data collection.
Video lecture
First, we start with a video lecture given by Ilia on web scraping and the use of APIs during the Text Mining course of 2022. The rest of this page contains a set of practice exercises that were shared during this lecture.
In-class R script shown in the video above 📄
Practice web scraping in R
Unlike the lab sessions, we do not provide the Python code, but the principles behind web scraping in R and Python remain the same.
Using CSS
In this pratice, we learn how to use the rvest
package to extract information from the famous IMDB (Internet Movie Database) site of the 50 most popular movies (https://www.imdb.com/search/title/?groups=top_250&sort=user_rating). The page was saved (downloaded) and is also available in the data/
folder. Alternatively, you can directly work on the link. However, bear in mind that thr structure of online websites can change in time, therefore, the code below might need adjustments (i.e., change in tags).
First, we load the page.
# local file (html)
<- read_html("data/IMDb _Top 250.html")
imdb.html # or alternatively use the link
# imdb.html <- read_html("https://www.imdb.com/search/title/?groups=top_250&sort=user_rating") # webpage
Now, we identify the positions of the titles. On the web page (opened preferably with Chrome) right-click on a title and select “Inspect”. The tag corresponding to the titles appears on the developer window (partially reproduced below).
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
[...]
</div>
Looking above, the title (“The Shawshank Redemption”) is under the div
tag with class="lister-item-content"
, then the sub-tag h3
within it then the tag a
within it. The html_nodes
function can target this tag. The “dot” after div
indicates the class value. It actually targets all such tags.
<- imdb.html %>%
titles html_nodes("div.lister-item-content h3 a")
head(titles)
The results are cleaned from the html code (i.e., only the texts remain) using html_text2
function.
<- html_text2(titles)
titles head(titles)
Another way would have been to use the fact that the targeted h3
tags have a class value. Modify the previous code to extract tags a within h3
with class value “lister-item-header”.
Answer
<- imdb.html %>%
titles html_nodes("h3.lister-item-header a") %>%
html_text2()
titles
Now, repeat that approach for the year and the run time. You may use the function substr
to extract the year from the text.
Answer
For the years:
## Extract the years
<- imdb.html %>%
years html_nodes("div.lister-item-content h3 .lister-item-year") %>%
html_text2()
<- as.numeric(substr(years,
years start = 2,
stop = 5))
# take only characters 2 to 5 corresponding to the year
##years <- as.numeric(gsub("[^0-9.-]", "", years)) # an alternative: keep only the numbers in a string
For the run times, first they are extracted in the format “120 min”. Then, the run time is split by space which gives “120” and “min”. The unlist
command casts this to a vector. Then we take one element every two (corresponding to the minutes).
<- imdb.html %>%
runtimes html_nodes("span.runtime") %>%
html_text2()
<- as.numeric(
runtimes unlist(
strsplit(runtimes, " "))[seq(from = 1, by = 2, len = 50)])
# by space,
Suppose that now we want to extract the description. In this case, there is no unique class value identifying the field (see the html code). However, one can note that it is the 4th paragraph (element) within a div
tag with a useful class value. To access the k-th paragraph you can use p.nth-child(k)
starting from the correct hierarchical position. For example, p:nth-child(2)
extract the 2-nd paragraph.
For the 4-th paragraph (i.e., the wanted description), a possible code is thus
<- imdb.html %>%
desc html_nodes("div.lister-item-content p:nth-child(4)") %>%
html_text2()
head(desc)
To finish, we build a data frame containing this information (tibble format below).
.50 <- tibble(data.frame(
imdb.topTitles = titles,
Years = years,
RunTimes = runtimes,
Desc = desc))
.50 %>%
imdb.tophead() %>%
flextable() %>%
autofit()
XPath
In the previous part, we used the CSS to identify the tags. We now use an alternative: the XPath. The Xpath is preferably used when we want to extract a specific text. For example, we want to extract the description of the first description: right-click and select inspect. Then right-click the corresponding code line, and select “Copy xpath”. Pass this, to the xpath parameter of html_nodes
like below:
<- imdb.html %>%
desc1 html_nodes(xpath="//*[@id='main']/div/div[3]/div/div[1]/div[3]/p[2]/text()") %>%
html_text2()
desc1
In the xpath
, you must turn the quotes around main
to simple quotes.
This is convenient when you want to extract a particular text. You can also use the Selector Gadget from Chrome to extract multiple Xpath.
Parsing PDF files
In this part, we practice the text extraction from a PDF file. First, we use the pdf_text
function to read the text on the file “cs.pdf”.
<- pdf_text("data/cs.pdf")
cs.text 1]] cs.text[[
The resulting object is a vector of strings (one element per page). By inspecting the first one, you see that there are lots of EOL characters (). Suppose now that we want to extract separately all the lines, we can use the function readr::read_lines
that will split them accordingly.
<- cs.text %>%
cs.text read_lines()
head(cs.text)
tabulizer
may not work
Please note that the package tabulizer
is not updated as often anymore. Therefore, even when trying to get it from its github page (https://github.com/ropensci/tabulizer), it may not work. Nevertheless, we have kept the details here, and you can use it from your side or alternatively look for similar packages online.
if (!require("remotes")) {
install.packages("remotes")
}# on 64-bit Windows
::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
remotes# elsewhere
::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer")) remotes
Suppose now that we want to extract the Table in Appendix p.10. The library tabulizer
and its function extract_tables
allows to make a step in this direction.
<- tabulizer::extract_tables("data/cs.pdf",
cs.table output = "data.frame",
pages = 10,
guess = TRUE
)1]][1:10,] ## 10 first rows of the table (which is stored in the first element of a list) cs.table[[
We can note that special characters (here special dashes “-”) could not be recognized. But the encoding can now be work out and the table be cleaned for further usage. The technique is far from perfect and is unlikly to be of any use for automatic extraction on a large number of documents, except if they are all very well structured.
Optical Character Recognition (OCR)
In this part, we use the Optical Character Recognition (OCR) functionalities accessible via R to read a text from an image. First, we read the image using image_read
in the package magick
, the result is then passed to the image_ocr
function in the package tesseract
.
<- image_read("data/TextFR.png") %>%
text image_ocr()
text
Depending on your OS, you may find several mistakes in the final result. An improvement can be obtain by indicating to image_ocr()
the language, here French. You may need to download the language first:
::tesseract_download("fra")
tesseract<- image_read("data/TextFR.png") %>%
text image_ocr(language = "fra")
text
It is still not perfect. Further improvement can be obtained by cleaning the image before to apply the OCR. This can be done with several tools in package magick
. Below some proposition:
<- image_read("data/TextFR.png") %>%
text image_resize("4000x") %>% ## resize the picture width -> 4000 keeping the ratio
image_convert(colorspace = 'gray') %>% ## change the background color
image_trim() %>% ## removes edges that are the background color from the image
image_ocr(language = "fra")
textcat(text) ## a more friendly version
The result should be close to perfect.
API
In this final part, you are invited to create your API key for The Guardian Open Plateform and to use it to extract papers using the guardianapi
package.
First, register to the open platform and save your key character. Then we request articles about “Arsenal Football Club” between for parts of August of 2022.
library(guardianapi)
gu_api_key()
#My Key: "fa9a4ddf-1e70-404f-889c-70ef31414cc5"
#Enter your key
<- gu_content("Arsenal Football Club",
Arsenal from_date = "2022-08-01",
to_date = "2022-08-22")
As we see on the first article, the text hence read is an HTML code. We now turn it into a text using rvest
.
library(rvest)
read_html(Arsenal$body[1]) %>% html_text()
From this point onwards, you can tokenize and analyze the data on your own.