Let’s extract a table from a PDF file
Jul 22, 2021
Some annual reports come with important tables that we need to extract for further analysis. Using the ‘tabulizer’ package, we are going to extract the list of countries with hunger index score from the GHI 2020 report:
The table of interest is located on page 11:
So let’s get to work:
library(tabulizer)library(xlsx)
load the pdf from url:
pdf<- “https://www.globalhungerindex.org/pdf/en/2020.pdf”
Now the minimal command for extracting the table will be:
extract_tables(pdf, pages = 11, output = “data.frame”)-> ghi2020
Job done! Let us save the file and see the output:
write.xlsx(ghi2020, file = “ghi2020.xlsx”)
Thanks to rOpenScience for this lifesaver!