Let’s extract a table from a PDF file

Jul 22, 2021

Some annual reports come with important tables that we need to extract for further analysis. Using the ‘tabulizer’ package, we are going to extract the list of countries with hunger index score from the GHI 2020 report:

The table of interest is located on page 11:

So let’s get to work:

library(tabulizer)library(xlsx)

load the pdf from url:

pdf<- “https://www.globalhungerindex.org/pdf/en/2020.pdf”

Now the minimal command for extracting the table will be:

extract_tables(pdf, pages = 11, output = “data.frame”)-> ghi2020

Job done! Let us save the file and see the output:

write.xlsx(ghi2020, file = “ghi2020.xlsx”)

Thanks to rOpenScience for this lifesaver!

Let’s extract a table from a PDF file

Written by infoart.ca

No responses yet