Let’s extract a table from a PDF file

infoart.ca
Jul 22, 2021

--

Some annual reports come with important tables that we need to extract for further analysis. Using the ‘tabulizer’ package, we are going to extract the list of countries with hunger index score from the GHI 2020 report:

The table of interest is located on page 11:

So let’s get to work:

library(tabulizer)library(xlsx)

load the pdf from url:

pdf<- “https://www.globalhungerindex.org/pdf/en/2020.pdf

Now the minimal command for extracting the table will be:

extract_tables(pdf, pages = 11, output = “data.frame”)-> ghi2020

Job done! Let us save the file and see the output:

write.xlsx(ghi2020, file = “ghi2020.xlsx”)

Thanks to rOpenScience for this lifesaver!

--

--

infoart.ca
infoart.ca

Written by infoart.ca

Center for Social Capital & Environmental Research | Posts by Bishwajit Ghose, BI consultant and lecturer at the University of Ottawa

No responses yet