Extracting and using World Bank data in R

6 min readMay 12, 2023

In a previous post, we have seen how to extract human development indicators data from the World Bank database in Stata. In this post, we will do so using R.

Command to load the library and to extract data using the basic code:

library(wbstats)
wb_data('your indicator')

We will start by searching for an indicator related to school enrollment among women:

print(wb_search('school enrollment'), n=50)

# A tibble: 30 × 3
   indicator_id      indicator                                                                   indicator_desc   
   <chr>             <chr>                                                                       <chr>            
 1 2.0.cov.Sch       Coverage: School Enrollment                                                 The coverage rat…
 2 2.0.hoi.Sch       HOI: School Enrollment                                                      The Human Opport…
 3 HD.HCI.EYRS       Expected Years of School                                                    Expected Years o…
 4 HD.HCI.EYRS.FE    Expected Years of School, Female                                            Expected Years o…
 5 HD.HCI.EYRS.MA    Expected Years of School, Male                                              Expected Years o…
 6 SE.ENR.PRIM.FM.ZS School enrollment, primary (gross), gender parity index (GPI)               Gender parity in…
 7 SE.ENR.PRSC.FM.ZS School enrollment, primary and secondary (gross), gender parity index (GPI) Gender parity in…
 8 SE.ENR.SECO.FM.ZS School enrollment, secondary (gross), gender parity index (GPI)             Gender parity in…
 9 SE.ENR.TERT.FM.ZS School enrollment, tertiary (gross), gender parity index (GPI)              Gender parity in…
10 SE.PRE.ENRR       School enrollment, preprimary (% gross)                                     Gross enrollment…
11 SE.PRE.ENRR.FE    School enrollment, preprimary, female (% gross)                             Gross enrollment…
12 SE.PRE.ENRR.MA    School enrollment, preprimary, male (% gross)                               Gross enrollment…
13 SE.PRM.ENRR       School enrollment, primary (% gross)                                        Gross enrollment…
14 SE.PRM.ENRR.FE    School enrollment, primary, female (% gross)                                Gross enrollment…
15 SE.PRM.ENRR.MA    School enrollment, primary, male (% gross)                                  Gross enrollment…
16 SE.PRM.NENR       School enrollment, primary (% net)                                          Net enrollment r…
17 SE.PRM.NENR.FE    School enrollment, primary, female (% net)                                  Net enrollment r…
18 SE.PRM.NENR.MA    School enrollment, primary, male (% net)                                    Net enrollment r…
19 SE.PRM.PRIV.ZS    School enrollment, primary, private (% of total primary)                    Private enrollme…
20 SE.SEC.ENRR       School enrollment, secondary (% gross)                                      Gross enrollment…
21 SE.SEC.ENRR.FE    School enrollment, secondary, female (% gross)                              Gross enrollment…
22 SE.SEC.ENRR.MA    School enrollment, secondary, male (% gross)                                Gross enrollment…
23 SE.SEC.NENR       School enrollment, secondary (% net)                                        Net enrollment r…
24 SE.SEC.NENR.FE    School enrollment, secondary, female (% net)                                Net enrollment r…
25 SE.SEC.NENR.MA    School enrollment, secondary, male (% net)                                  Net enrollment r…
26 SE.SEC.PRIV.ZS    School enrollment, secondary, private (% of total secondary)                Private enrollme…
27 SE.TER.ENRR       School enrollment, tertiary (% gross)                                       Gross enrollment…
28 SE.TER.ENRR.FE    School enrollment, tertiary, female (% gross)                               Gross enrollment…
29 SE.TER.ENRR.MA    School enrollment, tertiary, male (% gross)                                 Gross enrollment…
30 SI.POV.ENRL.MI    Multidimensional poverty, Educational enrollment (% of population deprived) Multidimensional…
>

Looks like the 21st indicator “SE.SEC.ENRR.FE” is a good match for what we are looking for, so let’s plug that in:

wb_data('SE.SEC.ENRR.FE')


And the output is

# A tibble: 13,671 × 9
   iso2c iso3c country  date SE.SEC.ENRR.FE unit  obs_status footnote
   <chr> <chr> <chr>   <dbl>          <dbl> <chr> <chr>      <chr>   
 1 AW    ABW   Aruba    1960             NA NA    NA         NA      
 2 AW    ABW   Aruba    1961             NA NA    NA         NA      
 3 AW    ABW   Aruba    1962             NA NA    NA         NA      
 4 AW    ABW   Aruba    1963             NA NA    NA         NA      
 5 AW    ABW   Aruba    1964             NA NA    NA         NA      
 6 AW    ABW   Aruba    1965             NA NA    NA         NA      
 7 AW    ABW   Aruba    1966             NA NA    NA         NA      
 8 AW    ABW   Aruba    1967             NA NA    NA         NA      
 9 AW    ABW   Aruba    1968             NA NA    NA         NA      
10 AW    ABW   Aruba    1969             NA NA    NA         NA      
# ℹ 13,661 more rows
# ℹ 1 more variable: last_updated <date>
# ℹ Use `print(n = ...)` to see more rows

The output shows that the basic command extracts data for all countries starting from 1960. However, the package allows more customized searches to specify the countries and time periods:

wb_data('your indicator', country = '', start_date =  , end_date =  )

The 3-letter country codes (iso3c) can be displayed using the second column of the first output:

> df=wb_data('SE.SEC.ENRR.FE', start_date = 2000)

> unique(df$iso3c)
  [1] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "ARM" "ASM" "ATG" "AUS" "AUT"
 [13] "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ"
 [25] "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHI" "CHL"
 [37] "CHN" "CIV" "CMR" "COD" "COG" "COL" "COM" "CPV" "CRI" "CUB" "CUW" "CYM"
 [49] "CYP" "CZE" "DEU" "DJI" "DMA" "DNK" "DOM" "DZA" "ECU" "EGY" "ERI" "ESP"
 [61] "EST" "ETH" "FIN" "FJI" "FRA" "FRO" "FSM" "GAB" "GBR" "GEO" "GHA" "GIB"
 [73] "GIN" "GMB" "GNB" "GNQ" "GRC" "GRD" "GRL" "GTM" "GUM" "GUY" "HKG" "HND"
 [85] "HRV" "HTI" "HUN" "IDN" "IMN" "IND" "IRL" "IRN" "IRQ" "ISL" "ISR" "ITA"
 [97] "JAM" "JOR" "JPN" "KAZ" "KEN" "KGZ" "KHM" "KIR" "KNA" "KOR" "KWT" "LAO"
[109] "LBN" "LBR" "LBY" "LCA" "LIE" "LKA" "LSO" "LTU" "LUX" "LVA" "MAC" "MAF"
[121] "MAR" "MCO" "MDA" "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT" "MMR" "MNE"
[133] "MNG" "MNP" "MOZ" "MRT" "MUS" "MWI" "MYS" "NAM" "NCL" "NER" "NGA" "NIC"
[145] "NLD" "NOR" "NPL" "NRU" "NZL" "OMN" "PAK" "PAN" "PER" "PHL" "PLW" "PNG"
[157] "POL" "PRI" "PRK" "PRT" "PRY" "PSE" "PYF" "QAT" "ROU" "RUS" "RWA" "SAU"
[169] "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SMR" "SOM" "SRB" "SSD" "STP" "SUR"
[181] "SVK" "SVN" "SWE" "SWZ" "SXM" "SYC" "SYR" "TCA" "TCD" "TGO" "THA" "TJK"
[193] "TKM" "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "TZA" "UGA" "UKR" "URY" "USA"
[205] "UZB" "VCT" "VEN" "VGB" "VIR" "VNM" "VUT" "WSM" "XKX" "YEM" "ZAF" "ZMB"
[217] "ZWE"

Here is an example code to extract data on School enrollment, secondary, female (% gross) for selected South Asian countries between 2000–2020:

df = wb_data('SE.SEC.ENRR.FE', country = c('BGD', 'IND', 'MDV', 'NPL', 'PAK'), start_date = 2000 , end_date = 2020)

> df
# A tibble: 105 × 9
   iso2c iso3c country     date SE.SEC.ENRR.FE unit  obs_status footnote last_updated
   <chr> <chr> <chr>      <dbl>          <dbl> <chr> <chr>      <chr>    <date>      
 1 BD    BGD   Bangladesh  2000           50.5 NA    NA         NA       2023-05-10  
 2 BD    BGD   Bangladesh  2001           53.1 NA    NA         NA       2023-05-10  
 3 BD    BGD   Bangladesh  2002           54.7 NA    NA         NA       2023-05-10  
 4 BD    BGD   Bangladesh  2003           54.5 NA    NA         NA       2023-05-10  
 5 BD    BGD   Bangladesh  2004           49.0 NA    NA         NA       2023-05-10  
 6 BD    BGD   Bangladesh  2005           48.5 NA    NA         NA       2023-05-10  
 7 BD    BGD   Bangladesh  2006           48.9 NA    NA         NA       2023-05-10  
 8 BD    BGD   Bangladesh  2007           49.6 NA    NA         NA       2023-05-10  
 9 BD    BGD   Bangladesh  2008           49.0 NA    NA         NA       2023-05-10  
10 BD    BGD   Bangladesh  2009           51.9 NA    NA         NA       2023-05-10  
# ℹ 95 more rows
# ℹ Use `print(n = ...)` to see more rows

We can now do some cleaning to keep the necessary data:

> df <- df %>% dplyr::select(country, date, SE.SEC.ENRR.FE) %>% dplyr::rename(education = SE.SEC.ENRR.FE)

> df
# A tibble: 105 × 3
   country     date education
   <chr>      <dbl>     <dbl>
 1 Bangladesh  2000      50.5
 2 Bangladesh  2001      53.1
 3 Bangladesh  2002      54.7
 4 Bangladesh  2003      54.5
 5 Bangladesh  2004      49.0
 6 Bangladesh  2005      48.5
 7 Bangladesh  2006      48.9
 8 Bangladesh  2007      49.6
 9 Bangladesh  2008      49.0
10 Bangladesh  2009      51.9
# ℹ 95 more rows
# ℹ Use `print(n = ...)` to see more rows

to make a plot like this using the ggplot library:

library(ggplot2)

#create some nice color schemes
num_colors <- length(unique(df$country))
colors <- brewer.pal(num_colors, "Set1")

ggplot(df, aes(x = date, y = education, color = country)) +
  geom_line() +
  labs(x = "Year", y = "Education", color = "Country") +
  scale_color_manual(values = colors, name = "Country")

Looks like there are many missing values for education, so we’ll just remove them from the dataframe before charting:

df %>% drop_na(education) %>% 
ggplot(aes(x = date, y = education, color = country)) +
    geom_line() +
    labs(x = "Year", y = "Education", color = "Country") +
    scale_color_manual(values = colors, name = "Country")

Extracting and using World Bank data in R

Written by infoart.ca

No responses yet