Extracting and using World Bank data in R
In a previous post, we have seen how to extract human development indicators data from the World Bank database in Stata. In this post, we will do so using R.
Command to load the library and to extract data using the basic code:
library(wbstats)
wb_data('your indicator')
We will start by searching for an indicator related to school enrollment among women:
print(wb_search('school enrollment'), n=50)
# A tibble: 30 × 3
indicator_id indicator indicator_desc
<chr> <chr> <chr>
1 2.0.cov.Sch Coverage: School Enrollment The coverage rat…
2 2.0.hoi.Sch HOI: School Enrollment The Human Opport…
3 HD.HCI.EYRS Expected Years of School Expected Years o…
4 HD.HCI.EYRS.FE Expected Years of School, Female Expected Years o…
5 HD.HCI.EYRS.MA Expected Years of School, Male Expected Years o…
6 SE.ENR.PRIM.FM.ZS School enrollment, primary (gross), gender parity index (GPI) Gender parity in…
7 SE.ENR.PRSC.FM.ZS School enrollment, primary and secondary (gross), gender parity index (GPI) Gender parity in…
8 SE.ENR.SECO.FM.ZS School enrollment, secondary (gross), gender parity index (GPI) Gender parity in…
9 SE.ENR.TERT.FM.ZS School enrollment, tertiary (gross), gender parity index (GPI) Gender parity in…
10 SE.PRE.ENRR School enrollment, preprimary (% gross) Gross enrollment…
11 SE.PRE.ENRR.FE School enrollment, preprimary, female (% gross) Gross enrollment…
12 SE.PRE.ENRR.MA School enrollment, preprimary, male (% gross) Gross enrollment…
13 SE.PRM.ENRR School enrollment, primary (% gross) Gross enrollment…
14 SE.PRM.ENRR.FE School enrollment, primary, female (% gross) Gross enrollment…
15 SE.PRM.ENRR.MA School enrollment, primary, male (% gross) Gross enrollment…
16 SE.PRM.NENR School enrollment, primary (% net) Net enrollment r…
17 SE.PRM.NENR.FE School enrollment, primary, female (% net) Net enrollment r…
18 SE.PRM.NENR.MA School enrollment, primary, male (% net) Net enrollment r…
19 SE.PRM.PRIV.ZS School enrollment, primary, private (% of total primary) Private enrollme…
20 SE.SEC.ENRR School enrollment, secondary (% gross) Gross enrollment…
21 SE.SEC.ENRR.FE School enrollment, secondary, female (% gross) Gross enrollment…
22 SE.SEC.ENRR.MA School enrollment, secondary, male (% gross) Gross enrollment…
23 SE.SEC.NENR School enrollment, secondary (% net) Net enrollment r…
24 SE.SEC.NENR.FE School enrollment, secondary, female (% net) Net enrollment r…
25 SE.SEC.NENR.MA School enrollment, secondary, male (% net) Net enrollment r…
26 SE.SEC.PRIV.ZS School enrollment, secondary, private (% of total secondary) Private enrollme…
27 SE.TER.ENRR School enrollment, tertiary (% gross) Gross enrollment…
28 SE.TER.ENRR.FE School enrollment, tertiary, female (% gross) Gross enrollment…
29 SE.TER.ENRR.MA School enrollment, tertiary, male (% gross) Gross enrollment…
30 SI.POV.ENRL.MI Multidimensional poverty, Educational enrollment (% of population deprived) Multidimensional…
>
Looks like the 21st indicator “SE.SEC.ENRR.FE” is a good match for what we are looking for, so let’s plug that in:
wb_data('SE.SEC.ENRR.FE')
And the output is
# A tibble: 13,671 × 9
iso2c iso3c country date SE.SEC.ENRR.FE unit obs_status footnote
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 AW ABW Aruba 1960 NA NA NA NA
2 AW ABW Aruba 1961 NA NA NA NA
3 AW ABW Aruba 1962 NA NA NA NA
4 AW ABW Aruba 1963 NA NA NA NA
5 AW ABW Aruba 1964 NA NA NA NA
6 AW ABW Aruba 1965 NA NA NA NA
7 AW ABW Aruba 1966 NA NA NA NA
8 AW ABW Aruba 1967 NA NA NA NA
9 AW ABW Aruba 1968 NA NA NA NA
10 AW ABW Aruba 1969 NA NA NA NA
# ℹ 13,661 more rows
# ℹ 1 more variable: last_updated <date>
# ℹ Use `print(n = ...)` to see more rows
The output shows that the basic command extracts data for all countries starting from 1960. However, the package allows more customized searches to specify the countries and time periods:
wb_data('your indicator', country = '', start_date = , end_date = )
The 3-letter country codes (iso3c) can be displayed using the second column of the first output:
> df=wb_data('SE.SEC.ENRR.FE', start_date = 2000)
> unique(df$iso3c)
[1] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "ARM" "ASM" "ATG" "AUS" "AUT"
[13] "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ"
[25] "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHI" "CHL"
[37] "CHN" "CIV" "CMR" "COD" "COG" "COL" "COM" "CPV" "CRI" "CUB" "CUW" "CYM"
[49] "CYP" "CZE" "DEU" "DJI" "DMA" "DNK" "DOM" "DZA" "ECU" "EGY" "ERI" "ESP"
[61] "EST" "ETH" "FIN" "FJI" "FRA" "FRO" "FSM" "GAB" "GBR" "GEO" "GHA" "GIB"
[73] "GIN" "GMB" "GNB" "GNQ" "GRC" "GRD" "GRL" "GTM" "GUM" "GUY" "HKG" "HND"
[85] "HRV" "HTI" "HUN" "IDN" "IMN" "IND" "IRL" "IRN" "IRQ" "ISL" "ISR" "ITA"
[97] "JAM" "JOR" "JPN" "KAZ" "KEN" "KGZ" "KHM" "KIR" "KNA" "KOR" "KWT" "LAO"
[109] "LBN" "LBR" "LBY" "LCA" "LIE" "LKA" "LSO" "LTU" "LUX" "LVA" "MAC" "MAF"
[121] "MAR" "MCO" "MDA" "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT" "MMR" "MNE"
[133] "MNG" "MNP" "MOZ" "MRT" "MUS" "MWI" "MYS" "NAM" "NCL" "NER" "NGA" "NIC"
[145] "NLD" "NOR" "NPL" "NRU" "NZL" "OMN" "PAK" "PAN" "PER" "PHL" "PLW" "PNG"
[157] "POL" "PRI" "PRK" "PRT" "PRY" "PSE" "PYF" "QAT" "ROU" "RUS" "RWA" "SAU"
[169] "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SMR" "SOM" "SRB" "SSD" "STP" "SUR"
[181] "SVK" "SVN" "SWE" "SWZ" "SXM" "SYC" "SYR" "TCA" "TCD" "TGO" "THA" "TJK"
[193] "TKM" "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "TZA" "UGA" "UKR" "URY" "USA"
[205] "UZB" "VCT" "VEN" "VGB" "VIR" "VNM" "VUT" "WSM" "XKX" "YEM" "ZAF" "ZMB"
[217] "ZWE"
Here is an example code to extract data on School enrollment, secondary, female (% gross) for selected South Asian countries between 2000–2020:
df = wb_data('SE.SEC.ENRR.FE', country = c('BGD', 'IND', 'MDV', 'NPL', 'PAK'), start_date = 2000 , end_date = 2020)
> df
# A tibble: 105 × 9
iso2c iso3c country date SE.SEC.ENRR.FE unit obs_status footnote last_updated
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <date>
1 BD BGD Bangladesh 2000 50.5 NA NA NA 2023-05-10
2 BD BGD Bangladesh 2001 53.1 NA NA NA 2023-05-10
3 BD BGD Bangladesh 2002 54.7 NA NA NA 2023-05-10
4 BD BGD Bangladesh 2003 54.5 NA NA NA 2023-05-10
5 BD BGD Bangladesh 2004 49.0 NA NA NA 2023-05-10
6 BD BGD Bangladesh 2005 48.5 NA NA NA 2023-05-10
7 BD BGD Bangladesh 2006 48.9 NA NA NA 2023-05-10
8 BD BGD Bangladesh 2007 49.6 NA NA NA 2023-05-10
9 BD BGD Bangladesh 2008 49.0 NA NA NA 2023-05-10
10 BD BGD Bangladesh 2009 51.9 NA NA NA 2023-05-10
# ℹ 95 more rows
# ℹ Use `print(n = ...)` to see more rows
We can now do some cleaning to keep the necessary data:
> df <- df %>% dplyr::select(country, date, SE.SEC.ENRR.FE) %>% dplyr::rename(education = SE.SEC.ENRR.FE)
> df
# A tibble: 105 × 3
country date education
<chr> <dbl> <dbl>
1 Bangladesh 2000 50.5
2 Bangladesh 2001 53.1
3 Bangladesh 2002 54.7
4 Bangladesh 2003 54.5
5 Bangladesh 2004 49.0
6 Bangladesh 2005 48.5
7 Bangladesh 2006 48.9
8 Bangladesh 2007 49.6
9 Bangladesh 2008 49.0
10 Bangladesh 2009 51.9
# ℹ 95 more rows
# ℹ Use `print(n = ...)` to see more rows
to make a plot like this using the ggplot
library:
library(ggplot2)
#create some nice color schemes
num_colors <- length(unique(df$country))
colors <- brewer.pal(num_colors, "Set1")
ggplot(df, aes(x = date, y = education, color = country)) +
geom_line() +
labs(x = "Year", y = "Education", color = "Country") +
scale_color_manual(values = colors, name = "Country")
Looks like there are many missing values for education, so we’ll just remove them from the dataframe before charting:
df %>% drop_na(education) %>%
ggplot(aes(x = date, y = education, color = country)) +
geom_line() +
labs(x = "Year", y = "Education", color = "Country") +
scale_color_manual(values = colors, name = "Country")