Extracting World Bank data in Stata
Stata is one of the most popular programs used in data science and analytics. It is a powerful statistical software that was originally developed at StataCorp in 1985. Since then, it has become a widely accepted program for individuals and organizations due to its ability to analyze large datasets.
Stata’s graphical user interface (GUI) allows users to navigate through complex tasks relatively easily, allowing for more efficient manipulation and visualizations of data. Additionally, it features an extensive library of user-written packages that further enhances its capabilities that
are not available in the base installation of the software.
In this post, I’d like to give a brief sketch of a package to extract data from the World Bank data repo, of which I happen to be a great fan!
The package can be installed using the following command:
ssc install wbopendata
An excellent manual is available for starters through the help command:
help wbopendata
Now, to load the GUI, just type
db wbopendata
A very neat and self-explanatory interface to work with, but I generally write the command myself:
wbopendata, indicator(indicator1; indicator2; ...) long clear
The ‘long’ element is optional, but the ‘clear’ one is not, as the package requires starting with an empty dataset.
The database features hundreds of indicators, each with a unique ID or code. Upon entering a search term, the corresponding code appears on the address bar following https://data.worldbank.org/indicator/:
Once the codes are recorded, the data can be requested using the following command:
wbopendata, indicator(SP.DYN.LE00.FE.IN; SP.DYN.LE00.MA.IN; SP.DYN.LE00.IN; SI.POV.DDAY) long clear
The variables include life expectancy (female), life expectancy (male), life expectancy (overall), and Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population).
Let’s do some exploration of the variables. Right now, the country variable seems to be in string format. We can encode it to numeric and generate an ID variable with that:
encode countryname, g(country)
egen id = group(country), label
Success! Now we’ll define the dataset as time series with year as the time variable-
xtset id year
which will allow plotting life expectancies among men and women over the time period:
twoway (scatter sp_dyn_le00_fe_in y) (scatter sp_dyn_le00_ma_in y)
Looks a bit messy. Let’s rename the variables and make a multicounty time series chart:
xtline LE_ALL if inlist(id, 1, 12, 34, 32, 35, 45), overlay
To conclude, we will limit the chart to the first country in the database only, and of course, use a cute colour scheme to give the chart a proper look:
twoway (scatter LE_Female y) (scatter LE_Male y) if inlist(id, 1) , sch(plottig) xlabel(1960(5)2021)
That’s a really quick tutorial of this really mighty package that I hope you enjoyed, and I shall be back with more soon!