Importing and cleaning Wikipedia tables in R
Wikipedia contains a mine of tables that can serve as suitable materials for practicing data wrangling.
R is our excavator for today, let’s do some mining!
Setting the libraries at work
library(XML)
library(httr)
library(stringr)
library(dplyr)
library(magrittr)
We are going to extract the world population table from here:
df <- readHTMLTable(doc=content(GET("https://en.wikipedia.org/wiki/World_population"), "text"))
Check the type of the object df
class(df)
[1] "list"# And the namesnames(df)[1] "World population .mw-parser-output .nobold{font-weight:normal}(millions, UN estimates)[14]\n"
[2] "Population by region (2020 estimates)\n"
[3] "NULL"
[4] "NULL"
[5] "NULL"
[6] "10 most densely populated countries (with population above 5 million)\n"
[7] "Countries ranking highly in both total population (more than 20 million people) and population density (more than 250 people per square kilometer):\n"
[8] "Global annual population growth[111]"
[9] "World historical and predicted populations (in millions)[115][116][117]"
[10] "World historical and predicted populations by percentage distribution[115][116]"
[11] "NULL"
[12] "NULL"
[13] "UN (medium variant – 2019 revision) and US Census Bureau (June 2015) estimates[137][138]"
[14] "UN 2019 estimates and medium variant projections (in millions)[137]"
[15] "Starting at 500 million\n"
[16] "Starting at 375 million\n"
[17] "NULL"
[18] "NULL"
[19] "NULL"
[20] "NULL"
[21] "NULL"
[22] "NULL"
[23] "NULL"
[24] "NULL"
[25] "NULL"
[26] "NULL"
A bit messy look for an output, but we are interested in the tables (the non-NULL ones) they contain rather the output itself. Let’s get the second one called ‘Population by region’.
Since the object is a list, we have to use [] to reach the content
doc[2]ordf <- as.data.frame(df[2])
Looks messy enough. Let’s save it as a data frame first before getting hands dirty.
world <- as.data.frame(doc[2])
Step 1: Setting the column names
As column names are wrongly treated as row 1, we have to assign the first row names to column names
colnames(world) <- world[1, ]
And remove the incorrect(first) row
world <- world[-1, ]# And check the names> names(world)
[1] "Region"
[2] "Density(inhabitants/km2)"
[3] "Population(millions)"
[4] "Most populous country"
[5] "Most populous city (metropolitan area)"
Let’s also trim the white spaces
names(world) <- make.names(names(world))> names(world)
[1] "Region"
[2] "Density.inhabitants.km2."
[3] "Population.millions."
[4] "Most.populous.country"
[5] "Most.populous.city..metropolitan.area."
And remove the trailing dots from column names for extra finesse
names(world) <- sub("\\.$", "", names(world))
Step 2: cleaning the columns
The fourth column Most.populous.country contains two different pieces of information: population and country.
Similarly, the fifth column contains four pieces of information: most populous city, metropolitan area, and their respective values. The names of city and metropolitan areas are separated by – (but not a hyphen -), and their values by /.
We can separate the values from names for each column using the separate function from tidyr:
world <- separate(data = world, col = Most.populous.country, into = c("Country.Population", "Country.Name"), sep = "\\–")# Check the names> names(world)
[1] "Region"
[2] "Density.inhabitants.km2"
[3] "Population.millions"
[4] "Country.Population"
[5] "Country.Name"
[6] "Most.populous.city..metropolitan.area"
Same goes for the fifth column
world <- separate(data = world, col = Most.populous.city..metropolitan.area, into = c( "Metropolitan", "City"), sep = "–")# check the names> names(world)
[1] "Region" "Density.inhabitants.km2"
[3] "Population.millions" "Country.Population"
[5] "Country.Name" "Metropolitan"
[7] "City"
Let’s also split the newest columns Metropolitan and City by the delimiter /
world <- separate(data = world, col = Metropolitan, into = c( "Metropolitan.Population", "City.Population"), sep = "/")world <- separate(data = world, col = City, into = c( "Metropolitan.Name", "City.Name"), sep = "/") # And check the names> names(world)
[1] "Region"
[2] "Density.inhabitants.km2"
[3] "Population.millions"
[4] "Country.Population"
[5] "Country.Name"
[6] "Metropolitan.Population"
[7] "City.Population"
[8] "Metropolitan.Name"
[9] "City.Name"
Step 3: Remove the [] brackets along with their content from the concerned columns
world %<>%
mutate(Region = str_remove(Region, "\\[.*"),
Population.millions.= str_remove(Population.millions., "\\[.*"),
Country.Population= str_remove(Country.Population, "\\[.*"),
Metropolitan.Population = str_remove(Metropolitan.Population, "\\[.*"))
Much better, but not done yet. Let’s get rid of the leading zeros from Country.Population
world$Country.Population<- sub("^0+", "", world$Country.Population)
Almost there, just change the Country.Name from Russia;approx. 110 million in Europe, to, what it is…
world$Country.Name[grepl("Russ", world$Country.Name)] <- "Russia"
And define N\A and all blank cells to missing
world$Country.Population[grepl("N/A", world$Country.Population)] <- NAworld %<>%
mutate(across(everything(), ~ifelse(.=="", NA, as.character(.))))
Looks like we did it! Here is the good old before & after as a parting pic…
Happy scraping!