Importing and cleaning Wikipedia tables in R

infoart.ca
4 min readJun 26, 2022

Wikipedia contains a mine of tables that can serve as suitable materials for practicing data wrangling.

R is our excavator for today, let’s do some mining!

Setting the libraries at work

library(XML)
library(httr)
library(stringr)
library(dplyr)
library(magrittr)

We are going to extract the world population table from here:

df <- readHTMLTable(doc=content(GET("https://en.wikipedia.org/wiki/World_population"), "text"))

Check the type of the object df

class(df)
[1] "list"
# And the namesnames(df)[1] "World population .mw-parser-output .nobold{font-weight:normal}(millions, UN estimates)[14]\n"
[2] "Population by region (2020 estimates)\n"
[3] "NULL"
[4] "NULL"
[5] "NULL"
[6] "10 most densely populated countries (with population above 5 million)\n"
[7] "Countries ranking highly in both total population (more than 20 million people) and population density (more than 250 people per square kilometer):\n"
[8] "Global annual population growth[111]"
[9] "World historical and predicted populations (in millions)[115][116][117]"
[10] "World historical and predicted populations by percentage distribution[115][116]"
[11] "NULL"
[12] "NULL"
[13] "UN (medium variant – 2019 revision) and US Census Bureau (June 2015) estimates[137][138]"
[14] "UN 2019 estimates and medium variant projections (in millions)[137]"
[15] "Starting at 500 million\n"
[16] "Starting at 375 million\n"
[17] "NULL"
[18] "NULL"
[19] "NULL"
[20] "NULL"
[21] "NULL"
[22] "NULL"
[23] "NULL"
[24] "NULL"
[25] "NULL"
[26] "NULL"

A bit messy look for an output, but we are interested in the tables (the non-NULL ones) they contain rather the output itself. Let’s get the second one called ‘Population by region’.

Since the object is a list, we have to use [] to reach the content

doc[2]ordf <- as.data.frame(df[2])

Looks messy enough. Let’s save it as a data frame first before getting hands dirty.

world <- as.data.frame(doc[2])

Step 1: Setting the column names

As column names are wrongly treated as row 1, we have to assign the first row names to column names

colnames(world) <- world[1, ]

And remove the incorrect(first) row

world <- world[-1, ]# And check the names> names(world)
[1] "Region"
[2] "Density(inhabitants/km2)"
[3] "Population(millions)"
[4] "Most populous country"
[5] "Most populous city (metropolitan area)"

Let’s also trim the white spaces

names(world) <- make.names(names(world))> names(world)
[1] "Region"
[2] "Density.inhabitants.km2."
[3] "Population.millions."
[4] "Most.populous.country"
[5] "Most.populous.city..metropolitan.area."

And remove the trailing dots from column names for extra finesse

names(world) <- sub("\\.$", "", names(world))

Step 2: cleaning the columns

The fourth column Most.populous.country contains two different pieces of information: population and country.

Similarly, the fifth column contains four pieces of information: most populous city, metropolitan area, and their respective values. The names of city and metropolitan areas are separated by – (but not a hyphen -), and their values by /.

We can separate the values from names for each column using the separate function from tidyr:

world <- separate(data = world, col = Most.populous.country, into = c("Country.Population", "Country.Name"), sep = "\\–")# Check the names> names(world)
[1] "Region"
[2] "Density.inhabitants.km2"
[3] "Population.millions"
[4] "Country.Population"
[5] "Country.Name"
[6] "Most.populous.city..metropolitan.area"

Same goes for the fifth column

world <- separate(data = world, col = Most.populous.city..metropolitan.area, into = c( "Metropolitan", "City"), sep = "–")# check the names> names(world)
[1] "Region" "Density.inhabitants.km2"
[3] "Population.millions" "Country.Population"
[5] "Country.Name" "Metropolitan"
[7] "City"

Let’s also split the newest columns Metropolitan and City by the delimiter /

world <- separate(data = world, col = Metropolitan, into = c( "Metropolitan.Population", "City.Population"), sep = "/")world <- separate(data = world, col = City, into = c( "Metropolitan.Name", "City.Name"), sep = "/") # And check the names> names(world)
[1] "Region"
[2] "Density.inhabitants.km2"
[3] "Population.millions"
[4] "Country.Population"
[5] "Country.Name"
[6] "Metropolitan.Population"
[7] "City.Population"
[8] "Metropolitan.Name"
[9] "City.Name"

Step 3: Remove the [] brackets along with their content from the concerned columns

world %<>%
mutate(Region = str_remove(Region, "\\[.*"),
Population.millions.= str_remove(Population.millions., "\\[.*"),
Country.Population= str_remove(Country.Population, "\\[.*"),
Metropolitan.Population = str_remove(Metropolitan.Population, "\\[.*"))

Much better, but not done yet. Let’s get rid of the leading zeros from Country.Population

world$Country.Population<- sub("^0+", "", world$Country.Population)

Almost there, just change the Country.Name from Russia;approx. 110 million in Europe, to, what it is…

world$Country.Name[grepl("Russ", world$Country.Name)] <- "Russia" 

And define N\A and all blank cells to missing

world$Country.Population[grepl("N/A", world$Country.Population)] <- NAworld %<>% 
mutate(across(everything(), ~ifelse(.=="", NA, as.character(.))))

Looks like we did it! Here is the good old before & after as a parting pic…

Happy scraping!

--

--

infoart.ca

Center for Social Capital & Environmental Research | Posts by Bishwajit Ghose, lecturer at the University of Ottawa