Importing and cleaning Wikipedia tables in R

4 min readJun 26, 2022

Wikipedia contains a mine of tables that can serve as suitable materials for practicing data wrangling.

R is our excavator for today, let’s do some mining!

Setting the libraries at work

library(XML)
library(httr)
library(stringr)
library(dplyr)
library(magrittr)

We are going to extract the world population table from here:

df <- readHTMLTable(doc=content(GET("https://en.wikipedia.org/wiki/World_population"), "text"))

Check the type of the object df

class(df)
[1] "list"# And the namesnames(df)[1] "World population .mw-parser-output .nobold{font-weight:normal}(millions, UN estimates)[14]\n"                                                         
 [2] "Population by region (2020 estimates)\n"                                                                                                              
 [3] "NULL"                                                                                                                                                 
 [4] "NULL"                                                                                                                                                 
 [5] "NULL"                                                                                                                                                 
 [6] "10 most densely populated countries (with population above 5 million)\n"                                                                              
 [7] "Countries ranking highly in both total population (more than 20 million people) and population density (more than 250 people per square kilometer):\n"
 [8] "Global annual population growth[111]"                                                                                                                 
 [9] "World historical and predicted populations (in millions)[115][116][117]"                                                                              
[10] "World historical and predicted populations by percentage distribution[115][116]"                                                                      
[11] "NULL"                                                                                                                                                 
[12] "NULL"
[13] "UN (medium variant – 2019 revision) and US Census Bureau (June 2015) estimates[137][138]"                                                             
[14] "UN 2019 estimates and medium variant projections (in millions)[137]"                                                                                  
[15] "Starting at 500 million\n"                                                                                                                            
[16] "Starting at 375 million\n"                                                                                                                            
[17] "NULL"                                                                                                                                                 
[18] "NULL"                                                                                                                                                 
[19] "NULL"                                                                                                                                                 
[20] "NULL"                                                                                                                                                 
[21] "NULL"                                                                                                                                                 
[22] "NULL"                                                                                                                                                 
[23] "NULL"                                                                                                                                                 
[24] "NULL"                                                                                                                                                 
[25] "NULL"                                                                                                                                                 
[26] "NULL"

A bit messy look for an output, but we are interested in the tables (the non-NULL ones) they contain rather the output itself. Let’s get the second one called ‘Population by region’.

Since the object is a list, we have to use [] to reach the content

doc[2]ordf <- as.data.frame(df[2])

Looks messy enough. Let’s save it as a data frame first before getting hands dirty.

world <- as.data.frame(doc[2])

Step 1: Setting the column names

As column names are wrongly treated as row 1, we have to assign the first row names to column names

colnames(world) <- world[1, ]

And remove the incorrect(first) row

world <- world[-1, ]# And check the names> names(world)
[1] "Region"                                
[2] "Density(inhabitants/km2)"              
[3] "Population(millions)"                  
[4] "Most populous country"                 
[5] "Most populous city (metropolitan area)"

Let’s also trim the white spaces

names(world) <- make.names(names(world))> names(world)
[1] "Region"                                
[2] "Density.inhabitants.km2."              
[3] "Population.millions."                  
[4] "Most.populous.country"                 
[5] "Most.populous.city..metropolitan.area."

And remove the trailing dots from column names for extra finesse

names(world) <- sub("\\.$", "", names(world))

Step 2: cleaning the columns

The fourth column Most.populous.country contains two different pieces of information: population and country.

Similarly, the fifth column contains four pieces of information: most populous city, metropolitan area, and their respective values. The names of city and metropolitan areas are separated by – (but not a hyphen -), and their values by /.

We can separate the values from names for each column using the separate function from tidyr:

world <- separate(data = world, col = Most.populous.country, into = c("Country.Population", "Country.Name"), sep = "\\–")# Check the names> names(world)
[1] "Region"                                
[2] "Density.inhabitants.km2"              
[3] "Population.millions"                  
[4] "Country.Population"                    
[5] "Country.Name"                          
[6] "Most.populous.city..metropolitan.area"

Same goes for the fifth column

world <- separate(data = world, col = Most.populous.city..metropolitan.area, into = c( "Metropolitan", "City"), sep = "–")# check the names> names(world)
[1] "Region"                   "Density.inhabitants.km2"
[3] "Population.millions"     "Country.Population"      
[5] "Country.Name"             "Metropolitan"            
[7] "City"

Let’s also split the newest columns Metropolitan and City by the delimiter /

world <- separate(data = world, col = Metropolitan, into = c( "Metropolitan.Population", "City.Population"), sep = "/")world <- separate(data = world, col = City, into = c( "Metropolitan.Name", "City.Name"), sep = "/") # And check the names> names(world)
[1] "Region"                  
[2] "Density.inhabitants.km2"
[3] "Population.millions"    
[4] "Country.Population"      
[5] "Country.Name"            
[6] "Metropolitan.Population" 
[7] "City.Population"         
[8] "Metropolitan.Name"       
[9] "City.Name"

Step 3: Remove the [] brackets along with their content from the concerned columns

world %<>%
 mutate(Region = str_remove(Region, "\\[.*"),
 Population.millions.= str_remove(Population.millions., "\\[.*"),
 Country.Population= str_remove(Country.Population, "\\[.*"),
 Metropolitan.Population = str_remove(Metropolitan.Population, "\\[.*"))

Much better, but not done yet. Let’s get rid of the leading zeros from Country.Population

world$Country.Population<- sub("^0+", "", world$Country.Population)

Almost there, just change the Country.Name from Russia;approx. 110 million in Europe, to, what it is…

world$Country.Name[grepl("Russ", world$Country.Name)] <- "Russia"

And define N\A and all blank cells to missing

world$Country.Population[grepl("N/A", world$Country.Population)] <- NAworld %<>% 
 mutate(across(everything(), ~ifelse(.=="", NA, as.character(.))))

Looks like we did it! Here is the good old before & after as a parting pic…

Happy scraping!

Importing and cleaning Wikipedia tables in R

Written by infoart.ca

No responses yet