Visualizing Cities

For this data visualization project, I used Bloomberg's Global Cities Index dataset from 2014. It contains interesting economic and environmental information about major cities around the world, although does not include a single city from Africa, South-East Asia or Eastern Europe.

From the dataset, I selected 51 cities with biggest metropolitan areas, and 11 of most interesting (in my opinion) parameters. And this is the Python script that did the cherry picking from the original .csv.

I stored the final processed data in a Google sheet so that it is easy to make changes to the data if needed (also because in this century cloud is sexy). The sheet is accessed by Tabletop and transformed into JSON each time this page is loaded. Below is the table showing the selected data I was visualizing.

I wanted to see if cities could be clustered based on their eleven parameters. To be able to plot this in 2D, 11 dimensions need to be reduced to only 2. For this, I used Principle Component Analysis (PCA) to calculate principle components of the original dataset, and selected the two with highest variances. The data was centered and standardized, and matplotlib's PCA function was used. To plot the results, I used a scatter plot from Highcharts.

I was very happy to see that nearly all US cities are clustered together in the bottom right corner. The two outliers - St. Louis and Minneapolis, are above the cluster (due to extremely big green areas per million people). From the dataset, nearly all US cities have high GDP per capita, low unemployment rate, small population density, and high CO2 emissions per capita. Each city's population comparing to overall national population is insignificant.

South Korean and Japanese cities form a small cluster in the middle bottom, below the European cities. Their unemployment rate is very small, population density is high, green area per million people is extremely low, and air pollution is high.

Seoul is the most visible outlier. Why? To understand that, I sorted each column in the table and checked if Seoul is at the top or at the bottom of the list. I figured that Seoul has the highest population density among 51 cities, as well as the highest percentage of all citizens living in the metropolitan area (almost 50%!), and as a result very high percentage of country's GDP is from there. It has one of the lowest unemployment rates, very little green areas, low CO2 emissions, yet extremely high air pollution.

Athens is similar in a way that 1/3 of Greece's citizens live in the capital's metropolitan area, and as a result 44% of GDP is produced there. But unlike Seoul, Athens has the highest unemployment rate among all compared cities.

I was expecting to see London, New York, and Tokyo being closer together, but they stretched in the bottom part of the graph. Vienna, for many years being considered the most liveable city, turned out to have the highest y coordinate!

In the second visualization, I wanted to show the map with all data I had for each city. Instead of designing complicated glyphs that are able to represent 11 parameters, I used a timeline-like approach: every two seconds or so, the information would change. But instead of showing historic data for one parameter, as one might expect, every update of my map shows a completely different parameter.

For the map, I used my beloved Leaflet with Stamen's basemap. I chose to represent each city with a circle icon whose radius depends on the current parameter's value for that city. To get coordinates of each city, I put to use an old script I developed for Google Sheets, geocoder-for-google-sheets. For each parameter, I determined the minimum (radius = 2px) and maximum (radius = 20px) value. I then calculated radius for each datapoint (city) as something between 2 and 20 pixels.

This visualization turned out to be very effective for determining regional features for cities. For example, it is easy to observe that US cities have very high GDP per capita and CO2 emission rates, or that Southern European cities (Lisbon, Madrid, Barcelona, Naples, Athens) have ridiculously high unemployment rates comparing to the rest of the world, or that the share of urbanised area in the US is higher on the East Coast than the rest of the country.

The final visualization is a map which allows you to compare two parameters across all cities. The value of the first parameter is expressed through the circle's radius, whereas the value of the second parameter influences its color (green for low, red for high values). The idea was that users can find the relationships between two different parameters. For instance, if you compare Population Density and Air Pollution, you may observe that often higher density (bigger circles) means higher air pollution (more red). But remember that correlation does not mean causation!

To calculate colors for different values, I discovered an elegant solution: to use HSL color model, and calculate hue as var hue = ((1 - value) * 120), where value is between 0 and 1. When value is 1, hue = 0 (red). When value is 0, hue = 120 (green).

shape vs color