Architecture and Beyond with Data Science and Open Data

6 min readJul 29, 2020

Central to one of things I want to do with my coding skills was dip into the ever flowing streams of data around us. There’s a very distinct connection in my mind between understanding the context of a project site through data, and delivering a better architectural project. There’s still room in the creative process to visit a site and get inspired by it. But these are expensive projects, and as stakeholders seek ever higher levels of energy performance, and to contribute positively to surrounding communities, there’s an increased focus on evidence-based design. (I would argue a strength of a data.)

Smart Cities use data and sensors to become more sustainable, resilient and adaptable.

Through sensors and databases within their jurisdictions, Smart Cities strive to use that data to become more sustainable, resilient and adaptable in the face of climate change. It’s fair to say Vancouver is a global leader in this regard, with open access to all sorts of data. Gaps exist however, and I was hoping to create a series of visualizations which describe the current scope and nature of smart city open data in Canada.

The first thing that comes to mind is how fragmented the subject matter is. There is very little uniformity between how jurisdictions post data. The part of me that loves creativity dies a bit to see such extreme fragmentation. It adds a layer of complexity and messiness to the problem that seems intractable.

For example, business license data available across Metro Vancouver is not consistently labelled. The triggers the need for pre-processing before any conclusions can be made. Pandas for Python makes alignment a relatively straightforward process, but more headaches are encountered if certain data is missing. Though most municipal databases conveniently package the GPS coordinates with the business address, not all do. Python again proves to be a flexible language for such chores. Programming an address parser is not a trivial exercise when the amount of data becomes substantial. The problem necessitates a certain degree of robustness in the code so it does not error out or time out during the computation. In a few short lines of code, Python delivers a parser consistently capable of churning out thousands of rows per hour directly into the project’s GeoPandas DataFrame.

Python GeoPandas DataFrame Address parser example

Other parts of the project tested my lack of formal computer science education, but also hopefully , in retrospect, proved my resiliency and curiosity to learn. Once the more than 750,000 business entries were collected and consistently geolocated, it was time to calculate their density. Laying out the geometry of the polygons over the Lower Mainland was good sport, but my first attempt to count them unwisely utilized a double loop, requiring the computation to go through all the columns, then rows, again and again, for each point counted. I had inadvertently created a hot spot in my code. The algorithm was so inefficient that even if taking a subset of the data, and lowering the resolution of the polygons, the computation still took 20 hours on my desktop computer.

So little code can cause such a headache.

1200 minutes is 20 hours for a subset of the data. The computation spent the majority of the time in the above loop.

Rewritten code using Python Numpy module.

Determined to find a way to do the calculation, solving the problem taught me a lot about how to effectively implement well-known Python modules to do calculations at a lower-level much faster. My second attempt with Numpy arrays instead of a loop was orders of magnitude faster, allowing me to complete the higher-resolution visualization below in seconds. If access to open data was more uniform across the lower mainland, one would expect to see a wider, more colourful patchwork below. Instead we see large urban areas with no data, and other Metro Vancouver communities not included in for the same reason.

If municipalities ever wish to see the economic benefit of having uniform data standards, one that businesses and policy makers can use to drive good decision making, what common ground can be found? I alluded above to the seemingly insurmountable challenge of harmonizing data formats, however, there is a type of geospatial visualization that suggests otherwise. The blue areas below represent where half of the population lives. Such a vast amount of space, but really, change is only required in a small proportion of it for the rest to follow. Initially, I didn’t know the technical name for such a statistical representation, but the logic made sense to me. Using open census data from the United States and Canada, one need only to sort the population of each county from largest to smallest, and add them together starting from the top until the sum totals half the population. A few tags later to represent the colours and one has an image representing the area-weighted average of a country’s population.

Blue areas represent where half the population lives. Area-weighted mean of Canadian Population.

While the above visualizations nicely represent facets to the problem, there is also formal research being done into the subject matter. I wanted to make readers aware of an excellent paper from 2015 from National Resources Canada which really resonated with me and my experiences using open data for building analysis. There is a matrix at the bottom of the link (which I hesitate to copy directly here) I thought well summarized the subject as it stands in Canada.

Smart Cities use data and sensors to improve the lives of their citizens

If such factors come together there is a chance Smart Cities can use data to improve the lives of their citizens. With this goal within reach, please support access to open data locally. The more integrative the data formats become, the wider the positive impact can be. In this time and place, access to these sophisticated data science tools is impressively democratic. It becomes trivial to apply them. What isn’t trivial is the problems we apply them to.

— — — — — —

Blair is a parametric modeler in Vancouver B.C. who has previously written on the subject of data science and architecture. Follow him on Instagram or connect with him on Linkedin.

Architecture and Beyond with Data Science and Open Data

Written by B Birdsell