How Spatial Polygons Shape Our World - Amelia mcnamara

>> All right. So I'm Amelia McNamara. On Twitter, I'm @ameliaMN. So I Tweeted these links. You can look them up there. So I want to talk about how spatial polygons shape our world. It's about spatial data. It's about how the world is different from the visualization of the world. But specifically, I'm talking about the polygons created by humans. When I think about geographic data, I think about three major types of geographic data. I think about points, which might be where people are or trees or houses, something like that. I think about lines. Maybe rivers or roads. And then the part that I'm the most interested in are polygons. And polygons are closed shapes that might describe some area. Again, you could think about maybe a house could be described as a polygon as well. But usually we're thinking about polygons that are larger and contain some amount of human population. So when we're doing mapping of spatial polygons, we can have regular or irregular polygons. You can see this example, hexagons. Those on a map, or polygons like the U.S. states. I grew up in Minnesota, here are some irregular polygons in the upper Midwest. And polygons are almost always colored by a value. So we make choropleth maps with areas, polygons, that are shaded according to the value of some variable. And this example I pulled off of Flickr. It's from the prop one in Seattle, they were voting on a light rail measure. And you can see there's different areas and they're shaded according to whether the people voted for or against this measure. Red is less than 50% voting for the measure and blue is more than 50%. And you can already see a problem with plotting this kind of data, which is that small areas look not important and big areas look very important. But if you're from Seattle, which I know many of you are, I believe this measure passed. And that's because the places where it's very blue, where people voted for it, those are very dense places in the city of Seattle. And these enormous swaths of red, those are the less urban areas surrounding Seattle. So Andrew Gelman has a great paper called �All Maps of Parameter Estimates are Misleading.� And it's kind of about this exact problem. So if you have some map of polygons and you color it according to some absolute number, then places that have really big populations are going to have a lot of that thing going on, and places that have small populations are going to have less of it. Often one way that we think about correcting for this problem is to divide by, you know, some other measure like the� we're going to see the percentage of people who have cancer in a particular county. So rather than showing the absolute number of cancer cases in a county, we could divide by the size of the population. And the problem, here� this example comes from Ben Jones� is that you end up with really high variance in low population areas. You have I think it's Pulaski County in Illinois that has the highest rate of kidney cancer. It's a small population. Polk County, Wisconsin, it's a relatively low population and has a low percentage of people. And that's basically because if you don't have many people in an area, if a few get cancer, then it's a high rate. If no one gets cancer, it's a low rate. So it doesn't work to just plot the absolute numbers and it doesn't work to plot the percentages. And I think, you know, Gelman has some suggestions for fixing this problem. One of the solutions that I have seen recently that I liked was this surprise Bayesian waiting for biasing for thematic maps. You have the absolute numbers, that's not good. You have the percentages which has this high variance in small population areas, and there's this technique to do a Bayesian weighting. We know places with small populations have a variance, is this surprising or not given that underlying distribution? And then the other problem other than the absolute numbers and the percentages is, again, that big areas give a lot of visual weight. So this is the 2008 election map. I'm sure you're all familiar with it. When you look at a map like this, it looks like the red polygons are taking over the map, right? There's a lot more red area than there is blue area. But in this particular election, that's not the way things swung. So you can see at the top, the electoral count which is heavily weighted toward the democrats. I didn't put in a map from this year, but the map doesn't look that different because you have the large red states that are giving a lot of visual weight. And the question is, like, how can you solve that? So one solution would be to plot the number of people that voted one way or another by county or by using some scale that includes purple. But, again, it doesn't really solve the problem of aggregating to spatial polygons. Because when we make our electoral decisions in the United States, we're basing them on these polygons. We're sending a certain number of electoral votes from a particular state. Showing people voting a way in a number of states is interesting, we're not as divided as we think we are, but not giving us more information about how the vote actually turned out. So people make cartograms. I think this is from the 2004 presidential election. Taking the counties and then sizing the shape based on the population and then coloring based on whether people voted democrat or Republican. Again, it might give you an idea we're not divided as we thought we were. But it's hard to look up results. If you want to look at your county and see how it voted, well, you better live in Los Angeles County otherwise it's going to be hard to see. Another type of cartogram does the opposite. Instead of distorting the shape in making areas larger or smaller, we make every area exactly the same. So you can have a cartogram where you use squares, and then every state, in this case, is given the same visual weight. This is an example from NPR. And they're visualizing what states have nondiscrimination laws for LGBT people. I think the darker the blue, the better the laws are. And then the gray is where there weren't laws passed yet. And, again, I'm from Minnesota, so when I look at this map, I can tell that there's something wrong about it. Minnesota is not next to Illinois like that. So we've gained something in that all the states have the same visual weight, but we've lost something about the relationship between states and how they connect. So here's maybe a slightly better cartogram which uses hexagons. And now Minnesota and Wisconsin sitting next to each other. But if you have contextual knowledge about the United States, you'll find things that aren't quite right. And people have measures for how accurate a cartogram is if it's distorting the picture or not. But it still is having this problem where we're weighting things in a particular way and maybe we're losing the spatial relationships. Okay. So with that in mind, I want to talk about some common spatial polygons that get used to aggregate data. So we've already talked about states� children are probably the easiest to think about. The census� this is the drop down. If you go download data off of the census Website, these are the enumeration units that you can choose for census data. And there's census blocks, census tracks, counties, zip codes, school districts, there's so many more. And I think one of the things that I want to emphasize is almost none of these spatial polygons are naturally occurring. They all arose from some human political process where a person said, this is where we're going to draw the line between these voting districts or these zip codes. And so it often does not make sense to aggregate data into spatial polygons. But people do it because it's easy and it looks really pretty. I'm a data science professor, so for me one of the problems comes when you have data and it's at two different spatial aggregation levels, but you would like to combine it together. Like to make a map that shows the two variables. One at the school district level and one at the state level and you need to combine them together. If you're an R programmer, like me, if you had tabular data, you would use something like DPLYR, or a package. Do an SQLflavored join and match up things that had kind of matching IDs. So the spatial combination problem is that you don't have matching IDs. If you have point data and you want to combine it with polygon data, the IDs for your points are not going to match up with the IDs for your polygons. So you have to do something to solve that kind of data science problem. And spatial statisticians call this type of problem the change of support problem. And there's different change of support methods for moving between different types of spatial data. So if you have point data, and what you really need is polygon data they call that upscaling. If you have polygon data and you really need point data, that's down scaling. And if you have polygon data that's aggregated at some particular level and you need polygon data aggregated at another level that's overlapping, that's side escaping. I'm going to talk about the easiest first. Upscaling. You have point data and what you actually want is polygon data. And now we're bringing in another geographic problem. Which is the modifiable aerial unit problem. And the modifiable aerial unit problem, you have point data and aggregate it into polygons, the choice of your polygon boundaries makes a huge difference to the visual interest you see in the polygons. So this image here� it may be the only image associated with the modifiable aerial unit problem. It is on every Website about this problem. But I think it's really nice. So the yellow dots are the points with, and then you can see the choices of breaking it down into polygons. And what we're going to color the polygons is counting how many points fall into that area and then coloring by that value. And you can see that sometimes I think that there's a huge group, you know, like a dark region, a clump, in the center. And then other ways that I can aggregate, I can really make that melt away. I don't think it should be incredibly surprising that the way that you aggregate data can have a big impact on the distribution that you see. This is some work in progress with my collaborator, Aaron Lunzer. We have been thinking they about histograms for a long time. With a histogram, it's point data and aggregating into bins. It's just in one dimension. But if you change the way the bins are defined. Either making the bins wider or skinnier or just adjusting the bin offset or the closedness, the left or right closedness of the bins�you can really change the visual distribution that you see in your histogram. We have an alpha version of this up right now that you can play with. We're soliciting feedback. So if you have suggestions, we would love to hear it. But I think the problem that I'm concerned with in terms of spatial polygons and aggregating into this twodimensional setting is that it's this histogram problem, which I think people already have trouble wrapping their mind around, but then we've jumped into two dimensions. And with regard to things that impact us and shape our world, the way that aggregating from point data to polygon data can have a huge impact on us is gerrymandering. So gerrymandering, as you probably know, was named after a former governor of Massachusetts, Governor Elbridge Gerry. Not Jerry for whatever reason. And he helped redraw state Senate districts to benefit his party in 1812. So the political cartoonists of the time true this salamander that was taking over and changing the way that people were getting aggregated into districts. For if you've seen that recently, it was probably on that last week tonight segment that came out about two weeks ago. And when I saw this video, I thought, maybe I should just play this as my OpenVis Conf talk. Jon Oliver does a really nice job of explaining gerrymandering and pulls up a lot of these examples that I'm going to talk about. So when I think about gerrymandering. People love this Washington Post article. This is the best explanation of gerrymandering you'll ever see. It's not points, they're little squares to represent people. Each square is a voter and we have 50 voters, 60% blue,�40% red. And we can divide them up into polygons in a variety of different ways. And then we can see how the election result would shake out. So in the first scenario, you end up with three blue districts and two red districts. Blue wins. In the second one, you end up with five blue districts and zero red districts and blue wins. And then if you do this kind of maybe gerrymandering process, you can get three blue districts, two� two blue districts, three red and red wins. And gerrymandering is a hard problem to solve. When we define voting districts, we have a variety of measures that we would like to optimize. So we would like to have compact, contiguous voting districts that respect political subdivisions or communities defined by actual shared interests. And it's very hard to draw a polygon that respects all three of those things. And sometimes people have even more things that they would like to have in their voting districts. Like maybe you think that competitive voting districts are a valuable thing. Or maybe you want safe districts. And I think our political process hasn't really decided what is legal, what's illegal, and what's even right? So when people talk about gerrymandering, they often pull up this example of North Carolina's 12th district. This is not a compact district. But it's a majority/minority district that perhaps does have shared interests. A bunch of AfricanAmerican people along an interstate corridor there. Another example would be California's 33rd district. I lived in L.A. for quite a while. And, again, this isn't a compact district, but maybe you think the people living along the coastline, beach front property, maybe they have a shared political interest. Okay. There's lots of ways to understand that upscaling problem where you have point data and you're trying to aggregate it into polygon data. I love this restricting game. This came out of USC. And I recommend going and playing it. Has a ton of levels and I'm really bad at it as you can see from this video. It sort of steps you through a number of challenges. So initially you can do whatever you want. And then it wants to have equal population in each of the districts. It wants them to be compact, contiguous. Of course, you need people to live in the district that they represent. And so the challenge of the redistricting game is to aggregate this point data into a number of different polygons that maybe help your party win or make the other party lose. And there's lots more to understand about the problem of gerrymandering. One of the projects that I have been most inspired by recently is that talismanic redistricting tool, generating many different electoral maps with different districts and look at the outcomes so you could compare a real redistricting plan with a bunch of theoretical, simulated redistricting plans and see if the one that was proposed or used was unfair in some way. But I think it's an open problem to decide the most fair way to aggregate us, as people, as points, up into these polygons that decide things about our political process. So the next problem is the downscaling problem. And this is where you have polygon data and what you really want is point data. So this is an example from New York City. This project, I think, is called Disser. It's the population by census block. Census blocks are similar to city blocks, but not exactly. And then you have to decide how you're going to disaggregate it into points. So one way that you could do this is you could just say, okay, in this block there's a thousand people. So I'm going to randomly distribute a thousand points. Spatial statisticians and geo statisticians don't like that because it's not respecting things about the environment. So there's a method called dasymetric mapping which takes sort of auxiliary information� information about the geographic landscape. Are there mountains or are there lakes or rivers? So we're not going to try to put any people there. And it uses these additional pieces of information to show the way that things are distributed in reality. So in this project the Disser project, this is zoomed in on the lower east side. This is S Town and if you were going to disaggregate into a random selection of dots, that's the picture you would get. But what the Disser project has done is combined this with polygon data with the footprint of the houses. With the buildings in each of these blocks. And if you randomly place the dots within the houses, then you're probably going to get a better visual distribution of how people actually live. So there's the comparison of those three maps. There's another very famous example of taking census data, which is usually given to you in an aggregated way and then disaggregating it into dots. This is the racial dot map. It has one dot per person and they're colored according to race. And, again, I think this is a really beautiful map and it's done a nice job of using additional information about where people don't live so that it's pretty accurate. But it's hard to get the kind of Gestalt of what's happening with people. You have to zoom in and look at the detail. That's not always what people want with visualization. The last problem is sidescaling. That's the hardest problem. You have polygon data and want polygon data, but they're not exactly the same. I'm going to start by cheating. This is upscaling again, this is more work with my collaborator, Aaron Lunzer. We built this tool which is an explanation of the modifiable unit problem, but it's interactive. It lets you see how changing the way that you aggregate this point data, which is the location of earthquakes in Southern California, changing the binning, where the bins are located, how they're oriented, how large or small they are, how that changes the distribution that you can see. And so I think that you can see that changing from one polygon level to another is going to make a difference. So back to our sidescaling problem. If we have nesting polygons, this is easy. Some of you might have seen this example from Kevin HayesWilson, redraw the states. And this is taking the electoral map from 2016 and just moving some counties from one state to another. So you can go and play with this. This is me and I'm moving counties from Florida to Alabama. And only have to move three to get Florida to swap from being a red state to being a blue state. Just by changing the shape of the spatial polygon, the State of Florida, we're able to get the distribution to be different enough that we can change the way the electoral vote turns out. So you could play with this tool. Lets you adjust the edges of any state, not just Florida. And there's a great media post that goes along with it. I don't think we're going to be able to change the outlines of the states. Those are probably spatial polygons that are here to stay. But the districts within the states, that's something we might be able to adjust. Again, with this idea of gerrymandering. That's a nested polygon problem that's easy. Misaligned polygons are the big problem. And thinking about misaligned polygons, I think the most compelling recent example is the problem of lead in the water in Flint, Michigan. There were a lot of things that Flint did badly when they did their analysis of lead. They weren't doing a very good sampling method. They had some strange protocols for people who were collecting water in their own homes. They did bad things with outliers. As a statistician, there's many things that I'm upset about. But one of the things was that they were collecting data at a spatial aggregation level that was inappropriate for the task at hand. So they were reporting the lead levels at the zip code level versus the municipal level. The water was distributed at the municipal level, but they were reporting things at zip code level. This is from a great blog post by a statistician who did some analysis here. And he's showing how the zip codes and the city boundaries are misaligned. The zip codes are the blue outlines, and then the city is that green outline. And, I mean, zip codes in general, if you're plotting zip codes, you're probably doing something wrong. Zip codes are probably the most humangenerated and nonsensical of the spatial polygons. They were generated to make it easy for the USPS to deliver mail. They don't have any� they're not respecting like geographic boundaries. They're not respecting communities of interest. That's not what they're about. So� but in this case, it was essentially taking data, which had high values in the city and lower values outside, and averaging them. So I have some students who are working on this problem, actually doing the analysis with data. But for the purposes of this talk, I just sorted of mocked something up in Photoshop. So maybe we had a choropleth map that looked like this and wanted to choose the value for the green region. Turns out that one method for finding the value of a region like this is just to find the exact center of that polygon� your target polygon. Say what value is under that point. And just say that's the value. Okay? So this is like in the spatial statistics literature as that's one of the methods that you could use. I think that's a bad method. And we should do better than that. So what's a little bit better? Well, we could say we're going to weight the areas that are sort of in that target enumeration unit that we're interested in. Weight those values by how much of the area is shown there. So if we did that, maybe we would find, you know, a slightly different value that might be more accurate. But there are many methods to do a better job� a more statistically rigorous job. Most of them rely on this rule of thumb, Tobler's first law of geography. Everything is related to everything else, but near things are more related than distant things. And that means that if you have two polygons next to each other, you think that maybe there's some continuous relationship between the stuff that's going on in the one over here and the one over here, because that line was just drawn by a human. And Tobler also has this idea of the pycnophylactic property. If you're going to move from one type of map to another, you need to make sure that whatever you're smoothing or interpolation method is, it can be reversed back to the way it was before. If you're taking polygon and turning it into a smoothed surface, it needs to reaggregate back to the values you had before. So I have some joint work with some students from last year, which is working with the Pycno package in R. This is a classic statistics example. This is data from North Carolina on birds in 1974. And in order to get this map you have to know things about the SP package and RGDAL. And I think there might be some GG plot 2 going on here. And you get this beautiful map of the places where there were lots of birds and where there were not many birds. So the bright spots are lots of birds. And the Pycno package is going to observe the pycnophylactic property and create a smooth surface. So if you run the package this is� depending on your parameters� this is the smooth surface you might get. And if I overlay the counties again, you can kind of see how things have kind of smooshed out from their original aggregated distribution. When you compare these two maps together, it's not clear that they're going to, you know, observe the pycnophylactic property. It doesn't seem like that bottom one is going to reaggregate to the top one. But it turns out that it actually does. Because pixel values are sort of colored by individuals. So if you have a smaller area, you're going to have more bright spots. So even this place over here where things look like they're not going to match up very well, it does actually aggregate back up to the original data. We tried this with some other data. This is population data in New England at the voting district level and at the county level. And we've created some different smooth surfaces from each of those levels. And are working to do the reaggregation from one level of spatial support to another. This is using, again, the Pycno package. There is great progress on other tools. I just heard about coGraham in JavaScript today. There's a new package in RSFR, maybe not pix the Pycno issue. But maybe work on spatial polygon data frames. And other methods. There's interpolation methods which are going to respect observed values and smooth things out in between, and there's smoothing methods which might not end up matching any of the points. People talking kriging a lot in spatial statistics. So that's a good interpolation method. If you're interested in this stuff, Amy happy to talk more with you about it. I think my main takeaways are, if you don't have to aggregate data to polygons, don't do it. If you are going to aggregate, pay attention to the spatial polygons that you're using. Don't use zip codes or something else that's kind of meaningless. And if you happen to have auxiliary information about where housing is located or where rivers or lakes are, use that information to help you when you're moving between levels of spatial support. So because spatial polygons are impacting us on a daytoday basis, I think it's really important that we keep working on these methods and being cognizant of the way that they can impact the visualizations that we make. Thank you. [ Applause ]