How Geographic Bias Can Creep Into AI
Artificial Intelligence is incredible. In healthcare, business, and government, AI can eliminate discrimination and provide fair access to all.
Unless we teach our AI to discriminate.
As AI begins to switch on around the world, we’re starting to see that computers can be just as discriminatory and bigoted as human beings. With the wrong data sets, an algorithm can be racist, sexist, or exclusionary.
Another type of bias that doesn’t get as much attention is geographic bias. This is when an AI makes false assumptions based on locale, or fails to account for regional differences. It’s already caused a stir in healthcare.
The Healthcare AI that repeated old mistakes
Stanford’s Institute for Human-Centered Artificial Intelligence (HAI) recently highlighted a concerning study of health-oriented machine learning.
The study looked at data sets used to train 74 different machine learning and AI tools. Researchers took a look at the geographic origin data and made a shocking discovery. Almost three-quarters of tools used data from just three states: New York, California, and Massachusetts. In total, only 16 states appeared in the data, with 34 states totally invisible.
This is worrying for many reasons. Healthcare is not evenly distributed across the nation. People face different challenges in each state: access to health services, exposure to chemicals, proximity to fracking or other hazardous activities, and local epidemiological issues. Not to mention the fact that the racial and economic profile of each state can affect health outcomes.
But perhaps what’s most worrying is that everybody already knew all this. Geographic bias was a major issue in the past, even before AI. Location bias always leads to bad healthcare data, which in turn led to bad medical decisions. In the 1990s – almost 30 years ago – the federal government created new guidelines to ensure that all medical research uses geographically diverse testing.
It’s a reminder that we can’t assume AI will always be impartial. An algorithm can only be as objective as the data it’s working with. To quote a recent McKinsey report: “AI can help reduce bias, but it can also bake in and scale bias.”
5 Kinds of geographic bias in AI
Geographic bias leads to AI making bad decisions. But how does AI acquire this bias in the first place?
It’s rarely intentional. Instead, it’s the result of bad data management. Here are five common mistakes.
Establishing one location as a universal standard
From a data scientist’s point of view, it’s tempting to focus on one city and then try to extrapolate a universal template.
This seems to be the problem identified in the Stanford report. Medical researchers were trying to use data from three states as a universal model for the whole country. But when you’re dealing with healthcare, you have to consider any regional variations that might impact your analysis.
It’s the same with any kind of regional bias. When you’re comparing populations, you have to look at things like:
- Population density
- Access to services
- Local economic conditions
An AI can only work with the data it’s given. It can’t tell the difference between different populations unless it has the right data.
Absence of data from some locations
As mentioned in the example above, the biggest problem is that some locations simply aren’t represented in the data. It might seem like this is an easy issue to avoid, but incomplete data can emerge in several ways:
- Data obfuscation: When you’re working with sensitive information, most of the values will be masked or obfuscated. This means that you can’t always tell which areas you’re looking at.
- Granularity: Coverage maps can have different levels of granularity. For example, you can organize data by state, by county, by city, or by ZIP code. If the data isn’t granular enough, you might not identify any blind spots.
- Inaccurate borders: Physical borders and cultural borders don’t often correspond. For example, many cities have distinct Uptown, Midtown, and Downtown areas, each with very different characteristics. If you just look at average figures for the whole city, you’ll miss out on that nuance.
Major locations overrepresented in data
Even when smaller locations are represented in data, larger locations can often drown them out.
Geographical data is always skewed towards urban areas, simply because more people live in cities than in rural areas. California has a population almost 80 times the size of Wyoming’s, which means that the Sunshine State’s citizens will vastly outnumber smaller communities in a nation-wide data set.
There are also other factors to consider, such as the availability of technology in large areas. A study of user-submitted data on the OpenStreetMap project found a heavy bias towards urban areas. This is because most project contributors are themselves urban residents, and users tend to focus on mapping the areas they know.
On the other hand, rural residents and people in underprivileged areas are less likely to get involved in OpenStreetMap. As a result, these regions are less well-represented.
An AI algorithm can only work with the data it’s given. If a dataset mostly focuses on New York and LA, the AI will also focus on those areas.
Assumptions about elements based on location
Over time, AI can begin to develop a bias of its own. A Machine Learning paper by Daniel Shapiro examines this problem offering an example of an algorithm that categorizes businesses based on their names.
Imagine two businesses: Daniel’s Gems and Sandy’s Gems. The AI may know that “Gems” might indicate jewelry or homecrafts. If the data shows that men are more likely to own high-value businesses, the AI could assume that Dave’s Gems is a jeweler’s store and Sandy’s Gems sells knick-knacks. Something like this happened to Amazon in 2018, when their recruitment AI taught itself to ignore female candidates.
The same bias can creep in with location names. For example, imagine a data set with two stores, one called Eagleton Gems and the other called Pawnee Gems. If the previous data indicated that Eagleton typically has higher-value businesses than Pawnee, it might assume that Eagleton Gems sells more expensive goods than Pawnee Gems. This is an example of geographic bias.
Data not detecting cultural and local variations
Sometimes, it’s simply a matter of comparing apples and oranges. For example, imagine two hypothetical towns: Packersville and Lakerston.
A sports marketing company is using Big Data and AI to identify opportunities. The company trains its algorithms on basketball data: how many people in each town attend games or watch NBA on TV. Their analysis shows that the people of Lakerton are highly engaged, while the people of Packersville don’t care so much for sports.
Here’s the problem: the people of Packersville actually love sports. They’re all football fanatics. But the algorithm won’t detect this, because it’s been trained to think that basketball data is the only indicator of sports engagement.
Some regions are so culturally different from each other that you can’t compare them with the same data. Instead, you need to train an AI from scratch using a new data set that reflects regional preferences.
How to avoid geographic bias in AI
Whether you’re building an AI from scratch or training a machine learning algorithm, there are a few steps you can take to avoid inherent bias.
Be open about biases
Human beings have geographic biases. Sometimes, this means they have a negative opinion of a locale. More often, it’s simply that they just don’t know about the place, and therefore think it’s not important. It’s good to sit down and have an open, non-judgmental conversation about your team’s biases. When you’re aware these biases exist, you can start watching out for them in data.
Audit training data sets
An AI understands the world by analyzing data. To develop an unbiased AI, therefore, you have to feed it unbiased data. Humans need to intervene at this stage to ensure that the training data is fair, accurate and complete.
Use human circuit-breakers
AI doesn’t have to be completely independent. Human operators can intervene at various stages in the process to ensure that bias hasn’t crept in. For example, suppose an AI is generating letters to clients. In that case, a human admin can audit those letters before they go out.
Do a sense check on outcomes
AI and machine learning can sometimes produce surprising results. This is why they’re useful – they identify patterns that we can’t see. However, it’s good to review automated processes and do a quick sense check to ensure there are no glaring errors in the output.
Add diversity to the AI team
Most importantly, make sure that there’s diversity in the AI training team. A diverse team can identify all kinds of bias at source, including geographical bias. Make sure that everyone on the team has an equal voice and can raise their concerns if needs be.
AI can be an incredibly useful tool for equality and representation. AI doesn’t really discriminate, it just processes data and follows the rules. Discrimination is simply a by-product of bad data. It’s up to us to ensure that algorithms get the right data.