The Dataset

Facebook and the Data for Good Initiative

Facebook, Inc. is a Fortune 50, social media conglomerate most well known for its creation and operation of its flagship product, Facebook.com. With over 2.5 billion users across the globe, and over 190 million users in the United States alone, the company’s influence and reputation has continually grown since its inception in 2004.1 With access to the personal information and media files of virtually all of their users, Facebook owns one of the largest datasets of personal information on the globe today.

The company’s Data for Good initiative was created to harness their data in an anonymized fashion in order to assist researcher, policy makers, and non-profit organizations in benefiting their communities.2 This dataset and its associated research paper was authored by researchers from Facebook, Harvard University, Princeton University, and New York University.3 It is not known, however, whether these associated institutions provided further funding and resources toward the project. 

Our project references the Social Connectedness Index, which acts to quantify the relative connection between different localities around the country and around the world. While the available data is global, this project focuses on their domestic data accounting for the comparison between all counties within the United States.

The county-to-county data is available via the Humanitarian Data Exchange, an open-source data-sharing platform provided by the United Nations Office for the Coordination of Humanitarian Affairs (OCHA). OCHA created the service in furtherance of its goal to help the world become more informed on humanitarian crises.4

County-by-county Social Connectedness Indexes

The following interactive map depicts Facebook’s publicly-available SCI data in a subjectively more-readable format. You can either type in or select a county from the module on the bottom right. The color of all of the other counties represents the relative likelihood that an individual within your chosen county is Facebook friends with an individual from another county, compared to all other counties. These data are inclusive of all counties within the contiguous United States, in addition to Alaska, Hawaii, and Puerto Rico.
(Note: Your selected county is represented in black)

Data Critique

Before we begin to analyze and explore the social implications of this data, it is important that we understand the understand the data collection process and any surrounding biases. To do this, we need to ask and answer the following questions:

How was this data generated?

Why did Facebook seek to create this? What were their methodologies?

What does this data tell us?

What are the limits to what we can derive? What questions can we answer using this data?
The public release of the Social Connectedness Index dataset includes solely an anonymized and calculated metric representing internal Facebook data. The Facebook researchers behind the creation of this dataset claim that each Facebook user is assigned to a location based on their information and activity on Facebook, including the stated city on their Facebook profile, device, and connection information. “Friendship” on Facebook.com is a binary system in which two users must mutually consent to a connection, and for the large majority of the population, “Friends” on Facebook are only those they know in real life.5 From this sample, they apply the following calculation to the anonymized data to produce a Social Connectedness Index:
The social connectedness index equation
Here, x and y represent two locations, Facebook Connections represents the number of Facebook friendships between the two locations, and FB Users represents the total number of Facebook users in each location. The calculated values are then normalized to a scale of 1 to 1,000,000. This measurement illustrates the relative likelihood that somebody in location x is connected to a user in location y.

By collecting location information from all users within distinct areas, Facebook hopes to create a representative sample of the population. However, while the public version of the Social Connectedness Index consists solely of completely anonymous data, the process through which the researchers who built the Index created the release involved the processing raw, non-anonymous data.6 As access to this greater dataset is available through formal research requests, this raises privacy concerns stemming from Facebook’s history of not only internal scandals but also those with which the company contracts.
Harvard University

2008: The T3 Dataset

2013 Facebook data breach

2013: Contact Information Data Breach

Cambridge Analytica and Facebook scandal

2018: Cambridge Analytica Scandal

Our concerns over privacy issues drive us toward the question of why Facebook would not only seek to extract this data, but why it also deemed these data critical enough to be well-suited for the Humanitarian Data Exchange. Given their history, there are valid concerns over whether the information gathered for Data for Good initiative is being used strictly for non-financially-beneficial and socially-focused purposes. While speculative, it is possible that this initiative and the distribution of these carefully-cleaned data serves to remedy the brand from privacy concerns and to boost its political standing. 

What's In? What's Out

The information provided by the data is grand in scope, yet shallow in depth and partially incomplete. Globally, the researchers aimed to include information from all countries, yet were prevented from providing data on the following areas:7

Afghanistan, Western Sahara, China, Cuba, Iraq, Israel, Iran, North Korea, Russia, Syria, Somalia, South Sudan, Sudan, Venezuela, Yemen, Crimea, Jammu and Kashmir, Donetsk, Luhansk, Sevastopol, West Bank, and Gaza

However, it does not tell us which of these countries were omitted and the factors that determined this omission. Likewise, the data also excludes regions with the least active users under the Database of Global Administrative areas. Countries that have a population with less than a million people are not divided compared to their counterparts in GADM level 1 and 2 regions. For counties within the United States, the dataset also excludes counties with fewer than 100 active users.8

But while broad, the Social Connectedness Index itself only tells us so much. We can understand the magnitude of connections between locations that that number relative to connections between other locations, yet we cannot understand the strength of these connections or what the Index implies. It does not specify whether connections involve consistent interaction, nor does it acknowledge intersectional identities such as ethnicity, gender, race, age, income, or political affiliation. Furthermore, this dataset is merely a representative sample of individuals with access to Facebook and those who are actively using their services. Almost 45% of Americans are not represented here. This 45% figure in itself may be an indication of low widespread inter-geographical connectedness, yet we cannot use this data to determine that.

Additionally, this dataset represents a snapshot in time from August 2020, therefore, we cannot explore how connection trends may have shifted throughout time, especially with the growth of technology and access to Facebook.

U.S. Department of Labor

The U.S. Bureau of Labor Statistics produces information on employment and unemployment in over 7,500 areas on a monthly basis. This information is generated through Local Area Unemployment Statistics (LAUS) which generates this information by gathering data from: states, census regions and divisions, cities with 25,000 population and more, and small labor markets. According to the LAUS, these statistics are a clear indicator of local economic conditions. This data is generated through the Current Population Survey, or CPS, which is the national household survey that focuses on the unemployment rate. Other data involved is the Current Employment Statistics (CES) and the Unemployment Insurance Systems (USI). The CES and USI along with the CPS are all used accordingly by states to estimate local and state unemployment.9  The information provided by the U.S Bureau of Labor statistics helps states and local governments with allocating funds to help areas that are in need of labor training and in need for local employment. For example, if the City of Ontario, CA, is dealing with a high number of unemployed, then the state can use this information to provide training for individuals in that area. On the federal level, this information is used to help provide states that are in high-need of assistance with more funding. This can include an increased budget for unemployment insurance or helping states increase their budget on necessary training. In the case of private use, this data helps identify local areas that are in need of development and to compare this information across states.

Using this data we can better understand and articulate labor differences on the county level and also on the state level. With the inclusion of important statistics, such as unemployment rates and household incomes, it may be possible to explore possible linkages between economic performance and social connectivity. However, as the dataset only provides aggregate measures of geographic areas, we cannot understand the sentiments or individual habits of those at the highest or lowest points of society.

Data Ontology

It’s difficult to determine any ideological biases at play in the SCI dataset due to its architectural simplicity. There seems to be no narrative at play, however, the primary narrative that seems to be at play here is that Facebook.com holistically represents a statistically significant sample of the population. So, while the numbers themselves hold no biases, the methods by which the data were collected leaves room for interpretation. As listed on their Data for Good page, these datasets are primarily for researchers, yet Facebook has power over their samples and who is represented within them.

Furthermore, if this were to be the only dataset utilized in this project, then conclusions might lend themselves to be more surface-level, and many possible determinants and effects might have gone unanalyzed. It should be pointed out, however, that this data does draw on governmental distinctions of location. In the U.S., there are no rules governing the drawing of county lines. The concept of gerrymandering is that of political bias dividing geographies into sections with politically-charged intentions. While this practice does not directly affect the creation of this data, it is important to consider that the distinction between two adjacent counties are not necessarily clear on political, racial, or other demographic-related fronts. Therefore, we have to be careful about the inferences we draw from trends within the data, as each county does not represent a homogeneous mix of individuals.

Citations

  1. Facebook, Inc, “Form 10-K Annual Report,” SEC EDGAR, Dec. 31, 2019, https://www.sec.gov/ix?doc=/Archives/edgar/data/1326801/000132680120000013/fb-12312019x10k.htm.
  2. Facebook, Inc., “Approach,” Facebook Data for Good, https://dataforgood.fb.com/approach.
  3. Michael Bailey et al., “Social Connectedness: Measurement, Determinants, and Effects,” Journal of Economic Perspectives 32, no. 3 (August 2018): 259–80, https://doi.org/10.1257/jep.32.3.259.
  4. United Nations OCHA, “About – Humanitarian Data Exchange,” accessed December 10, 2020, https://data.humdata.org/about/terms.
  5. Bailey et al., “Social Connectedness.”
  6. Bailey et al., “Social Connectedness.”
  7. Facebook, Inc., “Social Connectedness Index Methodology,” Facebook Data for Good (blog), https://dataforgood.fb.com/docs/social-connectedness-index-methodology/.
  8. Facebook Data for Good, “Social Connectedness Index Methodology.”
  9. U.S. Bureau of Labor Statistics, “Local Area Unemployment Statistics Home Page,” Local Area Employment Statistics, accessed December 11, 2020, https://www.bls.gov/lau/.
Images:
  1. Harvard University, Homepage Background, accessed December 10, 2020, https://cdnsecakmi.kaltura.com/p/1423662/sp/142366200/serveFlavor/entryId/1_tpf74teu/flavorId/1_th4ziv01/forceproxy/true/name/2018092601.mp4.
  2. Getty Images, “#CDUdigital Conference In Berlin,” accessed December 10, 2020, https://techcrunch.com/wp-content/uploads/2018/09/gettyimages-4878672681.jpg?w=1390&crop=1.
  3. Getty Images, “If You Don’t Fully Understand the Cambridge Analytica Scandal, Read This Simplified Version,” accessed December 10, 2020, https://www.incimages.com/uploaded_files/image/1920×1080/getty_935015144_351238.jpg.