Systemic Racism in the NYPD

ProPublica recently released a partial database of New York Police Department (NYPD) disciplinary records. An analysis of substantiated complaints of police misconduct reveals clear systemic racism. Black people face wildly disproportionate amounts of police misconduct regardless of the race or gender of individual officers.

Following a change to the New York law that kept police officers’ disciplinary records secret—and amid an ongoing lawsuit—ProPublica has released a searchable database of complaints to the Civilian Complaint Review Board (CCRB). The database, which can be downloaded in its entirety here, includes allegations against the nearly 4,000 officers who have at least one substantiated complaint against them. The CCRB’s powers are extremely circumscribed, and these data reflect the board’s limitations. The CCRB “exonerates” officers whose conduct is ruled to fall within departmental guidelines, no matter how egregious. Allegations may remain “unsubstantiated” due to a routine lack of NYPD cooperation (in violation of the law), and even “substantiated” allegations lead only to suggestions, which the department is free to ignore.

For the purpose of this analysis, despite these limitations, I am dealing only with substantiated complaints. Such complaints only represent a tiny slice of NYPD misconduct—in 2018 only 73 cases were substantiated out of about 3,000 allegations—but they still reveal striking patterns as to who this misconduct affects. Each complaint may contain multiple allegations, but I am treating each substantiated case of misconduct as a separate incident even though they may have happened at the same time.

Even a quick glance through these records turns up numerous officers who have committed repeated, serious, substantiated misconduct while rising through the ranks. But looking at these data from a bird’s eye view also reveals some striking patterns in NYPD misconduct. Black people bear the brunt of NYPD misconduct, and in this matter, the race and gender of the individual officer in question does not seem to make any difference.

Anyone who is familiar with New York City would not expect police misconduct—and thus complaints about police misconduct—to be evenly distributed geographically, and that is the case here.

The most substantiated complaints by far are found in the Seventy-Fifth Precinct in East New York, Brooklyn, the location of a major corruption scandal, but nearby neighborhoods in Brooklyn also see a disproportionate number of complaints, as does the South Bronx. This map seems to show the results of over-policing minoritized communities.

The ProPublica database records the ethnicity of both the complainant and the accused officer.

In the majority of substantiated complaints the officers were white. In cases where the ethnicity of the complainant is known, the majority of complainants are Black. According to the Census Bureau’s American Community Survey, New York City is 42.7% white, 29.1% Latino, 24.3% Black, and 13.9% Asian.

When we take a look at the ethnicity of these police officers, however, it does not seem to make much of a difference.

Officers of every ethnicity commit substantiated cases of misconduct against Black people at similar rates. White officers have a much greater overall number of cases, regardless of complainant’s ethnicity. It’s not clear from this data set whether this number is disproportionate to the number of white police in New York during this time period, but the disproportionate number of white officers is itself a symptom of systemic racism within the NYPD.

Similarly, officers’ gender does not seem to make a difference when it comes to racist policing.

The same pattern of misconduct holds whether the officer in question is a man or a woman. Men commit many, many more acts of misconduct overall but whether this is out of proportion to their numbers on the force during this time period would require a different data set to determine.

The data analyzed here are limited and partial, but they corroborate what Black New Yorkers, other New Yorkers of color, and their white allies already know from experience: the NYPD is a profoundly racist institution, not because of a few bad apples, but on a structural level.

All code is available on GitHub.


I’ve been exploring my personal Twitter data using the Twitter API (with the rtweet package) and the tidytext text-mining package. I haven’t come up with any mind-blowing conclusions but it’s been fun to see who my favorite tweeters are, who their favorite tweeters are, what we tweet about, and how the sentiment of my tweets has changed over time. I did not like it when Bernie Sanders dropped out of the presidential race or when Brian Kemp reopened Georgia’s economy! If you’re interested in seeing the details or reading my code, check out the GitHub repository.

Tidy Astronauts

As part of the TidyTuesday project, I created this visualization of who has gone into space based on gender and nationality. This is my first attempt at mapping data geographically! I’m pretty pleased with how it turned out, but I would welcome any feedback. The code is available here.

COVID-19 in DeKalb County, Georgia

So this post may be something of a cautionary tale about getting ahead of yourself when it comes to analyzing data. The DeKalb County Board of Health releases numbers on the spread of COVID-19 in the county, most recently on July 6. Included with these data is a breakdown of the county’s 7,043 cases by ZIP Code. It is immediately apparent that while this disease has affected the entire county, its effects have not been felt evenly.

It is difficult not to speculate on the causes of this variation. Have the policies of local governments prevented or exacerbated outbreaks? Has the politicization of mask wearing and social distancing led to increased spread in more conservative areas? Have certain communities had more access to testing resources—or chosen to get tested more often—than others? Are higher density areas more likely to see outbreaks than more suburban areas of the county?

I had a hypothesis that while any or all of these factors may contribute to the spread of COVID-19, a major contributor to COVID-19 variability from ZIP Code to ZIP Code in DeKalb is income inequality. So I extracted the raw data from the county’s Board of Health and combined it with census data to calculate cases per thousand residents. I immediately noticed what seemed like a strong correlation between median household income and COVID-19 rates. The poorest ZIP Code in the county when measured by median household income showed 13.5 positive cases per thousand residents while the richest showed only 1.7 cases per thousand.

When I analyzed the numbers more formally, I found that while the differences in COVID-19 rates seemed most pronounced at the extremes, there seemed to be a clear correlation between rates of infection and median household income throughout the county. For every extra $10,000 in median household income, the number of cases per thousand seemed to drop by 1.15. Median household income seemed to account for around a quarter of COVID-19’s geographical variability in DeKalb County.

So I got ready to write a fiery article arguing that low-wage service workers’ exposure to unsafe working conditions was leading to higher rates of COVID-19 in their communities. (I still believe this to be true, but it’s not evident at the level of ZIP Code in DeKalb)

But as I was writing up my results, I realized I had made a fatal mistake. ZIP Codes do not follow county boundaries! Fortunately, the U.S. Census Bureau provides data on what portion of a ZCTA (a Census category roughly analogous to ZIP Codes) falls within a particular county. I used this information to re-run my analysis and found out that while a correlation exists, it’s not as substantial as I had thought. I made a plot of this data and it seemed at first that while there was not necessarily a linear relationship between median household income and COVID-19 rates, it did seem like the wealthiest ZIP Codes had the lowest rates.

So I re-categorized each ZIP Code categorically, sorting each area into the wealthiest neighborhoods, with median household incomes of at least $80,000, and everyone else. And it turns out that there is a substantial and statistically significant difference between these neighborhoods and less well off ones.

## Call:
## lm(formula = cases_per_thousand ~ neighborhood_wealth, data = dekalb)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7480 -2.5631 -0.9299  1.6408  8.4739 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              10.2078     0.7294  13.994 6.82e-14 ***
## neighborhood_wealthrich  -4.1825     1.6037  -2.608   0.0147 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.498 on 27 degrees of freedom
## Multiple R-squared:  0.2012, Adjusted R-squared:  0.1716 
## F-statistic: 6.802 on 1 and 27 DF,  p-value: 0.01466

However, as I looked more closely at the visualization, I noticed that while none of the wealthiest ZIP Codes are among the most affected, most of the other ZIP Codes have similar rates to the wealthiest ones. Just five or six ZIP Codes stand out as having elevated rates.

I put some thought into thinking about what sets these ZIP Codes apart from the rest of the county, and I decided to make a map (while this map does not exactly trace the boundaries of DeKalb, the data represented only includes the portions of these ZIP Codes that are in DeKalb) .

From this perspective, it looks like the highest COVID-19 rates are clustered in the Northeast part of the county, but I don’t yet know what if anything this tells us about the spread of COVID-19.

Even though this particular exploration didn’t uncover clear evidence that local income inequality is contributing to an uneven spread of coronavirus, it doesn’t mean that this isn’t the case.

Certainly, other forms of inequality also contribute to a greater risk of contracting COVID-19. The New York Times has reported that Black and Latino people in the United States are more than three times likely to contract the virus than white people and more than twice as likely to die from it.

This pandemic has brought so many facets of U.S. society into sharp relief. In DeKalb, in Georgia, in the United States, and throughout the world, we’re all feeling the effects of this crisis, but we’re not feeling them equally.

For more details on my methodology or to check out my code, visit my GitHub page.