ProPublica recently released a partial database of New York Police Department (NYPD) disciplinary records. An analysis of substantiated complaints of police misconduct reveals clear systemic racism. Black people face wildly disproportionate amounts of police misconduct regardless of the race or gender of individual officers.
Following a change to the New York law that kept police officers’ disciplinary records secret—and amid an ongoing lawsuit—ProPublica has released a searchable database of complaints to the Civilian Complaint Review Board (CCRB). The database, which can be downloaded in its entirety here, includes allegations against the nearly 4,000 officers who have at least one substantiated complaint against them. The CCRB’s powers are extremely circumscribed, and these data reflect the board’s limitations. The CCRB “exonerates” officers whose conduct is ruled to fall within departmental guidelines, no matter how egregious. Allegations may remain “unsubstantiated” due to a routine lack of NYPD cooperation (in violation of the law), and even “substantiated” allegations lead only to suggestions, which the department is free to ignore.
For the purpose of this analysis, despite these limitations, I am dealing only with substantiated complaints. Such complaints only represent a tiny slice of NYPD misconduct—in 2018 only 73 cases were substantiated out of about 3,000 allegations—but they still reveal striking patterns as to who this misconduct affects. Each complaint may contain multiple allegations, but I am treating each substantiated case of misconduct as a separate incident even though they may have happened at the same time.
Even a quick glance through these records turns up numerous officers who have committed repeated, serious, substantiated misconduct while rising through the ranks. But looking at these data from a bird’s eye view also reveals some striking patterns in NYPD misconduct. Black people bear the brunt of NYPD misconduct, and in this matter, the race and gender of the individual officer in question does not seem to make any difference.
Anyone who is familiar with New York City would not expect police misconduct—and thus complaints about police misconduct—to be evenly distributed geographically, and that is the case here.
The most substantiated complaints by far are found in the Seventy-Fifth Precinct in East New York, Brooklyn, the location of a major corruption scandal, but nearby neighborhoods in Brooklyn also see a disproportionate number of complaints, as does the South Bronx. This map seems to show the results of over-policing minoritized communities.
The ProPublica database records the ethnicity of both the complainant and the accused officer.
In the majority of substantiated complaints the officers were white. In cases where the ethnicity of the complainant is known, the majority of complainants are Black. According to the Census Bureau’s American Community Survey, New York City is 42.7% white, 29.1% Latino, 24.3% Black, and 13.9% Asian.
When we take a look at the ethnicity of these police officers, however, it does not seem to make much of a difference.
Officers of every ethnicity commit substantiated cases of misconduct against Black people at similar rates. White officers have a much greater overall number of cases, regardless of complainant’s ethnicity. It’s not clear from this data set whether this number is disproportionate to the number of white police in New York during this time period, but the disproportionate number of white officers is itself a symptom of systemic racism within the NYPD.
Similarly, officers’ gender does not seem to make a difference when it comes to racist policing.
The same pattern of misconduct holds whether the officer in question is a man or a woman. Men commit many, many more acts of misconduct overall but whether this is out of proportion to their numbers on the force during this time period would require a different data set to determine.
The data analyzed here are limited and partial, but they corroborate what Black New Yorkers, other New Yorkers of color, and their white allies already know from experience: the NYPD is a profoundly racist institution, not because of a few bad apples, but on a structural level.
All code is available on GitHub.
I’ve been exploring my personal Twitter data using the Twitter API (with the rtweet package) and the tidytext text-mining package. I haven’t come up with any mind-blowing conclusions but it’s been fun to see who my favorite tweeters are, who their favorite tweeters are, what we tweet about, and how the sentiment of my tweets has changed over time. I did not like it when Bernie Sanders dropped out of the presidential race or when Brian Kemp reopened Georgia’s economy! If you’re interested in seeing the details or reading my code, check out the GitHub repository.
So this post may be something of a cautionary tale about getting ahead of yourself when it comes to analyzing data. The DeKalb County Board of Health releases numbers on the spread of COVID-19 in the county, most recently on July 6. Included with these data is a breakdown of the county’s 7,043 cases by ZIP Code. It is immediately apparent that while this disease has affected the entire county, its effects have not been felt evenly.
It is difficult not to speculate on the causes of this variation. Have the policies of local governments prevented or exacerbated outbreaks? Has the politicization of mask wearing and social distancing led to increased spread in more conservative areas? Have certain communities had more access to testing resources—or chosen to get tested more often—than others? Are higher density areas more likely to see outbreaks than more suburban areas of the county?
I had a hypothesis that while any or all of these factors may contribute to the spread of COVID-19, a major contributor to COVID-19 variability from ZIP Code to ZIP Code in DeKalb is income inequality. So I extracted the raw data from the county’s Board of Health and combined it with census data to calculate cases per thousand residents. I immediately noticed what seemed like a strong correlation between median household income and COVID-19 rates. The poorest ZIP Code in the county when measured by median household income showed 13.5 positive cases per thousand residents while the richest showed only 1.7 cases per thousand.
When I analyzed the numbers more formally, I found that while the differences in COVID-19 rates seemed most pronounced at the extremes, there seemed to be a clear correlation between rates of infection and median household income throughout the county. For every extra $10,000 in median household income, the number of cases per thousand seemed to drop by 1.15. Median household income seemed to account for around a quarter of COVID-19’s geographical variability in DeKalb County.
So I got ready to write a fiery article arguing that low-wage service workers’ exposure to unsafe working conditions was leading to higher rates of COVID-19 in their communities. (I still believe this to be true, but it’s not evident at the level of ZIP Code in DeKalb)
But as I was writing up my results, I realized I had made a fatal mistake. ZIP Codes do not follow county boundaries! Fortunately, the U.S. Census Bureau provides data on what portion of a ZCTA (a Census category roughly analogous to ZIP Codes) falls within a particular county. I used this information to re-run my analysis and found out that while a correlation exists, it’s not as substantial as I had thought. I made a plot of this data and it seemed at first that while there was not necessarily a linear relationship between median household income and COVID-19 rates, it did seem like the wealthiest ZIP Codes had the lowest rates.
So I re-categorized each ZIP Code categorically, sorting each area into the wealthiest neighborhoods, with median household incomes of at least $80,000, and everyone else. And it turns out that there is a substantial and statistically significant difference between these neighborhoods and less well off ones.
## ## Call: ## lm(formula = cases_per_thousand ~ neighborhood_wealth, data = dekalb) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.7480 -2.5631 -0.9299 1.6408 8.4739 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.2078 0.7294 13.994 6.82e-14 *** ## neighborhood_wealthrich -4.1825 1.6037 -2.608 0.0147 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.498 on 27 degrees of freedom ## Multiple R-squared: 0.2012, Adjusted R-squared: 0.1716 ## F-statistic: 6.802 on 1 and 27 DF, p-value: 0.01466
However, as I looked more closely at the visualization, I noticed that while none of the wealthiest ZIP Codes are among the most affected, most of the other ZIP Codes have similar rates to the wealthiest ones. Just five or six ZIP Codes stand out as having elevated rates.
I put some thought into thinking about what sets these ZIP Codes apart from the rest of the county, and I decided to make a map (while this map does not exactly trace the boundaries of DeKalb, the data represented only includes the portions of these ZIP Codes that are in DeKalb) .
From this perspective, it looks like the highest COVID-19 rates are clustered in the Northeast part of the county, but I don’t yet know what if anything this tells us about the spread of COVID-19.
Even though this particular exploration didn’t uncover clear evidence that local income inequality is contributing to an uneven spread of coronavirus, it doesn’t mean that this isn’t the case.
Certainly, other forms of inequality also contribute to a greater risk of contracting COVID-19. The New York Times has reported that Black and Latino people in the United States are more than three times likely to contract the virus than white people and more than twice as likely to die from it.
This pandemic has brought so many facets of U.S. society into sharp relief. In DeKalb, in Georgia, in the United States, and throughout the world, we’re all feeling the effects of this crisis, but we’re not feeling them equally.
For more details on my methodology or to check out my code, visit my GitHub page.
I have been recording my runs in Strava for about five years. I wanted to see if I could use this data to make predictions about my racing pace. I downloaded my data from the website, including a spreadsheet collecting all my activity data (I’ve deleted some data from this file for privacy reasons).
I spent some time using visualizations and linear models to determine which variables would provide the most predictive power. I looked into trying to predict both race and non-race paces, and I considered factors including age, distance, elevation gain (both relative and absolute), amount of training, recovery, and season.
While I have much more data on my overall running than I do on the small number of races that I have run, I found too much variability in my running pace that cannot be explained through these data. Was I taking it easy or doing a workout? Had I eaten breakfast? How hot was it?
With races, much of this variability no longer applies. In a race, I’m going to try to go as fast as possible and pay attention to proper rest and nutrition, as far as it is under my control.
After trying out various models, I found the model with the most significance and explanatory power was a simple one that only took into account the length of the race and the amount of training I did over the previous twelve weeks.
Racing Pace Model
## ## Call: ## lm(formula = pace ~ Distance + training, data = races) ## ## Residuals: ## Min 1Q Median 3Q Max ## -20.537 -7.097 1.153 10.061 20.001 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 342.06054 10.41302 32.849 1.61e-11 *** ## Distance 2.80869 0.32545 8.630 6.02e-06 *** ## training -0.17157 0.03272 -5.243 0.000377 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 13.32 on 10 degrees of freedom ## Multiple R-squared: 0.8875, Adjusted R-squared: 0.865 ## F-statistic: 39.46 on 2 and 10 DF, p-value: 1.8e-05
Analysis of Variance Table
## Analysis of Variance Table ## ## Response: pace ## Df Sum Sq Mean Sq F value Pr(>F) ## Distance 1 9122.4 9122.4 51.421 3.029e-05 *** ## training 1 4876.8 4876.8 27.489 0.0003772 *** ## Residuals 10 1774.1 177.4 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2.5 % 97.5 % ## (Intercept) 318.8588806 365.26219053 ## Distance 2.0835409 3.53383942 ## training -0.2444886 -0.09866009
This simple model explains much of the variability in my racing pace with a fairly high degree of confidence. It seems that how much I train for a race does indeed have a direct and measurable impact on my performance. Adjusting for the length of the race, every kilometer I run in the twelve weeks prior to the race results in an improvement of .17 seconds per kilometer (with a 95% confidence interval between .10 and .24 seconds).
With this model, it’s a simple task to craft a tool that allows me to predict my race pace based on the length of the race and how much I’ve trained.
In this case the tool is a Shiny app. For now, this app is very simple. The user simply enters the length of the race in either kilometers or miles and the distance I’ve trained in the preceding twelve weeks, and the app returns a predicted race time and pace.
The next steps are to enable the app to communicate uncertainty by adding (and perhaps visualizing) confidence intervals.
The source code for this project is available on GitHub.
Early on in the pandemic, I became frustrated by the lack of quality visualizations of local COVID-19 data, particularly concerning Metro Atlanta, where I live, so I set out to create a set of visualizations of these data. Since then, the situation has improved greatly, but these plots still provide some details and comparisons that I have not seen elsewhere.
The first item I set out to create is an R Markdown document focused on Metro Atlanta that visualizes the distribution of cases and deaths as they change over time. The charts in this document track new cases and deaths in the core Metro Atlanta counties, both in absolute and relative terms. Subsequent charts provide context by tracking new cases and deaths in Metro Atlanta, Georgia, and the United States.
The second product of this project is a Shiny web app, which visualizes new case and death data for the entire United States, with a focus on county-level data in the context of state and national data. This reactive app allows the user to look at data from any state and county in the United States, subject to the geographic limitations of the original data set.
The source code for this project is available on GitHub.
All data come from the New York Times’ ongoing repository of COVID-19 cases and deaths in the United States.