Utilizing 2011 - 2018 crime archives to forecast and compare against 2019 data | November 2025
Author: Andrew Castro
This project builds an end-to-end analytics pipeline to clean, process, and forecast FBI NIBRS crime
statistics across 9 years of multi-state data. It includes full data engineering, predictive modeling, MAE
evaluation, and a multi-page Power BI dashboard that compares forecasted (autoregression and linear
regression) vs actual 2019 crime counts from
the FBI NIBRS archives.
More info below.
Using real-world 2019 FBI NIBRS data as ground truth:
States that reported zero population coverage for consecutive years (2011-2018), were excluded from the forecasting model, as this data was insufficient for comparable time-series forecasting, to states with population coverage.
The cleaned dataset, used for 'Total Offenses' contained an actual count of 1,725,358. This 'apples-to-apples' comparison, based on the sum of errors (MAE), rather than the error of the sum, resulted in the autoregression model having a forecasting accuracy of 91.69%, with an average forecasting error of -4.53% by state.
The cleaned dataset, used for 'Homicide Offenses' contained an actual count of 6,719. This 'apples-to-apples' comparison, based on the sum of errors (MAE), rather than the error of the sum, resulted in the autoregression model having a forecasting accuracy for Homicide Offenses at 82.45%, with an average forecasting error of -13.96% by state.
The Autoregression Model was selected as the final forecasting methodology due to its superior performance in predicting 2019 crime statistics. Compared to Linear Regression, the Autoregressive approach improved accuracy by 8.02 percentage points for Total Offenses and 5.59 percentage points for Homicide Offenses. While these margins may appear nominal, the Mean Absolute Error (MAE) reveals a significant gap in predictive reliability and model magnitude. A detailed performance analysis follows below.
A surface-level comparison of accuracy metrics (91.69% vs 83.67%) disguises the true
performance gap between the two models.
By examining the inverse metric, the Error Rate, we see that the Linear Regression
Model (16.33%) produced nearly twice as many errors as the Autoregression Model
(8.31%).
This is then confirmed by the Mean Absolute Error (MAE), which shows the Linear
Model's total error magnitude was 96.58% greater than the Autoregressive approach.
In short: The 8-point gap in accuracy resulted in a double-magnitude gap in
reliability.
| Metric | Autoregression (AR) | Linear Regression (LR) |
|---|---|---|
| Forecast Accuracy | 91.69% | 83.67% |
| Calculate Error Rate | 100% - 91.69% = 8.31% | 100% - 83.67% = 16.33% |
| Error Rate | 8.31% | 16.33% (Nearly Double) |
| Mean Absolute Error (MAE) | 143,350 | 281,800 |
The Calculation:
((Linear MAE - AR MAE) / AR MAE) * 100
((281,800 - 143,350) / 143,350) * 100 = 96.58% Increase in Error