close
close

Bayesian analysis of urban theft crime in 674 Chinese cities

Study area and data sources

In academia, researchers have used a variety of core indicators to study theft crimes. Current research trends suggest that factors such as population size5,6,7,8,30,37,38Land use patterns13,31,32and urban economic conditions6,33,39 were used extensively in these studies. However, there is relatively little research focused on the use of points of interest (POI) counts in the context of theft crimes in cities. Due to their wide application and importance in analyzing the spatial distribution of theft crimes, this study selects four key indicators – urban population size, GDP, built-up area and POI count – for in-depth study. The selection of these indicators is well supported by an extensive literature in the literature review section.

Taking the urban areas of 674 cities in China as the base research units, the annual average number of burglaries from 2018 to 2020 is the focus of the analysis. The data came from “China Judgments Online” ( The verification criteria were as follows: Case type: criminal cases; Place and court: District XX or county-level city XX; Plea: Theft; Procedure: First instance criminal case; Year of judgment: 2018, 2019, 2020; Document Type: Judgment was conducted in August 2023, resulting in a total of 356,105 burglary cases decided by district-level (or county-level) people's courts in China, all included in the analysis, from the resident population data for the adjusted factors Tabular representation of the 2020 Chinese census by district; The GDP data comes from the China City Statistical Yearbook 2021 (statistics for 2020); The data on the built-up area comes from the 2020 China Urban and Rural Construction Yearbook. The POI data is from Amap for 2022 (the descriptive statistics of the relevant factors are shown in Table 1).

Table 1 Descriptive statistics on factors influencing theft crimes.

Spatial autocorrelation analysis

By using a weight matrix formulated using the inverse distance squared method and setting a threshold distance of 50 km, we calculated the global morans I index to conduct an autocorrelation analysis on the spatial arrangement of theft incidents in 674 cities in China. A Moran's in detail I Index in the range of (0,1]means increasingly stronger clustering with higher values; within [-1,0), it implies increasingly strong dispersion with lower values; and values nearing 0 indicate a random distribution. The z-score is a standardized score that represents the degree of difference between the observed value and the expected value. In spatial autocorrelation analysis, the z-score is used to assess the statistical significance of the Moran’s I index. Generally, a z-score greater than + 1.65 or less than − 1.65 indicates that the spatial distribution exhibits statistically significant clustering or dispersion characteristics. The p-value is an important concept in hypothesis testing, representing the probability of observing the current or more extreme Moran’s I index under the null hypothesis (that the data is randomly distributed in space). A smaller p-value indicates a greater difference between the spatial pattern and the null hypothesis, signifying more pronounced spatial autocorrelation. Typically, when the p-value is less than 0.05, it can be considered that the spatial autocorrelation is significant40.

Bayesian model-based fitting of urban burglary crimes

The application strengths of Bayesian models in theft crime research are predominantly demonstrated through their adept utilization of data characteristics and effective tackling of research hurdles. Notably, given that judicial document data available online is not exhaustive and some cities may have comparatively scant crime data, potentially posing challenges associated with small datasets, Bayesian models adeptly address this by harnessing information from neighboring cities to accurately fit the overall crime distribution. Specifically, when confronted with incomplete theft crime data, Bayesian models employ prior knowledge and available data to infer unknown parameters, thereby adeptly addressing the challenge of missing data. Moreover, acknowledging the prominent spatial and temporal distribution patterns of theft crimes, Bayesian models incorporate spatio-temporal effects, facilitating the revelation of crime distribution patterns and trends. By furnishing a comprehensive probability distribution of parameters, rather than a solitary estimate, these models enrich the understanding of parameter uncertainty and variability, thereby providing a more comprehensive informational foundation for policy formulation. Furthermore, criteria within the Bayesian framework, such as the Deviance Information Criterion (DIC), offer potent tools for selecting and evaluating crime prediction models, aiding researchers in determining the most fitting model41,42.

In the context of analyzing theft crime data, Bayesian models are particularly advantageous for several situations: Firstly, when crime data displays a clustered spatial distribution, Bayesian models are capable of considering spatial autocorrelation to make more precise predictions of crime hotspots. Secondly, when confronted with complex data structures that encompass multi-level data from different regions and time periods, the flexibility and adaptability of Bayesian models render them an ideal option. Additionally, given their outstanding predictive capabilities, Bayesian models are well-suited for forecasting future crime trends or evaluating crime risks in specific regions. When research necessitates the integration of diverse data sources such as police records, demographic statistics, and economic indicators, Bayesian models offer an intuitive approach for consolidation. Lastly, in the face of the challenge posed by extensive missing data, Bayesian models can estimate missing values based on prior distributions and other observed data, thereby enabling effective data analysis.

This study utilizes a Bayesian model to fit crime data from 674 cities in China. The Bayesian approach’s core viewpoint, compared with traditional frequentist methods, is to treat parameters as uncertain random variables rather than fixed values. By assigning prior probability distributions to each parameter and combining actual observational data with the likelihood function, the Bayesian method can accurately derive the posterior probability distribution of the parameters, thereby achieving precise parameter estimation. The posterior probability expectation of the parameters is often used to represent the most likely value of each parameter, while the 95% credible interval characterizes the 95% probability range of the parameter values. This probabilistic description provides point estimates for parameters and also deeply quantifies the uncertainty of parameters, making the analysis results markedly complete and reliable. This study selects the Poisson distribution to simulate the random distribution process of crimes, incorporating fixed effects and spatial random effects to fit the spatial differentiation of crimes43. The relevant parameters and calculation formulas of the Bayesian model are as follows:

$$\mathop Y\nolimits_{{\text{i}}} \sim P{\text{oisson}}\left( {\mathop \lambda \nolimits_{{\text{i}}} } \right)$$

(1)

$${\text{(}}\mathop \lambda \nolimits_{i} {\text{)=}}\mathop E\nolimits_{i} \mathop \rho \nolimits_{i}$$

(2)

$$\log (\mathop \rho \nolimits_{i} )=\mathop \eta \nolimits_{i}$$

(3)

$$\mathop \eta \nolimits_{i} =\mathop b\nolimits_{0} +\mathop u\nolimits_{i} +\mathop v\nolimits_{i}$$

(4)

where Yi represents the number of crimes in city i, assumed to follow a Poisson distribution; λi in Eq. (1) is the mean parameter of the Poisson distribution, represented by Eq. (2); Ei in Eq. (2) is the expected number of burglaries in city i, which can be calculated based on different city size indicators; and ρi represents the relative risk of crime in city i, reflecting the level of crime compared with its expected value. Equations (3) and (4) model the relative risk ρi. Given that the relative risk must be positive, it is typically modeled after taking the logarithm. Meanwhile, ηi is an intermediate variable for the natural logarithm of the relative risk; b0 is the intercept of the model, representing the average relative risk across cities; ui represents the spatially structured random effect of city i, measuring the influence of neighboring cities on this city (i.e., spatial correlation or spatial autocorrelation effects); and νi represents the spatial unstructured random effect of city i, which characterizes a random spatial effect not influenced by other regions, measuring spatial heterogeneity owing to random disturbances. Incorporating this random disturbance in the model can effectively represent data overdispersion, thereby stabilizing the model.

When fitting the spatial random effects (ui+ νi ) of the model, the widely used BYM model proposed by Besag et al.44 in Bayesian models is adopted. Prior distribution of the spatial structured random effect ui is estimated using a conditional autoregressive model. Prior probability distribution of the spatial unstructured random effect νi is assumed to follow a Gaussian distribution with a mean of 0.

The fitting and calculation of model parameters were implemented using the Integrated Nested Laplace Approximation (INLA) package in the R environment. A key advantage of INLA, compared with the commonly used MCMC method, is its ability to return accurate parameter estimates within a shorter time frame while maintaining flexibility45,46. For the fitting of other parameters and hyperparameters in the model, the default minimally informative prior probability distributions provided by the R-INLA package were adopted.

During model fitting, multiple sensitivity tests were conducted, in which various vague priors were assigned to relevant parameters and hyperparameters. Thereafter, the selection of the optimal model was made by comparing the deviance information criterion (DIC) values among the models. DIC, a statistical criterion used for model selection and evaluation, combines the goodness of fit of the model to the data with its complexity. A notable advantage of DIC (Deviance Information Criterion) is that it allows for the assessment and comparison of model goodness-of-fit without pre-specifying the number of model parameters. Consequently, DIC is often more suitable than AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) when evaluating and selecting hierarchical Bayesian models47,48. In general, a lower DIC value indicates a better balance between fitting the data and maintaining model simplicity, but a difference of over 5 in DIC values is typically considered necessary to signify a significant improvement in the model49,50. The results of sensitivity testing indicate that the model’s performance exhibits stability across different choices of prior distributions, with DIC values varying by no more than 5. Therefore, we have decided to adopt the Gamma (0.001, 0.001) as the prior distribution for the precision of hyperparameters in constructing our analytical model. It is worth noting that the default prior settings are intended to allow the data itself to speak as much as possible, hence the selected prior distributions are all of the high-variance, non-informative type. Such settings ensure that our model, at its initial stage, is not overly influenced by human assumptions, but rather relies more on actual data to uncover its underlying patterns41.

After obtaining the results using the R-INLA package, following the approach proposed by Blangiardo et al., the marginal posterior distributions of the sum of ui and υi were extracted and transformed exponentially [i.e., exp(ui+ υi)]. The posterior mean of this transformation was calculated to represent the relative risk of crime due to spatial random effects45. By using ArcGIS 10.5 software, the degree of adaptation represented by the spatial effects in each city under different prior specifications was visually expressed, allowing further analysis of their spatial distribution characteristics.