Size Distribution of States, Counties, and Cities in the USA: New Inequality Form Information

In this article, we propose a new approach for studying the patterns of size distribution in settlement systems, based on the analysis of the shape of the Pareto curve (PC). To study the shape of the PC, we used the Gini coefficient, the asymmetry coefficient, and, by analogy with the physics of phase transitions, critical exponent — the index of the PC degree in the neighborhood of zero. An empirical analysis of the PC of various levels of aggregation in the US settlement system has been performed. The form of size distribution of states was studied by decades from 1790 to 2010. The spatial analysis of the PC shape for counties and cities was performed for 2010. The results of an empirical study showed that the PC of the states had left-hand asymmetry over 220 years. The PC of districts and cities had both right-hand and left-hand asymmetries. The obtained results explain in which cases the Pareto distribution having a PC with right-hand asymmetry, and the lognormal distribution with a symmetric PC may not correspond objectively to real settlement systems. As an alternative to power-series distribution and lognormal distribution, we considered an analytically simple two-parameter model with a wide range of PC asymmetry that combines the properties of power-series distribution and lognormal distribution. Verification of the model showed that it adequately described the size of settlements in homogeneous settlement systems.


Introduction
The relevance of studying the order regularities for settlement systems is determined by the search for optimal ways to develop production, distribution and consumption. In the first half of the 20th century, empirical studies of this problem focused on the distribution of cities by size (population) in developed countries, especially in the United States. The fundamental principles in size distribution were formulated by Felix Auerbach (Auerbach, 1913) and Robert Gibrat (Gibrat, 1931). Having compared lists of cities ranked in descending order of size in five European countries and the United States, Auerbach concluded that for large cities, the product of city size by city rank is approximately equal to the size of the largest city (the rank-size rule) in the country, which formally corresponds to the Pareto distribution with an exponent equal to 1. Gibrat proposed a proportional growth rule, which states, if applied to urban settlement systems, that large cities would not, on average, grow faster or slower than smaller cities. The proportional growth rule results in a logarithmically normal distribution. The problem is that if true distribution is lognormal, then the entire distribution can never simultaneously correspond to the Pareto distribution. All subsequent studies of settlement systems can be briefly described as an analysis of situations where one of these two distributions is preferable to the other (for example, see (Rosen and Resnick, 1980;Carroll, 1982;Clauset, Shalizi and Newman, 2009;Berry and Okulicz-Kozaryn, 2012;Soo, 2012;Bee et al., 2013;Arshad, Hu and Ashraf, 2018;Bee, Riccaboni and Schiavo, 2019)). However, in many cases, neither power-series distribution nor lognormal distribution describe settlement systems in a satisfactory way (for example, see (Soo, 2012)). Given this fact, as early as in 1982, Glenn Carroll aimed at extracting new information (the basis) from the existing empirical data to explain the reason for this discrepancy (Carroll, 1982). This paper suggests that the missing information can be obtained from the analysis of the shape of the Pareto curve (PC).
The objective of this paper is to select tools for studying the form of inequality of empirical PC and to analyze the shape of the PC at various levels of aggregation in the US settlement system. The paper is structured as follows. Section 2 briefly outlines the theoretical issues underlying subsequent empirical analysis. For a description of the empirical data and its processing methods, see Section 3. The results of the analysis of the PC shape for three-level settlement system in the US are presented in Section 4. Section 5 deals with the discussion of the results obtained.

Pareto Curve
Let the settlement system consist of n elements (states, counties, cities) that have i w shares of population. Let us arrange the system elements in descending order of w. As a result, we have a sorted (ranked) list of elements, in which the ordinal number of each element is called its rank and is denoted by the letter r. The settlement with the largest population will have rank 1, the second most populous settlement will have rank 2, and so on.
follows, as also boundary conditions: The PC shape characterizes the degree of uneven size distribution of settlements. The steeper the curve and the further it is from the line of absolute equality, the greater the inequality in the size distribution, and vice versa.
Despite the fact that PC does not depend on the number of settlements, for a given n, it enables to calculate the shares of population using the formula below: Formally, for 1  n , the approximate equality is valid: At the same time, the ratio (2) applies only to small-sized settlements, where relative difference between two sequential elements is very small. The first ranking elements of the system that differ significantly in size cannot be expressed in terms of a PC derivative (Grachev, 2009(Grachev, , 2010. Consequently, even formally, the proportions of the first ranking elements of the system cannot be described by the rank-size rule, which is confirmed by numerous empirical studies (for example, see the review (Arshad, Hu and Ashraf, 2018)).
It follows from (2) that at the point  p that satisfies the equation is valid. In other words, the average-sized element of the system has a coordinate on the axis of shares of ranks at the point at which the PC derivative is equal to 1 is at the maximum distance from the egalitarian line x y  (Kakwani, 1980).

Gini coefficient
The most common integral measure of size inequality is the Gini coefficient (Gini, 1921): When integrating (2) by parts, we have: (3) and (4), it can be seen that G can be interpreted as the average distance from the PC to the egalitarian line, as the double area between the PC and the egalitarian line (Xu, 2005), and as the normalized coordinate of the center of mass of the system. The Gini coefficient ranges from a minimum value of zero, where all elements are equal, to a theoretical maximum of 1, where all elements except one have a size of zero.

From relations
The main disadvantage of G is that it sums up PCs of different shapes in the same way (Voitchovsky, 2005;Osberg, 2017;Clementi et al., 2019). Kakwani, 1980). By convention:

PC asymmetry coefficient
Pareto curves are symmetric with respect to the alternative diagonal when the equality a p p   from which the PC symmetry condition (Kakwani, 1980) follows is satisfied: There are three single-parameter PCs with different asymmetries in Table 1. In the first line, there is a power-series PC with a right-hand asymmetry, in the second line -a power-series PC with a left-hand asymmetry, in the third line -a symmetric PC of the Burr III distribution (Burr, 1942  From the curves presented in Fig. 1, it can be seen that the maximum of the asymmetry is at , and the minimum is at 1  G . Consequently, an increase in size differentiation in the system leads to a decrease in the PC asymmetry. Most often, the classical coefficient of asymmetry (A) is used to measure the PC asymmetry (Dagum, 1980;Kakwani, 1980;Damgaard and Weiner, 2000): For symmetric PCs, A , then the PC has a left-hand asymmetry, and a right-hand asymmetry -when 1  A . When calculating the lower and upper boundaries of A, we will find: We introduce the normalized asymmetry coefficient (γ), which has values in a finite interval Note that it follows from the dependencies presented in Fig. 1 that the 80/20 ratio (Pareto, 1897) obtained by Wilfredo Pareto when studying the distribution of land in Italy only occurs when 787 0.

Critical exponent of PC
where A, D are constants specific to each settlement system.
The parameter D is interpreted as the fractal dimension of the order parameter, and the is called critical exponent (Stanley, 1971). Note that the fractal dimension in (11) differs from the fractal dimension of Yanguang Chen (Chen, 2016) not only in size, but also in physical meaning.
The critical exponent provides objective information about the shape of the PC in the neighborhood of zero and can be used to compare the size of the first ranking elements of the system. For example, the primacy index -the ratio of the size of the two first ranking settlements -can be calculated using the formula: From the rank-size rule, Thus, a primate city that has a size of 2 1 2w w  (Jefferson, 1939) appears in settlement systems when their fractal dimension ; accordingly, the critical exponent is less than or equal to 0.585. The emergence of primates can be viewed as the result of asymmetric size competition, in which larger elements of the system suppress the growth of their smaller neighbors. Models of such competition can be found in ecology (for example, see (Weiner and Damgaard, 2006)).

The Pareto curve model
The search for distributions that replace the lognormal distribution and combine the properties of the power-series distribution and lognormal distribution has been going on for several decades (for example, see (Champernowne, 1956;Ghosh and Basu, 2019;Sánchez, 2019)). An analytically simple two-parameter PC model that gives a wide range of asymmetries can be obtained by generalizing the single-parameter PCs presented in The obvious advantages of the model (13) are a simple formula to calculate the PC and an explicit analytical expression for the Gini coefficient: It is known that the model(13) has shown good results in the study of inequality of incomes of the population (Sarabia, Jordá and Trueba, 2013).
To estimate the critical exponent of the model (13), we use the limit (Stanley, 1971): When applying L'Hôpital's rule, we will find: The relation . (16) suggests that in the neighborhood of zero, the two-parameter PC (13) does not depend on the parameter α and can be approximated by power function with a critical exponent β.
According to an empirical study (Ghosh,

Data
The main source of empirical data for state sizes (the first level of aggregation) is the US census (Historic US Census data). This census has been conducted every decade since 1790. In the first census in 1790, there were only 18 states. Since 1959, the United States comprise 50 States.
However, data for all 51 states, including the district of Columbia, has been present in the census since 1900.
As a database of sizes of counties (the second level of aggregation) and cities (the third level), the United States used website http://www.citypopulation.de/en/usa/.

Data processing methodology
To calculate the Gini coefficient, the ratio (4) was used, which for a discrete data set was calculated using the formula (Sen, 1973): The estimates of the parameters α and β were obtained using the Solution Search function in Microsoft Excel. As an objective function, we have used the minimum of the Kolmogorov -Smirnov statistics that is most common for abnormal data, which is equal to the maximum of differences between the empirical and theoretical PCs: where    , represents the estimated Pareto curve.

The system of settlement in states
The first and most aggregated level of the US settlement system is the settlement in states.
The rank distributions of states in 1800, 1850, 1900, 1950, and 2000 in the bilogarithmic coordinate system are shown by markers in Fig. 3. The results of their approximation by the model (13) are indicated with a dotted line.

Fig. 3. Rank distributions of state sizes
It can be seen from the rank distributions presented in Fig. 3 that the model (13) adequately describes the sizes of states in the entire range of ranks.
The dynamics of the parameters G, β, a p and  p is indicated in Fig. 4 by markers, and the linear trends -by a dashed line. The standard error of the exponent does not exceed 0.005, and the asymmetry coefficient -0.01. The behavior of PC shape parameters presented in Fig. 4 shows that for 220 years, the PC of the state settlement system had a left-hand asymmetry   a p p   decreasing as the settlement system developed.

County-based and city-based systems of settlement in states
The empirical dependences of β and γ on G for county-based 14  n and city-based settlement systems are shown in Fig. 5.

County-based settlement system
City-based settlement system It can be seen from the empirical rank distributions presented in Fig. 6, that the uniformity of the size distribution in the city-based settlement system is higher than in the county-based settlement system.

Discussion of the results. Conclusions
The initial premise of this paper is the assumption that the inadequacy of describing settlement systems by a Pareto distribution having a PC with a right-hand asymmetry and by a lognormal distribution having a symmetric PC occurs in cases where real Pareto curves have a lefthand asymmetry.
To test this hypothesis, the paper provides analysis of behavior of the Pareto curve shape in the system of settlement in states and spatial analysis of the PC shape of the county-based and citybased settlement systems of the United States. The Gini coefficient, the asymmetry coefficient and the critical exponent of the PC were used as indicators of the PC shape.
An empirical analysis of the behavior of the PC shape in the system of settlement in states showed that the PC had a left-sided asymmetry between 1790 and 2010, which explains why the Pareto and lognormal distributions do not describe the system of settlement in states in a satisfactory way (Soo, 2012).
Spatial analysis of the systems of settlement in counties and cities in 2010 showed that both systems have approximately equal PC with left-hand and right-hand asymmetries, which explains why Pareto and lognormal distributions are not applicable to these systems in half of the cases.
When analyzing the influence of aggregation of settlement systems on the parameters of the PC shape (see Table 2), we can see that the system of settlement in states has the lowest average of the Gini coefficient and the highest average of the critical exponent. The average values of G and β fall on the system of settlement in counties, while the system of settlement in cities has the highest average Gini coefficient and the lowest average critical exponent. Given that primates can only occur in systems with a critical exponent 585 0.   , the result explains why they appear primarily in city-based settlement systems in the United States. However, excluding them from the data sample when evaluating model parameters, as suggested in the paper (Benguigui and Blumenfeld-Lieberthal, 2007), is allowed only in exceptional cases.

13
A verification of a two-parameter model built in (Antoniou et al., 2004) that combines the properties of power-series distribution and lognormal distribution, based on empirical data of the US settlement system, has shown that it adequately describes homogeneous settlement systems.
A comparison of the uniformity of the size of counties and cities in the States showed that the uniformity of the size distribution in the city-based settlement system is higher than that of the district-based settlement system. Most likely, this is due to the fact that the city-based settlement system is self-organized under the influence of objective laws of geographical systems (Batty, 2008), while the aggregation of settlements into counties depends not only on objective historical and local events, but also on subjective views of state leaders on ways to improve the efficiency of the region (Allen, 1996).
Since the government can influence the areas of development of settlement systems (for example, various tax benefits for certain territories, federal subsidies to small cities to attract residents, etc.), the question arises: which asymmetry of the Pareto curves is appropriate to take as a basis for managing the development of settlement systems?
According to the isoperimetric inequality, the more "regular" shape of a flat figure of a given area is, the smaller its perimeter will be (Polya and Szegö, 1951). Or, by the same token, for the same value of the Gini coefficient, the length of the symmetric PC is less than that of asymmetric PC. Given that the isoperimetric inequality is conceptually related to the principle of least action, which explains why oil spots on water surface have the shape of a circle, and a drop of water has the shape of a ball, as well as the decrease in the PC asymmetry of the state-based settlement system over time and the equality of the average asymmetry of counties and state cities to zero, it is logical to assume that a normally functioning system should have a symmetrical PC.
Thus, the study confirmed that in settlement systems, the shape of the Pareto curve has a wide range of asymmetries that cannot be described by Pareto and lognormal distributions.
It has been established that primates in settlement systems occur when the critical exponent is less than 0.585. An increase in the Gini coefficient usually leads to an increase in the probability of a primate occurring.
The well-proven tools for evaluating the PC shape and the two-parameter PC model in this study can be used in other fields of knowledge.