API Reference

Tools for computing diversity, integration, and segregation metrics.

class divintseg.SimilarityReference(reference: <Mock name='mock.DataFrame' id='140501227297376'> | ~typing.Mapping[str, int | float])[source]

Bases: object

An object that computes dissimilarty from a reference.

Parameters:

reference – The reference community. It should be a mapping from name to count or a dataframe with a single row with a column with the reference population of each group.

dissimilarity(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>) <Mock name='mock.Series' id='140501227297472'>[source]

Compute the dissimilarity index of one or more communities relative to a reference community.

Parameters:

df_communities – The communities. This is a DataFrame with each row representing a community and each column representing a group.

Returns:

  • The dissimilarity index of the each community relative to the reference

  • community.

similarity(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>) <Mock name='mock.Series' id='140501227297472'>[source]

Compute the representation index of one or more communities.

If sim_ref is a SimilarityReference, then sim_ref.similarity(communities)- is equal to `1.0 - sim_ref.similarity(communities).

Parameters:

df_communities – The communities. This is a DataFrame with each row representing a community and each column representing a group.

Returns:

  • The dissimilarity index of the each community relative to the reference

  • community.

divintseg.bells(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, group_name: str, by: str, over: str) <Mock name='mock.DataFrame' id='140501227297376'>[source]

Computes the isolation of a group using the isolation index by Wendell Bell.

Parameters:
  • df_communities – A pd.DataFrame of communities.

  • group_name – The name of the group (name of a column in df_communities) whose isolation we wish to compute.

  • by – The column or index to group by in order to partition the rows into communities.

  • over – The column to group by in order to partition the rows of each community into smaller aggregation units where the base diversity will be computed.

Returns:

  • A dataframe with one row for each unique value of the by

  • column indicating the Bell’s Index of the group_name column

  • with respect to all of the other columns in the data frame. If community

  • population consists exclusively of group_name, 1.0 will take place in

  • the dataframe cell corresponding to that region.

divintseg.di(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, by=None, over=None, *, add_segregation: bool = False, drop_non_numeric: bool = False) <Mock name='mock.DataFrame' id='140501227297376'>[source]

Compute the diversity, integration, and optionally the segregation of each of a collection of communities.

Parameters:
  • df_communities – A pd.DataFrame of communities.

  • by – The column or index to group by in order to partition the rows into communities.

  • over – The column to group by in order to partition the rows of each community into smaller aggregation units where the base diversity will be computed. If None then each row is assumed to represent a different community.

  • add_segregation – if True add a column to the results for segregation.

  • drop_non_numeric – If True, then any non-numeric column other than those specified by by and over will be implicitly dropped. This is useful if there are columns naming other levels of geographic aggregation that should be ignored.

Returns:

  • A Series containing the diversity,

  • integration, and optionally the segregation of

  • each community.

divintseg.dissimilarity(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, reference: <Mock name='mock.DataFrame' id='140501227297376'> | ~typing.Mapping[str, int | float]) <Mock name='mock.Series' id='140501227297472'>[source]

Compute the dissimilarity index of one or more communities relative to a reference community.

If you want compute dissimilarity or similarity many times against a common reference, then creating at SimularityReference is a more efficient option.

Parameters:
  • df_communities – The communities. This is a DataFrame with each row representing a community and each column representing a group.

  • reference – The reference community. It should be a single row with a column with the reference population of each group.

Returns:

  • The dissimilarity index of the each community relative to the reference

  • community.

divintseg.diversity(communities: <Mock name='mock.DataFrame' id='140501227297376'> | ~typing.Iterable[float]) <Mock name='mock.Series' id='140501227297472'> | float[source]

Compute the diversity of one or more communities.

Parameters:

communities – The communities. This is either an iterable over the population of each group in the community or, more commonly, a DataFrame with each row representing a community and each column representing a group.

Returns:

  • The diversity of the community or, if passed a

  • DataFrame then a Series

  • with one entry for the diversity of each community.

divintseg.exposure(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, primary_group_name: str, by: str, over: str, secondary_group_name: str | None = None) <Mock name='mock.DataFrame' id='140501227297376'>[source]

Compute the exposure of a group in a community. Exposure measures a group’s average local exposure to members of another group.

Parameters:
  • df_communities – A pd.DataFrame of communities.

  • primary_group_name – The name of the group (name of a column in df_communities) whose exposure we wish to compute relative to the group specified in secondary_group_name.

  • by – The column or index to group by in order to partition the rows into communities.

  • over – The column to group by in order to partition the rows of each community into smaller aggregation units where the base diversity will be computed.

  • secondary_group_name – The name of the group whose exposure should be calculated relative to primary_group_name. If None, every single group will have its exposure calculated.

Returns:

  • A dataframe with one row for each unique value of the by

  • column and one column for each value of the over column other

  • than primary_group_name indicating the exposure of the

  • primary_group_name column with respect to another over column in

  • the data frame. If secondary_group_name is not None, the only column

  • in the returned dataframe will be the exposure of primary_group_name

  • to secondary_group_name.

Example

>>> import pandas as pd
...
... df = pd.DataFrame(
...     [
...         ['Region 1', 'Subregion A', 100, 0, 0],
...         ['Region 1', 'Subregion B', 50, 50, 50],
...         ['Region 2', 'Subregion C', 0, 110, 100],
...         ['Region 2', 'Subregion D', 0, 50, 0],
...         ['Region 2', 'Subregion E', 10, 90, 0],
...     ],
...     columns=['REGION', 'SUBREGION', 'A', 'B', 'C']
... )
...
... df
     REGION    SUBREGION    A    B    C
0  Region 1  Subregion A  100    0    0
1  Region 1  Subregion B   50   50   50
2  Region 2  Subregion C    0  110  100
3  Region 2  Subregion D    0   50    0
4  Region 2  Subregion E   10   90    0
>>> from divintseg import exposure
...
... exposure(df, "A", by="REGION", over="SUBREGION")
     REGION        B       C
0  Region 1   0.3333  0.3333
1  Region 2    0.036       0

Row R and column C represents the exposure of groupname to C in region R: the exposure of A to B is 0.036 in region 2.

Calculating the likelihood of A:

Region

Subregion

Likelihood of A

Region 1

Subregion A

100 / (100 + 0 + 0) = 1

Region 1

Subregion B

50 / (50 + 50 + 50) = 0.3333

Region 2

Subregion C

0 / (0 + 110 + 100) = 0

Region 2

Subregion D

0 / (0 + 50 + 0) = 0

Region 2

Subregion E

10 / (10 + 90 + 0) = 0.1

Computing the fraction of all B’s in each subregion of each region:

Region

Subregion

Fraction of all B’s in Region

Region 1

Subregion A

0 / 50 = 0

Region 1

Subregion B

50 / 50 = 1

Region 2

Subregion C

110 / 250 = 0.44

Region 2

Subregion D

50 / 250 = 0.2

Region 2

Subregion E

90 / 250 = 0.36

Multiplying and summing for the subregions in each region:

For Region 1, we get

\[(1 * 0) + (0.3333 * 1) = 0.3333.\]

For Region 2, we get

\[0 * 0.44 + 0 * 0.2 + 0.1 * 0.36 = 0.036.\]

Repeating the process for C:

Region

Subregion

Fraction of all C’s in Region

Region 1

Subregion A

0 / 50 = 0

Region 1

Subregion B

50 / 50 = 1

Region 2

Subregion C

100 / 100 = 1

Region 2

Subregion D

0 / 100 = 0

Region 2

Subregion E

0 / 100 = 0

For Region 1, we get

\[(1 * 0) + (0.3333 * 1) = 0.3333.\]

For Region 2, we get

\[0 * 1 + 0 * 0 + 0.1 * 0 = 0.\]
divintseg.integration(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, by=None, over=None, *, drop_non_numeric: bool = False) <Mock name='mock.DataFrame' id='140501227297376'>[source]

Compute the integration of one of more communities over a nested level of population aggregation. For example, with US census data we might compute integration of block groups over blocks.

Parameters:
  • df_communities – A pd.DataFrame of communities.

  • by – The column or index to group by in order to partition the rows into communities.

  • over – The column to group by in order to partition the rows of each community into smaller aggregation units where the base diversity will be computed. If None then each row is assumed to represent a different community.

  • drop_non_numeric – If True, then any non-numeric column other than those specified by by and over will be implicitly dropped. This is useful if there are columns naming other levels of geographic aggregation that should be ignored.

Returns:

  • A Series containing the integration of

  • each community.

divintseg.isolation(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, group_name: str, by: str, over: str) <Mock name='mock.DataFrame' id='140501227297376'>[source]

Compute the isolation of a group in a community. Isolation is the average, over all members of a group in a community, of the proportion of the smaller area they reside in that are not members of their group.

Parameters:
  • df_communities – A pd.DataFrame of communities.

  • group_name – The name of the group (name of a column in df_communities) whose isolation we wish to compute.

  • by – The column or index to group by in order to partition the rows into communities.

  • over – The column to group by in order to partition the rows of each community into smaller aggregation units where the base diversity will be computed.

Returns:

  • A dataframe with one row for each unique value of the by

  • column indicating the isolation of the group_name column

  • with respect to all of the other columns in the data frame.

Examples

>>> import pandas as pd
...
... df = pd.DataFrame(
...     [
...         ['Region 1', 'Subregion A', 100, 0],
...         ['Region 1', 'Subregion B', 50, 50],
...         ['Region 2', 'Subregion C', 0, 100],
...         ['Region 2', 'Subregion D', 0, 50],
...         ['Region 2', 'Subregion E', 10, 90],
...     ],
...     columns=['REGION', 'SUBREGION', 'S', 'T']
... )
...
... df
     REGION    SUBREGION    S    T
0  Region 1  Subregion A  100    0
1  Region 1  Subregion B   50   50
2  Region 2  Subregion C    0  100
3  Region 2  Subregion D    0   50
4  Region 2  Subregion E   10   90
>>> from divintseg import isolation
...
... isolation(df, "S", by="REGION", over="SUBREGION")
     REGION        S
0  Region 1  0.83333
1  Region 2      0.1

Let’s look at what this example computed. First, we have to see how likely each person in group S is to see other members of their own group in their subregion. This is as follows:

Region

Subregion

Likelihood of an S

Region 1

Subregion A

100 / (100 + 0) = 1.0

Region 1

Subregion B

50 / (50 + 50) = 0.5

Region 2

Subregion C

0 / (0 + 100) = 0.0

Region 2

Subregion D

0 / (0 + 50) = 0.0

Region 2

Subregion E

10 / (10 + 90) = 0.1

Next, we can compute the fraction of all S’s in each subregion of each region. There are 150 S’s in Region 1 and 10 S’s in region 2, therefore, we have:

Region

Subregion

Fraction of all As in Region

Region 1

Subregion A

100 / 150 = 0.6667

Region 1

Subregion B

50 / 150 = 0.3333

Region 2

Subregion C

0 / 10 = 0.0000

Region 2

Subregion D

0 / 10 = 0.0000

Region 2

Subregion E

10 / 10 = 1.0000

Finally, for each subregion, we multiply these together and add them up the values for the subregions in each region. For Region 1, we get

\[(0.6667 * 1.0) + (0.3333 * 0.5) = 0.8333.\]

For Region 2, we get

\[0.0 * 0.0 + 0.0 * 0.0 + 1.000 * 0.1 = 0.1.\]

Note that the implentation may not do this math exactly as specified here, but it will do something equivalent.

divintseg.segregation(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, by=None, over=None, *, drop_non_numeric: bool = False) <Mock name='mock.DataFrame' id='140501227297376'>[source]

Compute the segregation of one of more communities over a nested level of population aggregation. For example, with US census data we might compute integration of block groups over blocks.

Parameters:
  • df_communities – A pd.DataFrame of communities.

  • by – The column or index to group by in order to partition the rows into communities.

  • over – The column to group by in order to partition the rows of each community into smaller aggregation units where the base diversity will be computed. If None then each row is assumed to represent a different community.

  • drop_non_numeric – If True, then any non-numeric column other than those specified by by and over will be implicitly dropped. This is useful if there are columns naming other levels of geographic aggregation that should be ignored.

Returns:

  • A Series containing the segregation of

  • each community.

divintseg.similarity(df_communities: <Mock name='mock.DataFrame' id='140501227297376'>, reference: <Mock name='mock.DataFrame' id='140501227297376'> | ~typing.Mapping[str, int | float]) <Mock name='mock.Series' id='140501227297472'>[source]

Compute the similarity index of one or more communities relative to a reference community.

Note that similarity is just one minus dissimilarity.

If you want compute dissimilarity or similarity many times against a common reference, then creating at SimularityReference is a more efficient option.

Parameters:
  • df_communities – The communities. This is a DataFrame with each row representing a community and each column representing a group.

  • reference – The reference community. It should be a single row with a column with the reference population of each group.

Returns:

  • The similarity of the each community relative to the reference

  • community.