Diversity index

Date

A diversity index is a way to measure how many different types, such as species, are in a group of data, like a community. These indices show different parts of biodiversity, such as how many types there are, how evenly they are spread out, and which ones are most common. In ecology, the types studied are usually species, but they can also be other groups, like genera, families, or functional types.

A diversity index is a way to measure how many different types, such as species, are in a group of data, like a community. These indices show different parts of biodiversity, such as how many types there are, how evenly they are spread out, and which ones are most common.

In ecology, the types studied are usually species, but they can also be other groups, like genera, families, or functional types. The focus is often on individual organisms, such as plants or animals. The amount of each type can be measured in different ways, such as counting individuals, measuring biomass, or looking at how much area they cover. In demography, the focus is often on people, and the types might be different groups, like age or gender. In information science, the focus can be on characters, and the types might be the different letters of the alphabet. The most common diversity indices are based on the effective number of types, which is another way to describe diversity. Each index can also be used to measure something specific, but each one measures a different aspect.

Some indices only look at how different the groups are, but they don't consider the full range of differences that happen when both group differences and other types of differences are included.

Diversity indices described in this article include:

Some more advanced indices also consider how closely related the types are. These are called phylo-divergence indices, but they are not covered here.

Effective number of species or Hill numbers

True diversity, or the effective number of types, is the number of equally common types needed so that their average share in a group matches the actual average share found in a dataset (where types may not all be equally common). To calculate true diversity, first find the weighted generalized mean, called M q−1, of the proportional abundances of the types in the dataset. Then, take the reciprocal of this value. The formula is:

q D = 1 / M q−1 = 1 / [(∑ i=1 to R (p_i * p_i^{q−1}) )^{1/(q−1)}] = (∑ i=1 to R (p_i^q))^{1/(1−q)}.

Here, M q−1 represents the average proportional abundance of the types in the dataset, calculated using the weighted generalized mean with exponent q−1. In the formula, R is the total number of types (richness), and p_i is the proportional abundance of the i-th type. The proportional abundances themselves are used as weights. The values q D are known as Hill numbers of order q or effective number of species.

When q = 1, the formula is not valid. However, the mathematical limit as q approaches 1 is well-defined, and the corresponding diversity is calculated using:

1 D = 1 / (∏ i=1 to R (p_i^{p_i})) = exp(−∑ i=1 to R (p_i * ln(p_i))).

This is the exponential of the Shannon entropy calculated with natural logarithms (as described earlier). In other fields, this value is also called perplexity.

The general formula for diversity is often written as:

q D = (∑ i=1 to R (p_i^q))^{1/(1−q)}.

The term inside the parentheses is called the basic sum. Many widely used diversity indices correspond to the basic sum calculated with different values of q.

Sensitivity of the diversity value to rare vs. abundant species

The value of q is sometimes called the order of diversity. It shows how sensitive true diversity is to the presence of rare or common species by changing how the average of species' abundances is calculated. For certain values of q, the generalized mean M q−1 becomes specific types of averages. Specifically:

  • When q is 0, it uses the harmonic mean.
  • When q is 1, it uses the geometric mean.
  • When q is 2, it uses the arithmetic mean.
  • As q increases to very large numbers, the generalized mean focuses only on the species with the highest abundance.

Increasing q gives more importance to the most common species. This results in a higher M q−1 value and a lower true diversity (D) value.

When q is 1, the geometric mean of species abundances is used, and each species' weight matches its abundance. If q is greater than 1, the most common species are given more weight. If q is less than 1, rare species are given more weight. At q = 0, the weights of species cancel out their abundances, so the average of species abundances equals 1/R, even if species are not equally common. At this point, the effective number of species (D) equals the actual number of species (R).

In diversity studies, q is usually limited to non-negative values. Negative values of q would give rare species too much weight compared to common ones, causing D to be larger than R.

Richness

Richness R measures the number of different types in a dataset. For example, species richness (usually written as S) is the count of species found in a specific area. Richness is a simple measure, so it is often used to study diversity in ecology when information about how many of each species there are is not available. When true diversity is calculated using q = 0, the effective number of types (D) is the same as the actual number of types, which equals Richness (R).

Shannon index

The Shannon index is a widely used tool in ecology to measure diversity. It is also known as Shannon's diversity index, the Shannon–Wiener index, and sometimes mistakenly called the Shannon–Weaver index. The concept was first introduced by Claude Shannon in 1948 to measure entropy, which relates to how unpredictable or uncertain information is. In this context, entropy describes how difficult it is to guess the next letter in a string of text. If a string has many letters and they appear in similar amounts, it is harder to predict the next letter, leading to higher entropy. The Shannon entropy is calculated using the formula:

H' = -∑(p_i × ln(p_i))

Here, p_i represents the proportion of characters (or individuals in ecology) belonging to the i-th type. In ecology, p_i often refers to the proportion of individuals from the i-th species in a dataset. This formula quantifies the uncertainty in predicting which species an individual randomly selected from the dataset might belong to.

The logarithm used in the formula can be of any base, such as 2, 10, or e. Each base corresponds to different units of measurement: bits (base 2), decits (base 10), and nats (base e). When comparing Shannon entropy values calculated with different log bases, they must be converted to the same base by multiplying by the logarithm of the original base relative to the new base.

The Shannon index (H') is connected to the weighted geometric mean of the proportional abundances of types. Specifically, H' equals the natural logarithm of true diversity when calculated with q = 1. This relationship is expressed as:

H' = -∑(p_i × ln(p_i)) = -∑(ln(p_i^p_i))

This can be further simplified to:

H' = ln(1 / ∏(p_i^p_i))

Since the sum of all p_i values equals 1, the denominator represents the weighted geometric mean of the p_i values, with the p_i values themselves acting as weights. The term inside the parentheses equals true diversity (D), and H' equals ln(D).

When all types in a dataset are equally common, each p_i value is 1/R, and the Shannon index equals ln(R). If the abundances of types are uneven, the weighted geometric mean of the p_i values increases, and the Shannon entropy decreases. If nearly all individuals belong to one type, and others are very rare, the Shannon entropy approaches zero. When only one type exists in the dataset, the Shannon entropy is exactly zero, as there is no uncertainty in predicting the type of a randomly selected individual.

In machine learning, the Shannon index is sometimes referred to as "information gain."

The Rényi entropy is a broader version of the Shannon entropy that applies to other values of q (not just 1). It is expressed as:

qH = (1 / (1 – q)) × ln(∑(p_i^q))

This can also be written as:

qH = ln(1 / (∑(p_i × p_i^(q – 1))^(1/(q – 1)))) = ln(qD)

This means that taking the logarithm of true diversity based on any value of q gives the Rényi entropy for the same q.

Simpson index

The Simpson index was first introduced in 1949 by Edward H. Simpson to measure how concentrated individuals are when grouped into types. The same index was found again in 1950 by Orris C. Herfindahl. In 1945, Albert O. Hirschman had already used the square root of this index. In ecology, the measure is called the Simpson index. In economics, it is known as the Herfindahl index or the Herfindahl–Hirschman index (HHI).

The Simpson index calculates the chance that two randomly chosen items from a dataset belong to the same type. The formula is:

λ = ∑ (p_i²)

where R is the total number of types in the dataset, and p_i represents the proportion of each type. This formula is a weighted average of the proportions, with the proportions themselves acting as the weights. Since proportions range from 0 to 1, λ is always at least 1/R, which occurs when all types are equally common.

When comparing the Simpson index to true diversity calculations, 1/λ equals D, which is true diversity calculated with q = 2. The original Simpson index is therefore equivalent to this basic sum.

The Simpson index assumes that items are sampled with replacement, meaning the same item can be chosen twice. In large datasets, sampling without replacement (choosing different items) gives similar results. However, in small datasets, the difference is significant. If sampling without replacement is assumed, the probability of selecting two items of the same type is:

ℓ = [∑ (n_i(n_i − 1))] / [N(N − 1)]

where n_i is the number of items in the i-th type, and N is the total number of items. This version is called the Hunter–Gaston index in microbiology.

The Simpson index (λ) becomes smaller in highly diverse datasets and larger in less diverse ones. This is the opposite of what is expected for a diversity measure, so other versions of the index are often used. These include the inverse Simpson index (1/λ) and the Gini–Simpson index (1 − λ). Both are sometimes called the Simpson index in ecology, so care must be taken to avoid confusion.

The inverse Simpson index equals:

1/λ = 1 / ∑ (p_i²) = 2D

This value equals true diversity of order 2, which represents the effective number of types based on average proportions.

The Simpson index is also used to measure the effective number of political parties.

The Gini–Simpson index is also called Gini impurity or Gini’s diversity index in machine learning. The original Simpson index (λ) represents the probability that two randomly selected items (with replacement) are the same type. Its transformation (1 − λ) represents the probability that two items are different types. This is also known as the probability of interspecific encounter (PIE) or the Gini–Simpson index. It can be expressed as:

1 − λ = 1 − ∑ (p_i²) = 1 − 1/(2D)

The Gibbs–Martin index, used in sociology, psychology, and management studies, is the same as the Gini–Simpson index.

In population genetics, this measure is also known as expected heterozygosity.

Berger–Parker index

The Berger–Parker index, named after Wolfgang H. Berger and Frances Lawrence Parker, is the largest value of p_i in a dataset. This value represents the proportion of the most common type in the dataset. It is also the same as the weighted generalized mean of the p_i values when q becomes extremely large. Because of this, the Berger–Parker index is equal to the inverse of the true diversity of order infinity (1/D).

More
articles