cape.statutils: Statistics tools

This module includes several shorthand calls to statistical functions from scipy.stats. The primary tool provided by this module is to calculate 99% (or any other fraction) coverage ranges for two data sets.

This module depends on scipy.stats from the SciPy package. To ensure that this package is installed, even without root privileges on your system, run

pip install --user --upgrade scipy

This module does not provide a general-purpose statistical toolkit that wraps a complete package like scipy.stats. Instead, it provides a small set of tools that are common in handling data relevant to aerosciences databases but not commonly found in common statistics libraries.

cape.statutils.check_outliers(dx, cov=None, **kw)

Find outliers in a data set

Call:
>>> I = check_outliers(dx, cov, **kw)
Inputs:
dx: np.ndarray[float]

Array of signed deltas

cov, Coverage: {None} | 0 < float < 1

Strict coverage fraction

ksig, CoverageSigma: {None} | float

Number of standard deviations to cover (default based on cov; user must supply either cov or ksig or both)

cdf, CoverageCDF: {cov} | 0 < float < 1

Fraction to use to define ksig

osig, OutlierSigma: {1.5*ksig} | float

Multiple of standard deviation to identify outliers

Outputs:
I: np.ndarray[bool]

Flags for non-outlier cases, False if case is an outlier

Versions:
  • 2019-02-04 @ddalle: Version 1.0

  • 2021-09-20 @ddalle: Version 1.1
    • use _parse_options()

    • allow 100% coverage

cape.statutils.check_outliers_range(R, cov=None, **kw)

Find outliers in an array of ranges

Call:
>>> I = check_outliers_range(R, cov, **kw)
Inputs:
R: np.ndarray[float]

Array of ranges (unsigned deltas)

cov, Coverage: {None} | 0 < float < 1

Strict coverage fraction

ksig, CoverageSigma: {None} | float

Number of standard deviations to cover (default based on cov; user must supply either cov or ksig or both)

cdf, CoverageCDF: {cov} | 0 < float < 1

Fraction to use to define ksig

osig, OutlierSigma: {1.5*ksig} | float

Multiple of standard deviation to identify outliers

Outputs:
I: np.ndarray[bool]

Flags for non-outlier cases, False if case is an outlier

Versions:
  • 2021-02-20 @ddalle: Version 1.0

cape.statutils.get_cov_interval(dx, cov=None, **kw)

Calculate Student’s t-distribution confidence range

If the nominal application of the Student’s t-distribution fails to cover a high enough fraction of the data, the bounds are extended until cov (user-defined fraction) of the data is covered.

Call:
>>> a, b = get_cov_interval(dx, cov, **kw)
Inputs:
dx: np.ndarray[float]

Array of signed deltas

cov, Coverage: {None} | 0 < float < 1

Strict coverage fraction

ksig, CoverageSigma: {None} | float

Number of standard deviations to cover (default based on cov; user must supply either cov or ksig or both)

cdf, CoverageCDF: {cov} | 0 < float < 1

Fraction to use to define ksig

osig, OutlierSigma: {1.5*ksig} | float

Multiple of standard deviation to identify outliers

Outputs:
a: float

Lower bound of coverage interval

b: float

Upper bound of coverage interval

Versions:
  • 2019-02-04 @ddalle: Version 1.0

  • 2021-09-20 @ddalle: Version 1.1
    • use _parse_options()

    • allow 100% coverage

    • remove confusing kcov vs ksig scaling

cape.statutils.get_coverage(dx, cov=None, **kw)

Calculate Student’s t-distribution confidence range

If the nominal application of the Student’s t-distribution fails to cover a high enough fraction of the data, the bounds are extended until cov (user-defined fraction) of the data is covered.

Call:
>>> width = get_coverage(dx, cov, **kw)
Inputs:
dx: np.ndarray[float]

Array of signed deltas

cov, Coverage: {None} | 0 < float < 1

Strict coverage fraction

ksig, CoverageSigma: {None} | float

Number of standard deviations to cover (default based on cov; user must supply either cov or ksig or both)

cdf, CoverageCDF: {cov} | 0 < float < 1

Fraction to use to define ksig

osig, OutlierSigma: {1.5*ksig} | float

Multiple of standard deviation to identify outliers

Outputs:
width: float

Half-width of confidence region

Versions:
  • 2019-02-04 @ddalle: Version 1.0

  • 2021-09-20 @ddalle: Version 1.1
    • use _parse_options()

    • allow 100% coverage

    • remove confusing kcov vs ksig scaling

cape.statutils.get_ordered_lower(V, cov)

Calculate value less than fraction cov of V’s values

Call:
>>> v = get_ordered_lower(V, cov)
Inputs:
V: np.ndarray[float]

Array of scalar values

cov: float

Coverage fraction, 0 < cov <= 1

Outputs:
v: float

Value such that cov*V.size entries in V are greater than or equal to v; may be interpolated between sorted values of V

Versions:
  • 2021-09-30 @ddalle: Version 1.0

cape.statutils.get_ordered_stats(V, cov=None, onesided=False, **kw)

Calculate coverage using ordered statistics

Call:
>>> vmin, vmax = get_ordered_stats(V, cov)
>>> vmin, vmax = get_ordered_stats(V, **kw)
>>> vlim = get_ordered_stats(V, cov, onesided=True)
>>> vlim = get_ordered_stats(V, onsided=True, **kw)
Inputs:
V: np.ndarray[float]

Array of scalar values

cov: float

Coverage fraction, 0 < cov <= 1

onsided: True | {False}

Option to find coverage of one-sided distribution

ksig: {None} | float

Option to calculate cov based on Gaussian distribution

tsig: {None} | float

Option to calculate cov based on Student’s t-distribution

Outputs:
vmin: float

Lower limit of two-sided coverage interval

vmax: float

Upper limit of two-sided coverage interval

vlim: float

Upper limit of one-sided coverage interval

Versions:
  • 2021-09-30 @ddalle: Version 1.0

cape.statutils.get_ordered_upper(V, cov)

Calculate value greater than fraction cov of V’s values

Call:
>>> v = get_ordered_upper(V, cov)
Inputs:
V: np.ndarray[float]

Array of scalar values

cov: float

Coverage fraction, 0 < cov <= 1

Outputs:
v: float

Value such that cov*V.size entries in V are less than or equal to v; may be interpolated between sorted values of V

Versions:
  • 2021-09-30 @ddalle: Version 1.0

cape.statutils.get_range(R, cov=None, **kw)

Calculate Student’s t-distribution confidence range

If the nominal application of the Student’s t-distribution fails to cover a high enough fraction of the data, the bounds are extended until the data is covered.

Call:
>>> width = get_range(R, cov, **kw)
Inputs:
R: np.ndarray[float]

Array of ranges (absolute values of deltas)

cov, Coverage: {None} | 0 < float < 1

Strict coverage fraction

ksig, CoverageSigma: {None} | float

Number of standard deviations to cover (default based on cov; user must supply either cov or ksig or both)

cdf, CoverageCDF: {cov} | 0 < float < 1

Fraction to use to define ksig

osig, OutlierSigma: {1.5*ksig} | float

Multiple of standard deviation to identify outliers

Outputs:
width: float

Half-width of confidence region

Versions:
  • 2018-09-28 @ddalle: Version 1.0

  • 2021-09-20 @ddalle: Version 1.1
    • use _parse_options()

    • allow 100% coverage

    • remove confusing kcov vs ksig scaling