Statistics

lazygrid.statistics

lazygrid.statistics.confidence_interval_mean_t(x: numpy.ndarray, cl: float = 0.05) → List

Compute the confidence interval of the mean from sample data.

Parameters:
  • x – Sample
  • cl – Confidence level
Returns:

confidence interval

Return type:

List

Examples

>>> import numpy as np
>>> import lazygrid as lg
>>>
>>> np.random.seed(42)
>>> x = np.random.normal(loc=0, scale=2, size=10)
>>> confidence_level = 0.05
>>>
>>> lg.statistics.confidence_interval_mean_t(x, confidence_level)
[-0.13829578539063092, 1]

Notes

You should use the t distribution rather than the normal distribution when the variance is not known and has to be estimated from sample data.

When the sample size is large, say 100 or above, the t distribution is very similar to the standard normal distribution. However, with smaller sample sizes, the t distribution is leptokurtic, which means it has relatively more scores in its tails than does the normal distribution. As a result, you have to extend farther from the mean to contain a given proportion of the area.

lazygrid.statistics.find_best_solution(solutions: list, test: Callable = <function mannwhitneyu>, alpha: float = 0.05, **kwargs) -> (<class 'int'>, <class 'list'>, <class 'list'>)

Find the best solution in a list of candidates, according to a statistical test and a significance level (alpha).

The best solution is defined as the one having the highest mean value.

Parameters:
  • solutions – List of candidate solutions
  • test – Statistical test
  • alpha – Significance level
  • kwargs – Keyword arguments required by the statistical test
Returns:

  • the position of the best solution inside the candidate input list;
  • the positions of the solutions which are not separable from the best one;
  • the list of p-values returned by the statistical test while comparing the best solution to the other candidates

Return type:

Tuple

Examples

>>> from sklearn.linear_model import LogisticRegression, RidgeClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import cross_val_score
>>> import lazygrid as lg
>>>
>>> x, y = make_classification(random_state=42)
>>>
>>> model1 = LogisticRegression(random_state=42)
>>> model2 = RandomForestClassifier(random_state=42)
>>> model3 = RidgeClassifier(random_state=42)
>>> model_names = ["LogisticRegression", "RandomForestClassifier", "RidgeClassifier"]
>>>
>>> score1 = cross_val_score(estimator=model1, X=x, y=y, cv=10)
>>> score2 = cross_val_score(estimator=model2, X=x, y=y, cv=10)
>>> score3 = cross_val_score(estimator=model3, X=x, y=y, cv=10)
>>>
>>> scores = [score1, score2, score3]
>>> best_idx, best_solutions_idx, pvalues = lg.statistics.find_best_solution(scores)
>>> model_names[best_idx]
'LogisticRegression'
>>> best_solutions_idx
[0, 2]
>>> pvalues #doctest: +ELLIPSIS
[0.4782..., 0.0360..., 0.1610...]