Statistics¶

lazygrid.statistics

lazygrid.statistics.confidence_interval_mean_t(x: numpy.ndarray, cl: float = 0.05) → List¶

Compute the confidence interval of the mean from sample data.

Parameters:	x – Sample cl – Confidence level
Returns:	confidence interval
Return type:	List

Examples

>>> import numpy as np
>>> import lazygrid as lg
>>>
>>> np.random.seed(42)
>>> x = np.random.normal(loc=0, scale=2, size=10)
>>> confidence_level = 0.05
>>>
>>> lg.statistics.confidence_interval_mean_t(x, confidence_level)
[-0.13829578539063092, 1]

Notes

You should use the t distribution rather than the normal distribution when the variance is not known and has to be estimated from sample data.

When the sample size is large, say 100 or above, the t distribution is very similar to the standard normal distribution. However, with smaller sample sizes, the t distribution is leptokurtic, which means it has relatively more scores in its tails than does the normal distribution. As a result, you have to extend farther from the mean to contain a given proportion of the area.

lazygrid.statistics.find_best_solution(solutions: list, test: Callable = <function mannwhitneyu>, alpha: float = 0.05, **kwargs) -> (<class 'int'>, <class 'list'>, <class 'list'>)¶

Find the best solution in a list of candidates, according to a statistical test and a significance level (alpha).

The best solution is defined as the one having the highest mean value.

Parameters:

solutions – List of candidate solutions
test – Statistical test
alpha – Significance level
kwargs – Keyword arguments required by the statistical test

Returns:

the position of the best solution inside the candidate input list;
the positions of the solutions which are not separable from the best one;
the list of p-values returned by the statistical test while comparing the best solution to the other candidates

Return type:

Tuple

Examples

>>> from sklearn.linear_model import LogisticRegression, RidgeClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import cross_val_score
>>> import lazygrid as lg
>>>
>>> x, y = make_classification(random_state=42)
>>>
>>> model1 = LogisticRegression(random_state=42)
>>> model2 = RandomForestClassifier(random_state=42)
>>> model3 = RidgeClassifier(random_state=42)
>>> model_names = ["LogisticRegression", "RandomForestClassifier", "RidgeClassifier"]
>>>
>>> score1 = cross_val_score(estimator=model1, X=x, y=y, cv=10)
>>> score2 = cross_val_score(estimator=model2, X=x, y=y, cv=10)
>>> score3 = cross_val_score(estimator=model3, X=x, y=y, cv=10)
>>>
>>> scores = [score1, score2, score3]
>>> best_idx, best_solutions_idx, pvalues = lg.statistics.find_best_solution(scores)
>>> model_names[best_idx]
'LogisticRegression'
>>> best_solutions_idx
[0, 2]
>>> pvalues #doctest: +ELLIPSIS
[0.4782..., 0.0360..., 0.1610...]