이번 포스팅에서는 wine 데이터셋을 살펴보겠습니다. (KFold 는 생략하고 straitifiedkfold로 바로 설명하겠습니다.)
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.utils import all_estimators
import warnings
warnings.filterwarnings('ignore')
datasets = load_wine()
x = datasets.data
y = datasets.target.reshape(-1,1)
print(x.shape,y.shape)
(178, 13) (178, 1)
아래에 보시다시피 와인 데이터셋은 분류문제이며 총 3개의 클래스로 구성되어 있습니다.
print(datasets.feature_names)
print(datasets.DESCR)
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline'] .. _wine_dataset: Wine recognition dataset ------------------------ **Data Set Characteristics:** :Number of Instances: 178 (50 in each of three classes) :Number of Attributes: 13 numeric, predictive attributes and the class :Attribute Information: - Alcohol - Malic acid - Ash - Alcalinity of ash - Magnesium - Total phenols - Flavanoids - Nonflavanoid phenols - Proanthocyanins - Color intensity - Hue - OD280/OD315 of diluted wines - Proline - class: - class_0 - class_1 - class_2 :Summary Statistics: ============================= ==== ===== ======= ===== Min Max Mean SD ============================= ==== ===== ======= ===== Alcohol: 11.0 14.8 13.0 0.8 Malic Acid: 0.74 5.80 2.34 1.12 Ash: 1.36 3.23 2.36 0.27 Alcalinity of Ash: 10.6 30.0 19.5 3.3 Magnesium: 70.0 162.0 99.7 14.3 Total Phenols: 0.98 3.88 2.29 0.63 Flavanoids: 0.34 5.08 2.03 1.00 Nonflavanoid Phenols: 0.13 0.66 0.36 0.12 Proanthocyanins: 0.41 3.58 1.59 0.57 Colour Intensity: 1.3 13.0 5.1 2.3 Hue: 0.48 1.71 0.96 0.23 OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71 Proline: 278 1680 746 315 ============================= ==== ===== ======= ===== :Missing Attribute Values: None :Class Distribution: class_0 (59), class_1 (71), class_2 (48) :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 This is a copy of UCI ML Wine recognition datasets. https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine. Original Owners: Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. Citation: Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. .. topic:: References (1) S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Technometrics). The data was used with many others for comparing various classifiers. The classes are separable, though only RDA has achieved 100% correct classification. (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) (All results using the leave-one-out technique) (2) S. Aeberhard, D. Coomans and O. de Vel, "THE CLASSIFICATION PERFORMANCE OF RDA" Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Journal of Chemometrics).
columns = datasets.feature_names
columns.append("Target(Wine)")
data = np.concatenate([x,y],axis=1)
dataframe = pd.DataFrame(data,columns = columns)
dataframe
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | Target(Wine) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0.0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0.0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0.0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0.0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173 | 13.71 | 5.65 | 2.45 | 20.5 | 95.0 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740.0 | 2.0 |
174 | 13.40 | 3.91 | 2.48 | 23.0 | 102.0 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750.0 | 2.0 |
175 | 13.27 | 4.28 | 2.26 | 20.0 | 120.0 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835.0 | 2.0 |
176 | 13.17 | 2.59 | 2.37 | 20.0 | 120.0 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840.0 | 2.0 |
177 | 14.13 | 4.10 | 2.74 | 24.5 | 96.0 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560.0 | 2.0 |
178 rows × 14 columns
datasets = dataframe.values
x = datasets[:,:-1]
y = datasets[:,-1]
print(x.shape,y.shape)
(178, 13) (178,)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
K개의 데이터 셋을 만든 후 K번 만큼 1) 학습, 2) 검증을 수행하는 방법
교차검증의 목적은 모델의 성능 평가를 일반화하는것
모델의 성능을 직접적으로 향상시키지 않습니다.
다만, 하이퍼 파라미터 튜닝을 통해 최적의 성능을 발휘하는 파라미터를 찾을 수 있습니다.
cross_val_score을 사용해서 k-fold 교차검증을 수행합니다.
장점
kfold = StratifiedKFold(n_splits=5,shuffle=True)
all_Algorithm = all_estimators(type_filter = 'classifier')
scaler_list = [StandardScaler(),MinMaxScaler()]
best_acc_score=[]
for scaler in scaler_list:
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
for (name,algorithm) in all_Algorithm:
try:
score = cross_val_score(algorithm(),x_train,y_train,cv=kfold)
print("Model : ",name,"\n Mean Score : ",score.mean(),"\n")
print(score)
acc_score = best_acc_score.append((name,score.mean()))
except:
continue
print("Best Model")
print(max(best_acc_score,key=lambda x:x[1]))
Model : AdaBoostClassifier Mean Score : 0.8322660098522168 [0.68965517 0.79310345 0.89285714 0.89285714 0.89285714] Model : BaggingClassifier Mean Score : 0.9504926108374384 [0.96551724 0.96551724 0.89285714 1. 0.92857143] Model : BernoulliNB Mean Score : 0.9295566502463055 [0.93103448 0.93103448 0.89285714 1. 0.89285714] Model : CalibratedClassifierCV Mean Score : 0.993103448275862 [1. 0.96551724 1. 1. 1. ] Model : CategoricalNB Mean Score : nan [nan nan nan nan nan] Model : ComplementNB Mean Score : nan [nan nan nan nan nan] Model : DecisionTreeClassifier Mean Score : 0.9366995073891626 [0.93103448 0.93103448 0.92857143 0.92857143 0.96428571] Model : DummyClassifier Mean Score : 0.3310344827586207 [0.20689655 0.44827586 0.28571429 0.35714286 0.35714286] Model : ExtraTreeClassifier Mean Score : 0.8945812807881774 [0.89655172 0.86206897 0.89285714 0.89285714 0.92857143] Model : ExtraTreesClassifier Mean Score : 1.0 [1. 1. 1. 1. 1.] Model : GaussianNB Mean Score : 0.9716748768472907 [1. 0.96551724 1. 0.96428571 0.92857143] Model : GaussianProcessClassifier Mean Score : 0.9788177339901478 [1. 0.96551724 1. 0.96428571 0.96428571] Model : GradientBoostingClassifier Mean Score : 0.9440886699507389 [0.93103448 0.89655172 0.96428571 1. 0.92857143] Model : HistGradientBoostingClassifier Mean Score : 0.9714285714285715 [1. 1. 1. 0.85714286 1. ] Model : KNeighborsClassifier Mean Score : 0.9507389162561577 [0.93103448 0.96551724 0.96428571 0.92857143 0.96428571] Model : LabelPropagation Mean Score : 0.9652709359605911 [0.93103448 0.93103448 1. 0.96428571 1. ] Model : LabelSpreading Mean Score : 0.9509852216748769 [0.96551724 0.89655172 0.96428571 0.92857143 1. ] Model : LinearDiscriminantAnalysis Mean Score : 0.993103448275862 [0.96551724 1. 1. 1. 1. ] Model : LinearSVC Mean Score : 0.9928571428571429 [1. 1. 0.96428571 1. 1. ] Model : LogisticRegression Mean Score : 0.9857142857142858 [1. 1. 0.96428571 0.96428571 1. ] Model : LogisticRegressionCV Mean Score : 0.9785714285714286 [1. 1. 0.96428571 0.96428571 0.96428571] Model : MLPClassifier Mean Score : 0.9857142857142858 [1. 1. 0.96428571 0.96428571 1. ] Model : MultinomialNB Mean Score : nan [nan nan nan nan nan] Model : NearestCentroid Mean Score : 0.9788177339901478 [1. 0.96551724 0.96428571 0.96428571 1. ] Model : NuSVC Mean Score : 0.972167487684729 [0.89655172 1. 1. 1. 0.96428571] Model : PassiveAggressiveClassifier Mean Score : 0.993103448275862 [0.96551724 1. 1. 1. 1. ] Model : Perceptron Mean Score : 0.9790640394088669 [0.93103448 1. 0.96428571 1. 1. ] Model : QuadraticDiscriminantAnalysis Mean Score : 0.9857142857142858 [1. 1. 1. 0.92857143 1. ] Model : RandomForestClassifier Mean Score : 0.9859605911330049 [0.96551724 1. 1. 1. 0.96428571] Model : RidgeClassifier Mean Score : 0.9719211822660098 [1. 0.93103448 1. 1. 0.92857143] Model : RidgeClassifierCV Mean Score : 0.993103448275862 [0.96551724 1. 1. 1. 1. ] Model : SGDClassifier Mean Score : 0.9647783251231526 [0.96551724 0.96551724 0.92857143 1. 0.96428571] Model : SVC Mean Score : 0.9785714285714286 [1. 1. 0.96428571 0.96428571 0.96428571] Model : AdaBoostClassifier Mean Score : 0.7411330049261083 [0.5862069 0.65517241 0.96428571 0.92857143 0.57142857] Model : BaggingClassifier Mean Score : 0.9576354679802955 [1. 0.93103448 0.96428571 0.92857143 0.96428571] Model : BernoulliNB Mean Score : 0.3238916256157635 [0.34482759 0.31034483 0.35714286 0.32142857 0.28571429] Model : CalibratedClassifierCV Mean Score : 0.9788177339901478 [1. 0.96551724 0.96428571 0.96428571 1. ] Model : ComplementNB Mean Score : 0.8662561576354679 [0.86206897 0.86206897 0.82142857 0.82142857 0.96428571] Model : DecisionTreeClassifier Mean Score : 0.9152709359605913 [0.89655172 0.96551724 0.85714286 0.96428571 0.89285714] Model : DummyClassifier Mean Score : 0.33103448275862063 [0.37931034 0.27586207 0.39285714 0.32142857 0.28571429] Model : ExtraTreeClassifier Mean Score : 0.8657635467980296 [1. 0.79310345 0.85714286 0.85714286 0.82142857] Model : ExtraTreesClassifier Mean Score : 0.9859605911330049 [1. 0.96551724 0.96428571 1. 1. ] Model : GaussianNB Mean Score : 0.9857142857142858 [1. 1. 0.96428571 0.96428571 1. ] Model : GaussianProcessClassifier Mean Score : 0.9859605911330049 [0.96551724 1. 1. 1. 0.96428571] Model : GradientBoostingClassifier Mean Score : 0.9435960591133006 [0.96551724 0.93103448 1. 1. 0.82142857] Model : HistGradientBoostingClassifier Mean Score : 0.9716748768472907 [0.96551724 1. 0.92857143 0.96428571 1. ] Model : KNeighborsClassifier Mean Score : 0.9716748768472907 [0.96551724 1. 0.92857143 1. 0.96428571] Model : LabelPropagation Mean Score : 0.9788177339901478 [0.96551724 1. 0.96428571 0.96428571 1. ] Model : LabelSpreading Mean Score : 0.97192118226601 [0.93103448 1. 0.96428571 0.96428571 1. ] Model : LinearDiscriminantAnalysis Mean Score : 1.0 [1. 1. 1. 1. 1.] Model : LinearSVC Mean Score : 0.9857142857142858 [1. 1. 0.92857143 1. 1. ] Model : LogisticRegression Mean Score : 0.9859605911330049 [1. 0.96551724 1. 0.96428571 1. ] Model : LogisticRegressionCV Mean Score : 0.9857142857142858 [1. 1. 1. 0.96428571 0.96428571] Model : MLPClassifier Mean Score : 0.9719211822660098 [0.96551724 0.96551724 0.96428571 1. 0.96428571] Model : MultinomialNB Mean Score : 0.9647783251231526 [1. 0.93103448 1. 0.89285714 1. ] Model : NearestCentroid Mean Score : 0.9509852216748769 [0.93103448 0.93103448 0.96428571 0.96428571 0.96428571] Model : NuSVC Mean Score : 0.9790640394088669 [1. 0.93103448 1. 0.96428571 1. ] Model : PassiveAggressiveClassifier Mean Score : 0.9857142857142858 [1. 1. 0.92857143 1. 1. ] Model : Perceptron Mean Score : 0.9928571428571429 [1. 1. 1. 0.96428571 1. ] Model : QuadraticDiscriminantAnalysis Mean Score : 0.993103448275862 [1. 0.96551724 1. 1. 1. ] Model : RadiusNeighborsClassifier Mean Score : 0.9438423645320195 [1. 0.86206897 0.96428571 0.89285714 1. ] Model : RandomForestClassifier Mean Score : 0.9719211822660098 [0.96551724 0.96551724 0.96428571 1. 0.96428571] Model : RidgeClassifier Mean Score : 0.993103448275862 [0.96551724 1. 1. 1. 1. ] Model : RidgeClassifierCV Mean Score : 1.0 [1. 1. 1. 1. 1.] Model : SGDClassifier Mean Score : 0.9790640394088669 [0.93103448 1. 1. 1. 0.96428571] Model : SVC Mean Score : 0.9862068965517242 [0.93103448 1. 1. 1. 1. ] Best Model ('ExtraTreesClassifier', 1.0)
sklearn_iris (0) | 2021.07.21 |
---|---|
사이킷런 (scikit-learn,sklearn, diabetes) (0) | 2021.07.12 |
사이킷런 sklearn - boston house price (1) | 2021.07.06 |
Sklearn 함수 (0) | 2021.06.28 |
댓글 영역