사이킷런(scikit-learn / sklearn) - wine datasets with kfold¶

이번 포스팅에서는 wine 데이터셋을 살펴보겠습니다. (KFold 는 생략하고 straitifiedkfold로 바로 설명하겠습니다.)

In [1]:

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.utils import all_estimators
import warnings
warnings.filterwarnings('ignore')

In [2]:

datasets = load_wine()
x = datasets.data
y = datasets.target.reshape(-1,1)
print(x.shape,y.shape)

(178, 13) (178, 1)

아래에 보시다시피 와인 데이터셋은 분류문제이며 총 3개의 클래스로 구성되어 있습니다.

In [3]:

print(datasets.feature_names)
print(datasets.DESCR)

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

In [4]:

columns = datasets.feature_names
columns.append("Target(Wine)")

data = np.concatenate([x,y],axis=1)
dataframe = pd.DataFrame(data,columns = columns)
dataframe

Out[4]:

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline	Target(Wine)
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0	0.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0	0.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0	0.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0	0.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
173	13.71	5.65	2.45	20.5	95.0	1.68	0.61	0.52	1.06	7.70	0.64	1.74	740.0	2.0
174	13.40	3.91	2.48	23.0	102.0	1.80	0.75	0.43	1.41	7.30	0.70	1.56	750.0	2.0
175	13.27	4.28	2.26	20.0	120.0	1.59	0.69	0.43	1.35	10.20	0.59	1.56	835.0	2.0
176	13.17	2.59	2.37	20.0	120.0	1.65	0.68	0.53	1.46	9.30	0.60	1.62	840.0	2.0
177	14.13	4.10	2.74	24.5	96.0	2.05	0.76	0.56	1.35	9.20	0.61	1.60	560.0	2.0

178 rows × 14 columns

In [5]:

datasets = dataframe.values

x = datasets[:,:-1]
y = datasets[:,-1]
print(x.shape,y.shape)

(178, 13) (178,)

In [6]:

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

KFold 교차검증¶

K개의 데이터 셋을 만든 후 K번 만큼 1) 학습, 2) 검증을 수행하는 방법

교차검증의 목적은 모델의 성능 평가를 일반화하는것

모델의 성능을 직접적으로 향상시키지 않습니다.

다만, 하이퍼 파라미터 튜닝을 통해 최적의 성능을 발휘하는 파라미터를 찾을 수 있습니다.

cross_val_score을 사용해서 k-fold 교차검증을 수행합니다.

장점

반복문을 사용하는 것 보다 쉽게 사용할 수 있습니다.
내부적으로 StratifiedKFold를 사용하기 때문에 올바른 검증이 가능합니다.

In [7]:

kfold = StratifiedKFold(n_splits=5,shuffle=True)

In [8]:

all_Algorithm = all_estimators(type_filter = 'classifier')
scaler_list = [StandardScaler(),MinMaxScaler()]
best_acc_score=[]
for scaler in scaler_list:
    scaler.fit(x_train)
    x_train = scaler.transform(x_train)
    x_test = scaler.transform(x_test)
    for (name,algorithm) in all_Algorithm:
        try:
            score = cross_val_score(algorithm(),x_train,y_train,cv=kfold)
            print("Model : ",name,"\n Mean Score : ",score.mean(),"\n")
            print(score)
            acc_score = best_acc_score.append((name,score.mean()))
        except:
            continue
            
print("Best Model")
print(max(best_acc_score,key=lambda x:x[1]))

Model :  AdaBoostClassifier 
 Mean Score :  0.8322660098522168 

[0.68965517 0.79310345 0.89285714 0.89285714 0.89285714]
Model :  BaggingClassifier 
 Mean Score :  0.9504926108374384 

[0.96551724 0.96551724 0.89285714 1.         0.92857143]
Model :  BernoulliNB 
 Mean Score :  0.9295566502463055 

[0.93103448 0.93103448 0.89285714 1.         0.89285714]
Model :  CalibratedClassifierCV 
 Mean Score :  0.993103448275862 

[1.         0.96551724 1.         1.         1.        ]
Model :  CategoricalNB 
 Mean Score :  nan 

[nan nan nan nan nan]
Model :  ComplementNB 
 Mean Score :  nan 

[nan nan nan nan nan]
Model :  DecisionTreeClassifier 
 Mean Score :  0.9366995073891626 

[0.93103448 0.93103448 0.92857143 0.92857143 0.96428571]
Model :  DummyClassifier 
 Mean Score :  0.3310344827586207 

[0.20689655 0.44827586 0.28571429 0.35714286 0.35714286]
Model :  ExtraTreeClassifier 
 Mean Score :  0.8945812807881774 

[0.89655172 0.86206897 0.89285714 0.89285714 0.92857143]
Model :  ExtraTreesClassifier 
 Mean Score :  1.0 

[1. 1. 1. 1. 1.]
Model :  GaussianNB 
 Mean Score :  0.9716748768472907 

[1.         0.96551724 1.         0.96428571 0.92857143]
Model :  GaussianProcessClassifier 
 Mean Score :  0.9788177339901478 

[1.         0.96551724 1.         0.96428571 0.96428571]
Model :  GradientBoostingClassifier 
 Mean Score :  0.9440886699507389 

[0.93103448 0.89655172 0.96428571 1.         0.92857143]
Model :  HistGradientBoostingClassifier 
 Mean Score :  0.9714285714285715 

[1.         1.         1.         0.85714286 1.        ]
Model :  KNeighborsClassifier 
 Mean Score :  0.9507389162561577 

[0.93103448 0.96551724 0.96428571 0.92857143 0.96428571]
Model :  LabelPropagation 
 Mean Score :  0.9652709359605911 

[0.93103448 0.93103448 1.         0.96428571 1.        ]
Model :  LabelSpreading 
 Mean Score :  0.9509852216748769 

[0.96551724 0.89655172 0.96428571 0.92857143 1.        ]
Model :  LinearDiscriminantAnalysis 
 Mean Score :  0.993103448275862 

[0.96551724 1.         1.         1.         1.        ]
Model :  LinearSVC 
 Mean Score :  0.9928571428571429 

[1.         1.         0.96428571 1.         1.        ]
Model :  LogisticRegression 
 Mean Score :  0.9857142857142858 

[1.         1.         0.96428571 0.96428571 1.        ]
Model :  LogisticRegressionCV 
 Mean Score :  0.9785714285714286 

[1.         1.         0.96428571 0.96428571 0.96428571]
Model :  MLPClassifier 
 Mean Score :  0.9857142857142858 

[1.         1.         0.96428571 0.96428571 1.        ]
Model :  MultinomialNB 
 Mean Score :  nan 

[nan nan nan nan nan]
Model :  NearestCentroid 
 Mean Score :  0.9788177339901478 

[1.         0.96551724 0.96428571 0.96428571 1.        ]
Model :  NuSVC 
 Mean Score :  0.972167487684729 

[0.89655172 1.         1.         1.         0.96428571]
Model :  PassiveAggressiveClassifier 
 Mean Score :  0.993103448275862 

[0.96551724 1.         1.         1.         1.        ]
Model :  Perceptron 
 Mean Score :  0.9790640394088669 

[0.93103448 1.         0.96428571 1.         1.        ]
Model :  QuadraticDiscriminantAnalysis 
 Mean Score :  0.9857142857142858 

[1.         1.         1.         0.92857143 1.        ]
Model :  RandomForestClassifier 
 Mean Score :  0.9859605911330049 

[0.96551724 1.         1.         1.         0.96428571]
Model :  RidgeClassifier 
 Mean Score :  0.9719211822660098 

[1.         0.93103448 1.         1.         0.92857143]
Model :  RidgeClassifierCV 
 Mean Score :  0.993103448275862 

[0.96551724 1.         1.         1.         1.        ]
Model :  SGDClassifier 
 Mean Score :  0.9647783251231526 

[0.96551724 0.96551724 0.92857143 1.         0.96428571]
Model :  SVC 
 Mean Score :  0.9785714285714286 

[1.         1.         0.96428571 0.96428571 0.96428571]
Model :  AdaBoostClassifier 
 Mean Score :  0.7411330049261083 

[0.5862069  0.65517241 0.96428571 0.92857143 0.57142857]
Model :  BaggingClassifier 
 Mean Score :  0.9576354679802955 

[1.         0.93103448 0.96428571 0.92857143 0.96428571]
Model :  BernoulliNB 
 Mean Score :  0.3238916256157635 

[0.34482759 0.31034483 0.35714286 0.32142857 0.28571429]
Model :  CalibratedClassifierCV 
 Mean Score :  0.9788177339901478 

[1.         0.96551724 0.96428571 0.96428571 1.        ]
Model :  ComplementNB 
 Mean Score :  0.8662561576354679 

[0.86206897 0.86206897 0.82142857 0.82142857 0.96428571]
Model :  DecisionTreeClassifier 
 Mean Score :  0.9152709359605913 

[0.89655172 0.96551724 0.85714286 0.96428571 0.89285714]
Model :  DummyClassifier 
 Mean Score :  0.33103448275862063 

[0.37931034 0.27586207 0.39285714 0.32142857 0.28571429]
Model :  ExtraTreeClassifier 
 Mean Score :  0.8657635467980296 

[1.         0.79310345 0.85714286 0.85714286 0.82142857]
Model :  ExtraTreesClassifier 
 Mean Score :  0.9859605911330049 

[1.         0.96551724 0.96428571 1.         1.        ]
Model :  GaussianNB 
 Mean Score :  0.9857142857142858 

[1.         1.         0.96428571 0.96428571 1.        ]
Model :  GaussianProcessClassifier 
 Mean Score :  0.9859605911330049 

[0.96551724 1.         1.         1.         0.96428571]
Model :  GradientBoostingClassifier 
 Mean Score :  0.9435960591133006 

[0.96551724 0.93103448 1.         1.         0.82142857]
Model :  HistGradientBoostingClassifier 
 Mean Score :  0.9716748768472907 

[0.96551724 1.         0.92857143 0.96428571 1.        ]
Model :  KNeighborsClassifier 
 Mean Score :  0.9716748768472907 

[0.96551724 1.         0.92857143 1.         0.96428571]
Model :  LabelPropagation 
 Mean Score :  0.9788177339901478 

[0.96551724 1.         0.96428571 0.96428571 1.        ]
Model :  LabelSpreading 
 Mean Score :  0.97192118226601 

[0.93103448 1.         0.96428571 0.96428571 1.        ]
Model :  LinearDiscriminantAnalysis 
 Mean Score :  1.0 

[1. 1. 1. 1. 1.]
Model :  LinearSVC 
 Mean Score :  0.9857142857142858 

[1.         1.         0.92857143 1.         1.        ]
Model :  LogisticRegression 
 Mean Score :  0.9859605911330049 

[1.         0.96551724 1.         0.96428571 1.        ]
Model :  LogisticRegressionCV 
 Mean Score :  0.9857142857142858 

[1.         1.         1.         0.96428571 0.96428571]
Model :  MLPClassifier 
 Mean Score :  0.9719211822660098 

[0.96551724 0.96551724 0.96428571 1.         0.96428571]
Model :  MultinomialNB 
 Mean Score :  0.9647783251231526 

[1.         0.93103448 1.         0.89285714 1.        ]
Model :  NearestCentroid 
 Mean Score :  0.9509852216748769 

[0.93103448 0.93103448 0.96428571 0.96428571 0.96428571]
Model :  NuSVC 
 Mean Score :  0.9790640394088669 

[1.         0.93103448 1.         0.96428571 1.        ]
Model :  PassiveAggressiveClassifier 
 Mean Score :  0.9857142857142858 

[1.         1.         0.92857143 1.         1.        ]
Model :  Perceptron 
 Mean Score :  0.9928571428571429 

[1.         1.         1.         0.96428571 1.        ]
Model :  QuadraticDiscriminantAnalysis 
 Mean Score :  0.993103448275862 

[1.         0.96551724 1.         1.         1.        ]
Model :  RadiusNeighborsClassifier 
 Mean Score :  0.9438423645320195 

[1.         0.86206897 0.96428571 0.89285714 1.        ]
Model :  RandomForestClassifier 
 Mean Score :  0.9719211822660098 

[0.96551724 0.96551724 0.96428571 1.         0.96428571]
Model :  RidgeClassifier 
 Mean Score :  0.993103448275862 

[0.96551724 1.         1.         1.         1.        ]
Model :  RidgeClassifierCV 
 Mean Score :  1.0 

[1. 1. 1. 1. 1.]
Model :  SGDClassifier 
 Mean Score :  0.9790640394088669 

[0.93103448 1.         1.         1.         0.96428571]
Model :  SVC 
 Mean Score :  0.9862068965517242 

[0.93103448 1.         1.         1.         1.        ]
Best Model
('ExtraTreesClassifier', 1.0)

LIST

'scikit-learn(sklearn)' 카테고리의 다른 글

sklearn_iris (0)	2021.07.21
사이킷런 (scikit-learn,sklearn, diabetes) (0)	2021.07.12
사이킷런 sklearn - boston house price (1)	2021.07.06
Sklearn 함수 (0)	2021.06.28