이번 포스팅에서는 Pandas와 Numpy 를 이용하여 실제 데이터를 만져볼 것입니다. 해당 파일은 첨부되어 있습니다.
numpy 와 pandas는 관용적으로 np, pd로 import 합니다
import numpy as np
import pandas as pd
해당 함수는 pandas 모듈에있는 read_csv라는 함수로 일반적으로 데이터를 읽어올 때 많이 사용됩니다. csv파일이나 excel파일을 읽어오는데 용이합니다. (csv파일은 엑셀파일과 같다고 생각하시면 됩니다.)
해당 csv파일을 읽어오기 위해선 경로를 잡아줘야하는데 이 스크립트를 실행하는 위치와 같은 위치에 위치시켜서 놓으시면 됩니다.
ecom = pd.read_csv('Ecommerce Purchases.csv')
이제 head, describe, info 등의 함수로 ecom이라는 dataset이 어떻게 구성되어 있는지 파악해보겠습니다. head는 상위 몇개의 자료를 보여주는 함수입니다. describe함수는 각 열에대해 std,중간값,사분위수 등의 정보를 보여주는 함수입니다. info함수는 각 열에대한 데이터 타입, 값이 비어있는 non 값의 개수 등을 보여줍니다.
ecom.head()
Address | Lot | AM or PM | Browser Info | Company | Credit Card | CC Exp Date | CC Security Code | CC Provider | Job | IP Address | Language | Purchase Price | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16629 Pace Camp Apt. 448\nAlexisborough, NE 77... | 46 in | PM | Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2... | Martinez-Herman | 6011929061123406 | 02/20 | 900 | JCB 16 digit | pdunlap@yahoo.com | Scientist, product/process development | 149.146.147.205 | el | 98.14 |
1 | 9374 Jasmine Spurs Suite 508\nSouth John, TN 8... | 28 rn | PM | Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr... | Fletcher, Richards and Whitaker | 3337758169645356 | 11/18 | 561 | Mastercard | anthony41@reed.com | Drilling engineer | 15.160.41.51 | fr | 70.73 |
2 | Unit 0065 Box 5052\nDPO AP 27450 | 94 vE | PM | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ... | Simpson, Williams and Pham | 675957666125 | 08/19 | 699 | JCB 16 digit | amymiller@morales-harrison.com | Customer service manager | 132.207.160.22 | de | 0.95 |
3 | 7780 Julia Fords\nNew Stacy, WA 45798 | 36 vm | PM | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0 ... | Williams, Marshall and Buchanan | 6011578504430710 | 02/24 | 384 | Discover | brent16@olson-robinson.info | Drilling engineer | 30.250.74.19 | es | 78.04 |
4 | 23012 Munoz Drive Suite 337\nNew Cynthia, TX 5... | 20 IE | AM | Opera/9.58.(X11; Linux x86_64; it-IT) Presto/2... | Brown, Watson and Andrews | 6011456623207998 | 10/25 | 678 | Diners Club / Carte Blanche | christopherwright@gmail.com | Fine artist | 24.140.33.94 | es | 77.82 |
print(ecom.shape)
print(ecom.columns)
(10000, 14) Index(['Address', 'Lot', 'AM or PM', 'Browser Info', 'Company', 'Credit Card', 'CC Exp Date', 'CC Security Code', 'CC Provider', 'Email', 'Job', 'IP Address', 'Language', 'Purchase Price'], dtype='object')
ecom.describe()
Credit Card | CC Security Code | Purchase Price | |
---|---|---|---|
count | 1.000000e+04 | 10000.000000 | 10000.000000 |
mean | 2.341374e+15 | 907.217800 | 50.347302 |
std | 2.256103e+15 | 1589.693035 | 29.015836 |
min | 6.040186e+10 | 0.000000 | 0.000000 |
25% | 3.056322e+13 | 280.000000 | 25.150000 |
50% | 8.699942e+14 | 548.000000 | 50.505000 |
75% | 4.492298e+15 | 816.000000 | 75.770000 |
max | 6.012000e+15 | 9993.000000 | 99.990000 |
ecom.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Address 10000 non-null object 1 Lot 10000 non-null object 2 AM or PM 10000 non-null object 3 Browser Info 10000 non-null object 4 Company 10000 non-null object 5 Credit Card 10000 non-null int64 6 CC Exp Date 10000 non-null object 7 CC Security Code 10000 non-null int64 8 CC Provider 10000 non-null object 9 Email 10000 non-null object 10 Job 10000 non-null object 11 IP Address 10000 non-null object 12 Language 10000 non-null object 13 Purchase Price 10000 non-null float64 dtypes: float64(1), int64(2), object(11) memory usage: 1.1+ MB
unique함수는 해당 열에 몇가지 종류의 값이 있는지 리턴해줍니다. categorical변수 (남자/여자, 나이대)의 개수를 파악하기 용이합니다.
ecom["AM or PM"].unique()
array(['PM', 'AM'], dtype=object)
해당 열에 대하여 데이터당 갯수를 알려줍니다.
ecom["AM or PM"].value_counts()
PM 5068 AM 4932 Name: AM or PM, dtype: int64
ecom.mean()
Credit Card 2.341374e+15 CC Security Code 9.072178e+02 Purchase Price 5.034730e+01 dtype: float64
print(ecom["Purchase Price"].min())
print(ecom["Purchase Price"].max())
0.0 99.99
round(ecom["Purchase Price"].min())
0
round(ecom["Purchase Price"].max())
100
ecom[ecom['Language'] == 'en']['Credit Card'].count()
1098
ecom.loc[ecom['Language'] == 'en', 'Credit Card'].count()
1098
len(ecom.loc[ecom['Language'] == 'en', 'Credit Card'].unique())
1098
ecom[ecom['Job'] == 'Lawyer'].count().max()
30
ecom['AM or PM'].value_counts()
PM 5068 AM 4932 Name: AM or PM, dtype: int64
ecom['Job'].value_counts().head()
Interior and spatial designer 31 Lawyer 30 Social researcher 28 Research officer, political party 27 Purchasing manager 27 Name: Job, dtype: int64
ecom['Job'].value_counts().tail()
Prison officer 7 Actuary 6 Editor, film/video 6 Information systems manager 5 Investment analyst 5 Name: Job, dtype: int64
ecom['Job'].value_counts().sort_values().head()
Investment analyst 5 Information systems manager 5 Actuary 6 Editor, film/video 6 Special effects artist 7 Name: Job, dtype: int64
ecom.loc[ecom['Lot'] == '90 WT', 'Purchase Price']
513 75.1 Name: Purchase Price, dtype: float64
ecom.loc[ecom['Credit Card']==4926535242672853, 'Email']
1234 bondellen@williams-garza.com Name: Email, dtype: object
new_ecom = ecom[ecom['CC Provider']=='American Express']
new_ecom[new_ecom['Purchase Price']>95].count()
Address 39 Lot 39 AM or PM 39 Browser Info 39 Company 39 Credit Card 39 CC Exp Date 39 CC Security Code 39 CC Provider 39 Email 39 Job 39 IP Address 39 Language 39 Purchase Price 39 dtype: int64
ecom[
(ecom['CC Provider']=='American Express') & (ecom['Purchase Price'] > 95 )
].count()
Address 39 Lot 39 AM or PM 39 Browser Info 39 Company 39 Credit Card 39 CC Exp Date 39 CC Security Code 39 CC Provider 39 Email 39 Job 39 IP Address 39 Language 39 Purchase Price 39 dtype: int64
new_ecom = ecom[ecom['Purchase Price'] > 95]
new_ecom.append(
ecom[ecom['Purchase Price'] < 10]
).count()
Address 1502 Lot 1502 AM or PM 1502 Browser Info 1502 Company 1502 Credit Card 1502 CC Exp Date 1502 CC Security Code 1502 CC Provider 1502 Email 1502 Job 1502 IP Address 1502 Language 1502 Purchase Price 1502 dtype: int64
ecom[
(ecom['Purchase Price'] > 95) | (ecom['Purchase Price'] < 10)
].count()
Address 1502 Lot 1502 AM or PM 1502 Browser Info 1502 Company 1502 Credit Card 1502 CC Exp Date 1502 CC Security Code 1502 CC Provider 1502 Email 1502 Job 1502 IP Address 1502 Language 1502 Purchase Price 1502 dtype: int64
len(ecom[
(ecom['Purchase Price'] > 95) | (ecom['Purchase Price'] < 10)
])
1502
for idx, row in ecom.iterrows():
print(row)
break
Address 16629 Pace Camp Apt. 448\nAlexisborough, NE 77... Lot 46 in AM or PM PM Browser Info Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2... Company Martinez-Herman Credit Card 6011929061123406 CC Exp Date 02/20 CC Security Code 900 CC Provider JCB 16 digit Email pdunlap@yahoo.com Job Scientist, product/process development IP Address 149.146.147.205 Language el Purchase Price 98.14 Name: 0, dtype: object
ecom['CC Exp Date'].apply(
lambda x:(x.split('/')[1]=='25')
).value_counts()
False 8967 True 1033 Name: CC Exp Date, dtype: int64
ecom[
ecom['CC Exp Date'].apply(
lambda x:(x.split('/')[1]=='25')
)
].count()
Address 1033 Lot 1033 AM or PM 1033 Browser Info 1033 Company 1033 Credit Card 1033 CC Exp Date 1033 CC Security Code 1033 CC Provider 1033 Email 1033 Job 1033 IP Address 1033 Language 1033 Purchase Price 1033 dtype: int64
ecom['Email Provider'] = ecom['Email'].apply(
lambda email:email.split('@')[1]
)
ecom.head(2)
Address | Lot | AM or PM | Browser Info | Company | Credit Card | CC Exp Date | CC Security Code | CC Provider | Job | IP Address | Language | Purchase Price | Email Provider | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16629 Pace Camp Apt. 448\nAlexisborough, NE 77... | 46 in | PM | Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2... | Martinez-Herman | 6011929061123406 | 02/20 | 900 | JCB 16 digit | pdunlap@yahoo.com | Scientist, product/process development | 149.146.147.205 | el | 98.14 | yahoo.com |
1 | 9374 Jasmine Spurs Suite 508\nSouth John, TN 8... | 28 rn | PM | Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr... | Fletcher, Richards and Whitaker | 3337758169645356 | 11/18 | 561 | Mastercard | anthony41@reed.com | Drilling engineer | 15.160.41.51 | fr | 70.73 | reed.com |
ecom['Email Provider'].value_counts().head(5)
hotmail.com 1638 yahoo.com 1616 gmail.com 1605 smith.com 42 williams.com 37 Name: Email Provider, dtype: int64
ecom.hist()
array([[<AxesSubplot:title={'center':'Credit Card'}>, <AxesSubplot:title={'center':'CC Security Code'}>], [<AxesSubplot:title={'center':'Purchase Price'}>, <AxesSubplot:>]], dtype=object)
Pandas2 (0) | 2021.06.20 |
---|---|
Pandas_1 (0) | 2021.06.18 |
Numpy_2 (0) | 2021.06.14 |
Numpy_1 (0) | 2021.06.11 |
댓글 영역