[筆記] pandas 用法 (1) 基本功能 indexing 設值

pandas 用法 (1)

pandas 跟 numpy 的使用情境:
numpy 用來替代 python 當中的 list (陣列,矩陣)
pandas 用來替代 python 當中的 dictionary (字典)

pandas 基本功能

  • Pandas Series

    會自動加上字典的 index
    #coding=utf-8 import numpy as np import pandas as pd s = pd.Series([1,3,6, np.nan, 44, 1]) print(s) ''' 0 1.0 1 3.0 2 6.0 3 NaN 4 44.0 5 1.0 dtype: float64 '''
  • Pandas date_range
    #coding=utf-8 import numpy as np import pandas as pd dates = pd.date_range('20170101',periods=6) print(dates) ''' DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06'], dtype='datetime64[ns]', freq='D') '''
  • Pandas DataFrame 基本性質
    #coding=utf-8 import numpy as np import pandas as pd # 宣告 data frame df=pd.DataFrame(np.random.randn(6,4), index=dates, columns=['a','b','c','d']) print(df) ''' a b c d 2017-01-01 1.159450 -1.578301 -0.652926 0.375754 2017-01-02 0.393195 -0.346388 -0.754170 -0.008133 2017-01-03 1.192782 0.804110 -0.842838 0.012096 2017-01-04 -1.058825 -0.323025 0.722646 -1.166700 2017-01-05 -1.180235 -0.450869 1.203138 -0.815103 2017-01-06 -1.722095 0.304247 0.175714 -1.654301 ''' # 預設的 index 為 0 1 2 3 ... df1=pd.DataFrame(np.arange(12).reshape((3,4))) print(df1) ''' 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 '''
  • Pandas DataFrame 用法:使用 dictionary 宣告
    #coding=utf-8 import numpy as np import pandas as pd # 使用 dictionary 定義 data frame df2=pd.DataFrame({'A':1., 'B':pd.Timestamp('20170101'), 'C':pd.Series(1, index=list(range(4)),dtype='float32'), 'D':np.array([3]*4, dtype='int32'), 'E':pd.Categorical(["test","train","test","train"]), 'F':'foo'}) print(df2) ''' A B C D E F 0 1.0 2017-01-01 1.0 3 test foo 1 1.0 2017-01-01 1.0 3 train foo 2 1.0 2017-01-01 1.0 3 test foo 3 1.0 2017-01-01 1.0 3 train foo '''
  • Pandas DataFrame 使用 dtype 查看資料型態
    # 查看資料格式 print(df2.dtypes) ''' A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object '''
  • Pandas DataFrame 查看 index 與資料數值
    # 輸出所有 row index print(df2.index) ''' Int64Index([0, 1, 2, 3], dtype='int64') ''' # 輸出所有 column print(df2.columns) ''' Index([u'A', u'B', u'C', u'D', u'E', u'F'], dtype='object') ''' # 輸出所有值 values print(df2.values) ''' [[1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'test' 'foo'] [1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'train' 'foo'] [1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'test' 'foo'] [1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'train' 'foo']] '''
  • Pandas DataFrame 使用 describe 查看數字資料描述
    # 輸出 describe 查看數字形式的特性 print(df2.describe()) ''' A C D count 4.0 4.0 4.0 mean 1.0 1.0 3.0 std 0.0 0.0 0.0 min 1.0 1.0 3.0 25% 1.0 1.0 3.0 50% 1.0 1.0 3.0 75% 1.0 1.0 3.0 max 1.0 1.0 3.0 '''
  • Pandas DataFrame 轉置資料
    # transpose 轉置矩陣 print(df2.T) ''' 0 1 2 3 A 1 1 1 1 B 2017-01-01 00:00:00 2017-01-01 00:00:00 2017-01-01 00:00:00 2017-01-01 00:00:00 C 1 1 1 1 D 3 3 3 3 E test train test train F foo foo foo foo '''
  • Pandas DataFrame 排序資料 sort_index 與 sort_values
    # 排序,對於 axis=1 以 row 方向排序 ascending=False 倒排序 print(df2.sort_index(axis=1, ascending=False)) ''' 排序結果: F E D C B A F E D C B A 0 foo test 3 1.0 2017-01-01 1.0 1 foo train 3 1.0 2017-01-01 1.0 2 foo test 3 1.0 2017-01-01 1.0 3 foo train 3 1.0 2017-01-01 1.0 ''' # 排序,對於 axis=0 以 col 方向反向排序 (ascending=Fals) print(df2.sort_index(axis=0, ascending=False)) ''' 排序結果: 3 2 1 0 A B C D E F 3 1.0 2017-01-01 1.0 3 train foo 2 1.0 2017-01-01 1.0 3 test foo 1 1.0 2017-01-01 1.0 3 train foo 0 1.0 2017-01-01 1.0 3 test foo ''' # 排序 sort_values 針對單行的值進行排序 print(df2.sort_values(by='E')) ''' 排序結果: test train 兩倆排在一起 A B C D E F 0 1.0 2017-01-01 1.0 3 test foo 2 1.0 2017-01-01 1.0 3 test foo 1 1.0 2017-01-01 1.0 3 train foo 3 1.0 2017-01-01 1.0 3 train foo '''

pandas indexing

  • data frame 基本 indexing 方法
    #coding=utf-8 import pandas as pd import numpy as np dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) print(df['A']) ''' 2017-01-01 0 2017-01-02 4 2017-01-03 8 2017-01-04 12 2017-01-05 16 2017-01-06 20 Freq: D, Name: A, dtype: int64 ''' print(df.A) ''' 2017-01-01 0 2017-01-02 4 2017-01-03 8 2017-01-04 12 2017-01-05 16 2017-01-06 20 Freq: D, Name: A, dtype: int64 ''' print(df[0:3]) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 ''' print(df['2017-01-02':'2017-01-04']) ''' A B C D 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 '''
  • data frame location 用法 loc
    # 使用 data frame loc 來選擇查看數據 # 查看指定的 row 的內容 print(df.loc['2017-01-02']) ''' A 4 B 5 C 6 D 7 Name: 2017-01-02 00:00:00, dtype: int64 ''' # 篩選 col 的指令欄位 row 部分全部印出 print(df.loc[:,['A','B']]) ''' A B 2017-01-01 0 1 2017-01-02 4 5 2017-01-03 8 9 2017-01-04 12 13 2017-01-05 16 17 2017-01-06 20 21 ''' # 指定 row 指定 col print(df.loc['2017-01-03',['A','C']]) ''' A 8 C 10 Name: 2017-01-03 00:00:00, dtype: int64 '''
  • select py position 使用 iloc
    # 使用 index 數字來指定 select by position: iloc print(df.iloc[3]) ''' A 12 B 13 C 14 D 15 Name: 2017-01-04 00:00:00, dtype: int64 ''' # 第 3 row, 第 1 col print(df.iloc[3,1]) # 13 # 第 3~5 row, 第 1~3 col print(df.iloc[3:5,1:3]) ''' B C 2017-01-04 13 14 2017-01-05 17 18 '''
  • 混合篩選 使用 ix
    # 混合篩選 使用 ix print(df.ix[:3, ['A','C']]) ''' A C 2017-01-01 0 2 2017-01-02 4 6 2017-01-03 8 10 '''
  • Booling indexing
    # Booling indexing print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 ''' # 針對 A 欄位資料 > 8 的篩選出來 print(df[df.A > 8]) ''' A B C D 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 '''

pandas 設值

  • 在資料表格中根據 index 設定值
    #coding=utf-8 import pandas as pd import numpy as np dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 ''' df.iloc[2,2] = 1111 print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 1111 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 ''' df.loc['2017-01-02','C'] = 222 print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 222 7 2017-01-03 8 9 1111 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 '''
  • 設定符合條件,整欄位修改
    #coding=utf-8 import pandas as pd import numpy as np dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) # 全部都改到 df[df.A>8] = 0 print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 222 7 2017-01-03 8 9 1111 11 2017-01-04 0 0 0 0 2017-01-05 0 0 0 0 2017-01-06 0 0 0 0 '''
  • 設定符合條件,指定特定欄位才修改
    #coding=utf-8 import pandas as pd import numpy as np dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) # 指定特定欄位修改 df.A[df.A>8] = 0 print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 222 7 2017-01-03 8 9 1111 11 2017-01-04 0 13 14 15 2017-01-05 0 17 18 19 2017-01-06 0 21 22 23 ''' dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) # 指定特定欄位修改 df.B[df.A>8] = 0 print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 0 14 15 2017-01-05 16 0 18 19 2017-01-06 20 0 22 23 '''
  • 新增欄位並設定值
    #coding=utf-8 import pandas as pd import numpy as np dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) # 新增欄位 df['E'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('2017-01-01',periods=6)) df['F'] = np.nan print(df) ''' A B C D E F 2017-01-01 0 1 2 3 1 NaN 2017-01-02 4 5 6 7 2 NaN 2017-01-03 8 9 10 11 3 NaN 2017-01-04 12 13 14 15 4 NaN 2017-01-05 16 17 18 19 5 NaN 2017-01-06 20 21 22 23 6 NaN '''

pandas 處理資料表中的 “空值”

  • dropna 丟掉整欄/整列資料
#coding=utf-8 import pandas as pd import numpy as np dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 ''' # 設定兩個空值 df.iloc[0,1] = np.nan df.iloc[1,2] = np.nan print(df) ''' A B C D 2017-01-01 0 NaN 2.0 3 2017-01-02 4 5.0 NaN 7 2017-01-03 8 9.0 10.0 11 2017-01-04 12 13.0 14.0 15 2017-01-05 16 17.0 18.0 19 2017-01-06 20 21.0 22.0 23 ''' # dropna: how = any 偵測 row (axis=0) 當中有 nan 的就整個丟掉 print(df.dropna(axis=0, how='any')) # how = 'any' 'all' ''' A B C D 2017-01-03 8 9.0 10.0 11 2017-01-04 12 13.0 14.0 15 2017-01-05 16 17.0 18.0 19 2017-01-06 20 21.0 22.0 23 ''' # dropna: how = any 偵測 col (axis=1) 當中有 nan 的就整個丟掉 print(df.dropna(axis=1, how='any')) # how = 'any' 'all' ''' A D 2017-01-01 0 3 2017-01-02 4 7 2017-01-03 8 11 2017-01-04 12 15 2017-01-05 16 19 2017-01-06 20 23 ''' # p.s. 如果 how = all 的話,要整排 NaN 才會丟掉
  • fillna 在 NaN 空格中填入指定的值 value
# fillna: 遇到 NaN 填入指定的數字 print(df.fillna(value=0)) ''' A B C D 2017-01-01 0 0.0 2.0 3 2017-01-02 4 5.0 0.0 7 2017-01-03 8 9.0 10.0 11 2017-01-04 12 13.0 14.0 15 2017-01-05 16 17.0 18.0 19 2017-01-06 20 21.0 22.0 23 '''
  • isnull 查看資料是否為空值 NaN
# isnull 查看資料格是否是 NaN print(df.isnull()) ''' A B C D 2017-01-01 False True False False 2017-01-02 False False True False 2017-01-03 False False False False 2017-01-04 False False False False 2017-01-05 False False False False 2017-01-06 False False False False ''' # isnull 的應用:查看整個資料表是否有空值 NaN print(np.any(df.isnull())==True) # output: True ''' True 代表 df.isnull 當中有 True --> 意思是資料表中有 NaN (丟失的數據) '''

留言

這個網誌中的熱門文章

[筆記] CRLF跟LF之區別 --- 隱形的 bug

[ML筆記] Batch Normalization

[筆記] 統計實習(1) SAS 基礎用法 (匯入資料並另存SAS新檔,SUBSTR,計算總和與平均,BMI)

[ML筆記] Ensemble - Bagging, Boosting & Stacking