Pandas_DataFrame

1 创建
2 应用程序接口参考 API Reference
- 2.1 通用函数 General functions
3 =============特別功能=============
4 列操作
5 索引
6 过滤
- 6.1 pd.DataFrame()
  - 6.1.1 df[df['column']>value]
  - 6.1.2 df.query(expr, inplace=False, **kwargs)
7 对齐
- 7.1 广播 Broadcast
- 7.2 Boolean operators
8 比較 Comparisons

import numpy as np
import pandas as pd

创建¶

pandas.DataFrame(
          data=None
        , index=None
        , columns=None
        , dtype=None
        , copy=False
        )
data : numpy ndarray (structured or homogeneous), dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects
Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
columns : Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input

根據Series直接生成 from a Series¶

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

s=pd.Series(np.random.randn(5))
pd.DataFrame(s) # Series的每一個元素是DataFrame的一條行記錄

s=pd.Series([1,2,3,4,5],name='Jasper')
pd.DataFrame(s, columns=['Jasper','Casper'])

根據Series組成的字典創建 from dict of Series¶

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])
    }
pd.DataFrame(d)

pd.DataFrame(d, index=['d', 'b', 'a'])

pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

df=pd.DataFrame(d)
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

pd.DataFrame(d)
df.columns

Index(['one', 'two'], dtype='object')

根據列表組成的字典創建 from dict of lists¶

list中的每一個元素,作爲df的一行

# 也可直接根據列表創建
l=[1., 2., 3., 4.]
pd.DataFrame(l, columns=['one'])

d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)

pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame.from_dict(d)

# orient='index'
# If you pass orient='index', the keys will be the row labels.
# In this case, you can also pass the desired column names
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
pd.DataFrame.from_dict(d
                       , orient='index'
                       , columns=['one', 'two', 'three', 'four']
)

根據數組組成的字典創建 from dict of ndarrays¶

d = {'one': np.array([1.,2.,3.,4.]), 'two': np.array([4.,2.,3.,1.])}
pd.DataFrame(d)

d = {'one': np.array([1.,2.,3.,4.]), 'two': np.array([4.,2.,3.,1.])}
pd.DataFrame(d, index=['A','B','C','D'])

根據字典組成的列表創建 from a list of dicts¶

list中的每一個元素,作爲df的一行(同樣的),但是此刻每一行記錄是一個字典,期間的每一個(key和)value,作爲了df的每一列.
當使用爬蟲採集數據時,對每一條字典類記錄,可逐一加入到一個list中, 然後該list生成一個DataFrame

l = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(l, index=['first', 'second'], columns=['c', 'a', 'b'])

根據元組組成的字典創建 from a dict of tuples¶

t={  # 字典里面嵌套字典, key里面嵌套tuples, values里面嵌套字典
    ('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
    ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
    ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
    ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}
}
for i in t.keys():    # 獲取字典key的方法
    print(type(i), i)
print('-'*100)
for v in t.values():    # 獲取字典value的方法
    print(type(v), v)
print('-'*100)
for j in t.keys():    # 獲取字典key中的元組的元素的方法
    print(j[0], j[1])
print('-'*100)
for x, y in t.keys():   # 獲取字典key中的元組的元素的第二種方法
    print(x, y)
print('-'*100)
for v2 in t.values():   # 獲取字典value中的字典的元素的key和Value
    for k3, v3 in v2.items():
        print(k3, ' | ', v3)
    print('-'*20)
pd.DataFrame(t)

<class 'tuple'> ('a', 'b')
<class 'tuple'> ('a', 'a')
<class 'tuple'> ('a', 'c')
<class 'tuple'> ('b', 'a')
<class 'tuple'> ('b', 'b')
----------------------------------------------------------------------------------------------------
<class 'dict'> {('A', 'B'): 1, ('A', 'C'): 2}
<class 'dict'> {('A', 'C'): 3, ('A', 'B'): 4}
<class 'dict'> {('A', 'B'): 5, ('A', 'C'): 6}
<class 'dict'> {('A', 'C'): 7, ('A', 'B'): 8}
<class 'dict'> {('A', 'D'): 9, ('A', 'B'): 10}
----------------------------------------------------------------------------------------------------
a b
a a
a c
b a
b b
----------------------------------------------------------------------------------------------------
a b
a a
a c
b a
b b
----------------------------------------------------------------------------------------------------
('A', 'B')  |  1
('A', 'C')  |  2
--------------------
('A', 'C')  |  3
('A', 'B')  |  4
--------------------
('A', 'B')  |  5
('A', 'C')  |  6
--------------------
('A', 'C')  |  7
('A', 'B')  |  8
--------------------
('A', 'D')  |  9
('A', 'B')  |  10
--------------------

pd.DataFrame({  # key里面的tuple成为列组合，
                # value里面的dict里面key里面的tuple成为行组合
                # value里面的dict里面value成为DataFrame的元素
    ('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
})

# 再舉一例
pd.DataFrame({
    ('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
})

根據結構化數組創建 from structured array¶

# 注意與數組字典的區別:直接使用數組(而非字典)生成的DataFrame
ar=np.array([[1.,2.,3.,4.],[3,4,5,6]])
print(pd.DataFrame(ar, dtype=int, columns=list('ABCD'))) # 數組元素(列表)內的元素形成各列
print(pd.DataFrame(ar).index) 
pd.DataFrame(ar)

   A  B  C  D
0  1  2  3  4
1  3  4  5  6
RangeIndex(start=0, stop=2, step=1)

根據結構化記錄創建 from records¶

data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
print(type(data))
print(data)
pd.DataFrame(data, index=['first', 'second'], columns=['C', 'A', 'B'])

<class 'numpy.ndarray'>
[(1, 2., b'Hello') (2, 3., b'World')]

pd.DataFrame.from_records(data, index='C')

使用MultiIndex創建多維度索引的DataFrame¶

MultiIndex.from_arrays¶

arrays = [[1, 1, 2, 2], ['Oct', 'Nov', 'Oct', 'Nov']]
index = pd.MultiIndex.from_arrays(arrays, names=('Quarter', 'periods'))
print(index)
columns = pd.MultiIndex.from_tuples([('company1', 'area1'),
                                     ('company2', 'area2')])
print(columns)
df = pd.DataFrame([(389.0, 18),
                   ( 24.0, 59),
                   ( 80.5, None),
                   (np.nan, 550)],
                  index=index,
                  columns=columns)
df

MultiIndex(levels=[[1, 2], ['Nov', 'Oct']],
           labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
           names=['Quarter', 'periods'])
MultiIndex(levels=[['company1', 'company2'], ['area1', 'area2']],
           labels=[[0, 1], [0, 1]])

MultiIndex.from_tuples¶

index = pd.MultiIndex.from_tuples([('bird', 'falcon'),  # 與zip()搭配使用效果會很好
                                   ('bird', 'parrot'),
                                   ('mammal', 'lion'),
                                   ('mammal', 'monkey')],
                                  names=['class', 'name'])
columns = pd.MultiIndex.from_tuples([('speed', 'max'),
                                     ('species', 'type')])
print(index)
df = pd.DataFrame([(389.0, 'fly'),
                   ( 24.0, 'fly'),
                   ( 80.5, 'run'),
                   (np.nan, 'jump')],
                  index=index,
                  columns=columns)
df

MultiIndex(levels=[['bird', 'mammal'], ['falcon', 'lion', 'monkey', 'parrot']],
           labels=[[0, 0, 1, 1], [0, 3, 1, 2]],
           names=['class', 'name'])

m_index=pd.Index([("A","x1"),("A","x2"),("B","y1"),("B","y2"),("B","y3"),("B","y4"),("B","y5"),("B","y6"),("B","y7")],name=["class1","class2"])
print(m_index)
df = pd.DataFrame(
      data=[[45, 30], [200, 100], [1.5, 1], [30, 20], [250, 150], [1.5, 0.8], [320, 250], [1, 0.8], [0.3,0.2]]
    , index=m_index
    , columns=['big', 'small']
)
df

MultiIndex(levels=[['A', 'B'], ['x1', 'x2', 'y1', 'y2', 'y3', 'y4', 'y5', 'y6', 'y7']],
           labels=[[0, 0, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8]],
           names=['class1', 'class2'])

MultiIndex.from_product¶

m_index=pd.MultiIndex.from_product([["Y1","Y2"],['Q1','Q2','Q3','Q4']],names=["Year","Quarter"])
print(m_index)
df=pd.DataFrame(
                  np.random.randint(5,15,(3,8))
                , index=['one','two','three']
                , columns=m_index
)
df

MultiIndex(levels=[['Y1', 'Y2'], ['Q1', 'Q2', 'Q3', 'Q4']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]],
           names=['Year', 'Quarter'])

应用程序接口参考 API Reference¶

通用函数 General functions¶

数据操作 Data manipulations¶

pd.melt()¶

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
数据透视表的反操作
将宽表转换为长表
将列名转变为变量
“ Unpivots ” a DataFrame from wide format to long format, optionally leaving identifier variables set.
frame : 要处理的数据集 DataFrame
id_vars : tuple, list, or ndarray, optional
不需要被转换的列名,用作标识变量 Column(s) to use as identifier variables.
value_vars : tuple, list, or ndarray, optional
需要转换的列名，如果剩下的列全部都要转换，就不用写了 Column(s) to unpivot . If not specified, uses all columns that are not set as id_vars.
var_name : scalar
自定义设置对应的列名 Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_name : scalar, default ‘value’
自定义设置对应的列名 Name to use for the ‘value’ column.
col_level : int or string, optional
如果列是MultiIndex，则使用此级别 If columns are a MultiIndex then use this level to melt.

df = pd.DataFrame({
                    'A': {0: 'a', 1: 'b', 2: 'c'}
                  , 'B': {0: 1, 1: 3, 2: 5}
                  , 'C': {0: 2, 1: 4, 2: 6}
                  })
print(df)
pd.melt(df, id_vars=['A'], value_vars=['B'])

   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6

df.columns = [list('ABC'), list('DEF')]
print(df)
pd.melt(
      df
    , id_vars=[('A','D')]
    , value_vars=[('B','E'),('C','F')]
)

   A  B  C
   D  E  F
0  a  1  2
1  b  3  4
2  c  5  6

pd.pivot_table()¶

df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
根据长表创建数据透视表
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
values : column to aggregate, optional
index : column, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns : column, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc : function, list of functions, dict, default numpy.mean
If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions
fill_value : scalar, default None
Value to replace missing values with
margins : boolean, default False
Add all row / columns (e.g. for subtotal / grand totals)
dropna : boolean, default True
Do not include columns whose entries are all NaN
margins_name : string, default ‘All’
Name of the row / column that will contain the totals when margins is True.

df = pd.DataFrame({
                    'A': {0: 'a', 1: 'b', 2: 'c'}
                  , 'B': {0: 1, 1: 3, 2: 5}
                  , 'C': {0: 2, 1: 4, 2: np.NaN}
                  })
print(df)
print('-'*50)
df.columns = [list('ABC'), list('DEF')]
print(df)
print('-'*50)
df_m=pd.melt(
      df
    , id_vars=[('A','D')]
    , value_vars=[('B','E'),('C','F')]
)
df_m.columns = list('ABCD')
print(df_m)
print('-'*50)
table = pd.pivot_table(
      df_m
    , values='D'
    , index=['B','C']
    , columns=['A']
    , aggfunc=np.mean
    , fill_value=100
)
table

   A  B    C
0  a  1  2.0
1  b  3  4.0
2  c  5  NaN
--------------------------------------------------
   A  B    C
   D  E    F
0  a  1  2.0
1  b  3  4.0
2  c  5  NaN
--------------------------------------------------
   A  B  C    D
0  a  B  E  1.0
1  b  B  E  3.0
2  c  B  E  5.0
3  a  C  F  2.0
4  b  C  F  4.0
5  c  C  F  NaN
--------------------------------------------------

pd.crosstab()¶

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed

a = np.array(["foo", "foo", "foo", "foo", "nam", "bar", "bar", "nam", "foo", "foo", "foo"], dtype=object)
b = np.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"], dtype=object)
c = np.array(["dul", "dul", "shi", "dul", "jas", "shi", "shi", "cas", "shi", "shi", "shi"], dtype=object)
d = np.array([100,1,1,1,1,1,1,1,1,1,1], dtype=int)

pd.crosstab(
      a
    , [b, c]
    , values=d
    , rownames=['a']
    , colnames=['Jasper', 'Casper']
    , aggfunc=np.sum
    , dropna=True
    , margins=True
    , margins_name='Total'
)

# 交换index和column的位置（及其名称的位置），就相当于转置
# 此处计算的是count，计算count的应用场景会更多一些
pd.crosstab(
      [b, c]
    , a
    , rownames=['Jasper', 'Casper']
    , colnames=['a']
)

pd.cut()¶

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
将数值转换为区间
返回值为pandas.Categorical, Series, 或 ndarray
使用场景：先分组pd.cut()，再计算频次Series.value_counts()，再排序Series.sort_values()
Bin values into discrete intervals.
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
Returns:pandas.Categorical, Series, or ndarray

print(type(pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)))
print(type(pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)))
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)

<class 'pandas.core.arrays.categorical.Categorical'>
<class 'tuple'>

([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.994, 3.0]]
 Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] < (5.0, 7.0]],
 array([0.994, 3.   , 5.   , 7.   ]))

print(pd.cut(np.array([1, 7, 5, 4, 6, 3, 9, 9]), 3, labels=["bad", "medium", "good"]))
pd.cut(np.array([1, 7, 5, 4, 6, 3, 9, 9]), 3, labels=["bad", "medium", "good"]).value_counts().sort_values(ascending=False)

[bad, good, medium, medium, medium, bad, good, good]
Categories (3, object): [bad < medium < good]

good      3
medium    3
bad       2
dtype: int64

s = pd.Series(np.array([2, 4, 6, 8, 10]), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(pd.cut(s, 3))
pd.cut(s, 3).value_counts()

a     2
b     4
c     6
d     8
e    10
dtype: int64
a    (1.992, 4.667]
b    (1.992, 4.667]
c    (4.667, 7.333]
d     (7.333, 10.0]
e     (7.333, 10.0]
dtype: category
Categories (3, interval[float64]): [(1.992, 4.667] < (4.667, 7.333] < (7.333, 10.0]]

(7.333, 10.0]     2
(1.992, 4.667]    2
(4.667, 7.333]    1
dtype: int64

pd.qcut()¶

pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
保持各个区间中的变量数目相同
Quantile-based discretization function. Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
通过比较pd.cut()和pd.qcut()可以明显看出二者的区别

pd.qcut(np.random.randn(1000000), 5).value_counts() # 区间变量数相同

(-4.53, -0.844]     200000
(-0.844, -0.256]    200000
(-0.256, 0.251]     200000
(0.251, 0.84]       200000
(0.84, 4.777]       200000
dtype: int64

pd.cut(np.random.randn(1000000), 5).value_counts() # 区间变量符合正态分布

(-4.81, -2.902]       1830
(-2.902, -1.004]    156230
(-1.004, 0.894]     656875
(0.894, 2.793]      182473
(2.793, 4.691]        2592
dtype: int64

pd.merge()¶

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
how的选项：inner, pouter, left, right
相当于SQL的join

df1=pd.DataFrame({'name':['kate','herz','catherine','sally'], 'age':[25,28,39,35]}) 
df2=pd.DataFrame({'name':['kate','herz','sally'], 'score':[70,60,90]}) 
print(df1)
print('-'*50)
print(df2)
print('-'*50)
print(pd.merge(df1,df2))
print('-'*50)
print(pd.merge(df1,df2, how='outer'))

        name  age
0       kate   25
1       herz   28
2  catherine   39
3      sally   35
--------------------------------------------------
    name  score
0   kate     70
1   herz     60
2  sally     90
--------------------------------------------------
    name  age  score
0   kate   25     70
1   herz   28     60
2  sally   35     90
--------------------------------------------------
        name  age  score
0       kate   25   70.0
1       herz   28   60.0
2  catherine   39    NaN
3      sally   35   90.0

df1.merge(df2)

df1=pd.DataFrame({'name1':['kate','herz','catherine','sally'], 'age':[25,28,39,35]}) 
df2=pd.DataFrame({'name2':['kate','herz','sally'], 'score':[70,60,90]}) 
print(df1.merge(df2, left_on='name1', right_on='name2', how='outer'))
df1.merge(df2, left_on='name1', right_on='name2', how='outer').drop('name2',axis=1).fillna(0)

       name1  age  name2  score
0       kate   25   kate   70.0
1       herz   28   herz   60.0
2  catherine   39    NaN    NaN
3      sally   35  sally   90.0

pd.concat()¶

pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
根据不同的轴将数据进行简单融合
concat不会去重，要达到去重的效果可以使用drop_duplicates方法
Concatenate pandas objects along a particular axis with optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

df1 = pd.DataFrame([['a', 1], ['b', 2], ['c', 5]], columns=['letter', 'number'])
print(df1)
print('-'*50)
df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])
print(df2)
print('-'*50)
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
print(df3)
print('-'*50)
print(pd.concat([df1, df2]))
print('-'*50)
print(pd.concat([df1, df3], sort=False, keys=['oneblock', 'twoblock'], names=['block', 'number']))
print('-'*50)
print(pd.concat([df1, df3], sort=False, ignore_index=True)) # 忽略原有的index，重新建立index

  letter  number
0      a       1
1      b       2
2      c       5
--------------------------------------------------
  letter  number
0      c       3
1      d       4
--------------------------------------------------
  letter  number animal
0      c       3    cat
1      d       4    dog
--------------------------------------------------
  letter  number
0      a       1
1      b       2
2      c       5
0      c       3
1      d       4
--------------------------------------------------
                letter  number animal
block    number                      
oneblock 0           a       1    NaN
         1           b       2    NaN
         2           c       5    NaN
twoblock 0           c       3    cat
         1           d       4    dog
--------------------------------------------------
  letter  number animal
0      a       1    NaN
1      b       2    NaN
2      c       5    NaN
3      c       3    cat
4      d       4    dog

print(pd.concat([df1, df2], axis=1, join='inner'))
print('-'*50)
print(pd.concat([df1, df2], axis=1, join='outer'))

  letter  number letter  number
0      a       1      c       3
1      b       2      d       4
--------------------------------------------------
  letter  number letter  number
0      a       1      c     3.0
1      b       2      d     4.0
2      c       5    NaN     NaN

pd.join()¶

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
join方法默认根据index作关联，默认为左外连接how=left
Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.

df1=pd.DataFrame({'Red':[1,3,5],'Green':[5,0,3]},index=list('abd'))
df2=pd.DataFrame({'Blue':[1,9],'Yellow':[6,6]},index=list('ce'))
df3=pd.DataFrame({'Brown':[3,4,5],'White':[1,1,2]},index=list('aed'))
print(df1.join([df2,df3]))  # 默认是left join
#df1.join([df2,df3], sort=False, how='outer') # A future version of pandas will change to not sort by default.
df1.join(df2, sort=False, how='outer').join(df3, sort=False, how='outer')

   Red  Green  Blue  Yellow  Brown  White
a    1      5   NaN     NaN    3.0    1.0
b    3      0   NaN     NaN    NaN    NaN
d    5      3   NaN     NaN    5.0    2.0

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                    'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'C': ['C0', 'C1'],
                      'D': ['D0', 'D1']},
                      index=['K0', 'K1'])
right2 = pd.DataFrame({'v': [7, 8, 9]}, index=['K1', 'K1', 'K2'])
print(left)
print(right)
print(right2)
print(left.join(right, on='key').join(right2, on='key'))
result = left.join([right, right2], how='outer')
result

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K0
3  A3  B3  K1
     C   D
K0  C0  D0
K1  C1  D1
    v
K1  7
K1  8
K2  9
    A   B key   C   D    v
0  A0  B0  K0  C0  D0  NaN
1  A1  B1  K1  C1  D1  7.0
1  A1  B1  K1  C1  D1  8.0
2  A2  B2  K0  C0  D0  NaN
3  A3  B3  K1  C1  D1  7.0
3  A3  B3  K1  C1  D1  8.0

pd.get_dummies()¶

pandas.get_dummies(data, prefix=None, prefixsep='', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Convert categorical variable into dummy/indicator variables
类别变量向量化

df = pd.DataFrame({'key':['b','b','a','c','a','b'], 'data':np.random.randn(6)})
print(df)
print(pd.get_dummies(df['key']))

  key      data
0   b -0.194219
1   b  0.591529
2   a  0.435119
3   c  0.345278
4   a  0.324429
5   b -0.866458
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0

dummies = pd.get_dummies(df['key'],prefix = 'key')
dummies

df_with_dummy = df[['data']].join(dummies) # 與原df的數值結合
print(df_with_dummy)

       data  key_a  key_b  key_c
0 -0.194219      0      1      0
1  0.591529      0      1      0
2  0.435119      1      0      0
3  0.345278      0      0      1
4  0.324429      1      0      0
5 -0.866458      0      1      0

s = ['a', 'b', np.nan, 'a']
print(s)
print('-'*50)
print(pd.get_dummies(s))
print('-'*50)
print(pd.get_dummies(s, dummy_na=True)) # 添加顯示nan的列

['a', 'b', nan, 'a']
--------------------------------------------------
   a  b
0  1  0
1  0  1
2  0  0
3  1  0
--------------------------------------------------
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1
3  1  0    0

pandas.DataFrame.reorder_levels¶

index = pd.MultiIndex.from_tuples([('bird', 'falcon'),  # 與zip()搭配使用效果會很好
                                   ('bird', 'parrot'),
                                   ('mammal', 'lion'),
                                   ('mammal', 'monkey')],
                                  names=['class', 'name'])
columns = pd.MultiIndex.from_tuples([('speed', 'max'),
                                     ('species', 'type')])
df = pd.DataFrame([(389.0, 'fly'),
                   ( 24.0, 'fly'),
                   ( 80.5, 'run'),
                   (np.nan, 'jump')],
                  index=index,
                  columns=columns)
print(df)
print('-'*30)
order=[1,0] # 若提供3個元素,則報錯:Too many levels: Index has only 2 levels, not 3
print(df.reorder_levels(order)) # 交換了行level中class和name的位置
print('-'*30)
print(df.reorder_levels(order, axis=1)) # 交換列標籤的上下位置

               speed species
                 max    type
class  name                 
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump
------------------------------
               speed species
                 max    type
name   class                
falcon bird    389.0     fly
parrot bird     24.0     fly
lion   mammal   80.5     run
monkey mammal    NaN    jump
------------------------------
                 max    type
               speed species
class  name                 
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump

pandas.DataFrame.sort_values¶

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'A'],
    'col2': [3, 1, 9, 8, 7, 2],
    'col3': [0, 1, 9, 4, 2, 3],
})
print(df)
print('-'*30)
print(df.sort_values(by=['col1']))
print('-'*30)
print(df.sort_values(by=['col1', 'col2']))
print('-'*30)
print(df.sort_values(by='col1', ascending=False)) # NaN始終排在最後
print('-'*30)
df.sort_values(by='col1', ascending=False, na_position='first') # NaN這樣就排到前面了

  col1  col2  col3
0    A     3     0
1    A     1     1
2    B     9     9
3  NaN     8     4
4    D     7     2
5    A     2     3
------------------------------
  col1  col2  col3
0    A     3     0
1    A     1     1
5    A     2     3
2    B     9     9
4    D     7     2
3  NaN     8     4
------------------------------
  col1  col2  col3
1    A     1     1
5    A     2     3
0    A     3     0
2    B     9     9
4    D     7     2
3  NaN     8     4
------------------------------
  col1  col2  col3
4    D     7     2
2    B     9     9
0    A     3     0
1    A     1     1
5    A     2     3
3  NaN     8     4
------------------------------

pandas.DataFrame.sort_index¶

df = pd.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'A'],
    'col2': [3, 1, 9, 8, 7, 2],
    'col3': [0, 1, 9, 4, 2, 3],
})
print(df.sort_values(by='col1', ascending=False, na_position='first'))
df.sort_values(by='col1', ascending=False, na_position='first').sort_index()

  col1  col2  col3
3  NaN     8     4
4    D     7     2
2    B     9     9
0    A     3     0
1    A     1     1
5    A     2     3

pandas.DataFrame.nlargest¶

DataFrame.nlargest(n, columns, keep='first')
Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

df = pd.DataFrame({'a': [1, 10, 8, 10, -1, 5, 7],
                   'b': list('abdcefg'),
                   'c': [21.0, 2.0, np.nan, 5.0, 4.0, 2, 1]})
print(df)
print('-'*30)
df['rows_total']=df.apply(lambda x: x['a']+x['c'], axis=1)
print(df.sort_values(by='rows_total', ascending=False))
df.nlargest(5, ['c', 'a'], keep='first')  # 是按照列表中的列的合計,進行排序,然後選擇其中的n條記錄的

    a  b     c
0   1  a  21.0
1  10  b   2.0
2   8  d   NaN
3  10  c   5.0
4  -1  e   4.0
5   5  f   2.0
6   7  g   1.0
------------------------------
    a  b     c  rows_total
0   1  a  21.0        22.0
3  10  c   5.0        15.0
1  10  b   2.0        12.0
6   7  g   1.0         8.0
5   5  f   2.0         7.0
4  -1  e   4.0         3.0
2   8  d   NaN         NaN

pandas.DataFrame.nsmallest¶

df = pd.DataFrame({'a': [1, 10, 8, 10, -1, 5, 7],
                   'b': list('abdcefg'),
                   'c': [21.0, 2.0, np.nan, 5.0, 4.0, 2, 1]})
print(df)
print('-'*30)
df['rows_total']=df.apply(lambda x: x['a']+x['c'], axis=1)
print(df.sort_values(by='rows_total', ascending=True))
df.nsmallest(5, ['c', 'a'], keep='first')

    a  b     c
0   1  a  21.0
1  10  b   2.0
2   8  d   NaN
3  10  c   5.0
4  -1  e   4.0
5   5  f   2.0
6   7  g   1.0
------------------------------
    a  b     c  rows_total
4  -1  e   4.0         3.0
5   5  f   2.0         7.0
6   7  g   1.0         8.0
1  10  b   2.0        12.0
3  10  c   5.0        15.0
0   1  a  21.0        22.0
2   8  d   NaN         NaN

pandas.DataFrame.swaplevel¶

i, j : int, string (can be mixed)
- Level of index to be swapped. Can pass level name as string.

index = pd.MultiIndex.from_tuples([('bird', 'falcon'),  # 與zip()搭配使用效果會很好
                                   ('bird', 'parrot'),
                                   ('mammal', 'lion'),
                                   ('mammal', 'monkey')],
                                  names=['class', 'name'])
columns = pd.MultiIndex.from_tuples([('speed', 'max'),
                                     ('species', 'type')])
df = pd.DataFrame([(389.0, 'fly'),
                   ( 24.0, 'fly'),
                   ( 80.5, 'run'),
                   (np.nan, 'jump')],
                  index=index,
                  columns=columns)
print(df)
print('-'*30)
print(df.swaplevel(i=1, j=0, axis=1))
df.swaplevel(i='name', j='class', axis=0)

               speed species
                 max    type
class  name                 
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump
------------------------------
                 max    type
               speed species
class  name                 
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump

pandas.DataFrame.stack¶

df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]],
                                    index=['cat', 'dog'],
                                    columns=['weight', 'height'])
print(df_single_level_cols)
print('-'*30)
print(df_single_level_cols.stack().index)
df_single_level_cols.stack()  # 列轉行(索引)

     weight  height
cat       0       1
dog       2       3
------------------------------
MultiIndex(levels=[['cat', 'dog'], ['weight', 'height']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
                                       ('height', 'm')])
df_multi_level_cols2 = pd.DataFrame([[np.NaN, 2.0], [3.0, 4.0]],
                                    index=['cat', 'dog'],
                                    columns=multicol2)
print(df_multi_level_cols2)
print('-'*30)
print(df_multi_level_cols2.stack().index)
print('-'*30)
print(df_multi_level_cols2.stack(dropna=False)) # NaN 保留, 缺省是不保留的
df_multi_level_cols2.stack(dropna=True)

    weight height
        kg      m
cat    NaN    2.0
dog    3.0    4.0
------------------------------
MultiIndex(levels=[['cat', 'dog'], ['kg', 'm']],
           labels=[[0, 1, 1], [1, 0, 1]])
------------------------------
        height  weight
cat kg     NaN     NaN
    m      2.0     NaN
dog kg     NaN     3.0
    m      4.0     NaN

pandas.DataFrame.unstack¶

index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
                                   ('two', 'a'), ('two', 'b')])
df = pd.DataFrame(np.random.randint(1, 100, (4,2)), index=index, columns=['x', 'y'])
print(df)
print('-'*30)
print(df.unstack(level=0))
print('-'*30)
print(df.unstack(level=1))
print('-'*30)
print(df.unstack(level=-1)) # -1表示最後一個,在這裏也就是1

        x   y
one a  27  61
    b  25  34
two a  86  92
    b   6   7
------------------------------
    x       y    
  one two one two
a  27  86  61  92
b  25   6  34   7
------------------------------
      x       y    
      a   b   a   b
one  27  25  61  34
two  86   6  92   7
------------------------------
      x       y    
      a   b   a   b
one  27  25  61  34
two  86   6  92   7

pandas.DataFrame.swapaxes¶

index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
                                   ('two', 'a'), ('two', 'b')])
df = pd.DataFrame(np.random.randint(1, 100, (4,2)), index=index, columns=['x', 'y'])
print(df)
print('-'*30)
print(df.swapaxes(1,0)) # 交換行軸與列軸

        x   y
one a  35  57
    b  53   3
two a  39   6
    b  90  27
------------------------------
  one     two    
    a   b   a   b
x  35  53  39  90
y  57   3   6  27

df = pd.DataFrame([[0, 1], [2, 3]], index=['cat', 'dog'], columns=['weight', 'height'])
print(df)
print('-'*30)
print(df.swapaxes(1,0))

     weight  height
cat       0       1
dog       2       3
------------------------------
        cat  dog
weight    0    2
height    1    3

pandas.DataFrame.T¶

index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
                                   ('two', 'a'), ('two', 'b')])
df = pd.DataFrame(np.random.randint(1, 100, (4,2)), index=index, columns=['x', 'y'])
print(df)
print('-'*30)
print(df.T) # 交換行軸與列軸

        x   y
one a  44  25
    b  15  73
two a  93  90
    b  59  19
------------------------------
  one     two    
    a   b   a   b
x  44  15  93  59
y  25  73  90  19

df = pd.DataFrame([[0, 1], [2, 3]], index=['cat', 'dog'], columns=['weight', 'height'])
print(df)
print('-'*30)
print(df.T)

     weight  height
cat       0       1
dog       2       3
------------------------------
        cat  dog
weight    0    2
height    1    3

pandas.DataFrame.transpose¶

df = pd.DataFrame([[0, 1], [2, 3]], index=['cat', 'dog'], columns=['weight', 'height'])
print(df)
print('-'*30)
print(df.transpose())

     weight  height
cat       0       1
dog       2       3
------------------------------
        cat  dog
weight    0    2
height    1    3

index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
                                   ('two', 'a'), ('two', 'b')])
df = pd.DataFrame(np.random.randint(1, 100, (4,2)), index=index, columns=['x', 'y'])
print(df)
print('-'*30)
print(df.transpose()) # 交換行軸與列軸

        x   y
one a   7  89
    b  37  15
two a  71  86
    b  43  16
------------------------------
  one     two    
    a   b   a   b
x   7  37  71  43
y  89  15  86  16

pandas.DataFrame.append¶

df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df1)
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
print(df2)
df1.append(df2, sort=False, ignore_index=True)

   A  B
0  1  2
1  3  4
   A  B
0  5  6
1  7  8

pandas.DataFrame.assign¶

df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
print(df)
print('-'*30)
print(df.assign(total = lambda x:x['A']+ x['B'])) # 列之間的運算,如果想對行運算,可以移位(shift)後運算,也可以轉置後運算然後再轉置
print('-'*30)
print(df.assign(total = lambda x:np.sum([x.B]))) # 這個可以用來算結構百分比
print('-'*30)
print(df.assign(ln_A = lambda x: np.log(x.A)))

    A         B
0   1  1.591474
1   2 -0.057835
2   3  1.953241
3   4 -1.062957
4   5  1.014981
5   6 -0.767510
6   7  0.773790
7   8 -0.311985
8   9  0.315448
9  10  0.604212
------------------------------
    A         B      total
0   1  1.591474   2.591474
1   2 -0.057835   1.942165
2   3  1.953241   4.953241
3   4 -1.062957   2.937043
4   5  1.014981   6.014981
5   6 -0.767510   5.232490
6   7  0.773790   7.773790
7   8 -0.311985   7.688015
8   9  0.315448   9.315448
9  10  0.604212  10.604212
------------------------------
    A         B     total
0   1  1.591474  4.052859
1   2 -0.057835  4.052859
2   3  1.953241  4.052859
3   4 -1.062957  4.052859
4   5  1.014981  4.052859
5   6 -0.767510  4.052859
6   7  0.773790  4.052859
7   8 -0.311985  4.052859
8   9  0.315448  4.052859
9  10  0.604212  4.052859
------------------------------
    A         B      ln_A
0   1  1.591474  0.000000
1   2 -0.057835  0.693147
2   3  1.953241  1.098612
3   4 -1.062957  1.386294
4   5  1.014981  1.609438
5   6 -0.767510  1.791759
6   7  0.773790  1.945910
7   8 -0.311985  2.079442
8   9  0.315448  2.197225
9  10  0.604212  2.302585

pandas.DataFrame.update¶

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [400, 500, 600]})
new_df = pd.DataFrame({'B': [4, 5, 6],
                       'C': [7, 8, 9]})
print(df)
print(new_df)
df.update(new_df)
df

   A    B
0  1  400
1  2  500
2  3  600
   B  C
0  4  7
1  5  8
2  6  9

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [400, 500, 600]})
new_df = pd.DataFrame({'B': [4, np.NAN, 6],
                       'C': [7, 8, 9]})
print(df)
print(new_df)
df.update(new_df) # 當new_df裏面是NaN時,仍使用該元素之前的值
df

   A    B
0  1  400
1  2  500
2  3  600
     B  C
0  4.0  7
1  NaN  8
2  6.0  9

缺失值處理 Missing data handling¶

pandas.isnull(obj)¶

Detect missing values for an array-like object.
This function takes a scalar or array-like object and indictates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

print(pd.isnull('dog'))
print(pd.isnull(np.nan))
print(pd.isnull(None))

False
True
True

df = pd.DataFrame([['ant', 'bee', np.NaN], ['dog', None, 'fly']])
print(df)
pd.isnull(df)

     0     1    2
0  ant   bee  NaN
1  dog  None  fly

pandas.isna(obj)¶

Detect missing values for an array-like object.
This function takes a scalar or array-like object and indictates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

print(pd.isna('dog'))
print(pd.isna(np.nan))
print(pd.isna(None))

False
True
True

df = pd.DataFrame([['ant', 'bee', np.NaN], ['dog', None, 'fly']])
print(df)
pd.isna(df)

     0     1    2
0  ant   bee  NaN
1  dog  None  fly

pandas.notnull(obj)¶

Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indictates whether values are valid (not missing, which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

print(pd.notnull('dog'))
print(pd.notnull(np.nan))
print(pd.notnull(None))

True
False
False

df = pd.DataFrame([['ant', 'bee', np.NaN], ['dog', None, 'fly']])
print(df)
pd.notnull(df)

     0     1    2
0  ant   bee  NaN
1  dog  None  fly

pandas.notna(obj)¶

Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indictates whether values are valid (not missing, which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

print(pd.notna('dog'))
print(pd.notna(np.nan))
print(pd.notna(None))

True
False
False

df = pd.DataFrame([['ant', 'bee', np.NaN], ['dog', None, 'fly']])
print(df)
pd.notna(df)

     0     1    2
0  ant   bee  NaN
1  dog  None  fly

pandas.DataFrame.dropna¶

df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"),pd.NaT]
                  })
print(df)
print('~-'*25)
print(df.dropna())
print('~-'*25)
# Drop the columns where at least one element is missing.
print(df.dropna(axis='columns'))
print('~-'*25)
# Drop the rows where all elements are missing.
print(df.dropna(how='all'))
print('~-'*25)
# Keep only the rows with at least 2 non-NA values.
print('thresh=2\n',df.dropna(thresh=2)) # 至少有2個元素非空
print('~-'*25)
# Define in which columns to look for missing values.
print(df.dropna(subset=['name', 'toy']))
print('~-'*25)
# Keep the DataFrame with valid entries in the same variable.
df.dropna(inplace=True) # inplace=True表示對原數組進行更改
df

       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
     name        toy       born
1  Batman  Batmobile 1940-04-25
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
       name
0    Alfred
1    Batman
2  Catwoman
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
thresh=2
        name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-

pandas.DataFrame.fillna¶

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                   columns=list('ABCD'))
print(df)
print('~-'*25)
print(df.fillna(0))
print('~-'*25)
# We can also propagate non-null values forward or backward.
print('use method:\n', df.fillna(method='bfill', axis=1))
print('~-'*25)
#Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
print(df.fillna(value=values))
print('~-'*25)
#Only replace the first NaN element.
df.fillna(value=values, limit=1)

     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
     A    B    C  D
0  0.0  2.0  0.0  0
1  3.0  4.0  0.0  1
2  0.0  0.0  0.0  5
3  0.0  3.0  0.0  4
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
use method:
      A    B    C    D
0  2.0  2.0  0.0  0.0
1  3.0  4.0  1.0  1.0
2  5.0  5.0  5.0  5.0
3  3.0  3.0  4.0  4.0
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
     A    B    C  D
0  0.0  2.0  2.0  0
1  3.0  4.0  2.0  1
2  0.0  1.0  2.0  5
3  0.0  3.0  2.0  4
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-

pandas.DataFrame.replace¶

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
Replace values given in to_replace with value.
Regular expression `to_replace`

df = pd.DataFrame({'A': [9, 0, 8, 7, 4],
                   'B': [5, 0, 1, 3, 3],
                   'C': ['a', 'b', 'c', 'd', 'e']})
print(df)
print('~-'*25)
print(df.replace(0, 5))
print('~-'*25)
print(df.replace([0, 1, 2, 3], 4)) # 將所有的元素爲0,1,2或3的都替換爲4
print(df.replace([0, 1, 2, 3], [4, 3, 2, 1]))
print('~-'*25)
print(df.replace({0: 10, 1: 100}))
print('~-'*25)
#Regular expression `to_replace`
df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
                   'B': ['abc', 'bar', 'xyz']})
print(df)
print(df.replace(regex={r'^ba.$':'new', 'foo':'xyz'}))
df.replace(to_replace=r'^ba.*$', value='new', regex=True)

   A  B  C
0  9  5  a
1  0  0  b
2  8  1  c
3  7  3  d
4  4  3  e
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   A  B  C
0  9  5  a
1  5  5  b
2  8  1  c
3  7  3  d
4  4  3  e
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   A  B  C
0  9  5  a
1  4  4  b
2  8  4  c
3  7  4  d
4  4  4  e
   A  B  C
0  9  5  a
1  4  4  b
2  8  3  c
3  7  1  d
4  4  1  e
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
    A    B  C
0   9    5  a
1  10   10  b
2   8  100  c
3   7    3  d
4   4    3  e
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
      A    B
0   bat  abc
1   foo  bar
2  bait  xyz
      A    B
0   new  abc
1   xyz  new
2  bait  xyz

类型转换 Top-level conversions¶

pd.to_numeric(arg, errors='raise', downcast=None)¶

Convert argument to a numeric type.

df = pd.DataFrame([['1', '5', np.NaN], [None, '9', '7']])
print(df)
print('-'*50)
print(pd.to_numeric(df[0]))     # 需要轉換爲list或Series
print('-'*50)
print(pd.to_numeric(df.T[0].T)) # 需要轉換爲list或Series

      0  1    2
0     1  5  NaN
1  None  9    7
--------------------------------------------------
0    1.0
1    NaN
Name: 0, dtype: float64
--------------------------------------------------
0    1.0
1    5.0
2    NaN
Name: 0, dtype: float64

时间函数处理 Top-level dealing with datetimelike¶

pandas.to_datetime()¶

pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=False)
Convert argument to datetime.

df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]
                  })
pd.to_datetime(df)

0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

pd.to_datetime(
      np.arange(8)
    , unit='ns'
    , origin=pd.Timestamp('2018-11-06')
)
# unit : string, default ‘ns’
# unit of the arg (D,s,ms,us,ns) denote the unit, 
# which is an integer or float number. 
# This will be based off the origin. 
# Example, with unit=’ms’ and origin=’unix’ (the default), 
# this would calculate the number of milliseconds to the unix epoch start.

DatetimeIndex([          '2018-11-06 00:00:00',
               '2018-11-06 00:00:00.000000001',
               '2018-11-06 00:00:00.000000002',
               '2018-11-06 00:00:00.000000003',
               '2018-11-06 00:00:00.000000004',
               '2018-11-06 00:00:00.000000005',
               '2018-11-06 00:00:00.000000006',
               '2018-11-06 00:00:00.000000007'],
              dtype='datetime64[ns]', freq=None)

pandas.to_timedelta¶

pandas.to_timedelta(arg, unit='ns', box=True, errors='raise')
Convert argument to timedelta

pd.to_timedelta(np.arange(8), unit='D')
# unit : unit of the arg (D,h,m,s,ms,us,ns) denote the unit, which is an integer/float number

TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days', '5 days',
                '6 days', '7 days'],
               dtype='timedelta64[ns]', freq=None)

pandas.date_range()¶

Return a fixed frequency DatetimeIndex.(DatetimeIndex)

语法

pandas.date_range(
          start=None
        , end=None
        , periods=None
        , freq='D'
        , tz=None
        , normalize=False
        , name=None
        , closed=None
        , **kwargs
)

该函数主要用于生成一个固定频率的时间索引
在调用构造方法时，必须指定start、end、periods中的两个参数值，否则报错。

主要参数说明

periods ：固定时期，取值为整数或None
freq ：日期偏移量，取值为string或DateOffset，默认为'D'
normalize ：若参数为True表示将start、end参数值正则化到午夜时间戳
name ：生成时间索引对象的名称，取值为string或None
closed ：可以理解成在closed=None情况下返回的结果中，若closed=‘left’表示在返回的结果基础上，再取左开右闭的结果，若closed='right'表示在返回的结果基础上，再取做闭右开的结果

全部参数说明

start : str or datetime-like, optional
Left bound for generating dates.
end : str or datetime-like, optional
Right bound for generating dates.
periods : integer, optional
Number of periods to generate.
freq : str or DateOffset, default ‘D’ (calendar daily)
Frequency strings can have multiples, e.g. ‘5H’. See here for a list of frequency aliases.
tz : str or tzinfo, optional
Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive.
normalize : bool, default False
Normalize start/end dates to midnight before generating date range.
name : str, default None
Name of the resulting DatetimeIndex.
closed : {None, ‘left’, ‘right’}, optional
Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None, the default).
**kwargs :
For compatibility. Has no effect on the result.

freq Offset Aliases

A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset aliases.

Alias	Description
B	business day frequency
C	custom business day frequency
D	calendar day frequency
W	weekly frequency
M	month end frequency
SM	semi-month end frequency (15th and end of month)
BM	business month end frequency
CBM	custom business month end frequency
MS	month start frequency
SMS	semi-month start frequency (1st and 15th)
BMS	business month start frequency
CBMS	custom business month start frequency
Q	quarter end frequency
BQ	business quarter end frequency
QS	quarter start frequency
BQS	business quarter start frequency
A, Y	year end frequency
BA, BY	business year end frequency
AS, YS	year start frequency
BAS, BYS	business year start frequency
BH	business hour frequency
H	hourly frequency
T, min	minutely frequency
S	secondly frequency
L, ms	milliseconds
U, us	microseconds
N	nanoseconds

pd.date_range(start='1/1/2018', end='1/11/2018')

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10', '2018-01-11'],
              dtype='datetime64[ns]', freq='D')

# freq='W-MON' # 表示按周间隔，从每周一开始
dtindex = pd.date_range(start='10/28/2018', end='11/30/2018', freq='W-MON') 
dtindex

DatetimeIndex(['2018-10-29', '2018-11-05', '2018-11-12', '2018-11-19',
               '2018-11-26'],
              dtype='datetime64[ns]', freq='W-MON')

# freq='W' # 表示按周间隔，从start那一天开始
dtindex = pd.date_range(start='10/28/2018', end='11/30/2018', freq='W') 
dtindex

DatetimeIndex(['2018-10-28', '2018-11-04', '2018-11-11', '2018-11-18',
               '2018-11-25'],
              dtype='datetime64[ns]', freq='W-SUN')

# freq='M' # 表示按月间隔，每月最后一天
dtindex = pd.date_range(start='1/28/2018', end='11/30/2018', freq='M') 
dtindex

DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31', '2018-11-30'],
              dtype='datetime64[ns]', freq='M')

# freq='M' # 表示按3个月间隔，每月第一天
dtindex = pd.date_range(start='1/28/2018', end='11/30/2018', freq=pd.offsets.MonthBegin(3)) 
dtindex

DatetimeIndex(['2018-02-01', '2018-05-01', '2018-08-01', '2018-11-01'], dtype='datetime64[ns]', freq='3MS')

# freq='SM' # 月中15天的一次间隔
dtindex = pd.date_range(start='10/28/2018', end='11/30/2018', freq='SM') 
dtindex

DatetimeIndex(['2018-10-31', '2018-11-15', '2018-11-30'], dtype='datetime64[ns]', freq='SM-15')

# freq='SM-2' # 月中2天的一次间隔
dtindex = pd.date_range(start='10/28/2018', end='11/30/2018', freq='SM-2') 
dtindex

DatetimeIndex(['2018-10-31', '2018-11-02', '2018-11-30'], dtype='datetime64[ns]', freq='SM-2')

# freq='Q' # 每季度的最后一日
dtindex = pd.date_range(start='4/28/2018', end='12/31/2018', freq='Q') 
dtindex

DatetimeIndex(['2018-06-30', '2018-09-30', '2018-12-31'], dtype='datetime64[ns]', freq='Q-DEC')

# freq='Q-NOV' # 最后截止日是11月最有一日，然后倒退前面的日期，即提前一个月
dtindex = pd.date_range(start='4/28/2018', end='12/31/2018', freq='Q-NOV')
dtindex

DatetimeIndex(['2018-05-31', '2018-08-31', '2018-11-30'], dtype='datetime64[ns]', freq='Q-NOV')

# freq='Q-NOV' # 最后截止日是10月最有一日，然后倒退前面的日期，即提前一个月
dtindex = pd.date_range(start='4/28/2018', end='12/31/2018', freq='Q-OCT')
dtindex

DatetimeIndex(['2018-04-30', '2018-07-31', '2018-10-31'], dtype='datetime64[ns]', freq='Q-OCT')

# pd.date_range().to_period()  # 显示为月份，不显示日期
dtindex = pd.date_range(start='4/28/2018', end='12/31/2018', freq='M').to_period()
dtindex

PeriodIndex(['2018-04', '2018-05', '2018-06', '2018-07', '2018-08', '2018-09',
             '2018-10', '2018-11', '2018-12'],
            dtype='period[M]', freq='M')

# pd.date_range().to_timestamp() # 显示为每月1日
# 之所以没有显示12月份，是因为to_period时对于12月30日，不会显示12月份最后一日（12月31日）
# 所以既然没有12月份，所以最终也不会显示12月1日
dtindex = pd.date_range(start='4/28/2018', end='12/30/2018', freq='M').to_period().to_timestamp()
dtindex

DatetimeIndex(['2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01',
               '2018-08-01', '2018-09-01', '2018-10-01', '2018-11-01'],
              dtype='datetime64[ns]', freq='MS')

pandas.period_range()¶

pandas.period_range(start=None, end=None, periods=None, freq='D', name=None)
Return a fixed frequency PeriodIndex, with day (calendar) as the default frequency

pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')

PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
             '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
             '2018-01'],
            dtype='period[M]', freq='M')

pandas.timedelta_range()¶

pandas.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None)
Return a fixed frequency TimedeltaIndex, with day as the default frequency

pd.timedelta_range(start='1 day', periods=4)

TimedeltaIndex(['1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq='D')

区间设置 Top-level dealing with intervals¶

pandas.interval_range()¶

pandas.interval_range(start=None, end=None, periods=None, freq=None, name=None, closed='right')
Return a fixed frequency IntervalIndex

pd.interval_range(start=0, end=6)

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6]]
              closed='right',
              dtype='interval[int64]')

pd.interval_range(start=0, end=6, periods=4)

IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]]
              closed='right',
              dtype='interval[float64]')

pd.interval_range(start=pd.Timestamp('2017-01-01'),end=pd.Timestamp('2017-01-04'))

IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04]]
              closed='right',
              dtype='interval[datetime64[ns]]')

pd.interval_range(start=0, periods=4, freq=1.5)

IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]]
              closed='right',
              dtype='interval[float64]')

Attributes and underlying data¶

pandas.DataFrame.index¶

df=pd.DataFrame([[1.,2.,3.],[7.,8.,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
df.index

     X    Y    Z
a  1.0  2.0  3.0
b  7.0  8.0  NaN

Index(['a', 'b'], dtype='object')

pandas.DataFrame.columns¶

df=pd.DataFrame([[1.,2.,3.],[7.,8.,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
df.columns

     X    Y    Z
a  1.0  2.0  3.0
b  7.0  8.0  NaN

Index(['X', 'Y', 'Z'], dtype='object')

pandas.DataFrame.dtypes¶

df=pd.DataFrame([[1.,2.,3.],[7.,8.,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
df.dtypes

     X    Y    Z
a  1.0  2.0  3.0
b  7.0  8.0  NaN

X    float64
Y    float64
Z    float64
dtype: object

pandas.DataFrame.ftypes¶

df=pd.DataFrame([[1.,2.,3.],[7.,8.,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
df.ftypes

     X    Y    Z
a  1.0  2.0  3.0
b  7.0  8.0  NaN

X    float64:dense
Y    float64:dense
Z    float64:dense
dtype: object

pandas.DataFrame.get_dtype_counts¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
df.get_dtype_counts()

   X  Y    Z
a  1  2  3.0
b  7  8  NaN

float64    1
int64      1
object     1
dtype: int64

pandas.DataFrame.select_dtypes¶

DataFrame.select_dtypes(include=None, exclude=None)

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
print(df.select_dtypes(include='float'))
print('-'*50)
print(df.select_dtypes(include='int'))
print('-'*50)
print(df.select_dtypes(include='bool'))
print('-'*50)
print(df.select_dtypes(exclude=['int','bool']))

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------
     Z
a  3.0
b  NaN
--------------------------------------------------
   Y
a  2
b  8
--------------------------------------------------
Empty DataFrame
Columns: []
Index: [a, b]
--------------------------------------------------
   X    Z
a  1  3.0
b  7  NaN

pandas.DataFrame.values¶

矩阵，不包含column和index

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
df.values

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------

array([['1', 2, 3.0],
       ['7', 8, nan]], dtype=object)

pandas.DataFrame.get_values¶

矩阵，不包含column和index
This is the same as .values for non-sparse data.
For sparse data contained in a pandas.SparseArray, the data are first converted to a dense representation.

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
df.get_values()

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------

array([['1', 2, 3.0],
       ['7', 8, nan]], dtype=object)

pandas.DataFrame.axes¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
df.axes

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------

[Index(['a', 'b'], dtype='object'), Index(['X', 'Y', 'Z'], dtype='object')]

pandas.DataFrame.ndim¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
df.ndim

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------

2

pandas.DataFrame.size¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
df.size

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------

6

pandas.DataFrame.shape¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
df.shape

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------

(2, 3)

pandas.DataFrame.memory_usage¶

Return the memory usage of each column in bytes.

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
print(df.memory_usage())
print('-'*50)
print(df.memory_usage(index=False))
print('-'*50)
print(df.memory_usage(deep=True))
print('-'*50)
df.info()

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------
Index    16
X        16
Y        16
Z        16
dtype: int64
--------------------------------------------------
X    16
Y    16
Z    16
dtype: int64
--------------------------------------------------
Index    132
X        132
Y         16
Z         16
dtype: int64
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, a to b
Data columns (total 3 columns):
X    2 non-null object
Y    2 non-null int64
Z    1 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 64.0+ bytes

pandas.DataFrame.empty¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
df.empty

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------

False

转换 Conversion¶

pandas.DataFrame.astype()¶

DataFrame.astype(dtype, copy=True, errors='raise', **kwargs)
Cast a pandas object to a specified dtype dtype.

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df)
print('-'*50)
print(df.dtypes)
print('-'*50)
print(df.fillna(0).astype('int64').dtypes)
df.fillna(0).astype('int64')

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------
X     object
Y      int64
Z    float64
dtype: object
--------------------------------------------------
X    int64
Y    int64
Z    int64
dtype: object

pandas.DataFrame.copy¶

deep : bool, default True
Make a deep copy, including a copy of the data and the indices.
With deep=False neither the indices nor the data are copied.

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
df_c=df.copy(deep=True)
print(df)
print(df_c)
print('-'*50)
df_c.iloc[0,0]=100
print(df)
print(df_c)

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------
   X  Y    Z
a  1  2  3.0
b  7  8  NaN
     X  Y    Z
a  100  2  3.0
b    7  8  NaN

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
df_c=df.copy(deep=False)
print(df)
print(df_c)
print('-'*50)
df_c.iloc[0,0]=100
print(df)
print(df_c)

   X  Y    Z
a  1  2  3.0
b  7  8  NaN
   X  Y    Z
a  1  2  3.0
b  7  8  NaN
--------------------------------------------------
     X  Y    Z
a  100  2  3.0
b    7  8  NaN
     X  Y    Z
a  100  2  3.0
b    7  8  NaN

索引，迭代 Indexing, iteration¶

pandas.DataFrame.head¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
df.head(1)

pandas.DataFrame.at¶

按行列的名称定位元素

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
df.at['b','Y']

8

pandas.DataFrame.iat¶

按行列的索引值定位元素

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
df.iat[1,1]

8

pandas.DataFrame.loc¶

Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.

df=pd.DataFrame([['1',2,3.],['7',8,np.nan]], index=['a','b'], columns=list('XYZ'))
print(df.loc['b'])
print('-'*50)
print(df.loc[['b','a']])
print('-'*50)
print(df.loc['a':'b','Z'])
print('-'*50)
print(
    df.loc[
        df['Y']>5,['Z']
          ]
     )
print('-'*50)
print(
    df.loc[
        df['Y']>5
          ]
     )
print('-'*50)
print(
    df.loc[
        ['b','a'],['Y']
          ]
     )
print('-'*50)
print(
    df.loc[
        ['b','a'],'X':'Z'
          ]
     )
print('-'*50)
print(
    df.loc[
        :,'Y':'Z'
          ]
     )

X      7
Y      8
Z    NaN
Name: b, dtype: object
--------------------------------------------------
   X  Y    Z
b  7  8  NaN
a  1  2  3.0
--------------------------------------------------
a    3.0
b    NaN
Name: Z, dtype: float64
--------------------------------------------------
    Z
b NaN
--------------------------------------------------
   X  Y   Z
b  7  8 NaN
--------------------------------------------------
   Y
b  8
a  2
--------------------------------------------------
   X  Y    Z
b  7  8  NaN
a  1  2  3.0
--------------------------------------------------
   Y    Z
a  2  3.0
b  8  NaN

pandas.DataFrame.iloc¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df.iloc[1])
print('-'*50)
print(df.iloc[[2,1]])
print('-'*50)
print(df.iloc[0:1,1])
print('-'*50)
print(
    df.iloc[
        [0,1],[2]
          ]
     )
print('-'*50)
print(
    df.iloc[
        :,0:2
          ]
     )
print('-'*50)
print(
    df.iloc[
        [2,1],0:3
          ]
     )
print('-'*50)
print(
    df[df.Z > 5]
     )
print('-'*50)
print(
    df[df.Z > 1].X
     )

X      7
Y      8
Z    NaN
Name: b, dtype: object
--------------------------------------------------
    X    Y    Z
c  10  NaN  9.0
b   7  8.0  NaN
--------------------------------------------------
a    2.0
Name: Y, dtype: float64
--------------------------------------------------
     Z
a  3.0
b  NaN
--------------------------------------------------
    X    Y
a   1  2.0
b   7  8.0
c  10  NaN
--------------------------------------------------
    X    Y    Z
c  10  NaN  9.0
b   7  8.0  NaN
--------------------------------------------------
    X   Y    Z
c  10 NaN  9.0
--------------------------------------------------
a     1
c    10
Name: X, dtype: object

pandas.DataFrame.insert¶

DataFrame.insert(loc, column, value, allow_duplicates=False)
Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
在指定位置插入列
若是插入行,採用在最後位置concat新行

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df)
print('-'*50)
df.insert(1,'AAA+',['100',np.NaN,300.])
print(df)
pd.concat([df, pd.DataFrame([['1',2,3.,4]], columns=list('XYZJ')) ], ignore_index=True, sort=False)

    X    Y    Z
a   1  2.0  3.0
b   7  8.0  NaN
c  10  NaN  9.0
--------------------------------------------------
    X AAA+    Y    Z
a   1  100  2.0  3.0
b   7  NaN  8.0  NaN
c  10  300  NaN  9.0

pandas.DataFrame.items¶

pandas.DataFrame.iteritems

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
for i in df.items():   # 迭代列
    print(i)
    print('-'*30)

('X', a     1
b     7
c    10
Name: X, dtype: object)
------------------------------
('Y', a    2.0
b    8.0
c    NaN
Name: Y, dtype: float64)
------------------------------
('Z', a    3.0
b    NaN
c    9.0
Name: Z, dtype: float64)
------------------------------

pandas.DataFrame.iterrows¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
for i in df.iterrows():    # 迭代行
    print(i[0])
    print('-'*30)
    print(i[1])
    print('~'*30)

a
------------------------------
X    1
Y    2
Z    3
Name: a, dtype: object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
b
------------------------------
X      7
Y      8
Z    NaN
Name: b, dtype: object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
c
------------------------------
X     10
Y    NaN
Z      9
Name: c, dtype: object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pandas.DataFrame.keys¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
df.keys()

Index(['X', 'Y', 'Z'], dtype='object')

pandas.DataFrame.lookup¶

DataFrame.lookup(row_labels, col_labels)

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df)
print(type(df.lookup('c','Z')))
df.lookup('c','Z')

    X    Y    Z
a   1  2.0  3.0
b   7  8.0  NaN
c  10  NaN  9.0
<class 'numpy.ndarray'>

array([9.])

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
result = []
for row, col in zip(df.index,df.columns):
    print(row, col)
    result.append(df.at[row, col])
result

a X
b Y
c Z

['1', 8.0, 9.0]

pandas.DataFrame.pop¶

DataFrame.pop(item)
Return item and drop from frame. Raise KeyError if not found.

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df)
print('-'*30)
y=df.pop('Y')
print(type(y),'\n', 'y=\n', y)
df

    X    Y    Z
a   1  2.0  3.0
b   7  8.0  NaN
c  10  NaN  9.0
------------------------------
<class 'pandas.core.series.Series'> 
 y=
 a    2.0
b    8.0
c    NaN
Name: Y, dtype: float64

pandas.DataFrame.tail¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
df.tail(1)

pandas.DataFrame.xs¶

DataFrame.xs(key, axis=0, level=None, drop_level=True)
返回指定行或列
Returns a cross-section (row(s) or column(s)) from the Series/DataFrame. Defaults to cross-section on the rows (axis=0).

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df)
print('-'*30)
print(df.xs('a'))
print('-'*30)
print(df.xs('Z', axis=1))

    X    Y    Z
a   1  2.0  3.0
b   7  8.0  NaN
c  10  NaN  9.0
------------------------------
X    1
Y    2
Z    3
Name: a, dtype: object
------------------------------
a    3.0
b    NaN
c    9.0
Name: Z, dtype: float64

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
          ['1', '1', '1', '1', '2', '2', '2', '2']
         ]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second', 'third'])
print(len(index))
df = pd.DataFrame(np.random.randn(8,5), index=index, columns=list('ABCDE'))
print(df)
print('-'*50)
print(df.xs(('foo', 'two')))
print('='*50)
print(df.xs('qux', level=0))
print('-'*50)
print(df.xs('one', level=1))
print('*'*50)
print(df.xs('1', level=2))
print('*'*50)
print(df.xs(('baz', 'one'), level=[0, 'second']))
df.xs(('baz', '1'), level=[0, 'third']) # level的列表和之前的元組,存在映射關系,表示對第0,2列加限制條件進行篩選

8
                           A         B         C         D         E
first second third                                                  
bar   one    1      1.209468  1.290964 -0.376996 -0.107501  0.835098
      two    1     -0.834633  0.171100 -0.808162 -0.043249 -0.125283
baz   one    1      0.191019  0.115292 -1.219449  0.326575  1.170924
      two    1     -0.497769  0.247957 -0.116349 -1.485385  0.224618
foo   one    2     -1.155593 -0.248171 -0.238434 -0.238925  0.468336
      two    2     -0.282162  0.203174  0.278302 -0.454708 -1.855569
qux   one    2     -0.321771 -1.595031 -1.876123 -0.812769 -0.349010
      two    2     -0.683977 -0.423224 -0.644195 -1.602555  0.568056
--------------------------------------------------
              A         B         C         D         E
third                                                  
2     -0.282162  0.203174  0.278302 -0.454708 -1.855569
==================================================
                     A         B         C         D         E
second third                                                  
one    2     -0.321771 -1.595031 -1.876123 -0.812769 -0.349010
two    2     -0.683977 -0.423224 -0.644195 -1.602555  0.568056
--------------------------------------------------
                    A         B         C         D         E
first third                                                  
bar   1      1.209468  1.290964 -0.376996 -0.107501  0.835098
baz   1      0.191019  0.115292 -1.219449  0.326575  1.170924
foo   2     -1.155593 -0.248171 -0.238434 -0.238925  0.468336
qux   2     -0.321771 -1.595031 -1.876123 -0.812769 -0.349010
**************************************************
                     A         B         C         D         E
first second                                                  
bar   one     1.209468  1.290964 -0.376996 -0.107501  0.835098
      two    -0.834633  0.171100 -0.808162 -0.043249 -0.125283
baz   one     0.191019  0.115292 -1.219449  0.326575  1.170924
      two    -0.497769  0.247957 -0.116349 -1.485385  0.224618
**************************************************
              A         B         C         D         E
third                                                  
1      0.191019  0.115292 -1.219449  0.326575  1.170924

pandas.DataFrame.get¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df)
df.get('X')  # 只能獲取列

    X    Y    Z
a   1  2.0  3.0
b   7  8.0  NaN
c  10  NaN  9.0

a     1
b     7
c    10
Name: X, dtype: object

pandas.DataFrame.isin¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df)
df.isin([8, '7', np.NaN])  # np.NaN不參與比較,所以也是False,此外對比與位置無關

    X    Y    Z
a   1  2.0  3.0
b   7  8.0  NaN
c  10  NaN  9.0

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
other = pd.DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']})
print(df)
print(other)
df.isin(other) # Column A in `other` has a 3, but not at index 1.
# 當作用與df的時候,是按位置一一判斷isin與否的

   A  B
0  1  a
1  2  b
2  3  f
   A  B
0  1  e
1  3  f
2  3  f
3  2  e

pandas.DataFrame.where¶

df=pd.DataFrame([['1',2,3.],['7',8,np.nan],['10',np.nan,9]], index=['a','b','c'], columns=list('XYZ'))
print(df)
df2=df.fillna(0).astype('float')
print(df2)
print(df2.where(df2['X']>7)) # 先對列進行判斷,然後取該列中符合條件記錄所在的行的所有行記錄
df2.where(df2.iloc[2:]>7) # 直接對行判斷,並取行記錄

    X    Y    Z
a   1  2.0  3.0
b   7  8.0  NaN
c  10  NaN  9.0
      X    Y    Z
a   1.0  2.0  3.0
b   7.0  8.0  0.0
c  10.0  0.0  9.0
      X    Y    Z
a   NaN  NaN  NaN
b   NaN  NaN  NaN
c  10.0  0.0  9.0

pandas.DataFrame.mask¶

DataFrame.mask(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False, raise_on_error=None)
Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
# 一个参数为-1时，那么reshape函数会根据另一个参数的维度计算出数组的另外一个shape属性值。
print(df)
m = df % 3 == 0
df.mask(m, -df) # 不符合條件的,返回self的元素,符合條件的,返回other,即此處的-df的相應位置的元素

   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

# 與where()對照的理解其功能
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
m = df % 3 == 0
df.where(m, -df)  # 與mask()的值恰好相反

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
print(df)
print('-'*50)
print(df.where(df>=5))
df.mask(df>=5)

   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
--------------------------------------------------
     A    B
0  NaN  NaN
1  NaN  NaN
2  NaN  5.0
3  6.0  7.0
4  8.0  9.0

pandas.DataFrame.query¶

df = pd.DataFrame(np.random.randn(10, 2), columns=list('ab'))
df.query('a > b')

df[df.a > df.b]  # same result as the previous expression

Binary operator functions¶

pandas.DataFrame.add¶

a = pd.DataFrame([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'], columns=['one'])
print(a)
print('-'*30)
b = pd.DataFrame(dict(one=[1, np.nan, 1, np.nan], two=[np.nan, 2, np.nan, 2]), index=['a', 'b', 'd', 'e'])
print(b)
a.add(b, fill_value=100) # fill_value只對其中一個df有效,如果兩個都是NaN,那麼結果仍然是NaN

   one
a  1.0
b  1.0
c  1.0
d  NaN
------------------------------
   one  two
a  1.0  NaN
b  NaN  2.0
d  1.0  NaN
e  NaN  2.0

pandas.DataFrame.radd¶

a = pd.DataFrame([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'], columns=['one'])
print(a)
print('-'*30)
b = pd.DataFrame(dict(one=[1, np.nan, 1, np.nan], two=[np.nan, 2, np.nan, 2]), index=['a', 'b', 'd', 'e'])
print(b)
a.radd(b, fill_value=100) # fill_value只對其中一個df有效,如果兩個都是NaN,那麼結果仍然是NaN
# radd()與add()的區別,在於兩個df的順序,對於相加,a+b=b+a,對於sub,mul,div,調換a和b的順序,則結果會不一樣.

   one
a  1.0
b  1.0
c  1.0
d  NaN
------------------------------
   one  two
a  1.0  NaN
b  NaN  2.0
d  1.0  NaN
e  NaN  2.0

pandas.DataFrame.floordiv¶

a = pd.DataFrame([4, 6, 8, np.nan], index=['a', 'b', 'c', 'd'], columns=['one'])
print(a)
print('-'*30)
b = pd.DataFrame(dict(one=[2, np.nan, 3, np.nan], two=[np.nan, 3, np.nan, 2]), index=['a', 'b', 'd', 'e'])
print(b)
print('-'*30)
print(10%3, '10除以3,餘是1')
print(10//3, '10除以3,商是3')
a.floordiv(b, fill_value=100)  # 取商

   one
a  4.0
b  6.0
c  8.0
d  NaN
------------------------------
   one  two
a  2.0  NaN
b  NaN  3.0
d  3.0  NaN
e  NaN  2.0
------------------------------
1 10除以3,餘是1
3 10除以3,商是3

pandas.DataFrame.mod¶

a = pd.DataFrame([4, 6, 8, np.nan], index=['a', 'b', 'c', 'd'], columns=['one'])
print(a)
print('-'*30)
b = pd.DataFrame(dict(one=[2, np.nan, 3, np.nan], two=[np.nan, 3, np.nan, 2]), index=['a', 'b', 'd', 'e'])
print(b)
print('-'*30)
print(10%3, '10除以3,餘是1')
print(10//3, '10除以3,商是3')
a.mod(b, fill_value=100)  # 取餘(求模)

   one
a  4.0
b  6.0
c  8.0
d  NaN
------------------------------
   one  two
a  2.0  NaN
b  NaN  3.0
d  3.0  NaN
e  NaN  2.0
------------------------------
1 10除以3,餘是1
3 10除以3,商是3

pandas.DataFrame.pow¶

a = pd.DataFrame([2, 2, 2, np.nan], index=['a', 'b', 'c', 'd'], columns=['one'])
print(a)
print('-'*30)
b = pd.DataFrame(dict(one=[2, np.nan, 3, np.nan], two=[np.nan, 3, np.nan, 2]), index=['a', 'b', 'd', 'e'])
print(b)
print('-'*30)
print(2**2)
print(2**3)
a.pow(b, fill_value=3)  # 求冪(開方) power

   one
a  2.0
b  2.0
c  2.0
d  NaN
------------------------------
   one  two
a  2.0  NaN
b  NaN  3.0
d  3.0  NaN
e  NaN  2.0
------------------------------
4
8

pandas.DataFrame.dot¶

a = pd.DataFrame([[2, 2, 2, 2]], columns=list('abcd'))
print(a.shape, '\n', a)
print('-'*30)
b = pd.DataFrame([[2, 2, 2, 2]], columns=list('abcd'))
print(b.T.shape, '\n', b.T)
print('-'*30)
a.dot(b.T)

(1, 4) 
    a  b  c  d
0  2  2  2  2
------------------------------
(4, 1) 
    0
a  2
b  2
c  2
d  2
------------------------------

a = pd.DataFrame([[2, 4, 6, 8],[1, 1, 1, 1]], columns=list('abcd'))
print(a.shape, '\n', a)
print('-'*30)
b = pd.DataFrame([[1, 1, 1, 1],[1, 3, 5, 7]], columns=list('abcd'))
print(b.T.shape, '\n', b.T)
print('-'*30)
a.dot(b.T) # a的第一行分別乘以b的第一列和第二列,作爲新的df的第一行的兩個元素

(2, 4) 
    a  b  c  d
0  2  4  6  8
1  1  1  1  1
------------------------------
(4, 2) 
    0  1
a  1  1
b  1  3
c  1  5
d  1  7
------------------------------

a = pd.DataFrame([[2, 4, 6, 8],[1, 1, 1, 1]], columns=list('abcd'))
print(a.T.shape, '\n', a.T)
print('-'*30)
b = pd.DataFrame([[1, 1, 1, 1],[1, 3, 5, 7]], columns=list('abcd'))
print(b.shape, '\n', b)
print('-'*30)
a.T.dot(b)

(4, 2) 
    0  1
a  2  1
b  4  1
c  6  1
d  8  1
------------------------------
(2, 4) 
    a  b  c  d
0  1  1  1  1
1  1  3  5  7
------------------------------

pandas.DataFrame.eq¶

a = pd.DataFrame([[2, 4, 6, 8],[1, 1, 1, 1]], columns=list('abcd'))
print(a)
b = pd.DataFrame([[1, 1, 1, 1],[1, 3, 5, 7]], columns=list('abcd'))
print(b)
a.eq(b)

   a  b  c  d
0  2  4  6  8
1  1  1  1  1
   a  b  c  d
0  1  1  1  1
1  1  3  5  7

pandas.DataFrame.combine¶

df1 = pd.DataFrame({'A': [0, 2], 'B': [np.NaN, 4], 'C': [0, 0]})
print(df1)
print('~'*50)
print(df1.sum())
print('-'*50)
df2 = pd.DataFrame({'A': [7, 5], 'B': [3, np.NaN]})
print(df2)
print('~'*50)
print(df2.sum())
df1.combine(df2, lambda s1, s2: s1 if s1.sum() < s2.sum() else s2) # 比較列的sum()

   A    B  C
0  0  NaN  0
1  2  4.0  0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A    2.0
B    4.0
C    0.0
dtype: float64
--------------------------------------------------
   A    B
0  7  3.0
1  5  NaN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A    12.0
B     3.0
dtype: float64

pandas.DataFrame.combine_first¶

df1 = pd.DataFrame({'A': [0, np.NaN], 'B': [np.NaN, 6]})
print(df1)
print('~'*50)
print(df1.sum())
print('-'*50)
df2 = pd.DataFrame({'A': [7, 5], 'B': [3, 1]})
print(df2)
print('~'*50)
print(df2.sum())
df1.combine_first(df2) # df1’s values prioritized, use values from df2 to fill holes:

     A    B
0  0.0  NaN
1  NaN  6.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A    0.0
B    6.0
dtype: float64
--------------------------------------------------
   A  B
0  7  3
1  5  1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A    12
B     4
dtype: int64

函數調用和聚合 Function application & GroupBy¶

pandas.DataFrame.apply¶

DataFrame.apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds)
Apply a function along an axis of the DataFrame.

df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
print(df)
print('-'*50)
print(df.apply(np.sqrt))
print('-'*50)
print(df.apply(np.sum, axis=0))
print('-'*50)
print(df.apply(np.sum, axis=1))
print('-'*50)
print(type(df.apply(lambda x: [x[0]**1, x[1]**2], axis=1)))
print(df.apply(lambda x: [x[0]**1, x[1]**2], axis=1))
print('-'*50)
# result_type='expand'时，类型为DataFrame
# result_type='reduce'时，类型为Series
print(type(df.apply(lambda x: [x[0]**1, x[1]**2], axis=1, result_type='broadcast')))
print(df.apply(lambda x: [x[0]**1, x[1]**2], axis=1, result_type='broadcast'))
# result_type : {‘expand’, ‘reduce’, ‘broadcast’, None}, default None
#   These only act when axis=1 (columns):
#       ‘expand’ : list-like results will be turned into columns.
#       ‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
#       ‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
#   The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.
#   New in version 0.23.0.
print('-'*50)
print(type(df.apply(lambda x: str(x))[0]))
print(df.apply(lambda x: str(x))) # apply()是对整行或整列進行處理,獲得的是整行或整列的數據
print('~'*30)
for i, element in enumerate(df.apply(lambda x: str(x))[0]):
     print(i, element)# 验证apply()与applymap()的区别，一个是对整行或列進行處理，另一个是对行或列中的元素逐一处理
df.apply(lambda x: len(str(x)))

   A  B
0  4  9
1  4  9
2  4  9
--------------------------------------------------
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0
--------------------------------------------------
A    12
B    27
dtype: int64
--------------------------------------------------
0    13
1    13
2    13
dtype: int64
--------------------------------------------------
<class 'pandas.core.series.Series'>
0    [4, 81]
1    [4, 81]
2    [4, 81]
dtype: object
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
   A   B
0  4  81
1  4  81
2  4  81
--------------------------------------------------
<class 'str'>
A    0    4\n1    4\n2    4\nName: A, dtype: int64
B    0    9\n1    9\n2    9\nName: B, dtype: int64
dtype: object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0 0
1  
2  
3  
4  
5 4
6 

7 1
8  
9  
10  
11  
12 4
13 

14 2
15  
16  
17  
18  
19 4
20 

21 N
22 a
23 m
24 e
25 :
26  
27 A
28 ,
29  
30 d
31 t
32 y
33 p
34 e
35 :
36  
37 i
38 n
39 t
40 6
41 4

A    42
B    42
dtype: int64

df = pd.DataFrame(np.random.randn(4, 5), columns=['A', 'B', 'C', 'D', 'E'])
df['Col_sum'] = df.apply(lambda x: x.sum(), axis=1)
df.loc['Row_sum'] = df.apply(lambda x: x.sum())
df

pandas.DataFrame.applymap¶

Apply a function to a Dataframe elementwise.

df = pd.DataFrame([[1, 2.12], [3.356, 4.5678]])
print(df)
print(df.applymap(lambda x: str(x)))# applymap()是对行或列裏面的元素進行處理,獲得的是使用函數逐一對元素處理後的結果
df.applymap(lambda x: len(str(x)))

       0       1
0  1.000  2.1200
1  3.356  4.5678
       0       1
0    1.0    2.12
1  3.356  4.5678

pandas.DataFrame.pipe¶

DataFrame.pipe(func, args, \*kwargs)
Apply func(self, args, \*kwargs)
Pipe在实际运用中和apply主要区别就是写法不一样，pipe在写法上相对比较简便

fruits = ['ap1', 'or2', 'le3', 'bb4'] * 2
N = len(fruits)
df = pd.DataFrame(
    {
        'fruit': fruits
      , 'basket_id': np.arange(N)
      , 'count': np.random.randint(3, 15, size=N)
      , 'weight': np.random.uniform(0, 4, size=N)
    }
    , columns=['basket_id', 'fruit', 'count', 'weight']
)
print(df)
print('-'*70)

def first(x):
    return x*2
def second(x, sec):
    return x*sec
def third(x, t1, t2, t3=-1):
    return x*t1*t2

print(df.pipe(first).pipe(second,2).pipe(third,t1=1,t2=2,t3=0))
print('-'*70)
print(df.apply(first).apply(second,args=(2,)).apply(third,args=(1,2),t3=0))

   basket_id fruit  count    weight
0          0   ap1      7  0.336429
1          1   or2     11  3.921252
2          2   le3      7  1.776452
3          3   bb4      4  3.948572
4          4   ap1      7  1.104207
5          5   or2      4  1.063009
6          6   le3      3  0.594073
7          7   bb4     12  3.029662
----------------------------------------------------------------------
   basket_id                     fruit  count     weight
0          0  ap1ap1ap1ap1ap1ap1ap1ap1     56   2.691432
1          8  or2or2or2or2or2or2or2or2     88  31.370014
2         16  le3le3le3le3le3le3le3le3     56  14.211620
3         24  bb4bb4bb4bb4bb4bb4bb4bb4     32  31.588578
4         32  ap1ap1ap1ap1ap1ap1ap1ap1     56   8.833655
5         40  or2or2or2or2or2or2or2or2     32   8.504072
6         48  le3le3le3le3le3le3le3le3     24   4.752583
7         56  bb4bb4bb4bb4bb4bb4bb4bb4     96  24.237294
----------------------------------------------------------------------
   basket_id                     fruit  count     weight
0          0  ap1ap1ap1ap1ap1ap1ap1ap1     56   2.691432
1          8  or2or2or2or2or2or2or2or2     88  31.370014
2         16  le3le3le3le3le3le3le3le3     56  14.211620
3         24  bb4bb4bb4bb4bb4bb4bb4bb4     32  31.588578
4         32  ap1ap1ap1ap1ap1ap1ap1ap1     56   8.833655
5         40  or2or2or2or2or2or2or2or2     32   8.504072
6         48  le3le3le3le3le3le3le3le3     24   4.752583
7         56  bb4bb4bb4bb4bb4bb4bb4bb4     96  24.237294

pandas.DataFrame.agg¶

DataFrame.agg(func, axis=0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                   columns=['A', 'B', 'C'])
print(df)
# Aggregate these functions over the rows.
print(df.fillna(0).agg(['sum', 'min', 'mean'])) # 做3种运算
print(df.agg(['sum', 'min', 'mean']))
print(df.agg(('mean'), axis=1)) # 只做一种运算
df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max'], 'C' :'mean'}) # 不同列做不同运算

     A    B    C
0  1.0  2.0  3.0
1  4.0  5.0  6.0
2  7.0  8.0  9.0
3  NaN  NaN  NaN
         A      B     C
sum   12.0  15.00  18.0
min    0.0   0.00   0.0
mean   3.0   3.75   4.5
         A     B     C
sum   12.0  15.0  18.0
min    1.0   2.0   3.0
mean   4.0   5.0   6.0
0    2.0
1    5.0
2    8.0
3    NaN
dtype: float64

pandas.DataFrame.transform¶

To apply to column
DataFrame.transform(func, *args, **kwargs)
Call function producing a like-indexed NDFrame and return a NDFrame with the transformed values
只允许在同一时间在一个Series上进行一次转换，如果定义列‘a’ 减去列‘b’，则会出现异常；
必须返回与group相同的单个维度的序列（行）
返回单个标量对象也可以使用，如.transform(sum)

df = pd.DataFrame(np.random.randn(4, 4), columns=['A', 'B', 'C', 'D'], index=pd.date_range('1/1/2018', periods=4))
df.iloc[1:2] = np.nan
print(df)
df.transform(lambda x: (x - x.mean()) / x.std())

                   A         B         C         D
2018-01-01 -0.796542 -1.414477  0.300034  0.156123
2018-01-02       NaN       NaN       NaN       NaN
2018-01-03 -1.535938 -0.169594 -1.565617 -1.673637
2018-01-04  2.602354  0.735667 -0.364206  0.507863

df.transform(lambda x: x-x+x.mean()) # 列的平均數

df.transform(lambda x: x-x+x.std()) # 列的標準差

df.transform(max) # sum, max, min, np.mean, np.std # 只有df行數等於列數的時候才能正常執行

A    2.602354
B    0.735667
C    0.300034
D    0.507863
dtype: float64

pandas.DataFrame.groupby¶

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

df=pd.DataFrame([
     ['exact',9720,'2017-10-01',515,458]
    ,['exact',9720,'2017-10-01',510,896]
    ,['fuzzy',8242,'2017-11-01',122,415]
    ,['fuzzy',8242,'2017-11-01',128,782]
], columns=['type','id','date','code','amount'])
df

df.groupby(['type','id','date']).mean()

print(type(df.groupby(['type','id'])['amount'].mean()))
print(df.groupby(['type','id'])['amount'].mean().index)
df.groupby(['type','id'])['amount'].mean()

<class 'pandas.core.series.Series'>
MultiIndex(levels=[['exact', 'fuzzy'], [8242, 9720]],
           labels=[[0, 1], [1, 0]],
           names=['type', 'id'])

type   id  
exact  9720    677.0
fuzzy  8242    598.5
Name: amount, dtype: float64

Python ' ~ ' (取反) 操作符¶

#知识点:
#计算机中的符号数有三种表示方法:原码、反码和补码。
#三种表示方法均有符号位和数值位两部分，符号位都是用0表示“正”，用1表示“负”，而数值位，三种表示方法各不相同。
#在计算机系统中，数值一律用补码来表示和存储。
#原因在于，使用补码，可以将符号位和数值域统一处理；同时，加法和减法也可以统一处理。
#正整数的补码是其二进制表示，与原码相同
#负整数的补码，将其对应正数二进制表示所有位取反（包括符号位，0变1，1变0）后加1。
#运算分析：
#-6的补码是+6（0000 0110）取反后再+1,为（1111 1001）+（0000 0001）=（1111 1010）,也就是计算机中-6是用(1111 1010)来存储的,(1111 1010) 按位取反得到(0000 0101)这就是答案5
#-4的补码,是其對應的正數4(0000 0100）取反后再+1,为（1111 1011）+（0000 0001）=（1111 1100）,也就是计算机中-4是用(1111 1100)来存储的,(1111 1100) 按位取反得到(0000 0011)这就是答案3
#Python按位取反运算：
a=-4
print('~-4 =',~a)
# 4的补码與原碼相同是(0000 0100）, 取反即为（1111 1011）
#对于计算机来说，二进制以1开头表示的是负数
#所以（1111 1011）是一個負數
#但是（1111 1011）是十進制的多少呢?
#想要知道这个值，可以求它的补码，即先取反码 ：0000 0100，再加1：0000 0101，
#(0000 0101),是5,並且之前知道這個數是一個負數,所以答案是-5
b=4
print('~4 =',~b)

~-4 = 3
~4 = -5

聚合函數組合使用示例¶

df=pd.DataFrame([
     ['exact',9720,'2018-10-01',515,458]
    ,['exact',9720,'2018-10-01',510,896]
    ,['fuzzy',8242,'2018-11-01',122,415]
    ,['fuzzy',8242,'2018-11-01',128,782]
    ,['fuzzy',8243,'2018-11-02',128,782]
    ,['fuzzy',8243,'2018-11-03',128,782]
], columns=['type','id','date','code','amount'])
df

exact_rows = df['type'] != 'fuzzy'
exact_rows

0     True
1     True
2    False
3    False
4    False
5    False
Name: type, dtype: bool

df.loc[exact_rows]

df.loc[~exact_rows]

df.loc[~exact_rows].groupby('id').apply(lambda x:x.sum())

df.loc[~exact_rows].groupby('id').apply(lambda x:x.sum())['amount']

id
8242    1197
8243    1564
Name: amount, dtype: int64

df.loc[~exact_rows].groupby('id').transform('nunique') # 取非重復元素的個數
# 譬如date列,判斷是否唯一,是分組進行的

grouped = df.loc[~exact_rows].groupby('id').apply(lambda g: g.sort_values('code', ascending=False))
print(grouped)
print('~'*50)
grouped = df.loc[~exact_rows].sort_values(['id','date'], ascending=True).groupby('id')
print(grouped)
print(type(grouped))
for i in grouped:
    print(i[0])
    print(i[1])

         type    id        date  code  amount
id                                           
8242 3  fuzzy  8242  2018-11-01   128     782
     2  fuzzy  8242  2018-11-01   122     415
8243 4  fuzzy  8243  2018-11-02   128     782
     5  fuzzy  8243  2018-11-03   128     782
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f78f8d8ae80>
<class 'pandas.core.groupby.groupby.DataFrameGroupBy'>
8242
    type    id        date  code  amount
2  fuzzy  8242  2018-11-01   122     415
3  fuzzy  8242  2018-11-01   128     782
8243
    type    id        date  code  amount
4  fuzzy  8243  2018-11-02   128     782
5  fuzzy  8243  2018-11-03   128     782

for i in grouped:
    print(i[0])
    print(i[1])

8242
    type    id        date  code  amount
2  fuzzy  8242  2018-11-01   122     415
3  fuzzy  8242  2018-11-01   128     782
8243
    type    id        date  code  amount
4  fuzzy  8243  2018-11-02   128     782
5  fuzzy  8243  2018-11-03   128     782

for i in grouped['code']:
    print(i[0])
    print(i[1])

8242
2    122
3    128
Name: code, dtype: int64
8243
4    128
5    128
Name: code, dtype: int64

grouped['code'].transform('nunique') # 判斷是否唯一時,是分組進行的,而不是全部列元素進行判斷.

2    2
3    2
4    1
5    1
Name: code, dtype: int64

a = np.where(grouped['code'].transform('nunique') == 2, 18, 81)
# where(cond, option1, option2)
a

array([18, 18, 81, 81])

pd.apply()與pd.transform區別示例¶

针对dataframe完成特征的计算，并且常常与groupby()方法一起使用

data = pd.DataFrame({'state':['Florida','Florida','Texas','Texas'], 'a':[4,5,1,3], 'b':[6,10,3,11] }) 
print(data)
print('-'*50)
def sub_two(X):
    return X['a'] - X['b']
data1 = data.groupby(data['state']).apply(sub_two) # 此处使用transform 会出现错误
print(data1)
print('-'*50)
def group_sum(x): 
    return x.sum() 
data3 = data.groupby(data['state']).transform(group_sum) # 返回与原df一样的行數 
print(data3)
print('-~'*30)
data4 = data.groupby(data['state'])['a','b'].apply(group_sum)                                          
print(data4)

     state  a   b
0  Florida  4   6
1  Florida  5  10
2    Texas  1   3
3    Texas  3  11
--------------------------------------------------
state     
Florida  0   -2
         1   -5
Texas    2   -2
         3   -8
dtype: int64
--------------------------------------------------
   a   b
0  9  16
1  9  16
2  4  14
3  4  14
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
         a   b
state         
Florida  9  16
Texas    4  14

Computations / Descriptive Stats¶

pandas.DataFrame.abs¶

df = pd.DataFrame({
    'a': [4, 5, 6, 7],
    'b': [10, 20, 30, 40],
    'c': [100, 50, -30, -50]
})
print(df)
print('-~'*25)
print(df.abs())
print('-~'*25)
print(df.loc[[1,0,2,3]]) # (df.c - 43).abs().argsort() 返回 [1,0,2,3],見下一cell說明
df.loc[(df.c - 43).abs().argsort()]

   a   b    c
0  4  10  100
1  5  20   50
2  6  30  -30
3  7  40  -50
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
   a   b    c
0  4  10  100
1  5  20   50
2  6  30   30
3  7  40   50
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
   a   b    c
1  5  20   50
0  4  10  100
2  6  30  -30
3  7  40  -50

# argsort()函数是将x中的元素从小到大排列，提取其对应的index(索引)，然后输出到y
df = pd.DataFrame({
    'a': [4, 5, 6, 7],
    'b': [10, 20, 30, 40],
    'c': [100, 50, -30, -50]
})
print(df)
print('-~'*25)
print((df.c - 43))
print('-~'*25)
print((df.c - 43).abs())
print('-~'*25)
print((df.c - 43).abs().argsort())

   a   b    c
0  4  10  100
1  5  20   50
2  6  30  -30
3  7  40  -50
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
0    57
1     7
2   -73
3   -93
Name: c, dtype: int64
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
0    57
1     7
2    73
3    93
Name: c, dtype: int64
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
0    1
1    0
2    2
3    3
Name: c, dtype: int64

pandas.DataFrame.all¶

DataFrame.all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
Return whether all elements are True, potentially over an axis.
必須全部是True才會返回True,否則False

df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})
print(df)
print('-~'*25)
print(df.all())
print('-~'*25)
print(df.all(axis='columns'))
print('-~'*25)
print(df.all(axis=None))

   col1   col2
0  True   True
1  True  False
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
col1     True
col2    False
dtype: bool
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
0     True
1    False
dtype: bool
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
False

pandas.DataFrame.any¶

DataFrame.any(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
Return whether any element is True over requested axis.
正好與all()相反,只要有一個是True就返回True

df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})
print(df)
print('-~'*25)
print(df.any())
print('-~'*25)
print(df.any(axis='columns'))
print('-~'*25)
print(df.any(axis=None))

   col1   col2
0  True   True
1  True  False
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
col1    True
col2    True
dtype: bool
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
0    True
1    True
dtype: bool
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
True

pandas.DataFrame.clip¶

pandas.DataFrame.clip_lower
pandas.DataFrame.clip_upper
DataFrame.clip(lower=None, upper=None, axis=None, inplace=False, *args, **kwargs)
Trim values at input threshold(s).
按給定的值確定df元素值的邊界

data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
print(df)
print('-~'*25)
print(df.clip(-4, 6))
print('-~'*25)
print(df.clip_lower(-2))
print('-~'*25)
print(df.clip_upper(3))

   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
   col_0  col_1
0      9     -2
1     -2     -2
2      0      6
3     -1      8
4      5     -2
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
   col_0  col_1
0      3     -2
1     -3     -7
2      0      3
3     -1      3
4      3     -5

pandas.DataFrame.compound¶

DataFrame.compound(axis=None, skipna=None, level=None)
Return the compound percentage of the values for the requested axis

data = {'col_0': [1, 2, 5], 'col_1': [10, 20, 50]}
df = pd.DataFrame(data)
print(df)
print('-~'*25)
print(df.prod())
df.compound() 
# col_0: (1+1)*(1+2)*(1+5)-1=2*3*6-1=35
# col_1: (1+10)*(1+20)*(1+50)-1=11*21*51-1=11780

   col_0  col_1
0      1     10
1      2     20
2      5     50
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
col_0       10
col_1    10000
dtype: int64

col_0       35
col_1    11780
dtype: int64

pandas.DataFrame.count¶

DataFrame.count(axis=0, level=None, numeric_only=False)
Count non-NA cells for each column or row.

df = pd.DataFrame({"Person":["John", "Myla", None, "John", "Myla"],
                   "Age": [24., np.nan, 21., 33, 26],
                   "Single": [False, True, True, True, False]
                  })
print(df)
print('-~'*25)
print(df.count())  # Notice the uncounted NA values
print('-~'*25)
print((df.count(axis='columns')))  # Counts for each row
print('-~'*25)
#Counts for one level of a MultiIndex:
df.set_index(["Person", "Single"]).count(level="Person")

  Person   Age  Single
0   John  24.0   False
1   Myla   NaN    True
2   None  21.0    True
3   John  33.0    True
4   Myla  26.0   False
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
Person    4
Age       4
Single    5
dtype: int64
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
0    3
1    2
2    2
3    3
4    3
dtype: int64
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~

df.groupby('Person')['Age'].count()

Person
John    2
Myla    1
Name: Age, dtype: int64

pandas.DataFrame.cov¶

DataFrame.cov(min_periods=None)
Compute pairwise covariance of columns, excluding NA/null values.
随机变量的协方差: 與数学期望、方差一样，是分布的一个总体参数. 在概率论和统计中，协方差是对两个随机变量联合分布线性相关程度的一种度量。两个随机变量越线性相关，协方差越大，完全线性无关，协方差为零。
样本的协方差: 是样本集的一个统计量，可作为联合分布总体参数的一个估计。在实际中计算的通常是样本的协方差.
计算各维度两两之间的协方差，各协方差组成一个n×n的矩阵，称为协方差矩阵。协方差矩阵是个对称矩阵，对角线上的元素是各维度上随机变量的方差

df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=['dogs', 'cats'])
print(df)
df.cov()

   dogs  cats
0     1     2
1     0     3
2     2     0
3     1     1

pandas.DataFrame.cummax¶

Return cumulative maximum over a DataFrame or Series axis.

df = pd.DataFrame([[2.0, 1.0],
                   [3.0, np.nan],
                   [1.0, 0.0]],
                   columns=list('AB'))
print(df)
print('-~'*25)
# By default, iterates over rows and finds the maximum in each column. 
# This is equivalent to axis=None or axis='index'.
print(df.cummax())
print('-~'*25)
print(df.cummax(axis=1))

     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0

pandas.DataFrame.cummin¶

df = pd.DataFrame([[2.0, 1.0],
                   [3.0, np.nan],
                   [1.0, 0.0]],
                   columns=list('AB'))
print(df)
print('-~'*25)
print(df.cummin()) # 每一行記錄與上一行比較,取最小的記錄
print('-~'*25)
print(df.cummin(axis=1)) # 每一列記錄與前一(左)列數據比較,取最小記錄

     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

pandas.DataFrame.cumprod¶

df = pd.DataFrame([[2.0, 1.0, 5],
                   [3.0, np.nan, None],
                   [1.0, 0.0, 3]],
                   columns=list('ABC'))
print(df)
print('-~'*25)
print(df.cumprod())
print('-~'*25)
print(df.cumprod(axis=1))

     A    B    C
0  2.0  1.0  5.0
1  3.0  NaN  NaN
2  1.0  0.0  3.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B     C
0  2.0  1.0   5.0
1  6.0  NaN   NaN
2  6.0  0.0  15.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B     C
0  2.0  2.0  10.0
1  3.0  NaN   NaN
2  1.0  0.0   0.0

pandas.DataFrame.cumsum¶

df = pd.DataFrame([[2.0, 1.0, 5],
                   [3.0, np.nan, None],
                   [1.0, 0.0, 3]],
                   columns=list('ABC'))
print(df)
print('-~'*25)
print(df.cumsum())
print('-~'*25)
print(df.cumsum(axis=1))

     A    B    C
0  2.0  1.0  5.0
1  3.0  NaN  NaN
2  1.0  0.0  3.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B    C
0  2.0  1.0  5.0
1  5.0  NaN  NaN
2  6.0  1.0  8.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     A    B    C
0  2.0  3.0  8.0
1  3.0  NaN  NaN
2  1.0  1.0  4.0

pandas.DataFrame.describe¶

df = pd.DataFrame({ 'object': ['a', 'b', 'b'],
                    'numeric': [1, 2, 3],
                    'categorical': pd.Categorical(['g','e','f'])
                  })
print(df)
print('-~'*25)
print(df.describe())
print('-~'*25)
df.describe(include='all')

  object  numeric categorical
0      a        1           g
1      b        2           e
2      b        3           f
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~

pandas.DataFrame.diff¶

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                   'b': [1, 1, 2, 3, 5, 8],
                   'c': [1, 4, 9, 16, 25, 36]})
print(df)
print('-~'*25)
print(df.diff()) # 與前一條記錄的比較
print('-~'*25)
print(df.diff(axis=1)) # 與前一列記錄的比較
print('-~'*25)
print(df.diff(periods=3)) # 與第前3條記錄的比較(相減)

   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     a    b     c
0  NaN  NaN   NaN
1  1.0  0.0   3.0
2  1.0  1.0   5.0
3  1.0  1.0   7.0
4  1.0  2.0   9.0
5  1.0  3.0  11.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
    a    b     c
0 NaN  0.0   0.0
1 NaN -1.0   3.0
2 NaN -1.0   7.0
3 NaN -1.0  13.0
4 NaN  0.0  20.0
5 NaN  2.0  28.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
     a    b     c
0  NaN  NaN   NaN
1  NaN  NaN   NaN
2  NaN  NaN   NaN
3  3.0  2.0  15.0
4  3.0  4.0  21.0
5  3.0  6.0  27.0

pandas.DataFrame.mad¶

DataFrame.mad(axis=None, skipna=None, level=None)
Return the mean absolute deviation of the values for the requested axis
每一個元素與平均數的差,取絕對數,再平均,得到平均絕對偏差the mean absolute deviation

df = pd.DataFrame({'A': [2, 2, 4]})
print(df)
print('-~'*25)
print(df.mean())
print('-~'*25)
print((df-df.mean()).abs().mean())
print('-~'*25)
print(df.mad())

   A
0  2
1  2
2  4
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
A    2.666667
dtype: float64
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
A    0.888889
dtype: float64
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
A    0.888889
dtype: float64

pandas.DataFrame.median¶

中位數

df = pd.DataFrame({'A': [1, 2, 1, 102, 1, 3, 8]})
print(df)
print('-~'*25)
print(df.sort_values(by=['A']).reset_index())
print('-~'*25)
print(type(df.median()))
print('-~'*25)
print(df.median())
print('-~'*25)
print(df.median()[0])
print('-~'*25)
print(df[df['A']==df.median()[0]])
print('-~'*25)
df[df['A']==df.median()[0]].index.tolist()

     A
0    1
1    2
2    1
3  102
4    1
5    3
6    8
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
   index    A
0      0    1
1      2    1
2      4    1
3      1    2
4      5    3
5      6    8
6      3  102
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
<class 'pandas.core.series.Series'>
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
A    2.0
dtype: float64
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
2.0
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
   A
1  2
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~

[1]

pandas.DataFrame.mode¶

衆數

df = pd.DataFrame({'A': [1, 1, 3, 2, 3, 3, 3], 'B': [9, 9, 9, 9, 3, 3, 3]})
print(df)
df.mode()

   A  B
0  1  9
1  1  9
2  3  9
3  2  9
4  3  3
5  3  3
6  3  3

# 先按照a列做聚合,然後再對聚合後的a列的值,即A和B的所有值取衆數.
df = pd.DataFrame({'a':['A','A','A','A','B','B','B','B','B', 'B'],'b':[2,1,2,3,1,2,2,3,3, 3]})
print(df)
# dir(df.groupby('a')) 檢查可用的方法
df.groupby('a')['b'].apply(lambda x:x.mode())

   a  b
0  A  2
1  A  1
2  A  2
3  A  3
4  B  1
5  B  2
6  B  2
7  B  3
8  B  3
9  B  3

a   
A  0    2
B  0    3
Name: b, dtype: int64

pandas.DataFrame.pct_change¶

DataFrame.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
Percentage change between the current and a prior element.
可用於計算環比

df = pd.DataFrame({
    'FR': [4.0405, 4.0963, 4.3149],
    'GR': [1.7246, 1.7482, 1.8519],
    'IT': [804.74, 810.01, 860.13]},
    index=['2018-11-01', '2018-12-01', '2019-01-01'])
print(df)
print('~-'*25)
print(df.diff())    # 先計算與上一行的差,若索引爲時間(如月),就是環比.
print('~-'*25)
print(df.shift(1))  # 移位,對齊
print('~-'*25)
print(df.diff().div(df.shift(1))) # 即將每一行與上一行的差,再除以上一行,得到的百分比,就是 pct_change()的值
# 當periods=12時,表示與上年同期的同比
df.pct_change()

                FR      GR      IT
2018-11-01  4.0405  1.7246  804.74
2018-12-01  4.0963  1.7482  810.01
2019-01-01  4.3149  1.8519  860.13
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
                FR      GR     IT
2018-11-01     NaN     NaN    NaN
2018-12-01  0.0558  0.0236   5.27
2019-01-01  0.2186  0.1037  50.12
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
                FR      GR      IT
2018-11-01     NaN     NaN     NaN
2018-12-01  4.0405  1.7246  804.74
2019-01-01  4.0963  1.7482  810.01
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
                  FR        GR        IT
2018-11-01       NaN       NaN       NaN
2018-12-01  0.013810  0.013684  0.006549
2019-01-01  0.053365  0.059318  0.061876

df = pd.DataFrame({
     '2016': [1769950, 30586265],
     '2015': [1500923, 40912316],
     '2014': [1371819, 41403351]},
     index=['GOOG', 'APPL'])
print(df)
df.pct_change(axis='columns')

          2016      2015      2014
GOOG   1769950   1500923   1371819
APPL  30586265  40912316  41403351

pandas.DataFrame.prod¶

行,或列的元素連乘

df = pd.DataFrame({
     'A': [1, 3],
     'B': [3, 4],
     'C': [2, 4]},
     index=['GOOG', 'APPL'])
print(df)
print('~-'*25)
print(df.prod())
print('~-'*25)
print(df.prod(axis='columns'))

      A  B  C
GOOG  1  3  2
APPL  3  4  4
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
A     3
B    12
C     8
dtype: int64
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
GOOG     6
APPL    48
dtype: int64

pandas.DataFrame.quantile¶

P分位:P取0~1之间的任何数值
P分位所在位置计算公式：
- pos = 1+(n-1)*p
- value=i+(j-i)*fraction
P分位的数值是指先将所有数据从大到小排列，
若P分位的位置通过上述公式计算后为整数，则直接取P分位所在处的数值；
若为小数，则表示该位置在两个数之间，则用vakue公式计算出对应的值
分位距fraction为小数部分,
i,j为分位前后的数值

df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), columns=['a', 'b'])
print(df)
print('~-'*25)
print(df.quantile(.1))  
# 计算a列
# pos = 1 + (4 - 1)*0.1 = 1.3 
# fraction = 0.3
# ret = 1 + (2 - 1) * 0.3 = 1.3
# 计算b列
# pos = 1.3 
# ret = 1 + (10 - 1) * 0.3 = 3.7
print('~-'*25)
print(type(df.quantile([.1, .5, .1, .5, .3, .2])))
print(df.quantile([.1, .5, .1, .5, .3, .2]))

   a    b
0  1    1
1  2   10
2  3  100
3  4  100
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
a    1.3
b    3.7
Name: 0.1, dtype: float64
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
<class 'pandas.core.frame.DataFrame'>
       a     b
0.1  1.3   3.7
0.5  2.5  55.0
0.1  1.3   3.7
0.5  2.5  55.0
0.3  1.9   9.1
0.2  1.6   6.4

pandas.DataFrame.rank¶

rank 表示在这个数在原来的Series或DataFrame中排第几名，有相同的数，取其排名平均（默认）作为值。

df = pd.DataFrame(np.array([[1, 3], [2., 2], [3, 1], [3, 1]]), columns=['a', 'b'])
print(df)
print('~-'*25)
print(df.rank())

     a    b
0  1.0  3.0
1  2.0  2.0
2  3.0  1.0
3  3.0  1.0
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
     a    b
0  1.0  4.0
1  2.0  3.0
2  3.5  1.5
3  3.5  1.5

pandas.DataFrame.round¶

DataFrame.round(decimals=0, *args, **kwargs)
Round a DataFrame to a variable number of decimal places.
四舍五入

df = pd.DataFrame(np.random.random([3, 3]), columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
print(df)
print('~-'*25)
print(df.round(2))
print('~-'*25)
df.round({'A': 1, 'C': 2})

               A         B         C
first   0.704474  0.686108  0.084263
second  0.834782  0.348655  0.857906
third   0.627151  0.743559  0.115237
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
           A     B     C
first   0.70  0.69  0.08
second  0.83  0.35  0.86
third   0.63  0.74  0.12
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-

pandas.DataFrame.nunique¶

返回唯一值的個數

df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print(df)
print('~-'*25)
print(df.nunique())
df.nunique(axis=1)

   A  B
0  1  1
1  2  1
2  3  1
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
A    3
B    1
dtype: int64

0    1
1    2
2    2
dtype: int64

Reindexing / Selection / Label manipulation¶

pandas.DataFrame.add_prefix¶

pandas.DataFrame.add_suffix¶

df = pd.DataFrame({'A': [1, 2, 3, 4],  'B': [3, 4, 5, 6]})
print(df)
print('~-'*25)
print(df.add_prefix('col_'))
print('~-'*25)
print(df.add_suffix('_col'))

   A  B
0  1  3
1  2  4
2  3  5
3  4  6
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   col_A  col_B
0      1      3
1      2      4
2      3      5
3      4      6
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   A_col  B_col
0      1      3
1      2      4
2      3      5
3      4      6

pandas.DataFrame.align¶

df1 = pd.DataFrame({'A': [1, 2, 3, 4],  'B': [3, 4, 5, 6]})
df2 = pd.DataFrame({'X': [1, 2, 3, 4],  'Y': [3, 4, 5, 6]}, index=[3,2,1,4])
print(df1)
print('~-'*25)
print(df2)
print('~-'*25)
print(type(df1.align(df2, join='inner', axis=0)))
print(df1.align(df2, join='outer', axis=0)[0])
print('~-'*25)
print(df1.align(df2, join='outer', axis=0)[1])

   A  B
0  1  3
1  2  4
2  3  5
3  4  6
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   X  Y
3  1  3
2  2  4
1  3  5
4  4  6
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
<class 'tuple'>
     A    B
0  1.0  3.0
1  2.0  4.0
2  3.0  5.0
3  4.0  6.0
4  NaN  NaN
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
     X    Y
0  NaN  NaN
1  3.0  5.0
2  2.0  4.0
3  1.0  3.0
4  4.0  6.0

pandas.DataFrame.at_time¶

i = pd.date_range('2018-12-09', periods=4, freq='12H')
ts = pd.DataFrame({'A': [1,2,3,4]}, index=i)
print(ts)
ts.at_time('12:00')

                     A
2018-12-09 00:00:00  1
2018-12-09 12:00:00  2
2018-12-10 00:00:00  3
2018-12-10 12:00:00  4

pandas.DataFrame.between_time¶

i = pd.date_range('2018-12-09', periods=6, freq='4H')
ts = pd.DataFrame({'A': [1,2,3,4, 5, 6]}, index=i)
print(ts)
ts.between_time('4:15', '16:45')

                     A
2018-12-09 00:00:00  1
2018-12-09 04:00:00  2
2018-12-09 08:00:00  3
2018-12-09 12:00:00  4
2018-12-09 16:00:00  5
2018-12-09 20:00:00  6

pandas.DataFrame.drop¶

df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D'])
print(df)
print('~-'*25)
print(df.drop(['B', 'C'], axis=1))
print('~-'*25)
print(df.drop(columns=['B', 'C']))
print('~-'*25)
df.drop([0, 1])

   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   A   D
0  0   3
1  4   7
2  8  11
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   A   D
0  0   3
1  4   7
2  8  11
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-

midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
                             ['speed', 'weight', 'length']],
                     labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(index=midx, columns=['big', 'small'],
                  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
                        [250, 150], [1.5, 0.8], [320, 250],
                        [1, 0.8], [0.3,0.2]])
print(df)
print('~-'*25)
print(df.drop(index='cow', columns='small'))
print('~-'*25)
print(df.drop(index='length', level=1)) # level 與 索引的級別需要一致,否則無效, 如length的level是1

                 big  small
lama   speed    45.0   30.0
       weight  200.0  100.0
       length    1.5    1.0
cow    speed    30.0   20.0
       weight  250.0  150.0
       length    1.5    0.8
falcon speed   320.0  250.0
       weight    1.0    0.8
       length    0.3    0.2
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
                 big
lama   speed    45.0
       weight  200.0
       length    1.5
falcon speed   320.0
       weight    1.0
       length    0.3
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
                 big  small
lama   speed    45.0   30.0
       weight  200.0  100.0
cow    speed    30.0   20.0
       weight  250.0  150.0
falcon speed   320.0  250.0
       weight    1.0    0.8

pandas.DataFrame.drop_duplicates¶

删除重复记录值
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
Return DataFrame with duplicate rows removed, optionally only considering certain columns
- keep : {‘first’, ‘last’, False}, default ‘first’
  - first : Drop duplicates except for the first occurrence.
  - last : Drop duplicates except for the last occurrence.
  - False : Drop all duplicates.

midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
                             ['speed', 'weight', 'length']],
                     labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(index=midx, columns=['big', 'small'],
                  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
                        [250, 150], [1.5, 0.8], [45, 250],
                        [1.5, 0.8], [0.3,0.2]])
print(df)
print('~'*25)
print(df.drop_duplicates(subset='big', keep='last', inplace=False))

                 big  small
lama   speed    45.0   30.0
       weight  200.0  100.0
       length    1.5    1.0
cow    speed    30.0   20.0
       weight  250.0  150.0
       length    1.5    0.8
falcon speed    45.0  250.0
       weight    1.5    0.8
       length    0.3    0.2
~~~~~~~~~~~~~~~~~~~~~~~~~
                 big  small
lama   weight  200.0  100.0
cow    speed    30.0   20.0
       weight  250.0  150.0
falcon speed    45.0  250.0
       weight    1.5    0.8
       length    0.3    0.2

df = pd.DataFrame({'A': [1, 2, 3, 4],  'B': [3, 4, 3, 3]})
print(df)
print('~-'*25)
print(df.drop_duplicates('B', keep='last'))

   A  B
0  1  3
1  2  4
2  3  3
3  4  3
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
   A  B
1  2  4
3  4  3

pandas.DataFrame.duplicated¶

df = pd.DataFrame({'A': [1, 2, 3, 1],  'B': [3, 4, 3, 3]})
print(df)
print('~-'*25)
print(df.duplicated(['A','B'], keep='last'))

   A  B
0  1  3
1  2  4
2  3  3
3  1  3
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
0     True
1    False
2    False
3    False
dtype: bool

pandas.DataFrame.equals¶

全部相同才會返回True

df1 = pd.DataFrame({'A': [1, 2, 3, 1],  'B': [3, 4, np.NAN, 3]})
df2 = pd.DataFrame({'A': [1, 2, 3, 1],  'B': [3, 4, np.NAN, 3]})
print(df1)
print('~-'*25)
df1.equals(df2)

   A    B
0  1  3.0
1  2  4.0
2  3  NaN
3  1  3.0
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-

True

pandas.DataFrame.filter¶

DataFrame.filter(items=None, like=None, regex=None, axis=None)
Subset rows or columns of dataframe according to labels in the specified index.
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

df=pd.DataFrame([
     [1,2,3]
    ,[4,5,6]
], index=['mouse','rabbit'], columns=['one','two','three'])
print(df)
print('-'*30)
print(df.filter(items=['one','two']))  # 只選擇列表中的列
print('-'*30)
print(df.filter(regex='e$', axis=1))   # 對列做過濾,選擇索引標籤中含有以"e"結尾的列
print('-'*30)
print(df.filter(like='bbi', axis=0))   # 對行做過濾,選擇索引標籤中包含bbi的行

        one  two  three
mouse     1    2      3
rabbit    4    5      6
------------------------------
        one  two
mouse     1    2
rabbit    4    5
------------------------------
        one  three
mouse     1      3
rabbit    4      6
------------------------------
        one  two  three
rabbit    4    5      6

pandas.DataFrame.first¶

i = pd.date_range('2018-04-09', periods=8, freq='2D')
ts = pd.DataFrame({'A': [1,2,3,4,2,3,4,2]}, index=i)
print(ts)
print('-'*30)
# Get the rows for the first 3 days:
print(ts.first('4D'))
print('-'*30)
ts.first('5D')

            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
2018-04-17  2
2018-04-19  3
2018-04-21  4
2018-04-23  2
------------------------------
            A
2018-04-09  1
2018-04-11  2
------------------------------

pandas.DataFrame.head¶

i = pd.date_range('2018-04-09', periods=8, freq='2D')
ts = pd.DataFrame({'A': [1,2,3,4,2,3,4,2]}, index=i)
print(ts)
print('-'*30)
print(ts.head())
ts.head(3)

            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
2018-04-17  2
2018-04-19  3
2018-04-21  4
2018-04-23  2
------------------------------
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
2018-04-17  2

pandas.DataFrame.idxmax¶

pandas.DataFrame.idxmin¶

i = pd.date_range('2018-04-09', periods=8, freq='2D')
ts = pd.DataFrame({'A': [1,2,3,16,2,3,4,2], 'B': [5,8,3,9,2,3,6,12]}, index=i)
print(ts)
print('-'*30)
# Get the rows for the first 3 days:
print(ts.idxmax(axis=0))
print('-'*30)
print(ts.idxmax(axis=1)) # 若是相同,取前一列(A列)
print('-'*30)
print(ts.idxmin(axis=0))
print('-'*30)
ts.idxmin(axis=1) # 若是相同,取前一列(A列)

             A   B
2018-04-09   1   5
2018-04-11   2   8
2018-04-13   3   3
2018-04-15  16   9
2018-04-17   2   2
2018-04-19   3   3
2018-04-21   4   6
2018-04-23   2  12
------------------------------
A   2018-04-15
B   2018-04-23
dtype: datetime64[ns]
------------------------------
2018-04-09    B
2018-04-11    B
2018-04-13    A
2018-04-15    A
2018-04-17    A
2018-04-19    A
2018-04-21    B
2018-04-23    B
Freq: 2D, dtype: object
------------------------------
A   2018-04-09
B   2018-04-17
dtype: datetime64[ns]
------------------------------

2018-04-09    A
2018-04-11    A
2018-04-13    A
2018-04-15    B
2018-04-17    A
2018-04-19    A
2018-04-21    A
2018-04-23    A
Freq: 2D, dtype: object

pandas.DataFrame.last¶

i = pd.date_range('2018-04-09', periods=8, freq='2D')
ts = pd.DataFrame({'A': [1,2,3,4,2,3,4,2]}, index=i)
print(ts)
print('-'*30)
# Get the rows for the first 3 days:
print(ts.last('4D'))
print('-'*30)
ts.last('5D')

            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
2018-04-17  2
2018-04-19  3
2018-04-21  4
2018-04-23  2
------------------------------
            A
2018-04-21  4
2018-04-23  2
------------------------------

pandas.DataFrame.reindex¶

DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.
A new object is produced unless the new index is equivalent to the current one and copy=False

index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
df = pd.DataFrame({
    'http_status': [200,200,404,404,301],
    'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
    index=index)
print(df)
print('-'*30)
new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome']
print(df.reindex(new_index))
df.reindex(new_index, fill_value=0)

           http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00
------------------------------
               http_status  response_time
Safari               404.0           0.07
Iceweasel              NaN            NaN
Comodo Dragon          NaN            NaN
IE10                 404.0           0.08
Chrome               200.0           0.02

date_index1 = pd.date_range('11/1/2018', periods=6, freq='D')
date_index2 = pd.date_range('10/29/2018', periods=10, freq='D')
print(date_index1)
print('-'*30)
print(date_index2)
df = pd.DataFrame({"prices": [100, 105, np.nan, 100, 89, 88]},
                    index=date_index1)
print(df)
print('~'*50)
print(df.reindex(date_index2)) # 按索引對齊,缺失爲NaN
df.reindex(date_index2, method='bfill') #對NaN值替換,是向前替換的(b代表before)

DatetimeIndex(['2018-11-01', '2018-11-02', '2018-11-03', '2018-11-04',
               '2018-11-05', '2018-11-06'],
              dtype='datetime64[ns]', freq='D')
------------------------------
DatetimeIndex(['2018-10-29', '2018-10-30', '2018-10-31', '2018-11-01',
               '2018-11-02', '2018-11-03', '2018-11-04', '2018-11-05',
               '2018-11-06', '2018-11-07'],
              dtype='datetime64[ns]', freq='D')
            prices
2018-11-01   100.0
2018-11-02   105.0
2018-11-03     NaN
2018-11-04   100.0
2018-11-05    89.0
2018-11-06    88.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            prices
2018-10-29     NaN
2018-10-30     NaN
2018-10-31     NaN
2018-11-01   100.0
2018-11-02   105.0
2018-11-03     NaN
2018-11-04   100.0
2018-11-05    89.0
2018-11-06    88.0
2018-11-07     NaN

df = pd.DataFrame({"A": [100, 105, np.nan],"B": [100, 105, 100],"C": [100, 105, np.nan]}, index=['a', 'b', 'c'])
print(df)
print('~'*50)
df.reindex(['B', 'C', 'D'], axis=1)

       A    B      C
a  100.0  100  100.0
b  105.0  105  105.0
c    NaN  100    NaN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pandas.DataFrame.reindex_like¶

df1 = pd.DataFrame({"X": [10000, 10500, np.nan],"B": [10000, 10500, 10000],"C": [10000, 10500, np.nan]}, index=['x', 'b', 'c'])
df2 = pd.DataFrame({"A": [1, 1.5, np.nan],"B": [1, 1.5, 1],"C": [1, 1.5, np.nan]}, index=['a', 'b', 'c'])
print(df1)
print('~'*50)
print(df2)
print('~'*50)
df1.reindex_like(df2) # 形式上像df2, 但是數據仍然是 df1 的,如果有缺失數據,使用NaN

         X      B        C
x  10000.0  10000  10000.0
b  10500.0  10500  10500.0
c      NaN  10000      NaN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     A    B    C
a  1.0  1.0  1.0
b  1.5  1.5  1.5
c  NaN  1.0  NaN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pandas.DataFrame.rename¶

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(df)
#df.rename(index={0:'a', 1:'b', 2:'c'}, columns={"A": "a", "B": "c"})
df.rename(index={1:'a', 0:'b', 2:'啊'}, columns={"A": "a", "B": "c"}).rename(str.upper) # (str.lower, axis='columns')

   A  B
0  1  4
1  2  5
2  3  6

pandas.DataFrame.reset_index¶

df = pd.DataFrame([('bird',    389.0),
                   ('bird',     24.0),
                   ('mammal',   80.5),
                   ('mammal', np.nan)],
                  index=['falcon', 'parrot', 'lion', 'monkey'],
                  columns=('class', 'max_speed'))
print(df)
print('~'*50)
print(df.reset_index())
print(df.reset_index().columns)
df.reset_index(drop=True) # use the drop parameter to avoid the old index being added as a column

         class  max_speed
falcon    bird      389.0
parrot    bird       24.0
lion    mammal       80.5
monkey  mammal        NaN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    index   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN
Index(['index', 'class', 'max_speed'], dtype='object')

index = pd.MultiIndex.from_tuples([('bird', 'falcon'),  # 與zip()搭配使用效果會很好
                                   ('bird', 'parrot'),
                                   ('mammal', 'lion'),
                                   ('mammal', 'monkey')],
                                  names=['class', 'name'])
columns = pd.MultiIndex.from_tuples([('speed', 'max'),
                                     ('species', 'type')])
df = pd.DataFrame([(389.0, 'fly'),
                   ( 24.0, 'fly'),
                   ( 80.5, 'run'),
                   (np.nan, 'jump')],
                  index=index,
                  columns=columns)
print(df)
print('-'*30)
print(df.columns)
print('-'*30)
#If the index has multiple levels, we can reset a subset of them:
print(df.reset_index(level='class', col_level=0, col_fill='genus').index)
print('-'*30)
print(df.reset_index(level='class', col_fill='genus')) 
print(df.reset_index(level='class', col_fill='genus').columns)

               speed species
                 max    type
class  name                 
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump
------------------------------
MultiIndex(levels=[['species', 'speed'], ['max', 'type']],
           labels=[[1, 0], [0, 1]])
------------------------------
Index(['falcon', 'parrot', 'lion', 'monkey'], dtype='object', name='name')
------------------------------
         class  speed species
         genus    max    type
name                         
falcon    bird  389.0     fly
parrot    bird   24.0     fly
lion    mammal   80.5     run
monkey  mammal    NaN    jump
MultiIndex(levels=[['species', 'speed', 'class'], ['max', 'type', 'genus']],
           labels=[[2, 1, 0], [2, 0, 1]])

df = pd.DataFrame(np.random.random((4,6)), index=[['a','a','c','c'],['x','z','z','w']])
df.columns = pd.MultiIndex.from_product([[1,2],['E','C','A']])
print(df)
print('-'*30)
print(df.columns)
print('-'*30)
print(df.reset_index(level=0, col_fill='B'))
print('-'*30)
df.reset_index(level=0, col_fill='B').columns 
# 在levels 和 labels中, 索引轉化爲label的部分, 順序上的對應關系很特別
# levels=[[1, 2, 'level_0'], ['A', 'C', 'E', 'B']],
# labels=[[2, 0, 0, 0, 1, 1, 1], [3, 2, 1, 0, 2, 1, 0]]
# 注意這裏的labels中的[2...]和[3...]

            1                             2                    
            E         C         A         E         C         A
a x  0.752234  0.081645  0.667921  0.527488  0.523513  0.336706
  z  0.725801  0.911460  0.618994  0.799133  0.610035  0.613607
c z  0.257035  0.323279  0.662518  0.077544  0.822074  0.287594
  w  0.231728  0.030180  0.366634  0.966586  0.408579  0.840229
------------------------------
MultiIndex(levels=[[1, 2], ['A', 'C', 'E']],
           labels=[[0, 0, 0, 1, 1, 1], [2, 1, 0, 2, 1, 0]])
------------------------------
  level_0         1                             2                    
        B         E         C         A         E         C         A
x       a  0.752234  0.081645  0.667921  0.527488  0.523513  0.336706
z       a  0.725801  0.911460  0.618994  0.799133  0.610035  0.613607
z       c  0.257035  0.323279  0.662518  0.077544  0.822074  0.287594
w       c  0.231728  0.030180  0.366634  0.966586  0.408579  0.840229
------------------------------

MultiIndex(levels=[[1, 2, 'level_0'], ['A', 'C', 'E', 'B']],
           labels=[[2, 0, 0, 0, 1, 1, 1], [3, 2, 1, 0, 2, 1, 0]])

pandas.DataFrame.sample¶

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Return a random sample of items from an axis of object.
frac : float, optional
Fraction of axis items to return. Cannot be used with n.

df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
print(df)
df.sample(frac=0.3, replace=True)

          A         B         C         D
0  0.034693  1.644373  0.439421  0.309513
1  2.473348 -2.996134 -2.336114  0.085081
2  0.013747  0.752300 -2.211867 -0.964746
3 -1.073599 -0.170450 -0.635899  1.534373
4  0.515041  1.196360 -0.994070 -0.828308
5  1.727775 -0.952100  0.907272  1.660095
6  0.077633 -1.607453  0.378108 -0.113058
7  0.184237  0.382458 -0.111344 -0.026671
8  0.342763 -0.770604 -0.795568 -0.677436
9  0.141384 -0.043347  2.473006  0.814057

pandas.DataFrame.set_axis¶

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(df)
print('-'*50)
print(df.set_axis(['a', 'b', 'c'], axis='index', inplace=False))
print('-'*50)
print(df.set_axis(['I', 'II'], axis='columns', inplace=False))

   A  B
0  1  4
1  2  5
2  3  6
--------------------------------------------------
   A  B
a  1  4
b  2  5
c  3  6
--------------------------------------------------
   I  II
0  1   4
1  2   5
2  3   6

pandas.DataFrame.set_index¶

df = pd.DataFrame({'month': [1, 4, 7, 10],
                   'year': [2018, 2018, 2019, 2018],
                   'sale':[55, 40, 84, 31]})
print(df)
df.sort_values(['year', 'month']).set_index(['year', 'month'])

   month  year  sale
0      1  2018    55
1      4  2018    40
2      7  2019    84
3     10  2018    31

pandas.DataFrame.tail¶

i = pd.date_range('2018-04-09', periods=8, freq='2D')
ts = pd.DataFrame({'A': [1,2,3,4,2,3,4,2]}, index=i)
print(ts)
print('-'*30)
print(ts.tail())
ts.tail(3)

            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
2018-04-17  2
2018-04-19  3
2018-04-21  4
2018-04-23  2
------------------------------
            A
2018-04-15  4
2018-04-17  2
2018-04-19  3
2018-04-21  4
2018-04-23  2

pandas.DataFrame.take¶

df = pd.DataFrame([('falcon', 'bird',    389.0),
                   ('parrot', 'bird',     24.0),
                   ('lion',   'mammal',   80.5),
                   ('monkey', 'mammal', np.nan)],
                   columns=['name', 'class', 'max_speed'],
                   index=[0, 2, 3, 1])
print(df)
print('-'*30)
print(df.take([0, 3]))  # 對行記錄切片
print('-'*30)
print(df.take([-1, -2]))  # 對行記錄切片
print('-'*30)
print(df.take([1, 2], axis=1))  # 對列記錄切片

     name   class  max_speed
0  falcon    bird      389.0
2  parrot    bird       24.0
3    lion  mammal       80.5
1  monkey  mammal        NaN
------------------------------
     name   class  max_speed
0  falcon    bird      389.0
1  monkey  mammal        NaN
------------------------------
     name   class  max_speed
1  monkey  mammal        NaN
3    lion  mammal       80.5
------------------------------
    class  max_speed
0    bird      389.0
2    bird       24.0
3  mammal       80.5
1  mammal        NaN

pandas.DataFrame.truncate¶

df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'],
                   'B': ['f', 'g', 'h', 'i', 'j'],
                   'C': ['k', 'l', 'm', 'n', 'o']},
                   index=[1, 2, 3, 4, 5])
print(df)
print('-'*30)
print(df.truncate(before=2, after=4)) # include 2 and 4
print('-'*30)
print(df.truncate(before="A", after="B", axis="columns"))

   A  B  C
1  a  f  k
2  b  g  l
3  c  h  m
4  d  i  n
5  e  j  o
------------------------------
   A  B  C
2  b  g  l
3  c  h  m
4  d  i  n
------------------------------
   A  B
1  a  f
2  b  g
3  c  h
4  d  i
5  e  j

Time series-related¶

pandas.DataFrame.asfreq¶

Convert TimeSeries to specified frequency.

index = pd.date_range('11/15/2018', periods=4, freq='T')
series = pd.Series([0.0, None, 2.0, 3.0], index=index)
df = pd.DataFrame({'s':series})
print(df)
print('-'*30)
print(df.asfreq(freq='30S'))
print('-'*30)
print(df.asfreq(freq='30S', fill_value=9.0))
print('-'*30)
df.asfreq(freq='30S', method='bfill')

                       s
2018-11-15 00:00:00  0.0
2018-11-15 00:01:00  NaN
2018-11-15 00:02:00  2.0
2018-11-15 00:03:00  3.0
------------------------------
                       s
2018-11-15 00:00:00  0.0
2018-11-15 00:00:30  NaN
2018-11-15 00:01:00  NaN
2018-11-15 00:01:30  NaN
2018-11-15 00:02:00  2.0
2018-11-15 00:02:30  NaN
2018-11-15 00:03:00  3.0
------------------------------
                       s
2018-11-15 00:00:00  0.0
2018-11-15 00:00:30  9.0
2018-11-15 00:01:00  NaN
2018-11-15 00:01:30  9.0
2018-11-15 00:02:00  2.0
2018-11-15 00:02:30  9.0
2018-11-15 00:03:00  3.0
------------------------------

pandas.DataFrame.shift¶

DataFrame.shift(periods=1, freq=None, axis=0)

index = pd.date_range('11/15/2018', periods=4, freq='T')
series = pd.Series([0.0, 8.5, 2.0, 3.0], index=index)
df = pd.DataFrame({'s':series})
print(df)
print('-'*30)
df.shift(periods=-2)

                       s
2018-11-15 00:00:00  0.0
2018-11-15 00:01:00  8.5
2018-11-15 00:02:00  2.0
2018-11-15 00:03:00  3.0
------------------------------

pandas.DataFrame.slice_shift¶

The shifted data will not include the dropped periods and the shifted axis will be smaller than the original.

index = pd.date_range('11/15/2018', periods=4, freq='T')
series = pd.Series([0.0, 8.5, 2.0, 3.0], index=index)
df = pd.DataFrame({'s':series})
print(df)
print('-'*30)
df.slice_shift(periods=2)

                       s
2018-11-15 00:00:00  0.0
2018-11-15 00:01:00  8.5
2018-11-15 00:02:00  2.0
2018-11-15 00:03:00  3.0
------------------------------

pandas.DataFrame.tshift¶

Shift the time index, using the index’s frequency if available.

index = pd.date_range('11/15/2018', periods=4, freq='T')
series = pd.Series([0.0, 8.5, 2.0, 3.0], index=index)
df = pd.DataFrame({'s':series})
print(df)
print('-'*30)
df.tshift(periods=2)

                       s
2018-11-15 00:00:00  0.0
2018-11-15 00:01:00  8.5
2018-11-15 00:02:00  2.0
2018-11-15 00:03:00  3.0
------------------------------

index = pd.date_range('11/15/2018', periods=5, freq='M')
df = pd.DataFrame({'C': ['k', 'l', 'm', 'n', 'o']
                 , 'D': ['a', 'c', 'g', 'j', 'r']
                  },
                   index=index)
print(df)
print('-'*30)
df.tshift(periods=2) # 移動的是索引

            C  D
2018-11-30  k  a
2018-12-31  l  c
2019-01-31  m  g
2019-02-28  n  j
2019-03-31  o  r
------------------------------

pandas.DataFrame.first_valid_index¶

Return index for first non-NA/null value.

pandas.DataFrame.last_valid_index¶

index = pd.date_range('11/15/2018', periods=4, freq='D')
series = pd.Series([0.0, np.NaN, 2.0, 3.0], index=index)
df = pd.DataFrame({'s':series})
print(df)
print('-'*30)
print(df.first_valid_index())
print('-'*30)
print(df.last_valid_index())

              s
2018-11-15  0.0
2018-11-16  NaN
2018-11-17  2.0
2018-11-18  3.0
------------------------------
2018-11-15 00:00:00
------------------------------
2018-11-18 00:00:00

pandas.DataFrame.resample¶

DataFrame.resample(rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)
Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

index = pd.date_range('1/1/2018', periods=12, freq='M')
df = pd.DataFrame(np.random.randn(12, 3), columns=['Sales','Cost','Profit'], index=index)
print(df)
print('-'*50)
print(df.resample('Q').sum())
df.resample('Y').sum()

               Sales      Cost    Profit
2018-01-31  1.066053 -0.396406 -0.490727
2018-02-28 -1.275683  1.533675  0.743364
2018-03-31  0.771554  1.199510  0.671340
2018-04-30  1.167655 -1.289586  0.932610
2018-05-31 -0.621035  1.137168 -1.493918
2018-06-30 -1.854931  0.847472 -0.090845
2018-07-31  1.234532  0.332266  0.352186
2018-08-31  0.287898 -0.632708 -0.279520
2018-09-30 -0.842835  0.144228 -0.395220
2018-10-31  1.934353 -0.922857  0.290504
2018-11-30  0.061530 -1.381387 -0.278487
2018-12-31  0.961158  0.081062  1.555879
--------------------------------------------------
               Sales      Cost    Profit
2018-03-31  0.561924  2.336778  0.923978
2018-06-30 -1.308311  0.695054 -0.652153
2018-09-30  0.679595 -0.156215 -0.322554
2018-12-31  2.957041 -2.223182  1.567896

def custom_resampler(array_like):
    return np.sum(array_like)+5
df.resample('Q').apply(custom_resampler) # 嵌套自定義函數

pandas.DataFrame.to_period¶

Convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency.

index = pd.date_range('1/1/2018', periods=2, freq='M')
df = pd.DataFrame(np.random.randn(2, 3), columns=['Sales','Cost','Profit'], index=index)
print(df)
print('-'*50)
print(df.to_period())  # 從年月日轉換爲年月

               Sales      Cost    Profit
2018-01-31  0.333414 -0.087775 -1.214567
2018-02-28  0.182817  0.856936 -0.405778
--------------------------------------------------
            Sales      Cost    Profit
2018-01  0.333414 -0.087775 -1.214567
2018-02  0.182817  0.856936 -0.405778

Plotting¶

pandas.DataFrame.plot¶

data : DataFrame

x : label or position, default None

y : label, position or list of label, positions, default None

Allows plotting of one column versus another

kind : str

line : line plot (default)
bar : vertical bar plot
barh : horizontal bar plot
hist : histogram
box : boxplot
kde : Kernel Density Estimation plot
density : same as ‘kde’
area : area plot
pie : pie plot
scatter : scatter plot
hexbin : hexbin plot

ax : matplotlib axes object, default None

subplots : boolean, default False

Make separate subplots for each column

sharex : boolean, default True if ax is None else False

In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in; Be aware, that passing in both an ax and sharex=True will alter all x axis labels for all axis in a figure!

sharey : boolean, default False

In case subplots=True, share y axis and set some y axis labels to invisible

layout : tuple (optional)

(rows, columns) for the layout of subplots

figsize : a tuple (width, height) in inches

use_index : boolean, default True

Use index as ticks for x axis

title : string or list

Title to use for the plot. If a string is passed, print the string at the top of the figure. If a list is passed and subplots is True, print each item in the list above the corresponding subplot.

grid : boolean, default None (matlab style default)

Axis grid lines

legend : False/True/’reverse’

Place legend on axis subplots

style : list or dict

matplotlib line style per column

logx : boolean, default False

Use log scaling on x axis

logy : boolean, default False

Use log scaling on y axis

loglog : boolean, default False

Use log scaling on both x and y axes

xticks : sequence

Values to use for the xticks

yticks : sequence

Values to use for the yticks

xlim_ : 2-tuple/list

ylim : 2-tuple/list

rot : int, default None

Rotation for ticks (xticks for vertical, yticks for horizontal plots)

fontsize : int, default None

Font size for xticks and yticks

colormap : str or matplotlib colormap object, default None

Colormap to select colors from. If string, load colormap with that name from matplotlib.

colorbar : boolean, optional

If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’ plots)

position : float

Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)

table : boolean, Series or DataFrame, default False

If True, draw a table using the data in the DataFrame and the data will be transposed to meet matplotlib’s default layout. If a Series or DataFrame is passed, use passed data to draw a table.

yerr : DataFrame, Series, array-like, dict and str

See Plotting with Error Bars for detail.

xerr : same types as yerr.

stacked : boolean, default False in line and

bar plots, and True in area plot. If True, create stacked plot.

sort_columns : boolean, default False

Sort column names to determine plot ordering

secondary_y : boolean or sequence, default False

Whether to plot on the secondary y-axis If a list/tuple, which columns to plot on secondary y-axis

mark_right : boolean, default True

When using a secondary_y axis, automatically mark the column labels with “(right)” in the legend

**kwds : keywords

Options to pass to matplotlib plotting method

pandas.DataFrame.plot.area¶

df = pd.DataFrame(
    np.random.rand(12,3),
    index=range(1, 13),
    columns =['Product_1','Product_2','Product_3']
)
print(df)
ax = df.plot(kind='area')
ax = df.plot.area(x=None, y=None)

    Product_1  Product_2  Product_3
1    0.568601   0.270091   0.847199
2    0.989733   0.041083   0.769295
3    0.653774   0.275836   0.553220
4    0.814210   0.703038   0.149856
5    0.958921   0.470318   0.993422
6    0.902969   0.032810   0.120629
7    0.355975   0.114863   0.546863
8    0.961236   0.779842   0.161180
9    0.353689   0.836001   0.292414
10   0.066087   0.770575   0.149908
11   0.733022   0.025898   0.339293
12   0.029467   0.285820   0.007905

pandas.DataFrame.plot.bar¶

#df.plot(kind='bar',stacked=True)
df = pd.DataFrame({'Product Catgory':['Product_1', 'Product_2', 'Product_3'], 'Sales':[32, 45, 20], 'Cost':[25, 30, 15]})
print(df)
ax = df.plot.bar(x='Product Catgory', y=['Sales', 'Cost'], rot=45) # rot 傾斜度

  Product Catgory  Sales  Cost
0       Product_1     32    25
1       Product_2     45    30
2       Product_3     20    15

speed = [0.1, 17.5, 40, 48, 52, 69, 88]
lifespan = [2, 8, 70, 1.5, 25, 12, 28]
index = ['snail', 'pig', 'elephant', 'rabbit', 'giraffe', 'coyote', 'horse']
df = pd.DataFrame({'speed': speed,
                   'lifespan': lifespan}
                  , index=index
                 )
print(df)
df.plot(kind='bar',rot=0)
ax = df.plot.bar(rot=0)
print('1'*50)
axes = df.plot.bar(rot=0, subplots=True)
print('2'*50)
axes[1].legend(loc=2)
ax = df.plot.bar(y='lifespan', rot=0)
ax = df.plot.bar(x='lifespan', rot=0)

          speed  lifespan
snail       0.1       2.0
pig        17.5       8.0
elephant   40.0      70.0
rabbit     48.0       1.5
giraffe    52.0      25.0
coyote     69.0      12.0
horse      88.0      28.0
11111111111111111111111111111111111111111111111111
22222222222222222222222222222222222222222222222222

pandas.DataFrame.plot.barh¶

df.plot(kind='barh')
axes = df.plot.barh(rot=0, subplots=True)

pandas.DataFrame.plot.box¶

箱形图（Box-plot）又称为盒须图、盒式图或箱线图，是一种用作显示一组数据分散情况资料的统计图。
用于反映原始数据分布的特征，还可以进行多组数据分布特征的比较
箱线图的绘制方法是：
- 先找出一组数据的最大值、最小值、中位数和两个四分位数
- 然后，连接两个四分位数画出箱子；
- 再将最大值和最小值与箱子相连接，中位数在箱子中间。
- 四分位数（Quartile）也称四分位点，是指在统计学中把所有数值由小到大排列并分成四等份，处于三个分割点位置的数值。
- 多应用于统计学中的箱线图绘制。它是一组数据排序后处于25%和75%位置上的值。

df.plot(kind='box')
print(df)
df.describe()

          speed  lifespan
snail       0.1       2.0
pig        17.5       8.0
elephant   40.0      70.0
rabbit     48.0       1.5
giraffe    52.0      25.0
coyote     69.0      12.0
horse      88.0      28.0

pandas.DataFrame.boxplot¶

np.random.seed(1234)
df = pd.DataFrame(np.random.randn(10,4), columns=['Col1', 'Col2', 'Col3', 'Col4'])
boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3', 'Col4'])
df

pandas.DataFrame.plot.line¶

df.plot(kind='line')

<matplotlib.axes._subplots.AxesSubplot at 0x7f78f5648ba8>

pandas.DataFrame.plot.pie¶

df = pd.DataFrame({'Sales': [330, 487, 597],
                   'profit': [24.397, 60.518, 63.781]},
                  index=['P1', 'P2', 'P3'])
plot = df.plot.pie(y='Sales', figsize=(5, 5))
df

pandas.DataFrame.plot.scatter¶

df = pd.DataFrame(np.random.randn(1000, 3), columns=['length', 'width', 'species'])
ax1 = df.plot.scatter(x='length', y='width', c='DarkBlue')

=============特別功能=============¶

列操作¶

添加 df['name']=0
修改 df['name']=1

d_6 = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
       'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
d_6

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64, 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

df6=pd.DataFrame(d_6)
df6

df['column']¶

df6['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

df['newcolumn']=value¶

When inserting a scalar value, it will naturally be propagated to fill the column

df6['three'] = df6['one'] * df6['two']
df6['flag'] = df6['one'] > 2
df6['newtest'] = 666
df6

df6['column'][:2] 将某列（对行有选择性地）拷贝至新列¶

df6['one_trunc'] = df6['one'][:2] # 将某列（对行有选择性地）拷贝至新列
df6

df.assign()¶

指定新的列（如果列名已存在，则替换；如果不存在，添加该列）
返回一个新的 DataFrame，不对原始的 DataFrame 进行修改；
也可接收一个 lambda 型的函数对象，该函数对象接收的参数则是原始的 dataframe

df6.assign(one_three_ratio=df6['one']/df6['three'])

df6.assign(proc_one=lambda x: df6['one']**2 )

df.assign().assign()¶

dependent = pd.DataFrame({"A": [1, 1, 1]})
print(dependent)
dependent.assign(A=lambda x: x['A'] + 1).assign(B=lambda x: x['A'] + 2)

   A
0  1
1  1
2  1

dependent.assign(A=lambda x: x["A"] + 1,B=lambda x: x["A"] + 2)

df.pop('column')¶

pop返回删除的列

three = df6.pop('three')
print(type(three))
print(three)
df6

<class 'pandas.core.series.Series'>
a    1.0
b    4.0
c    9.0
d    NaN
Name: three, dtype: float64

df.insert(location, newcolumnname, value)¶

df6.insert(2,'one_copy',df6.pop('one'))  # pop()返回删除的列
df6

df.drop('column')¶

删除指定列, 可对多行或多列删除
axis为1表示删除列，0表示删除行
传入inpalce=True对原数据修改，默认inplace=False

df6_1=df6.drop('flag', axis=1, inplace=False)
print(df6)
df6_1

   two   flag  one_copy  newtest  one_trunc
a  1.0  False       1.0      666        1.0
b  2.0  False       2.0      666        2.0
c  3.0   True       3.0      666        NaN
d  4.0  False       NaN      666        NaN

del df('column')¶

原址数据删除，仅一次删除1列，不可多列

del df6['two']  #直接在原数据上删除列
df6

索引¶

The basics of indexing are as follows:

Operation	Syntax	Result
Select column	df[col]	Series
Select row by label	df.loc[label]	Series
Select row by integer location	df.iloc[loc]	Series
Select columns by integer location	df.iloc[:, loc]	Series
Slice rows	df[5:10]	DataFrame
Select rows by boolean vector	df[bool_vec]	DataFrame

Row selection, for example, returns a Series whose index is the columns of the DataFrame:

df_i=pd.DataFrame({('one'):[1., 2., 3., np.nan]
                   ,('bar'):[1., 2., 3., np.nan]
                   ,('flag'):[False, False, True, False]
                   ,('foo'):['bar', 'bar', 'bar', 'bar']
                   ,('one_trunc'):[1., 2., np.nan, np.nan]
                  }, index=['a','b','c','d'])
df_i

df_i['bar'] # Select column

a    1.0
b    2.0
c    3.0
d    NaN
Name: bar, dtype: float64

df_i.iloc[:, 0:4] # Select column

df_i.loc['b'] # Select row by label

one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

df_i.iloc[-1] # Select row by integer location

one            NaN
bar            NaN
flag         False
foo            bar
one_trunc      NaN
Name: d, dtype: object

df_i[1:3] # Slice rows

过滤¶

from numpy.random import randn
from pandas import DataFrame
df_q = pd.DataFrame(randn(10, 2), columns=list('ab'))
df_q

pd.DataFrame()¶

df[df['column']>value]¶

df_q[df_q['a']>0]

df.query(expr, inplace=False, **kwargs)¶

df_q.query('a > b')

df_q[df_q.a > df_q.b]  # same result as the previous expression

对齐¶

df_align_1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
print(df_align_1)
df_align_2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
print(df_align_2)

          A         B         C         D
0 -0.212638  1.383167  0.481298  0.097620
1 -1.020895  0.000780  0.180238 -1.130841
2 -0.916284  0.214761  1.419390 -0.534460
3  0.798228 -0.663361  0.230646  1.692948
4  0.386671  1.322776 -0.650471 -0.037331
5  0.195719 -1.773187 -1.713767  1.229453
6  1.327582  1.630621 -0.230174 -0.440809
7 -2.205679 -1.375005 -0.359556 -1.193813
8  0.449054  0.448315  0.249817  0.363814
9 -0.062270 -0.536387  0.453623 -0.588019
          A         B         C
0  0.658425 -0.277770 -0.186852
1  1.239148  0.664975 -1.033754
2  0.459361  0.391003  0.319687
3 -0.060047  1.452571  1.736672
4 -2.036418 -1.741175 -1.695446
5  0.662236 -0.570800  0.887765
6 -0.973147 -1.801421  1.403986

df_align_1+df_align_2

df_align_1 - df_align_1.iloc[9] # 只能对一行做处理，如果是多行，全部变为NaN

广播 Broadcast¶

df

df*10

1/df

(df*100).round(2)

Boolean operators¶

dfb1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
dfb1

dfb2 = pd.DataFrame({'a' : [0, 0, 1], 'b' : [1, 1, 0] }, dtype=bool)
dfb2

和 (&) df1 & df2¶

dfb1 & dfb2 # 和, 對應位置,只有兩個都是True,才會得到True,其他全部是False

或 (|) df1 | df2¶

dfb1 | dfb2 # 或,只有有True,就會得到True

异或 (^) df1 ^ df2¶

如果a、b两个值不相同，则异或结果为1。如果a、b两个值相同，异或结果为0。

dfb1 ^ dfb2

比較 Comparisons¶

df_align_1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
print(df_align_1)
df_align_2 = pd.DataFrame(np.random.randn(2, 2), columns=['A', 'B'])
print(df_align_2)
df_align_3 = df_align_1.add(df_align_2, fill_value=0)
print(df_align_3)

          A         B         C
0 -1.198445 -0.239510  1.655233
1  0.104431  1.293790 -1.114275
2  0.512665  0.130736  0.316020
3  0.229451 -0.431436 -1.013582
4  0.310191  1.877913 -0.701835
          A         B
0 -3.233505  0.200243
1 -0.139337 -0.037094
          A         B         C
0 -4.431950 -0.039267  1.655233
1 -0.034906  1.256696 -1.114275
2  0.512665  0.130736  0.316020
3  0.229451 -0.431436 -1.013582
4  0.310191  1.877913 -0.701835

	0
0	0.053168
1	0.644198
2	-1.012411
3	2.371181
4	1.440502

	(A, D)	variable_0	variable_1	value
0	a	B	E	1
1	b	B	E	3
2	c	B	E	5
3	a	C	F	2
4	b	C	F	4
5	c	C	F	6

	a	b
3	1.692565	-0.367258
4	-1.228971	-1.649726
5	-0.935311	-1.194938
7	0.383545	0.216440
8	1.320212	-2.022758
9	0.670143	-0.378956

		company1	company2
		area1	area2
Quarter	periods
1	Oct	389.0	18.0
1	Nov	24.0	59.0
2	Oct	80.5	NaN
2	Nov	NaN	550.0

		speed	species
		max	type
class	name
bird	falcon	389.0	fly
bird	parrot	24.0	fly
mammal	lion	80.5	run
mammal	monkey	NaN	jump

		big	small
class1	class2
A	x1	45.0	30.0
A	x2	200.0	100.0
B	y1	1.5	1.0
	y2	30.0	20.0
	y3	250.0	150.0
	y4	1.5	0.8
	y5	320.0	250.0
	y6	1.0	0.8
	y7	0.3	0.2

Year	Y1				Y2
Quarter	Q1	Q2	Q3	Q4	Q1	Q2	Q3	Q4
one	9	11	13	8	11	10	7	13
two	14	14	9	12	12	12	13	14
three	13	12	11	11	6	6	14	9

Jasper	one			two			Total
Casper	dul	jas	shi	cas	dul	shi
a
bar	NaN	NaN	2.0	NaN	NaN	NaN	2
foo	101.0	NaN	2.0	NaN	1.0	2.0	106
nam	NaN	1.0	NaN	1.0	NaN	NaN	2
Total	101.0	1.0	4.0	1.0	1.0	2.0	110

	Red	Green	Blue	Yellow	Brown	White
a	1.0	5.0	NaN	NaN	3.0	1.0
b	3.0	0.0	NaN	NaN	NaN	NaN
c	NaN	NaN	1.0	6.0	NaN	NaN
d	5.0	3.0	NaN	NaN	5.0	2.0
e	NaN	NaN	9.0	6.0	4.0	1.0

	A	B	key	C	D	v
0	A0	B0	K0	NaN	NaN	NaN
1	A1	B1	K1	NaN	NaN	NaN
2	A2	B2	K0	NaN	NaN	NaN
3	A3	B3	K1	NaN	NaN	NaN
K0	NaN	NaN	NaN	C0	D0	NaN
K1	NaN	NaN	NaN	C1	D1	7.0
K1	NaN	NaN	NaN	C1	D1	8.0
K2	NaN	NaN	NaN	NaN	NaN	9.0

	A	B	C	D	E
second
one	0.191019	0.115292	-1.219449	0.326575	1.170924
two	-0.497769	0.247957	-0.116349	-1.485385	0.224618

	A	B	C	D	E	Col_sum
0	0.576207	1.424097	-0.237222	1.205378	0.467291	3.435751
1	0.015648	-1.629934	0.398383	0.397965	-0.567927	-1.385865
2	0.234611	-0.453310	-2.221688	-1.048314	0.062570	-3.426131
3	0.143311	-1.669318	0.108048	-0.186586	-0.093079	-1.697625
Row_sum	0.969777	-2.328465	-1.952480	0.368443	-0.131144	-3.073870

	A	B	C	D
2018-01-01	-0.401679	-1.048301	0.891787	0.420646
2018-01-02	NaN	NaN	NaN	NaN
2018-01-03	-0.736705	0.104867	-1.081139	-1.141609
2018-01-04	1.138384	0.943434	0.189352	0.720963

	A	B
C
b'Hello'	1	2.0
b'World'	2	3.0

	A	variable	value
0	a	B	1
1	b	B	3
2	c	B	5

	Red	Green	Blue	Yellow	Brown	White
a	1.0	5.0	NaN	NaN	3.0	1.0
b	3.0	0.0	NaN	NaN	NaN	NaN
c	NaN	NaN	1.0	6.0	NaN	NaN
d	5.0	3.0	NaN	NaN	5.0	2.0
e	NaN	NaN	9.0	6.0	4.0	1.0

	A	B	key	C	D	v
0	A0	B0	K0	NaN	NaN	NaN
1	A1	B1	K1	NaN	NaN	NaN
2	A2	B2	K0	NaN	NaN	NaN
3	A3	B3	K1	NaN	NaN	NaN
K0	NaN	NaN	NaN	C0	D0	NaN
K1	NaN	NaN	NaN	C1	D1	7.0
K1	NaN	NaN	NaN	C1	D1	8.0
K2	NaN	NaN	NaN	NaN	NaN	9.0

	A	B	C	D
2018-01-01	0.089958	-0.282801	-0.543263	-0.336551
2018-01-02	NaN	NaN	NaN	NaN
2018-01-03	0.089958	-0.282801	-0.543263	-0.336551
2018-01-04	0.089958	-0.282801	-0.543263	-0.336551

	A	B	C	D
2018-01-01	2.206983	1.079533	0.945626	1.171231
2018-01-02	NaN	NaN	NaN	NaN
2018-01-03	2.206983	1.079533	0.945626	1.171231
2018-01-04	2.206983	1.079533	0.945626	1.171231

	type	id	date	code	amount
0	exact	9720	2017-10-01	515	458
1	exact	9720	2017-10-01	510	896
2	fuzzy	8242	2017-11-01	122	415
3	fuzzy	8242	2017-11-01	128	782

			code	amount
type	id	date
exact	9720	2017-10-01	512.5	677.0
fuzzy	8242	2017-11-01	125.0	598.5

	type	id	date	code	amount
0	exact	9720	2018-10-01	515	458
1	exact	9720	2018-10-01	510	896
2	fuzzy	8242	2018-11-01	122	415
3	fuzzy	8242	2018-11-01	128	782
4	fuzzy	8243	2018-11-02	128	782
5	fuzzy	8243	2018-11-03	128	782

	type	id	date	code	amount
id
8242	fuzzyfuzzy	16484	2018-11-012018-11-01	250	1197
8243	fuzzyfuzzy	16486	2018-11-022018-11-03	256	1564

	object	numeric	categorical
count	3	3.0	3
unique	2	NaN	3
top	b	NaN	g
freq	2	NaN	1
mean	NaN	2.0	NaN
std	NaN	1.0	NaN
min	NaN	1.0	NaN
25%	NaN	1.5	NaN
50%	NaN	2.0	NaN
75%	NaN	2.5	NaN
max	NaN	3.0	NaN

	FR	GR	IT
2018-11-01	NaN	NaN	NaN
2018-12-01	0.013810	0.013684	0.006549
2019-01-01	0.053365	0.059318	0.061876

	Red	Green	Blue	Yellow	Brown	White
a	1.0	5.0	NaN	NaN	3.0	1.0
b	3.0	0.0	NaN	NaN	NaN	NaN
c	NaN	NaN	1.0	6.0	NaN	NaN
d	5.0	3.0	NaN	NaN	5.0	2.0
e	NaN	NaN	9.0	6.0	4.0	1.0

	A	B	key	C	D	v
0	A0	B0	K0	NaN	NaN	NaN
1	A1	B1	K1	NaN	NaN	NaN
2	A2	B2	K0	NaN	NaN	NaN
3	A3	B3	K1	NaN	NaN	NaN
K0	NaN	NaN	NaN	C0	D0	NaN
K1	NaN	NaN	NaN	C1	D1	7.0
K1	NaN	NaN	NaN	C1	D1	8.0
K2	NaN	NaN	NaN	NaN	NaN	9.0

	A	B	C
first	0.7	0.686108	0.08
second	0.8	0.348655	0.86
third	0.6	0.743559	0.12

	A
2018-12-09 08:00:00	3
2018-12-09 12:00:00	4
2018-12-09 16:00:00	5

	http_status	response_time
Safari	404	0.07
Iceweasel	0	0.00
Comodo Dragon	0	0.00
IE10	404	0.08
Chrome	200	0.02

	prices
2018-10-29	100.0
2018-10-30	100.0
2018-10-31	100.0
2018-11-01	100.0
2018-11-02	105.0
2018-11-03	NaN
2018-11-04	100.0
2018-11-05	89.0
2018-11-06	88.0
2018-11-07	NaN

	A	B	C	D
8	0.342763	-0.770604	-0.795568	-0.677436
5	1.727775	-0.952100	0.907272	1.660095
7	0.184237	0.382458	-0.111344	-0.026671

	s
2018-11-15 00:00:00	0.0
2018-11-15 00:00:30	NaN
2018-11-15 00:01:00	NaN
2018-11-15 00:01:30	2.0
2018-11-15 00:02:00	2.0
2018-11-15 00:02:30	3.0
2018-11-15 00:03:00	3.0

	s
2018-11-15 00:02:00	0.0
2018-11-15 00:03:00	8.5
2018-11-15 00:04:00	2.0
2018-11-15 00:05:00	3.0

	C	D
2019-01-31	k	a
2019-02-28	l	c
2019-03-31	m	g
2019-04-30	n	j
2019-05-31	o	r