pandas知识点(处理缺失数据)

2018-12-17 10:50:25来源:博客园 阅读 ()

新老客户大回馈,云服务器低至5折

pandas使用浮点值NaN表示浮点和非浮点数组中的缺失数据:
In [14]: string_data = Series(['aardvark','artichoke',np.nan,'avocado'])
In [15]: string_data
Out[15]:
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
In [16]: string_data.isnull()
Out[16]:
0    False
1    False
2     True
3    False
dtype: bool

 

python内置的None值也会被当作NA处理:
In [17]: string_data[0] = None
In [18]: string_data.isnull()
Out[18]:
0     True
1    False
2     True
3    False
dtype: bool

 

1.滤除缺失数据
dropna用于返回一个仅含非空数据和索引值的Series:
In [20]: data = Series([1,NA,3.5,NA,7])
In [21]: data.dropna()
Out[21]:
0    1.0
2    3.5
4    7.0
dtype: float64

 

通过布尔型索引亦可以:
In [22]: data[data.notnull()]
Out[22]:
0    1.0
2    3.5
4    7.0
dtype: float64

 

对于DataFrame,dropna默认丢弃含有缺失值的行
In [23]: data = DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])
In [24]: cleaned = data.dropna()
In [25]: data
Out[25]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
In [26]: cleaned
Out[26]:
     0    1    2
0  1.0  6.5  3.0

 

传入how="all"丢弃全为NaN的行:
In [27]: data.dropna(how="all")
Out[27]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0

 

丢弃含有NaN的列:
In [28]: data[4] = NA
In [29]: data
Out[29]:
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN
In [30]: data.dropna(axis=1,how="all")
Out[30]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

 

thresh可以选择过滤含有多少有用数据的行
In [41]: df
Out[41]:
          0         1         2
0 -0.184676       NaN       NaN
1  0.565214       NaN       NaN
2  0.440203       NaN       NaN
3  0.188283       NaN  0.146847
4  1.696903       NaN  0.554640
5 -1.287915  0.139527 -0.494558
6  0.854922  0.299511  0.773247
In [42]: df.dropna(thresh=2)  # thresh=2表示至少有两个非空数据
Out[42]:
          0         1         2
3  0.188283       NaN  0.146847
4  1.696903       NaN  0.554640
5 -1.287915  0.139527 -0.494558
6  0.854922  0.299511  0.773247
In [43]: df.dropna(thresh=1)
Out[43]:
          0         1         2
0 -0.184676       NaN       NaN
1  0.565214       NaN       NaN
2  0.440203       NaN       NaN
3  0.188283       NaN  0.146847
4  1.696903       NaN  0.554640
5 -1.287915  0.139527 -0.494558
6  0.854922  0.299511  0.773247

 

2.填充缺失数据
fillna是最主要的填充缺省数据的方法:
In [9]: df.fillna(0)
Out[9]:
          0         1         2
0  0.863556  0.000000  0.000000
1 -0.099558  0.000000  0.000000
2 -0.605804  0.000000  0.000000
3 -0.934688  0.000000 -1.198976
4  0.741383  0.000000  0.229845
5 -1.415495  0.511485 -0.086808
6 -0.748325  0.437964 -2.458319

 

通过字典调用fillna可以实现不同列填充不同值:
In [11]: df.fillna({1:0.5,2:-1})
Out[11]:
          0         1         2
0  0.863556  0.500000 -1.000000
1 -0.099558  0.500000 -1.000000
2 -0.605804  0.500000 -1.000000
3 -0.934688  0.500000 -1.198976
4  0.741383  0.500000  0.229845
5 -1.415495  0.511485 -0.086808
6 -0.748325  0.437964 -2.458319

 

标签:

版权申明:本站文章部分自网络,如有侵权,请联系:west999com@outlook.com
特别注意:本站所有转载文章言论不代表本站观点,本站所提供的摄影照片,插画,设计作品,如需使用,请与原作者联系,版权归原作者所有

上一篇:统计numpy数组中最频繁出现的值

下一篇:Python:Selenium+Webdriver安装