首页 > > 网络编程 > 其它 >

Python中解码decode()与编码encode()与错误处理U…

2018-06-18 02:21:10来源：未知阅读 ()

编码方法encoding（）

描述

　　encode() 方法以指定的编码格式编码字符串，默认编码为 'utf-8'。将字符串由string类型变成bytes类型。

　　对应的解码方法：bytes decode() 方法。

语法

　　str.encode([encoding='utf-8'][,errors='strict'])

str是表示需要编码的字符串，并且是个string类型。
encoding -- 可选参数，要使用的编码方案，默认编码为 'utf-8'。
errors -- 可选参数，设置不同错误的处理方案。默认为 'strict',意为编码错误引起一个UnicodeError。其他可能得值有 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' 以及通过 codecs.register_error() 注册的任何值。

返回值

　　该方法返回编码后的字符串，它是一个 bytes 对象，这个字节对象是用于下面的解码用的。

官方文档解释：

　　str.encode(encoding="utf-8", errors="strict")

　　Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors 　　raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list 　　of possible encodings, see section Standard Encodings.

　　Changed in version 3.1: Support for keyword arguments added.

------------------------------------------------------------------------------------------------------------------------------------------------

解码方法decode()

　　decode() 方法以 encoding 指定的编码格式来解码字符串。默认编码规则是encoding=‘utf-8’

语法：

　　bytes.decode(encoding='UTF-8',errors='strict')

参数

　　bytes是由编码方法encoding()编码转换过后得到的字符串的字节表示值。

　　encoding -- 解码时要使用的编码方案，如"UTF-8"。

　　errors -- 设置不同错误的处理方案。默认为 'strict',意为编码错误引起一个UnicodeError。其他可能得值有 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' 以及通过 codecs.register_error() 注册的任何值。

返回值：

　　该方法返回解码后的字符串。

官方文档解释

　　bytes.decode(encoding="utf-8", errors="strict")bytearray.decode(encoding="utf-8", errors="strict")

　　Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise 　　　 a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard 　　 Encodings.

　　Note

　　Passing the encoding argument to str allows decoding any bytes-like object directly, without needing to make a temporary bytes or bytearray object.

　　Changed in version 3.1: Added support for keyword arguments.

其实编码解码的关系就是如下：

str->bytes:encode编码

bytes->str:decode解码

字符串通过编码成为字节码，字节码通过解码成为字符串。可以这样解释，编码就是将字符串转换成字节码，涉及到字符串的内部表示。解码就是将字节码转换为字符串，将比特位显示成字符。

例如：

 1 >>> text = '我是文本'
 2 >>> text
 3 '我是文本'
 4 >>> print(text)
 5 我是文本
 6 >>> bytesText = text.encode()
 7 >>> bytesText
 8 b'\xe6\x88\x91\xe6\x98\xaf\xe6\x96\x87\xe6\x9c\xac'
 9 >>> print(bytesText)
10 b'\xe6\x88\x91\xe6\x98\xaf\xe6\x96\x87\xe6\x9c\xac'
11 >>> type(text)
12 <class 'str'>
13 >>> type(bytesText)
14 <class 'bytes'>
15 >>> textDecode = bytesText.decode()
16 >>> textDecode
17 '我是文本'
18 >>> print(textDecode)
19 我是文本

例2

 1 >>>text='我好吗'
 2 >>>byteText=text.encode('gbk')
 3 >>>byteText
 4  b'\xce\xd2\xba\xc3\xc2\xf0'
 5 >>>strText=byteText.decode('gbk')
 6 >>>strText
 7  '我好吗'
 8 >>>byteText.decode('utf-8')
 9 Traceback (most recent call last):
10   File "G:\softs\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
11     exec(code_obj, self.user_global_ns, self.user_ns)
12   File "<ipython-input-11-f0ef1443f388>", line 1, in <module>
13     byteText.decode('utf-8')
14 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte

上面的第8行出现了错误，是由于文本text='我好吗'，是按照‘gbk’进行编码的，而在解码时是按照‘utf-8’的编码规则进行的解码，所以会导致解码失败，即‘utf-8’不能解码‘gbk’编码规则的字节。用相对应的解码编码规则来对字符进行处理。下面给出了几条处理这种错误的方法供参考。

出现如下错误时：

UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 11126: illegal multibyte sequence

使用python的时候经常会遇到文本的编码与解码问题，其中很常见的一种解码错误如题目所示，下面介绍该错误的解决方法，将‘gbk’换成‘utf-8’也适用。
（1）、首先在打开文本的时候，设置其编码格式，如：open(‘1.txt’,encoding=’gbk’)；
（2）、若（1）不能解决，可能是文本中出现的一些特殊符号超出了gbk的编码范围，可以选择编码范围更广的‘gb18030’，如：open(‘1.txt’,encoding=’gb18030’)；
（3）、若（2）仍不能解决，说明文中出现了连‘gb18030’也无法编码的字符，可以使用‘ignore’属性进行忽略，如：open(‘1.txt’,encoding=’gb18030’，errors=‘ignore’)；

（4）、还有一种常见解决方法为open(‘1.txt’).read().decode(‘gb18030’,’ignore’)

对于机器学习实战第四章朴素贝叶斯一张代码实现出现的解码错误就用了上面的方法（4）解决了

 1 def spamTest():
 2     docList=[];classList=[];fillText=[]
 3     for i in range(1,26):
 4         wordList=textParse(open('D:/machinelearning data/machinelearninginaction/Ch04/email/spam/%d.txt' % i,encoding='utf-8',errors='ignore').read())
 5        # print('%d word:'%i)
 6         docList.append(wordList)
 7         fillText.extend(wordList)
 8         classList.append(1)
 9         wordList = textParse(open('D:/machinelearning data/machinelearninginaction/Ch04/email/ham/%d.txt' % i,encoding='utf-8',errors='ignore').read())
10         docList.append(wordList)
11         fillText.extend(wordList)
12         classList.append(0)

原文上面代码出现错误是因为在解析ham文件夹文件23.txt时出现解码错误，才导致整个文件运行不了，我们将文件打开的编码方式统一换成'utf-8'，并且忽略掉出现的错误便可以正常运行了

参考资料：

1，https://www.cnblogs.com/tingyugetc/p/5727383.html

2，https://blog.csdn.net/shijing_0214/article/details/51971734

标签：

版权申明：本站文章部分自网络，如有侵权，请联系：west999com@outlook.com
特别注意：本站所有转载文章言论不代表本站观点，本站所提供的摄影照片，插画，设计作品，如需使用，请与原作者联系，版权归原作者所有

上一篇：Python爬虫学习==>第十二章：使用 Selenium 模拟浏览器抓取淘

下一篇：Sublime Text3 运行Python 出现Error：Decode error - output no

python3基础之“术语表（2）” 2019-08-13
python3 之字符串编码小结（Unicode、utf-8、gbk、gb2312等 2019-08-13
Python3安装impala 2019-08-13
小白如何入门 Python 爬虫？ 2019-08-13
python_字符串方法 2019-08-13

IDC资讯：主机资讯注册资讯托管资讯 vps资讯网站建设

网站运营：建站经验策划盈利搜索优化网站推广免费资源

网站联盟：联盟新闻联盟介绍联盟点评网赚技巧

行业资讯：搜索引擎网络游戏电子商务广告传媒

网络编程： Asp.Net编程 Asp编程 Php编程 Xml编程 Access Mssql Mysql 其它

服务器技术： Web服务器 Ftp服务器 Mail服务器 Dns服务器安全防护

软件技巧：其它软件 Word Excel Powerpoint Ghost Vista QQ空间 QQ FlashGet 迅雷

网页制作： FrontPages Dreamweaver Javascript css photoshop fireworks Flash

程序设计： Java技术 C/C++ VB delphi

网络知识：网络协议网络安全网络管理组网方案 Cisco技术

操作系统： Win2000 WinXP Win2003 Mac OS Linux FreeBSD

热门词条

最新资讯

热门关注

热门标签