简单的电子邮件爬虫Python代码
2018-07-20 来源:open-open
import requests import re try: from urllib.parse import urljoin except ImportError: from urlparse import urljoin # regex email_re = re.compile(r'([\w\.,]+@[\w\.,]+\.\w+)') link_re = re.compile(r'href="(.*?)"') def crawl(url): result = set() req = requests.get(url) # Check if successful if(req.status_code != 200): return [] # Find links links = link_re.findall(req.text) print("\nFound {} links".format(len(links))) # Search links for emails for link in links: # Get an absolute URL for a link link = urljoin(url, link) # Find all emails on current page result.update(email_re.findall(req.text)) return result if __name__ == '__main__': emails = crawl('http://www.realpython.com') print("\nScrapped e-mail addresses:") for email in emails: print(email) print("\n")
标签:
版权申明:本站文章部分自网络,如有侵权,请联系:west999com@outlook.com
特别注意:本站所有转载文章言论不代表本站观点!
本站所提供的图片等素材,版权归原作者所有,如需使用,请与原作者联系。
上一篇:ruby 安装更新
下一篇: python爬虫下载百度贴吧图片
最新资讯
热门推荐