Python爬虫教程-入门

发表于 2017-04-25 更新于 2018-11-23 分类于 quick-start ， Python爬虫教程本文字数： 9.5k 阅读时长 ≈ 9 分钟

使用的IDE ： pycharm ce 社区版
python lib文档 : The Python Standard Library

python3 爬虫

创建爬虫

BeautifulSoup

网络连接

1 2	from urllib.request import urlopen html = urlopen("http://pythonscraping.com/pages/page1.html") print(html.read())

BeautifulSoup

中文文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

安装pip
- 根据 https://pip.pypa.io/en/latest/installing/ 安装pip
安装beautifulsoup
1
$pip3 install beautifulsoup4

使用

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

这样使用会警告 BeautifulSoup([your markup], "html.parser")这样正确的使用。

处理Http异常

try:
	html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e: 
	print(e)
	# 返回空值，中断程序，或者执行另一个方案 
else:
	# 程序继续。注意:如果你已经在上面异常捕捉那一段代码里返回或中断(break)， 
	# 那么就不需要使用else语句了，这段代码也不会执行

处理beautifulSoup异常

try:
	badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e: 
	print("Tag was not found")
else:
	if badContent == None:
		print ("Tag was not found") 
	else:
		print(badContent)

复杂HTML解析

使用beautifulSoup抓取特定css属性

html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("span", {"class":"green"}) 
for name in nameList:
    print(name.get_text())

使用beautifulSoup处理html标签树

获取子标签 children()

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for child in bsObj.find("table",{"id":"giftList"}).children:
	print(child)

获取自身之后的兄弟标签 next_siblings()
- 获取除了自身以外的兄弟标签，同时只能获取自身之后的兄弟标签。
  1
  2
  3
  4
  html = urlopen("http://www.pythonscraping.com/pages/page3.html")
  bsObj = BeautifulSoup(html)
  for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
  print(sibling)
- 除此之外还有获取自身之前的兄弟标签 previous_siblings()、获取自身之前的单个兄弟标签 previous_sibling()、获取自身之后的单个兄弟标签 next_sibling()。
获取父标签
- parent 和 parents

正则表达式与 BeautifulSoup

查找图片相对路径为以 ../img/gifts/img 开头，以 .jpg 结尾的图片

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) for image in images:
print(image["src"])

获取属性

经常不需要查找标签的内容，而是需要查找标签属性
myTag.attrs可以获取一个标签对象的全部属性。比如获取src属性：myImgTag.attrs["src"]

开始采集

遍历单个域名

找到URL链接，获取网页内容，从中找出另一个链接，然后再获取这个网页的内容，不断循环这一过程。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
random.seed(datetime.datetime.now())

def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html)
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")

while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)

定义了 getLinks函数，过滤出指向其他词条的链接。再随机选择一条链接获取新的页面，如此循环。

采集整个网站

url链接去重，打印（收集）需要信息

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re
pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl) 
    bsObj = BeautifulSoup(html)
    try:
        print(bsObj.h1.get_text()) 
        print(bsObj.find(id="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
        print("页面缺少一些属性!不过不用担心!")
    
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
    if 'href' in link.attrs:
        if link.attrs['href'] not in pages:
            # 我们遇到了新页面
            newPage = link.attrs['href'] 
            print("----------------\n"+newPage) 
            pages.add(newPage) 
            getLinks(newPage)
            
getLinks("")

使用Scrapy采集

Scrapy 就是一个帮你大幅度降低网页链接查找和识别工作复杂度的 Python 库，它可以让你轻松地采集一个或多个域名的信息。
Scrapy 1.3.0 已经支持 python3.3+了

官方文档 : Scrapy 1.3 documentation
安装： pip3 install scrapy

Scrapy 创建项目

1	scrapy startproject wikiSpider

执行上面命令后：会生成目录:

- wikiSpider
	* scrapy.cfg 
	- wikiSpider 
		- __pycache__
		- spiders
		* __init__.py
		* items.py
		* middlewares.py
		* pipelines.py
		* settings.py

Scrapy 创建爬虫

1. 在items.py中添加一个类

1
2
3

class Article(scrapy.Item):
    title = scrapy.Field()

Scrapy 的每个 Item(条目)对象表示网站上的一个页面。当然，你可以根据需要定义不同的条目(比如url、content、header image等)，但是现在我只演示收集每页的title字段 (field)。

2. wikiSpider/wikiSpider/spiders/ 文件夹里增加一个 articleSpider.py 文件

from scrapy.selector import Selector
from scrapy import Spider
from wikiSpider.items import Article

class ArticleSpider(Spider):
    name="article"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page","http://en.wikipedia.org/wiki/Python_%28programming_language%29"]

    def parse(self, response):
        item = Article()
        title = response.xpath('//h1/text()')[0].extract()
        print("Title is: "+title)
        item['title'] = title
        return item

通过运行 scrapy crawl article 可以看到一大堆日志信息和：

1 2	Title is: Main Page Title is: Python (programming language)

解析JSON

import json
jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}],' \
             '"arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}'
jsonObj = json.loads(jsonString)
print(jsonObj.get("arrayOfNums")) 
print(jsonObj.get("arrayOfNums")[1]) 
print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number")) 
print(jsonObj.get("arrayOfFruits")[2].get("fruit"))

储存数据

使用 urlretrieve 储存到本地

import os
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"

def getAbsoluteURL(baseUrl, source):
    if source.startswith("http://www."):
        url = "http://"+source[11:]
    elif source.startswith("http://"):
        url = source
    elif source.startswith("www."):
        url = source[4:]
        url = "http://"+source
    else:
        url = baseUrl+"/"+source
    if baseUrl not in url:
        return None
    return url

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
    path = absoluteUrl.replace("www.", "")
    path = path.replace(baseUrl, "")
    path = downloadDirectory+path
    directory = os.path.dirname(path)
    if not os.path.exists(directory):
        os.makedirs(directory)
    return path

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
downloadList = bsObj.findAll(src=True)

for download in downloadList:
    fileUrl = getAbsoluteURL(baseUrl, download["src"])
    if fileUrl is not None:
        print(fileUrl)

urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

选择首页上所有带 src 属性的标签。然后对 URL 链接进行清理和标准化，获得文件的绝对路径(而且去掉了外链)。最后，每个文件都会下载到程序所在文件夹的 downloaded 文件里。

将网页中的表单保存为CSV格式

import csv
from urllib.request import urlopen 
from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors") 
bsObj = BeautifulSoup(html)
# 主对比表格是当前页面上的第一个表格
table = bsObj.findAll("table",{"class":"wikitable"})[0]
rows = table.findAll("tr")

csvFile = open("../files/editors.csv", 'wt', newline='', encoding='utf-8')
writer = csv.writer(csvFile)
try:
    for row in rows:
    csvRow = []
    for cell in row.findAll(['td', 'th']):
        csvRow.append(cell.get_text())
        writer.writerow(csvRow) 
    finally:
        csvFile.close()

高级数据采集

POST 与登录

使用第三方库 Request
安装 pip3 install requests

使用：

import requests
params = {'firstname': 'Ryan', 'lastname': 'Mitchell'}
r = requests.post("http://pythonscraping.com/files/processing.php", data=params)
print(r.text)

处理Cookie

使用 requests库跟踪cookie

import requests

params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params) 
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",cookies=r.cookies)
print(r.text)

处理Session

使用 requests库跟踪session

import requests
session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params) 
print("Cookie is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php") 
print(s.text)

为了让请求更像是浏览器发出的，需要修改请求头

import requests
from bs4 import BeautifulSoup

session = requests.Session()
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
           "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
url = "https://www.whatismybrowser.com/developers/what-http-headers-is-my-browser-sending" 
req = session.get(url, headers=headers)
bsObj = BeautifulSoup(req.text)
print(bsObj.find("table",{"class":"table-striped"}).get_text)

创建爬虫

BeautifulSoup

复杂HTML解析

开始采集

遍历单个域名

采集整个网站

使用Scrapy采集

解析JSON

储存数据

高级数据采集

POST 与 登录

处理Cookie

处理Session

修改header

POST 与登录