Python爬虫教程-入门

python3 爬虫

创建爬虫

BeautifulSoup

网络连接

1
2
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html") print(html.read())

BeautifulSoup

中文文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

  1. 安装pip
  2. 安装beautifulsoup

    1
    $pip3 install beautifulsoup4
  3. 使用

    1
    2
    3
    4
    5
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
    bsObj = BeautifulSoup(html.read())
    print(bsObj.h1)
    • 这样使用会警告 BeautifulSoup([your markup], "html.parser")这样正确的使用。

处理Http异常

1
2
3
4
5
6
7
8
try:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
print(e)
# 返回空值,中断程序,或者执行另一个方案
else:
# 程序继续。注意:如果你已经在上面异常捕捉那一段代码里返回或中断(break),
# 那么就不需要使用else语句了,这段代码也不会执行

处理beautifulSoup异常

1
2
3
4
5
6
7
8
9
try:
badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
print("Tag was not found")
else:
if badContent == None:
print ("Tag was not found")
else:
print(badContent)

复杂HTML解析

使用beautifulSoup抓取特定css属性

1
2
3
4
5
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
print(name.get_text())

使用beautifulSoup处理html标签树

  1. 获取子标签 children()

    1
    2
    3
    4
    html = urlopen("http://www.pythonscraping.com/pages/page3.html")
    bsObj = BeautifulSoup(html)
    for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)
  2. 获取自身之后的兄弟标签 next_siblings()

    • 获取除了自身以外的兄弟标签,同时只能获取自身之后的兄弟标签。

      1
      2
      3
      4
      html = urlopen("http://www.pythonscraping.com/pages/page3.html") 
      bsObj = BeautifulSoup(html)
      for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
      print(sibling)
    • 除此之外还有 获取自身之前的兄弟标签 previous_siblings()、获取自身之前的单个兄弟标签 previous_sibling()、获取自身之后的单个兄弟标签 next_sibling()

  3. 获取父标签

    • parent 和 parents

正则表达式 与 BeautifulSoup

查找图片相对路径为以 ../img/gifts/img 开头,以 .jpg 结尾的图片

1
2
3
4
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) for image in images:
print(image["src"])

获取属性

经常不需要查找标签的内容, 而是需要查找标签属性
myTag.attrs可以获取一个标签对象的全部属性。比如获取src属性:myImgTag.attrs["src"]

开始采集

遍历单个域名

找到URL链接,获取网页内容,从中找出另一个链接,然后再获取这个网页的内容,不断循环这一过程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
random.seed(datetime.datetime.now())

def getLinks(articleUrl):
html = urlopen("http://en.wikipedia.org"+articleUrl)
bsObj = BeautifulSoup(html)
return bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")

while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
print(newArticle)
links = getLinks(newArticle)

定义了 getLinks函数,过滤出指向其他词条的链接。再随机选择一条链接 获取新的页面,如此循环。

采集整个网站

url链接去重,打印(收集)需要信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from urllib.request import urlopen 
from bs4 import BeautifulSoup
import re
pages = set()

def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html)
try:
print(bsObj.h1.get_text())
print(bsObj.find(id="mw-content-text").findAll("p")[0])
print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
except AttributeError:
print("页面缺少一些属性!不过不用担心!")

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
# 我们遇到了新页面
newPage = link.attrs['href']
print("----------------\n"+newPage)
pages.add(newPage)
getLinks(newPage)

getLinks("")

使用Scrapy采集

Scrapy 就是一个帮你大幅度降低网页链接查找和识别工作复杂度的 Python 库,它可以 让你轻松地采集一个或多个域名的信息。
Scrapy 1.3.0 已经支持 python3.3+了

官方文档 : Scrapy 1.3 documentation
安装: pip3 install scrapy

Scrapy 创建项目

1
scrapy startproject wikiSpider

执行上面命令后:会生成目录:

1
2
3
4
5
6
7
8
9
10
- wikiSpider
* scrapy.cfg
- wikiSpider
- __pycache__
- spiders
* __init__.py
* items.py
* middlewares.py
* pipelines.py
* settings.py

Scrapy 创建爬虫

1. 在items.py中添加一个类

1
2
class Article(scrapy.Item):
title = scrapy.Field()

Scrapy 的每个 Item(条目)对象表示网站上的一个页面。当然,你可以根据需要定义不同的条目(比如url、content、header image等),但是现在我只演示收集每页的title字段 (field)。

2. wikiSpider/wikiSpider/spiders/ 文件夹里增加一个 articleSpider.py 文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from scrapy.selector import Selector
from scrapy import Spider
from wikiSpider.items import Article

class ArticleSpider(Spider):
name="article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Main_Page","http://en.wikipedia.org/wiki/Python_%28programming_language%29"]

def parse(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print("Title is: "+title)
item['title'] = title
return item

通过运行 scrapy crawl article 可以看到一大堆日志信息 和:

1
2
Title is: Main Page
Title is: Python (programming language)

解析JSON

1
2
3
4
5
6
7
8
import json
jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}],' \
'"arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}'
jsonObj = json.loads(jsonString)
print(jsonObj.get("arrayOfNums"))
print(jsonObj.get("arrayOfNums")[1])
print(jsonObj.get("arrayOfNums")[1].get("number")+jsonObj.get("arrayOfNums")[2].get("number"))
print(jsonObj.get("arrayOfFruits")[2].get("fruit"))

储存数据

使用 urlretrieve 储存到本地

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import os
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"

def getAbsoluteURL(baseUrl, source):
if source.startswith("http://www."):
url = "http://"+source[11:]
elif source.startswith("http://"):
url = source
elif source.startswith("www."):
url = source[4:]
url = "http://"+source
else:
url = baseUrl+"/"+source
if baseUrl not in url:
return None
return url

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
path = absoluteUrl.replace("www.", "")
path = path.replace(baseUrl, "")
path = downloadDirectory+path
directory = os.path.dirname(path)
if not os.path.exists(directory):
os.makedirs(directory)
return path

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
downloadList = bsObj.findAll(src=True)

for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl, download["src"])
if fileUrl is not None:
print(fileUrl)

urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

选择首页上所有带 src 属性的标签。然 后对 URL 链接进行清理和标准化,获得文件的绝对路径(而且去掉了外链)。最后,每个文件都会下载到程序所在文件夹的 downloaded 文件里。

将网页中的表单保存为CSV格式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors")
bsObj = BeautifulSoup(html)
# 主对比表格是当前页面上的第一个表格
table = bsObj.findAll("table",{"class":"wikitable"})[0]
rows = table.findAll("tr")

csvFile = open("../files/editors.csv", 'wt', newline='', encoding='utf-8')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow = []
for cell in row.findAll(['td', 'th']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()

高级数据采集

POST 与 登录

使用第三方库 Request
安装 pip3 install requests

使用:

1
2
3
4
import requests
params = {'firstname': 'Ryan', 'lastname': 'Mitchell'}
r = requests.post("http://pythonscraping.com/files/processing.php", data=params)
print(r.text)

处理Cookie

使用 requests库跟踪cookie

1
2
3
4
5
6
7
8
9
10
import requests

params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",cookies=r.cookies)
print(r.text)

处理Session

使用 requests库跟踪session

1
2
3
4
5
6
7
8
9
10
import requests
session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

修改header

为了让请求更像是浏览器发出的,需要修改请求头

1
2
3
4
5
6
7
8
9
10
import requests
from bs4 import BeautifulSoup

session = requests.Session()
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
url = "https://www.whatismybrowser.com/developers/what-http-headers-is-my-browser-sending"
req = session.get(url, headers=headers)
bsObj = BeautifulSoup(req.text)
print(bsObj.find("table",{"class":"table-striped"}).get_text)