python 利用 BeautifulSoup4 網路爬蟲抓網頁標題

woff · 發表於 2020-10-8 15:41:05

演示圖

python BeautifulSoup 網路爬蟲

首先用終端機安裝BeautifulSoup 4模組

複製代碼

anaconda

複製代碼

以PTT 八卦版為例
完整原始碼

import urllib.request as req
def getData(url):
request=req.Request(url,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
,"cookie":"over18=1"
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8") #utf-8編碼
#print(data)
# conda install beautifulsoup4
# 解析原始碼，抓取文章標題
import bs4
root = bs4.BeautifulSoup(data,"html.parser") #html 格式
titles=root.find_all("div",class_="title") #find all class="title" 的div標籤
#print(titles)
for title in titles:
if title.a != None: #如果標題包含 a 標籤(沒有被刪除)，就印出來
print(title.a.string)
nextLink = root.find("a",string="‹ 上頁") #找到內文是 ‹ 上頁的 a 標籤
#print(nextLink["href"])
return nextLink["href"]
#主程序：抓取5個頁面標題
pageURL = "https://www.ptt.cc/bbs/Gossiping/index.html" #抓取網址
count=0
while count<5: #抓取5頁
pageURL = "https://www.ptt.cc"+getData(pageURL)
count+=1
# print(pageURL)

複製代碼

解說:
以下這段在模擬用戶用瀏覽器上網，"cookie":"over18=1" 是已點選我滿18歲的紀錄

request=req.Request(url,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
,"cookie":"over18=1"
})

複製代碼

python BeautifulSoup 網路爬蟲

可在chrome瀏覽器壓F12或開發者模式看到
選取Application

python BeautifulSoup 網路爬蟲

網頁設計,網站架設 ,網路行銷,網頁優化,SEO - NetYea 網頁設計

賬號		自動登錄	找回密碼
密碼			註冊

[教學] python 利用 BeautifulSoup4 網路爬蟲抓網頁標題