收藏

下载资源加入VIP,免费下载

Python网络爬虫实习报告文档格式.docx

上传人：b****1 文档编号：4908289 上传时间：2023-05-04 格式：DOCX 页数：8 大小：22.09KB

下载相关举报

Python网络爬虫实习报告文档格式.docx_第1页

第1页 / 共8页

Python网络爬虫实习报告文档格式.docx_第2页

第2页 / 共8页

Python网络爬虫实习报告文档格式.docx_第3页

第3页 / 共8页

Python网络爬虫实习报告文档格式.docx_第4页

第4页 / 共8页

Python网络爬虫实习报告文档格式.docx_第5页

第5页 / 共8页

Python网络爬虫实习报告文档格式.docx_第6页

第6页 / 共8页

Python网络爬虫实习报告文档格式.docx_第7页

第7页 / 共8页

Python网络爬虫实习报告文档格式.docx_第8页

第8页 / 共8页

亲，该文档总共8页，全部预览完了，如果喜欢就下载吧！

资源描述

Python网络爬虫实习报告文档格式.docx

《Python网络爬虫实习报告文档格式.docx》由会员分享，可在线阅读，更多相关《Python网络爬虫实习报告文档格式.docx（8页珍藏版）》请在冰点文库上搜索。

Python网络爬虫实习报告文档格式.docx

Python-goose框架可提取的信息包括：

<

1>

文章主体内容;

<

2>

文章主要图片;

3>

文章中嵌入的任heYoutube/Vimeo视频;

4>

元描述;

5>

元标签

1分析网页

#获取html源代码

pageSize=0

try:

while（pageSize<

=125）:

#headers={'

User-Agent'

:

'

Mozilla/（WindowsNT

AppleWebKit/（KHTML,likeGecko）Chrome/Safari/'

#'

Referer'

None#注意如果依然不能抓取的话，这里

可以设置抓取网站的host

#}

#opener=

#=[headers]

url="

"

+str（pageSize）+"

&

filter="

+str（pageNum）

#data['

html%s'

%i]="

utf-8"

）

））

pageSize+=25

pageNum+=1

print（pageSize,pageNum）

exceptExceptionase:

raisee

returndata

2爬取数据

def__getData（html）:

title=[]#电影标题

#rating_num=[]#评分

range_num=[]#排名

#rating_people_num=[]#评价人数

movie_author=[]#导演

data={}

#bs4解析html

soup=BeautifulSoup（html,"

forliin（"

ol"

attrs二{'

class'

grid_view'

}）.find_all（"

li"

）:

（"

span"

class_="

title"

）.text）

#（"

div"

class_='

star'

）.find（"

class_='

rating_num'

pic'

em"

#spans=（"

）.find_all（"

#forxinrange（len（spans））:

#ifx<

=2:

#pass

#else:

#（spans[x].string[-len（spans[x].string）:

-3]）str=（"

bd'

p"

）.（）

index=（"

主"

if（index==-1）:

..."

print（"

==210）:

if（"

）.text

index=60

#print（"

aaa"

#print（str[4:

index]）

（str[4:

data['

title'

]=title

#data['

rating_num'

]=rating_num

range_num'

]=range_num

rating_people_num'

]=rating_people_num

movie_author'

]=movie_author

3数据整理、转换

head>

metacharset二'

UTF-8'

>

title>

Inserttitlehere<

/title>

v/head>

vbody>

h1>

爬取豆瓣电影<

/h1>

h4>

作者：

刘文斌<

/h4>

时间：

”+nowtime+"

hr>

vtablewidth二'

800px'

border二'

1'

align=center>

thead>

tr>

vthxfontsize二'

5'

color二green>

电影v/font>

/th>

thwidth='

50px'

xfontsize='

color=green>

评分

v/font>

fontsize='

排名

/font>

100px'

xfontsize='

color=green>

评价人数

th>

导演<

/font>

v/tr>

v/thead>

vtbody>

fordataindatas:

foriinrange（0,25）:

td

style二'

color:

orange;

text-align:

center'

%s<

/td>

%

][i]）

#（"

style='

blue;

text-align:

center'

%data['

red;

rating_people_num'

black;

v/tbody>

v/table>

/body>

/html>

（）

if__name__=='

__main__'

datas=[]

htmls=__getHtml（）

foriinrange（len（htmls））:

data=__getData（htmls[i]）

（data）

getMovies（datas）

4数据保存、展示

结果如后图所示:

5技术难点关键点

数据爬取实战（搜房网爬取房屋数据）

frombs4importBeautifulSoup

importrequests

rep=（'

="

gb2312"

#设置编码方式

html=

soup=BeautifulSoup（html,”）

f=open（'

'

w'

encoding='

utf-8'

）

html>

Inserttitle

here<

center>

新房成交TOP3<

v/center>

tableborder二'

1px'

width二'

1000px'

height二'

align=center>

h2>

房址v/h2>

v/th>

成交量v/h2>

均价v/h2>

ul"

class_="

ul02"

）.find_all（"

name=（"

class_二"

pbtext"

）.find（"

）.text

chengjiaoliang=（"

span"

red-f3"

try:

gray-9"

）#.（'

?

平方米'

junjia=（"

ohter"

class_=O'

'

exceptExceptionase:

O'

tdalign=center>

5px'

color二red>

v/td>

%name）

tdalign=center>

vfontsize='

color=blue>

%chengjiaoliang）

color二green>

%s<

%junjia）

print（name）

/table>

）（"

六、总结

!

教师评语:

展开阅读全文

相关资源

猜你喜欢

相关搜索

资源标签

当前位置：首页 > 人文社科 > 法律资料

copyright@ 2008-2023 冰点文库网站版权所有

经营许可证编号:鄂ICP备19020893号-2