收藏

下载资源加入VIP,免费下载

Python网络爬虫实习报告Word文件下载.docx

上传人：b****3 文档编号：6932645 上传时间：2023-05-07 格式：DOCX 页数：7 大小：16.42KB

下载相关举报

Python网络爬虫实习报告Word文件下载.docx_第1页

第1页 / 共7页

Python网络爬虫实习报告Word文件下载.docx_第2页

第2页 / 共7页

Python网络爬虫实习报告Word文件下载.docx_第3页

第3页 / 共7页

Python网络爬虫实习报告Word文件下载.docx_第4页

第4页 / 共7页

Python网络爬虫实习报告Word文件下载.docx_第5页

第5页 / 共7页

Python网络爬虫实习报告Word文件下载.docx_第6页

第6页 / 共7页

Python网络爬虫实习报告Word文件下载.docx_第7页

第7页 / 共7页

亲，该文档总共7页，全部预览完了，如果喜欢就下载吧！

资源描述

Python网络爬虫实习报告Word文件下载.docx

《Python网络爬虫实习报告Word文件下载.docx》由会员分享，可在线阅读，更多相关《Python网络爬虫实习报告Word文件下载.docx（7页珍藏版）》请在冰点文库上搜索。

Python网络爬虫实习报告Word文件下载.docx

newspaper框架是一个用来提取新闻、文章以及内容分析的Python爬虫框架。

Python-goose框架：

Python-goose框架可提取的信息包括：

<

1>

文章主体内容;

2>

文章主要图片;

3>

文章中嵌入的任heYoutube/Vimeo视频;

4>

元描述;

5>

元标签

五、数据爬取实战（豆瓣网爬取电影数据）

1分析网页

#获取html源代码

def__getHtml（）:

data=[]

pageNum=1

pageSize=0

try:

while（pageSize<

=125）:

2爬取数据

def__getData（html）:

title=[]#电影标题

#rating_num=[]#评分

range_num=[]#排名

#rating_people_num=[]#评价人数

movie_author=[]#导演

data={}

#bs4解析html

soup=BeautifulSoup（html,"

html.parser"

）

forliinsoup.find（"

ol"

attrs={'

class'

:

'

grid_view'

}）.find_all（"

li"

）:

title.append（li.find（"

span"

class_="

title"

）.text）

#rating_num.append（li.find（"

div"

class_='

star'

）.find（"

rating_num'

range_num.append（li.find（"

pic'

em"

#spans=li.find（"

）.find_all（"

#forxinrange（len（spans））:

#ifx<

=2:

#pass

#else:

#rating_people_num.append（spans[x].string[-len（spans[x].string）:

-3]）

str=li.find（"

bd'

p"

'

）.text.lstrip（）

index=str.find（"

主"

if（index==-1）:

..."

print（li.find（"

if（li.find（"

）.text==210）:

index=60

#print（"

aaa"

#print（str[4:

index]）

movie_author.append（str[4:

data['

title'

]=title

#data['

]=rating_num

range_num'

]=range_num

rating_people_num'

]=rating_people_num

movie_author'

]=movie_author

returndata

3数据整理、转换

f.write（"

thead>

"

tr>

th>

fontsize='

5'

color=green>

电影<

/font>

/th>

#f.write（"

thwidth='

50px'

>

评分<

排名<

100px'

评价人数<

导演<

/tr>

/thead>

f.write（"

tbody>

fordataindatas:

foriinrange（0,25）:

tdstyle='

color:

orange;

text-align:

center'

%s<

/td>

%data['

][i]）

#f.write（"

blue;

red;

black;

/tbody>

/table>

/body>

/html>

f.close（）

if__name__=='

__main__'

datas=[]

htmls=__getHtml（）

foriinrange（len（htmls））:

data=__getData（htmls[i]）

datas.append（data）

__getMovies（datas）

4数据保存、展示

结果如后图所示：

5技术难点关键点

数据爬取实战（搜房网爬取房屋数据）

frombs4importBeautifulSoup

importrequests

rep=requests.get（）

rep.encoding="

gb2312"

#设置编码方式

html=rep.text

soup=BeautifulSoup（html,'

html.parser'

f=open（,'

w'

encoding='

utf-8'

html>

head>

metacharset='

UTF-8'

title>

Inserttitlehere<

/title>

/head>

body>

center>

h1>

新房成交TOP3<

/h1>

/center>

tableborder='

1px'

width='

1000px'

height='

800px'

align=center>

h2>

房址<

/h2>

成交量<

均价<

forliinsoup.find（"

ul"

class_="

ul02"

name=li.find（"

pbtext"

）.text

chengjiaoliang=li.find（"

red-f3"

junjia=li.find（"

ohter"

gray-9"

）#.text.replace（'

?

O'

'

平方米'

exceptExceptionase:

tdalign=center>

5px'

color=red>

%name）

color=blue>

%chengjiaoliang）

%junjia）

print（name）

六、总结

教师评语：

成绩：

指导教师：

展开阅读全文

相关资源

猜你喜欢

相关搜索

资源标签

当前位置：首页 > 解决方案 > 学习计划

copyright@ 2008-2023 冰点文库网站版权所有

经营许可证编号:鄂ICP备19020893号-2