当前位置:  编程语言>python

Python下Html/xml解析库Beautiful Soup快速入门教程

 
    发布时间:2013-8-2  


    本文导语: BeautifulSoup是一个用来解析HTML/XML文档的库。Beautiful Soup在Windows 7上的简易安装方法1.使用git下载最新的代码后进入代码顶级目录。2.运行:python setup.py install即可安装完成。简易入门教程假设解析的HTML文档为:html_doc = """ <html><...

BeautifulSoup是一个用来解析HTML/XML文档的库。

Beautiful Soup在Windows 7上的简易安装方法

1.使用git下载最新的代码后进入代码顶级目录

2.运行:

python setup.py install

即可安装完成。

简易入门教程

假设解析的HTML文档为:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""

输出解析后的文档:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

其它一些常用用法:

soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

提取文档中所有的URL:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

提取所有的文本:

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...


  • 本站(WWW.169IT.COM)旨在分享和传播互联网科技相关的资讯和技术,将尽最大努力为读者提供更好的信息聚合和浏览方式。
    本站(WWW.169IT.COM)站内文章除注明原创外,均为转载,整理或搜集自网络.欢迎任何形式的转载,转载请注明出处.
    转载请注明:文章转载自:[169IT-IT技术资讯]
    本文标题:Python下Html/xml解析库Beautiful Soup快速入门教程
相关文章推荐:


站内导航:


特别声明:169IT网站部分信息来自互联网,如果侵犯您的权利,请及时告知,本站将立即删除!

©2012-2019,169IT.COM,E-mail:www_169it_com#163.com(请将#改为@)

浙ICP备11055608号