| 首页 |
  • 2008-07-01

    尝试代码发芽网 无需插件实现blog代码高亮功能

    版权声明:转载时请以超链接形式标明文章原始出处和作者信息及本声明
    http://fayaa.blogbus.com/logs/23891688.html

    Python代码: 抓取糗事百科前100页的Python脚本
    01 #coding=utf-8
    02 #需要BeautifulSoup(美丽的汤)支持:http://crummy.com/software/BeautifulSoup
    03
    04 import urllib
    05 import urllib2
    06 from xml.sax.saxutils import unescape
    07 from BeautifulSoup import BeautifulSoup          # For processing HTML
    08
    09 def formalize(text):
    10     result = ''
    11     lines = text.split(u'\n')
    12     for line in lines:
    13         line = line.strip()
    14         if len(line) == 0:
    15             continue
    16         result += line + u'\n\n'
    17     return result
    18
    19 outfile = open("qiushi.txt", "w")
    20 count = 0
    21 for i in range(1, 101):
    22     url = "http://qiushibaike.com/qiushi/best/all/page/%d" % i
    23     data = urllib2.urlopen(url).readlines()
    24     soup = BeautifulSoup("".join(data))
    25     contents = soup.findAll('div', "content")
    26     stories = [str(text) for text in contents]
    27     for story in stories:
    28         count += 1
    29         print "processing page %d, %d items added" % (i, count)
    30         minisoup = BeautifulSoup(story)
    31         text = ''.join([e for e in minisoup.recursiveChildGenerator() if isinstance(e, unicode)])
    32         text = urllib.unquote(unescape(text, {'"':'"'}))
    33         text = formalize(text).encode("utf-8")
    34         print >> outfile, '-' * 20 + " %05d " % count + '-' * 20 + "\n"
    35         print >> outfile, text + "\r\n"
    36 outfile.close()

    随机文章:

    听君一席话 2008-07-01
    来生书 2008-07-01

    收藏到:Del.icio.us




    评论

  • 呵呵,堪称完美支持啊
    连字体和背景色都没有变!

发表评论

您将收到博主的回复邮件
记住我