每个人手中都可能有一大堆讨论不同话题的HTML文档。但你真正感兴趣的内容可能隐藏于广告、布局表格或格式标记以及无数链接当中。甚至更糟的是,你希望 那些来自菜单、页眉和页脚的文本能够被过滤掉。如果你不想为每种类型的HTML文件分别编写复杂的抽取程序的话,我这里有一个解决方案。
本文讲述如何编写与从大量HTML代码中获取正文内容的简单脚本,这一方法无需知道HTML文件的结构和使用的标签。它能够工作于含有文本内容的所有新闻文章和博客页面……
- 一、解析HTML代码并记下处理的字节数。
- 二、以行或段的形式保存解析输出的文本。
- 三、统计每一行文本相应的HTML代码的字节数
- 四、通过计算文本相对于字节数的比率来获取文本密度
- 五、最后用神经网络来决定这一行是不是正文的一部分。
转换HTML为文本
def extract_text(html):
# Derive from formatter.AbstractWriter to store paragraphs.
writer = LineWriter()
# Default formatter sends commands to our writer.
formatter = AbstractFormatter(writer)
# Derive from htmllib.HTMLParser to track parsed bytes.
parser = TrackingParser(writer, formatter)
# Give the parser the raw HTML data.
parser.feed(html)
parser.close()
# Filter the paragraphs stored and output them.
return writer.output()
class TrackingParser(htmllib.HTMLParser):
"""Try to keep accurate pointer of parsing location."""
def __init__(self, writer, *args):
htmllib.HTMLParser.__init__(self, *args)
self.writer = writer
def parse_starttag(self, i):
index = htmllib.HTMLParser.parse_starttag(self, i)
self.writer.index = index
return index
def parse_endtag(self, i):
self.writer.index = i
return htmllib.HTMLParser.parse_endtag(self, i)
class Paragraph:
def __init__(self):
self.text = ''
self.bytes = 0
self.density = 0.0
class LineWriter(formatter.AbstractWriter):
def __init__(self, *args):
self.last_index = 0
self.lines = [Paragraph()]
formatter.AbstractWriter.__init__(self)
def send_flowing_data(self, data):
# Work out the length of this text chunk.
t = len(data)
# We've parsed more text, so increment index.
self.index += t
# Calculate the number of bytes since last time.
b = self.index - self.last_index
self.last_index = self.index
# Accumulate this information in current line.
l = self.lines[-1]
l.text += data
l.bytes += b
def send_paragraph(self, blankline):
"""Create a new paragraph if necessary."""
if self.lines[-1].text == '':
return
self.lines[-1].text += 'n' * (blankline+1)
self.lines[-1].bytes += 2 * (blankline+1)
self.lines.append(Writer.Paragraph())
def send_literal_data(self, data):
self.send_flowing_data(data)
def send_line_break(self):
self.send_paragraph(0)
数据分析
过滤文本行
def compute_density(self):
"""Calculate the density for each line, and the average."""
total = 0.0
for l in self.lines:
l.density = len(l.text) / float(l.bytes)
total += l.density
# Store for optional use by the neural network.
self.average = total / float(len(self.lines))
def output(self):
"""Return a string with the useless lines filtered out."""
self.compute_density()
output = StringIO.StringIO()
for l in self.lines:
# Check density against threshold.
# Custom filter extensions go here.
if l.density > 0.5:
output.write(l.text)
return output.getvalue()
监督式机器学习
- 当前行的密度
- 当前行的HTML字节数
- 当前行的输出文本长度
- 前一行的这三个值
- 后一行的这三个值
from pyfann import fann, libfann
# This creates a new single-layer perceptron with 1 output and 3 inputs.
obj = libfann.fann_create_standard_array(2, (3, 1))
ann = fann.fann_class(obj)
# Load the data we described above.
patterns = fann.read_train_from_file('training.txt')
ann.train_on_data(patterns, 1000, 1, 0.0)
# Then test it with different data.
for datin, datout in validation_data:
result = ann.run(datin)
print 'Got:', result, ' Expected:', datout