从网页中搜集信息有两个难点需要解决:一是编写可以通过链接遍历网页的蜘蛛程序,这点ChilkatDotNet组件已经给我们提供了很好的支持.二是从网页中提取需要的信息,这点可以通过很多方式解决,这里我选择的是正则表达式.
先给一张程序运行时的截图:

界面的设计很简单,3个Textbox+1个RichTextBox+2个Button,3个Textbox分别用来输入站点地址,起始Url和需要遍历的链接数,RichTextBox用来存放搜集到的网页信息,这里我保存的是网页url和网页中的Email地址.
程序主要分为两部分,首先是遍历站点,代码如下:
Chilkat.Spider spider = new Chilkat.Spider();
string website = this.textWebsite.Text;

string url = this.textUrl.Text;

int links = Int32.Parse(this.textLinks.Text);

// The spider object crawls a single web site at a time. As you'll see

// in later examples, you can collect outbound links and use them to

// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize(website);

// Add the 1st URL:
spider.AddUnspidered(url);

// Begin crawling the site by calling CrawlNext repeatedly.
int i;

for (i = 0; i <= links; i++)
{
bool success;

success = spider.CrawlNext();

if (success == true)
{
Invoke(new AppendTextDelegate(AppendText), new object[] { spider.LastUrl + "\r\n" });
GetAllURL(spider.LastUrl.ToString());
}
else
{
// Did we get an error or are there no more URLs to crawl?

if (spider.NumUnspidered == 0)
{
MessageBox.Show("No more URLs to spider");

}

else
{
MessageBox.Show(spider.LastErrorText);

}

}

// Sleep 1 second before spidering the next URL.
spider.SleepMs(1000);

}
和ChilkatDotNet里的示例代码相似,只是增加了从文本框获取初始条件的代码.获取Url地址后,需要提取网页的内容,再根据正则表达式获取Email地址.
获取网页内容:
HttpWebRequest webRequest1 = (HttpWebRequest)WebRequest.Create(new Uri(URlStr));
webRequest1.Method = "GET";
HttpWebResponse response = (HttpWebResponse)webRequest1.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader streamReader = new StreamReader(stream, Encoding.Default);
String textData = streamReader.ReadToEnd();
streamReader.Close();
response.Close();关于正则表达式的用法,网上有很多教程,随便找一个学习一下就行.
这里我只搜集了单个站点的Email地址,利用ChilkatDotNet组件不难做到搜集整个网络的信息,有兴趣的朋友可以自己研究一下.
















