使用.NET组件编写邮箱搜集工具

推荐原创

ssbird 2008-01-02 18:51:30 博主文章分类：DotNet ©著作权

文章标签 .NET 组件邮箱 .Net 搜集 文章分类 .Net 后端开发

©著作权归作者所有：来自51CTO博客作者ssbird的原创作品，请联系作者获取转载授权，否则将追究法律责任

        前面转载了一篇文章介绍ChilkatDotNet组件的使用,下面我将利用这个组件编写一个从网页搜集Email的工具.

       从网页中搜集信息有两个难点需要解决:一是编写可以通过链接遍历网页的蜘蛛程序,这点ChilkatDotNet组件已经给我们提供了很好的支持.二是从网页中提取需要的信息,这点可以通过很多方式解决,这里我选择的是正则表达式.
       先给一张程序运行时的截图:

       界面的设计很简单,3个Textbox+1个RichTextBox+2个Button,3个Textbox分别用来输入站点地址,起始Url和需要遍历的链接数,RichTextBox用来存放搜集到的网页信息,这里我保存的是网页url和网页中的Email地址.
       程序主要分为两部分,首先是遍历站点,代码如下:

Chilkat.Spider spider = new Chilkat.Spider();

string website = this.textWebsite.Text;

string url = this.textUrl.Text;

int links = Int32.Parse(this.textLinks.Text);

// The spider object crawls a single web site at a time. As you'll see

// in later examples, you can collect outbound links and use them to

// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize(website);

// Add the 1st URL:

spider.AddUnspidered(url);

// Begin crawling the site by calling CrawlNext repeatedly.

int i;

for (i = 0; i <= links; i++)

{

bool success;

success = spider.CrawlNext();

if (success == true)

{

Invoke(new AppendTextDelegate(AppendText), new object[] { spider.LastUrl + "\r\n" });

GetAllURL(spider.LastUrl.ToString());

}

else

{

// Did we get an error or are there no more URLs to crawl?

if (spider.NumUnspidered == 0)

{

MessageBox.Show("No more URLs to spider");

}

else

{

MessageBox.Show(spider.LastErrorText);

}

// Sleep 1 second before spidering the next URL.

spider.SleepMs(1000);

}

和ChilkatDotNet里的示例代码相似,只是增加了从文本框获取初始条件的代码.获取Url地址后,需要提取网页的内容,再根据正则表达式获取Email地址.
获取网页内容:

HttpWebRequest webRequest1 = (HttpWebRequest)WebRequest.Create(new Uri(URlStr));

webRequest1.Method = "GET";

HttpWebResponse response = (HttpWebResponse)webRequest1.GetResponse();

Stream stream = response.GetResponseStream();

StreamReader streamReader = new StreamReader(stream, Encoding.Default);

String textData = streamReader.ReadToEnd();

streamReader.Close();

response.Close();

提取Email的正则表达式:

@"(?<EmailStr>\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b)"

关于正则表达式的用法,网上有很多教程,随便找一个学习一下就行.
这里我只搜集了单个站点的Email地址,利用ChilkatDotNet组件不难做到搜集整个网络的信息,有兴趣的朋友可以自己研究一下.

上一篇：使用ChilkatDotNet组件构建自己的搜索引擎

下一篇：C#操作Word

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯