前面转载了一篇文章介绍ChilkatDotNet组件的使用,下面我将利用这个组件编写一个从网页搜集Email的工具.

       从网页中搜集信息有两个难点需要解决:一是编写可以通过链接遍历网页的蜘蛛程序,这点ChilkatDotNet组件已经给我们提供了很好的支持.二是从网页中提取需要的信息,这点可以通过很多方式解决,这里我选择的是正则表达式.
       先给一张程序运行时的截图:
      

       界面的设计很简单,3个Textbox+1个RichTextBox+2个Button,3个Textbox分别用来输入站点地址,起始Url和需要遍历的链接数,RichTextBox用来存放搜集到的网页信息,这里我保存的是网页url和网页中的Email地址.
       程序主要分为两部分,首先是遍历站点,代码如下:
      
         Chilkat.Spider spider = new Chilkat.Spider();

            string website = this.textWebsite.Text;

            string url = this.textUrl.Text;

            int links = Int32.Parse(this.textLinks.Text);

            //  The spider object crawls a single web site at a time.  As you'll see

            //  in later examples, you can collect outbound links and use them to

            //  crawl the web.  For now, we'll simply spider 10 pages of chilkatsoft.com

            spider.Initialize(website);


            //  Add the 1st URL:

            spider.AddUnspidered(url);


            //  Begin crawling the site by calling CrawlNext repeatedly.

            int i;

            for (i = 0; i <= links; i++)
            {

                bool success;

                success = spider.CrawlNext();

                if (success == true)
                {
                    Invoke(new AppendTextDelegate(AppendText), new object[] { spider.LastUrl + "\r\n" });
                    GetAllURL(spider.LastUrl.ToString());
                }

                else
                {

                    //  Did we get an error or are there no more URLs to crawl?

                    if (spider.NumUnspidered == 0)
                    {

                        MessageBox.Show("No more URLs to spider");

                    }

                    else
                    {

                        MessageBox.Show(spider.LastErrorText);

                    }

                }


                //  Sleep 1 second before spidering the next URL.

                spider.SleepMs(1000);

            }

       和ChilkatDotNet里的示例代码相似,只是增加了从文本框获取初始条件的代码.获取Url地址后,需要提取网页的内容,再根据正则表达式获取Email地址.
       获取网页内容:
HttpWebRequest webRequest1 = (HttpWebRequest)WebRequest.Create(new Uri(URlStr));
            webRequest1.Method = "GET";
            HttpWebResponse response = (HttpWebResponse)webRequest1.GetResponse();
            Stream stream = response.GetResponseStream();
            StreamReader streamReader = new StreamReader(stream, Encoding.Default);
            String textData = streamReader.ReadToEnd();
            streamReader.Close();
            response.Close();
     提取Email的正则表达式:
@"(?<EmailStr>\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b)"

     关于正则表达式的用法,网上有很多教程,随便找一个学习一下就行.
     这里我只搜集了单个站点的Email地址,利用ChilkatDotNet组件不难做到搜集整个网络的信息,有兴趣的朋友可以自己研究一下.