介绍(Introduction)

The process of collecting information from a website (or websites) is often referred to as either web scraping or web crawling. Web scraping is the process of scanning a webpage/website and extracting information out of it, whereas web crawling is the process of iteratively finding and fetching web links starting from a URL or list of URLs.

从一个或多个网站收集信息的过程通常称为网络抓取或网络爬网。 Web抓取是扫描网页/网站并从中提取信息的过程,而Web抓取是从URL或URL列表开始迭代查找和获取Web链接的过程。

While there are differences between the two, you might have heard the two words used interchangeably. Although this article will be a guide on how to scrape information, the lessons learned here can very easily be used for the purposes of ‘crawling’.

尽管两者之间存在差异,但是您可能已经听说过两个词可以互换使用。 尽管本文将作为有关如何抓取信息的指南,但此处学习的课程很容易用于“抓取”目的。

Hopefully I don’t need to spend much time talking about why we would look to scrape data from an online resource, but quite simply, if there is data you want to collect from an online resource, scraping is how we would go about it. And if you would prefer to avoid the rigour of going through each page of a website manually, we now have tools that can automate the process.

希望我不需要花很多时间谈论为什么我们希望从在线资源中抓取数据,但是很简单,如果有要从在线资源中收集的数据,那么抓取就是我们的处理方式。 并且,如果您希望避免手动浏览网站的每个页面的严格要求,我们现在提供了可以使流程自动化的工具。

I’ll also take a moment to add that the process of web scraping is a legal grey area. You will be steering on the side of legal if you are collecting data for personal use and it is data that is otherwise freely available. Scraping data that is not otherwise freely available is where stuff enters murky water. Many websites will also have policies relating to how data can be used, so please bear those policies in mind. With all of that out of the way, let’s get into it.

我还要花一点时间补充说,网络抓取过程是合法的灰色区域。 如果您要收集供个人使用的数据,并且您可以免费获得这些数据,那么您将在法律方面进行指导。 东西进入浑浊的水的地方,就无法获得以前无法获得的数据。 许多网站也会制定有关如何使用数据的政策,因此请牢记这些政策。 所有这些都解决了,让我们开始吧。

For the purposes of demonstration, I will be scraping my own website and will be downloading a copy of the scraped data. In doing so, we will:

出于演示目的,我将抓取自己的网站,并将下载抓取的数据的副本。 为此,我们将:

  1. Set up an environment that allows us to be able to watch the automation if we choose to (the alternative is to run this in what is known as a ‘headless’ browser — more on that later);
  2. Automating the visit to my website;
  3. Traverse the DOM;
  4. Collect pieces of data;
  5. Download pieces of data;
  6. Learn how to handle asynchronous requests;
  7. And my favourite bit: end up with a complete project that we can reuse whenever we want to scrape data.

Now in order to do all of these, we will be making use of two things: Node.js, and Puppeteer. Now chances are you have already heard of Node.js before, so we won’t go into what that is, but just know that we will be using one Node.js module: FS (File System).

现在,为了完成所有这些操作,我们将利用两件事:Node.js和Puppeteer。 现在您很可能已经听说过Node.js,所以我们不再赘述,只是知道我们将使用一个Node.js模块:FS(文件系统)。

Let’s briefly explain what Puppeteer is.

让我们简单地解释一下Puppeteer是什么。

(Puppeteer)

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Most things that you can do manually in the browser can be done using Puppeteer.

Puppeteer是一个Node库,它提供了高级API来通过DevTools协议控制Chrome或Chromium。 您可以在浏览器中手动执行的大多数操作都可以使用Puppeteer完成。

The Puppeteer website provides a bunch of examples, such as taking screenshots and generating PDFs of webpages, automating form submission, testing UI, and so on.

Puppeteer网站提供了许多示例,例如截取屏幕截图和生成网页PDF,自动提交表单,测试UI等。

One thing they don’t expressly mention is the concept of data scraping, likely due to the potential legal issues mentioned earlier. But as it states, anything you can do manually in a browser can be done with Puppeteer. Automating those things means that you can do it way, way faster than any human ever could.

他们没有明确提及的一件事是数据刮擦的概念,可能是由于前面提到的潜在法律问题。 但正如它指出的那样,您可以在浏览器中手动执行的任何操作都可以使用Puppeteer完成。 使这些事情自动化意味着您可以做到这一点,比任何人都快。

This is going to be your new favourite website: https://pptr.dev/ Once you’re finished with this article, I’d recommend bookmarking this link as you will want to refer to their API if you plan to do any super advanced things.

这将是您最喜欢的新网站: https : //pptr.dev/当您完成本文后,我建议将此链接添加为书签,因为如果您打算进行任何超级操作,都希望引用他们的API。高级的东西。

(Installation)

If you don’t already have Node installed, go to https://nodejs.org/en/download/ and install the relevant version for your computer. That will also install something called npm, which is a package manager and allows us to be able to install third party packages (such as Puppeteer).

如果尚未安装Node,请转到https://nodejs.org/en/download/并为您的计算机安装相关版本。 这还将安装名为npm的程序,这是一个程序包管理器,使我们能够安装第三方程序包(例如Puppeteer)。

We will then go and create a directory and create a package.json by typing npm init inside of the directory. Note: I actually use yarn instead of npm, so feel free to use yarn if that's what you prefer.

然后,我们将创建一个目录,并通过在目录内键入npm init来创建package.json。 注意:我实际上是使用yarn而不是npm的,因此,如果您愿意,请随时使用yarn。

From here on, we are going to assume that you have a basic understanding of package managers such as npm/yarn and have an understanding of Node environments. Next, go ahead and install Puppeteer by running npm i puppeteer or yarn add puppeteer.

从这里开始,我们将假定您对软件包管理器(例如npm / yarn)有基本的了解,并且对Node环境有所了解。 接下来,继续运行npm i puppeteeryarn add puppeteer npm i puppeteer安装Puppeteer。

(Directory Structure)

Okay, so after running npm init/yarn init and installing puppeteer, we currently have a directory made up of a node_modules folder, a package.json and a package-lock.json.

好的,所以在运行npm init / yarn init并安装puppeteer之后,我们当前有一个目录,该目录由node_modules文件夹, package.jsonpackage-lock.json组成

Now we want to try and create our app with some separation of concerns in mind. So to begin with, we'll create a file in the root of our directory called main.js. main.js will be the file that we execute whenever we want to run our app. In our root, we will then create a folder called api. This api folder will include most of the code our application will be using.

现在,我们想尝试创建我们的应用程序时要考虑一些关注点。 因此,首先,我们将在目录的根目录中创建一个文件main.js。 main.js将是我们想要运行应用程序时执行的文件。 然后,在我们的根目录中,创建一个名为api的文件夹。 此api文件夹将包含我们的应用程序将使用的大多数代码。

Inside of this api folder we will create three files: interface.js, system.js, and utils.js. interface.js will contain any puppeteer-specific code (so things such as opening the browser, navigating to a page etc), system.js will include any node-specific code (such as saving data to disk, opening files etc), utils.js will include any reusable bits of JavaScript code that we might create along the way.

在此api文件夹内,我们将创建三个文件: interface.jssystem.jsutils.jsinterface.js将包含任何特定于操纵code的代码(例如,打开浏览器,导航到页面等), system.js将包含任何特定于节点的代码(例如将数据保存到磁盘,打开文件等), utils .js将包含我们可能在此过程中创建的所有可重复使用JavaScript代码。

Note: In the end, we didn’t make use of utils.js in this tutorial so feel free to remove it if you think your own project will make use of it.

注意:最后,我们在本教程中没有使用utils.js ,因此如果您认为自己的项目会使用utils.js ,请随时将其删除。

(Basic Commands)

Okay, now because a lot of the code we will be writing depends on network-requests, waiting for responses etc, we tend to write a lot of puppeteer code asynchronous. Because of this, it is common practice to wrap all of your executing code inside of an async IIFE.

好的,因为现在要编写的许多代码取决于网络请求,等待响应等,所以我们倾向于异步编写许多伪代码。 因此,通常的做法是将所有正在执行的代码包装在异步IIFE中。

If you’re unsure what an IIFE is, it’s basically a function that executes immediately after its creation. For more info, here’s an article I wrote about IIFEs. To make our IIFE asynchronous, we just add the async keyword to the beginning on it like so:

如果您不确定IIFE是什么,那么它基本上就是一个在创建后立即执行的函数。 有关更多信息,这是我写的有关IIFE的文章。 为了使IIFE异步,我们只需要在其开头添加async关键字,如下所示:

(async () => {

})();

Right, so we’ve set up our async IIFE, but so far we have nothing to run in there. Let’s fix that by enabling our ability to open a browser with Puppeteer. Let’s open api/interface.js and begin by creating an object called interface. We will also want to export this object. Therefore, our initial boilerplate code inside of api/interface.js will look like this:

是的,所以我们已经建立了异步IIFE,但是到目前为止,我们什么也没有要运行。 让我们通过使用Puppeteer打开浏览器的功能来解决此问题。 让我们打开api / interface.js并从创建一个名为interface的对象开始。 我们还将要导出该对象。 因此,我们在api / interface.js内部的初始样板代码将如下所示:

const interface = {
};module.exports = interface;

As we are going to be using Puppeteer, we’ll need to import it. Therefore, we’ll require() it at the top of our file by writing const puppeteer = require("puppeteer"); Inside of our interface object, we will create a function called async init() As mentioned earlier, a lot of our code is going to be asynchronous.

由于我们将要使用Puppeteer,因此需要将其导入。 因此,我们require()它在我们的文件的开头写const puppeteer = require("puppeteer");interface对象内部,我们将创建一个名为async init()的函数,如前所述,我们的许多代码将是异步的。

Now because we want to open a browser, that may take a few seconds. We will also want to save some information into variables off the back of this. Therefore, we'll need to make this asynchronous so that our variables get the responses assigned to them. There are two pieces of data that will come from our init() function that we are going to want to store into variables inside of our interface object. Because of this, let's go ahead and create two key:value pairings inside of our interface object, like so:

现在,因为我们要打开浏览器,可能需要几秒钟。 我们还将希望将一些信息保存到变量背后。 因此,我们需要使此异步,以便我们的变量获得分配给它们的响应。 我们将从init()函数获得两段数据,这些数据将要存储到interface对象内部的变量中。 因此,让我们继续在interface对象内部创建两个key:value配对,如下所示:

const interface = {
  browser: null,
  page: null,
};module.exports = interface;

Now that we have those set up, let’s write a try/catch block inside of our init() function.

现在我们已经设置好了,让我们在init()函数中编写一个try / catch块。

For the catch part, we'll simply console.log out our error. If you'd like to handle this another way, by all means go ahead - the important bits here are what we will be putting inside of the try part. We will first set this.browser to await puppeteer.launch(). As you may expect, this simply launches a browser.

对于catch部分,我们只需console.log注销我们的错误。 如果您想以其他方式处理此问题,请务必继续进行-这里重要的一点是我们将在try部分中放入的内容。 我们首先将this.browser设置为await puppeteer.launch() this.browser await puppeteer.launch() 。 如您所料,这只会启动浏览器。

The launch() function can accept an optional object where you can pass in many different options. We will leave it as is for the moment but we will return to this in a little while. Next we will set this.page to await this.browser.newPage(). As you may imagine, this will open a tab in the puppeteer browser. So far, this gives us the following code:

launch()函数可以接受一个可选对象,您可以在其中传递许多不同的选项。 我们暂时将其保留,但不久后我们将返回。 接下来,我们将this.page设置为await this.browser.newPage() 。 如您所料,这将在操纵up的浏览器中打开一个选项卡。 到目前为止,这为我们提供了以下代码:

const puppeteer = require("puppeteer");const interface = {
  browser: null,
  page: null,  async init() {
    try {
      this.browser = await puppeteer.launch();
      this.page = await this.browser.newPage();
    } catch (err) {
      console.log(err);
    }
   },};
module.exports = interface;

We’re also going to add two more functions into our interface object. The first is a visitPage() function which we will use to navigate to certain pages. You will see below that it accepts a url param which will basically be the full URL that we want to visit. The second is a close() function which will basically kill the browser session.

我们还将在interface对象中添加另外两个函数。 第一个是visitPage()函数,我们将使用它来导航到某些页面。 您将在下面看到它接受url参数,该参数基本上是我们要访问的完整URL。 第二个是close()函数,它将基本上终止浏览器会话。

These two functions look like this:

这两个函数如下所示:

async visitPage(url) {
  await this.page.goto(url);
 },async close() {
  await this.browser.close();
},

Now before we try to run any code, let’s add some arguments into the puppeteer.launch() function that sits inside of our init() function. As mentioned before, the launch() accepts an object as its argument. So let's write the following: puppeteer.launch({headless: false}) This will mean that when we do try to run our code, a browser will open and we will be able to see what is happening.

现在,在尝试运行任何代码之前,让我们向init()函数内部的puppeteer.launch()函数添加一些参数。 如前所述, launch()接受一个对象作为其参数。 因此,让我们编写以下代码: puppeteer.launch({headless: false})这意味着当我们尝试运行代码时,浏览器将打开,我们将能够看到正在发生的事情。

This is great for debugging purposes as it allows us to see what is going on in front of our very eyes. As an aside, the default option here is headless: true and I would strongly advise that you keep this option set to true if you plan to run anything in production as your code will use less memory and will run faster - some environments will also have to be headless such as a cloud function.

这对于调试目的非常有用,因为它使我们能够看到眼前发生的一切。 顺便说一句,这里的默认选项是headless: true ,我强烈建议您如果计划在生产环境中运行任何内容,则应将此选项设置为true ,因为代码将使用较少的内存并运行得更快-某些环境也将具有无头如云功能。

Anyway, this gives us this.browser = await puppeteer.launch({headless: false}). There's also an args: [] key which takes an array as its value. Here we can add certain things such as use of proxy IPs, incognito mode etc. Finally, there's a slowMo key that we can pass in to our object which we can use to slow down the speed of our Puppeteer interactions. There are many other options available but these are the ones that I wanted to introduce to you so far. So this is what our init() function looks like for now (use of incognito and slowMo have been commented out but left in to provide a visual aid):

无论如何,这给了我们this.browser = await puppeteer.launch({headless: false}) 。 还有一个args: []键,它将数组作为其值。 在这里,我们可以添加某些内容,例如代理IP的使用,隐身模式等。最后,有一个slowMo键可以传递给我们的对象,我们可以使用它来减慢Puppeteer交互的速度。 还有许多其他选项,但是到目前为止,我想向您介绍这些。 因此,这就是我们的init()函数现在的样子(使用了隐身模式和slowMo已被注释掉,但留作了视觉辅助):

async init() {
   try {
     this.browser = await puppeteer.launch({
     args: [
     // " - incognito",
     ],
     headless: false,
     // slowMo: 250,
     });
     this.page = await this.browser.newPage();
   } catch (err) {
     console.log(err);
   }
 },

There’s one other line of code we are going to add, which is await this.page.setViewport({ width: 1279, height: 768 });. This isn't necessary, but I wanted to put the option of being able to set the viewport so that when you view what is going on the browser width and height will seem a bit more normal. Feel free to adjust the width and height to be whatever you want them to be (I've set mine based on the screen size for a 13" Macbook Pro). You'll notice in the code block below that this setViewport function sits below the this.page assignment. This is important because you have to set this.page before you can see its viewport.

我们还要添加另一行代码,它正在await this.page.setViewport({ width: 1279, height: 768 }); 。 这不是必需的,但是我想选择能够设置视口的选项,这样当您查看浏览器上正在发生的事情时,宽度和高度看起来会更加正常。 可以随意调整宽度和高度,使其达到您想要的大小(我已经根据13英寸Macbook Pro的屏幕尺寸设置了我的)。您会在下面的代码块中注意到,该setViewport函数位于下面this.page分配。这一点很重要,因为必须先设置this.page才能看到其视口。

So now if we put everything together, this is how our interface.js file looks:

现在,如果我们将所有内容放在一起,这就是我们的interface.js文件的外观:

const puppeteer = require("puppeteer");const interface = {
  browser: null,
  page: null,

  async init() {
    try {
      this.browser = await puppeteer.launch({
      args: [
      // ` - proxy-server=http=${randProxy}`,
      // " - incognito",
      ],
      headless: false,
      // slowMo: 250,
      });
      this.page = await this.browser.newPage();
      await this.page.setViewport({ width: 1279, height: 768 });
    } catch (err) {
      console.log(err);
    }
  },

  async visitPage(url) {
    await this.page.goto(url);
  },
  async close() {
    await this.browser.close();
  },
};module.exports = interface;

Now, let’s move back to our main.js file in the root of our directory and put use some of the code we have just written. Add the following code so that your main.js file now looks like this:

现在,让我们回到目录根目录中的main.js文件,并使用我们刚刚编写的一些代码。 添加以下代码,以便您的main.js文件现在看起来像这样:

const interface = require("./api/interface");(async () => {
  await interface.init();
  await interface.visitPage("https://sunilsandhu.com");
})();

Now go to your command line, navigate to the directory for your project and type node main.js. Providing everything has worked okay, your application will proceed to load up a browser and navigate to sunilsandhu.com (or any other website if you happened to put something else in). Pretty neat!

现在转到您的命令行,导航到您的项目的目录,然后键入node main.js 如果一切正常,您的应用程序将继续加载浏览器并导航至sunilsandhu.com(或其他任何网站(如果您碰巧放入了其他东西))。 很简约!

Now during the process of writing this piece, I actually encountered an error while trying to execute this code. The error said something along the lines of Error: Could not find browser revision 782078. Run "PUPPETEER_PRODUCT=firefox n pm install" or "PUPPETEER_PRODUCT=firefox yarn install" to download a supported Firefox browser binary. This seemed quite strange to me as I was not trying to use Firefox and had not encountered this issue when using the same code for a previous project. It turns out that when installing puppeteer, it hadn't downloaded a local version of Chrome to use from within the node_modules folder.

现在,在编写本文的过程中,我实际上在尝试执行此代码时遇到错误。 该错误表示以下Error: Could not find browser revision 782078. Run "PUPPETEER_PRODUCT=firefox n pm install" or "PUPPETEER_PRODUCT=firefox yarn install" to download a supported Firefox browser binary. 这对我来说似乎很奇怪,因为我没有尝试使用Firefox,并且在以前的项目中使用相同的代码时也没有遇到此问题。 事实证明,在安装puppeteer时,它没有从node_modules文件夹中下载要使用的Chrome的本地版本。

I'm not entirely sure what caused this issue (it may have been because I was hotspotting off of my phone at the time), but managed to fix the issue by simply copying over the missing files from another project I had that was using the same version of Puppeteer.

我不完全确定是什么原因导致了这个问题(可能是因为当时我正在从手机上盗窃),但是通过简单地复制了我曾经使用过的另一个项目中的丢失文件,设法解决了该问题。相同版本的Puppeteer。

If you encounter a similar issue, please let me know and I'd be curious to hear more.

如果您遇到类似的问题,请告诉我,我想知道更多。

(Advanced Commands)

Okay, so we’ve managed to navigate to a page, but how do we gather data from the page? This bit may look a bit confusing, so be ready to pay attention! We’re going to create two functions here, one that mimics document.querySelectorAll and another that mimics document.querySelector.

好的,我们已经设法导航到页面了,但是如何从页面中收集数据呢? 这一点可能看起来有些混乱,所以请注意! 我们将在这里创建两个函数,一个模仿document.querySelectorAll ,另一个模仿document.querySelector

The difference here is that our functions will return whatever attribute/attributes from the selector you were looking for. Both functions actually use querySelector/querySelectorAll under the hood and if you have used them before, you might wonder why I am asking you to pay attention.

此处的区别在于我们的函数将返回您要查找的选择器中的任何属性。 这两个函数实际上都是在querySelector/querySelectorAll使用querySelector/querySelectorAll ,如果您以前曾经使用过它们,您可能想知道为什么我要请您注意。

The reason here is because the retrieval of attributes from them is not the same as it is when you're traversing the DOM in a browser. Before we talk about how the code works, let's take a look what our final function looks like:

这是因为从属性中检索属性与在浏览器中遍历DOM时的属性不同。 在讨论代码如何工作之前,让我们看一下最终函数的外观:

async querySelectorAllAttributes(selector, attribute) {
   try {
     return await this.page.$$eval(selector,
     (elements, attribute) => {
       return elements.map((element) => element[attribute]);
     }, attribute);
   } catch (error) {
       console.log(error);
   }
 },

So, we’re writing another async function and we’ll wrap the contents inside of a try/catch block. To begin with, we will await and return the value from an $$eval function which we have available for execution on our this.page value. Therefore, we're running return await this.page.$$eval(). $$eval is just a wrapper around document.querySelectorAll.

因此,我们正在编写另一个异步函数,并将内容包装在try/catch块中。 首先,我们将等待并返回一个$$eval函数的值,该函数可以在this.page值上执行。 因此,我们正在运行return await this.page.$$eval()$$eval只是document.querySelectorAll的包装。

There’s also an $eval function available (note that this one only has 1 dollar sign), which is the equivalent for using document.querySelector.

还有一个$eval函数可用(请注意,该函数只有1个美元符号),与使用document.querySelector等效

The $eval and $$eval functions accept two parameters. The first is the selector we want to run it again. So for example, if I want to find divelements, the selector would be 'div'. The second is a function which retrieves specific attributes from the result of the query selection. You will see that we are passing in two parameters to this function, the firstelementsis basically just the entire result from the previous query selection. The second is an optional value that we have decided to pass in, this beingattribute.

$eval$$eval函数接受两个参数。 第一个是我们要再次运行它的选择器。 因此,例如,如果我要查找div元素,则选择器将为“ div”。 第二个功能是从查询选择的结果中检索特定属性。 您将看到我们向该函数传递了两个参数,第一个elements基本上只是先前查询选择的整个结果。 第二个是我们决定传递的可选值,这是attribute

We then map over our query selection and find the specific attribute that we passed in as the parameter. You’ll also notice that after the curly brace, we pass in the attributeagain, which is necessary because when we use\$\$evaland\$eval, it executes them in a different environment (the browser) to where the initial code was executed (in Node). When this happens, it loses context. However, we can fix this by passing it in at the end. This is simply a quirk specific to Puppeteer that we have to account for.

然后,我们映射查询选择并找到作为参数传递的特定属性。 您还会注意到,花括号后,我们再次传递了attribute ,这是必要的,因为当我们使用\$\$eval\$eval ,它将在初始环境所在的不同环境(浏览器)中执行它们。代码已执行(在Node中)。 发生这种情况时,它将失去上下文。 但是,我们可以通过最后将其传递来解决此问题。 这只是我们必须考虑的针对Puppeteer的怪癖。

With regard to our function that simply returns one attribute, the difference between the code is that we simply return the attribute value rather than mapping over an array of values. Okay, so we are now in a position where we are able to query elements and retrieve values. This puts us in a great position to now be able to collect data.

对于仅返回一个属性的函数,代码之间的区别在于,我们仅返回属性值,而不是在值数组上进行映射。 好的,现在我们可以查询元素和检索值了。 这使我们处于现在可以收集数据的有利位置。

So let’s go back into our main.js file. I’ve decided that I would like to collect all of the links from my website. Therefore, I’ll use the querySelectorAllAttributes function and will pass in two parameters: "a" for the selector in order to get all of the <a> tags, then "href" for the attribute in order to get the link from each <a> tag. Let's see how that code looks:

因此,让我们回到main.js文件。 我已经决定要从我的网站收集所有链接。 因此,我将使用querySelectorAllAttributes函数,并将传入两个参数:“ a”代表选择器,以获取所有<a>标记,然后“ href”代表属性,以从每个<a>标签。 让我们看看该代码的外观:

const interface = require("./api/interface");
(async () => {
 await interface.init();
 await interface.visitPage("https://sunilsandhu.com");
 let links = await interface.querySelectorAllAttributes("a", "href");
 console.log(links);
})();

Let’s run node main.js again. If you already have it running from before, type cmd+c/ctrl+c and hit enter to kill the previous session. In the console you should be able to see a list of links retrieved from the website. Tip: What if you wanted to then go and visit each link? Well you could simply write a loop function that takes each value and passes it in to our visitPage function. It might look something like this:

让我们再次运行node main.js 如果您以前已经运行过它,请键入cmd+c / ctrl+c并按Enter键以终止上一个会话。 在控制台中,您应该能够看到从网站检索的链接列表。 提示:如果您想访问每个链接怎么办? 好吧,您可以简单地编写一个循环函数,该函数接受每个值并将其传递给我们的visitPage函数。 它可能看起来像这样:

for await (const link of links) { 
  await interface.visitPage(link) 
}

(Saving data)

Great, so we are able to visit pages and collect data. Let’s take a look at how we can save this data. Note: There are of course, many options here when it comes to saving data, such as saving to a database. We are, however, going to look at how we would use Node.js to save data locally to our hard drive. If this isn’t of interest to you, you can probably skip this section and swap it out for whatever approach you’d prefer to take.

太好了,因此我们能够访问页面并收集数据。 让我们看一下如何保存这些数据。 注意:当然,在保存数据(例如保存到数据库)方面,这里有许多选项。 但是,我们将研究如何使用Node.js在本地将数据保存到硬盘驱动器。 如果您对此不感兴趣,则可以跳过本节,将其换成您希望采用的任何方法。

Let’s switch gears and go into our empty system.js file. We’re just going to create one function. This function will take three parameters, but we are going to make two of them optional. Let’s take a look at what our system.js file looks like, then we will review the code:

让我们切换齿轮并进入我们空的system.js文件。 我们将要创建一个函数。 该函数将使用三个参数,但是我们将使其中两个成为可选参数。 让我们看一下我们的system.js文件是什么样子,然后我们将检查代码:

const fs = require("fs");
const system = {
  async saveFile(data, filePath = Date.now(), fileType = "json") {
    fs.writeFile(`${filePath}.${fileType}`, JSON.stringify(data), function (err) {
      if (err) return console.log(err);
  });
  },
};module.exports = system;

So the first thing you will notice is that we are requiring an fs module at the top. This is a Node.js-specific module that is available to you as long as you have Node installed on your device. We then have our system object and we are exporting it at the bottom, this is the same process we followed for the interface.js file earlier.

因此,您会注意到的第一件事是,我们在顶部需要一个fs模块。 只要您在设备上安装了Node,这就是特定于Node.js的模块。 然后,我们有了系统对象,并在底部将其导出,这与之前对interface.js文件所遵循的过程相同。

(Conclusion)

And there we have it! We have created a new project from scratch that allows you to automate the collection of data from a website. We have gone through each of the steps involved, from initial installation of packages, right up to downloading and saving collected data. You now have a project that allows you to input any website and collect and download all of the links from.

我们终于得到它了! 我们从头开始创建了一个新项目,使您可以自动从网站收集数据。 从软件包的初始安装到下载和保存收集的数据,我们已经完成了涉及的每个步骤。 现在,您有一个项目,可让您输入任何网站并收集和下载所有链接。

Hopefully the methods we have outlined provide you with enough knowledge to be able to adapt the code accordingly (eg, if you want to gather a different HTML tag besides <a> tags).

希望我们概述的方法为您提供足够的知识,以便能够相应地修改代码(例如,如果您想收集除<a>标记之外的其他HTML标记)。

翻译自: https://medium.com/javascript-in-plain-english/how-to-scrape-data-from-a-website-with-javascript-9c93bbb4de51