January 10, 2020  4 min read 


Headless browsers have become very popular with the rise of automated UI tests in the application development process. There are also countless use cases for website crawlers and HTML-based content analysis.

For 99 percent of these cases, you don’t actually need a browser GUI because it is fully automated. Running a GUI is more expensive than spinning up a Linux-based server or scaling a simple Docker container across a microservices cluster such as ​​Kubernetes​​.

But I digress. Put simply, it has become increasingly critical to have a Docker container-based headless browser to maximize flexibility and scalability. In this tutorial, we’ll demonstrate how to create a Dockerfile to set up a Headless Chrome browser in Node.js.

Headless Chrome with Node.js

Node.js is the main language interface used by the Google Chrome development team, and it has an almost native integrated library for communicating with Chrome called Puppeteer.js. This library uses WebSocket or a System Pipe-based protocol over a DevTools interface, which can do all kinds of things such as take screenshots, measure page load metrics, connection speeds, and downloaded content size, and more. You can test your UI on different device simulations and take screenshots with it. Most importantly, Puppeteer doesn’t require a running GUI; it can all be done in a headless mode.

const puppeteer = require('puppeteer');
const fs = require('fs');

Screenshot('https://google.com');

async function Screenshot(url) {
const browser = await puppeteer.launch({
headless: true,
args: [
"--no-sandbox",
"--disable-gpu",
]
});

const page = await browser.newPage();
await page.goto(url, {
timeout: 0,
waitUntil: 'networkidle0',
});
const screenData = await page.screenshot({encoding: 'binary', type: 'jpeg', quality: 30});
fs.writeFileSync('screenshot.jpg', screenData);

await page.close();
await browser.close();
}


Shown above is the simple actionable code for taking a screenshot over Headless Chrome. Note that we are not specifying Google Chrome’s executable path because Puppeteer’s NPM module comes with a Headless Chrome version embedded inside. Chrome’s dev team did a great job of keeping the library usage very simple and minimizing the required setup. This also makes our job of embedding this code inside the Docker container much easier.

Google Chrome inside a Docker container

Running a browser inside a container seems simple based on the code above, but it’s important not to overlook security. By default, everything inside a container runs under the root user, and the browser executes JavaScript files locally.

Of course, Google Chrome is secure, and it doesn’t allow users to access local files from browser-based script, but there are still potential security risks. You can minimize many of these risks by creating a new user for the specific purpose of executing the browser itself. Google also has sandbox mode enabled by default, which restricts external scripts from accessing the local environment.

Below is the Dockerfile sample responsible for the Google Chrome setup. We will choose Alpine Linux as our base container because it has a minimal footprint as a Docker image.

FROM alpine:3.6

RUN apk update && apk add --no-cache nmap && \
echo @edge http://nl.alpinelinux.org/alpine/edge/community >> /etc/apk/repositories && \
echo @edge http://nl.alpinelinux.org/alpine/edge/main >> /etc/apk/repositories && \
apk update && \
apk add --no-cache \
chromium \
harfbuzz \
"freetype>2.8" \
ttf-freefont \
nss

ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true

....
....


The ​​run​​ command handles the edge repository for getting Chromium for Linux and libraries required to run chrome for Alpine. The tricky part is to make sure we don’t download Chrome embedded inside Puppeteer. That would be a useless space for our container image, which is why we are keeping the ​​PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true​​ environment variable.

After running the Docker build, we get our Chromium executable: ​​/usr/bin/chromium-browser​​. This should be our main Puppeteer Chrome executable path.


 


Now let’s jump to our JavaScript code and complete a Dockerfile.

Combining Node.js Server and Chromium container

Before we continue, let’s change a little bit of our code to fit as a microservice for taking screenshots of given websites. For that, we’ll use Express.js to spin a basic HTTP server.

// server.js
const express = require('express');
const puppeteer = require('puppeteer');

const app = express();

// /?url=https://google.com
app.get('/', (req, res) => {
const {url} = req.query;
if (!url || url.length === 0) {
return res.json({error: 'url query parameter is required'});
}

const imageData = await Screenshot(url);

res.set('Content-Type', 'image/jpeg');
res.set('Content-Length', imageData.length);
res.send(imageData);
});

app.listen(process.env.PORT || 3000);

async function Screenshot(url) {
const browser = await puppeteer.launch({
headless: true,
executablePath: '/usr/bin/chromium-browser',
args: [
"--no-sandbox",
"--disable-gpu",
]
});

const page = await browser.newPage();
await page.goto(url, {
timeout: 0,
waitUntil: