Website Content Transfer (Webscraping using jQuery Selectors and Node.js)

I was looking at how…

Login

Blog History

I was looking at how to transfer product information from custom shop to WooCommerce in WordPress.

My challenge was to get all the data for a mass product transfer without doing a lot of manual entry.

With products online they generally have consistent html divs, classes and id's that you can find the information in.

So how do you get the information from the source website to a CSV file?

Firstly use wget to take a copy of the target site (http://example.com) or part of the site (http://example.com/shop)

wget --no-clobber --page-requisites --convert-links \
     --random-wait -r -p -E -e robots=off 
     -U mozilla http://example.com

# the above doesn't work for sites which store things on other servers
# so try this
wget -E -H -k -K -p http://example.com

This will create a directory named example.com and inside it will be all the scripts, images and pages of the site

Next you need to use some sort of scraping tool. I used the information on this page ==> http://blog.miguelgrinberg.com/post/easy-web-scraping-with-nodejs to get started and from there I modified it to pick up the wget downloaded files from the local hard disk and then write out the relevant information to disk.

Once you have the html files containing the information you need on your local hard disk, you then make a directory to contain the node.js javascript , I called it "scraping" and then install the node.js modules as the local user so you end up with a folder named node_modules and the modules in it (e.g. cheerio). From there you run node filename.js where filename.js contains the script below.

This is the code I ended up with.

var cheerio = require('cheerio');
var fs = require('fs');
// I found the pipe character was the correct dividing character
// once the out.txt file is imported into LibreOffice Calc you can 
// convert it to csv and add extra columns for the import to wordpress
var sep = "|";

// read the directory that contains the html files
// you want to scrape
// this will return an array containing file and directory
// names
var html = fs.readdirSync("../");

var data = "";

for (i = 0; i < html.length; i++) {

    // return a stats object so you can check
    // if it's a file
    var mystats = fs.statSync("../" + html[i]);

    // check it's a file and it's a .html
    var patt = /.*\.html$/i;
    if (mystats.isFile() && patt.test(html[i])) {
        console.log(html[i]);

        // first field is the filename the information came
        // from
        data += html[i] + sep;

        $ = cheerio.load(fs.readFileSync("../" + html[i]));

        // use jquery selectors to find the information
        // you want and store in data for later writing
        // to file
        // this is the product title field
        $('h1.main-content-title').each(function() {
            console.log($(this).text());
            data += $(this).text() + sep;
        });

        // this field contains html markup (product description)
        // so use .html() method not .text()
        $('div.field-type-text-with-summary').each(function() {
            var ret = $(this).html();
            ret = ret.replace(/(\r\n|\n|\r)/gm, "");
            console.log(ret);
            data += ret + sep;
        });

        // price field
        $('div.field-name-commerce-price').each(function() {
            console.log($(this).text());
            data += $(this).text() + sep;
        });

         // here we find the product gallery images
         $('div.field-slideshow a.colorbox').each(function() {
            var href = $(this).attr('href');
            console.log(href);
            data += href + sep ;
        });
        data += "\n";
    }
    
}

// finally write the data to a file for further processing
// in libreoffice calc.
fs.writeFileSync("out.csv", data);

What you end up with is a file separated by the | character which contains all you need to transfer data to a word press sight using any one of the WordPress data import plugins for the purpose.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

The reCAPTCHA verification period has expired. Please reload the page.