Tech On The Tip: Extracting data from a website to csv using wget, pup and jq

I've got a little spending habit at a certain hi-tech retailer that lists second hand computer parts on their website.

There are bargains here to be found, but I find daily exploration a pain, with the multiple categories of products shown on separate slow pages. Almost a bit like this blog ;-)

So my quickly written script does the boring work instead using wget, pup and jq, and leaves me a spreadsheet to browse over.

The script requires these tools installed.

pup : https://github.com/ericchiang/pup

"pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors."

jq : https://stedolan.github.io/jq/

"jq is a lightweight and flexible command-line JSON processor. jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data..."

I think the wget tool should already be present in most linux distributions.

The Script

I chose to use pup after looking at the website returned search results. These static HTML coded pages contained a structured 'window.universal_variable' JSON object. Storing a maximum of 16 data records, this object sufficiently held the data that I was interested in.

I did also find further records delivered by dynamic XMLHttpRequest, but with not that many items per interested category shown in their stores, I found this mechanism rarely used. So, no need for me to create code to interface with that :)

In the rest of my script, jq is used to filter, collect and convert these JSON objects to csv.

Running it for the first time, a file 'subcat_list.html' is created. An extract looks like this:


...
...
<a href="product.php?mode=buy&plid=8">
 Data Storage
</a>
<a href="search/index.php?stext=*&section=&catid=1">
* IDE Hard Drives
</a>
<a href="search/index.php?stext=*&section=&catid=2">
* Laptop Hard Drives
</a>
<a href="search/index.php?stext=*&section=&catid=3">
 Network Attached Hard Drives
</a>
... ,

This lists the subcategories and their internal links found within the website, and is now a configuration file that requires some simple editing to allow script csv output.

To indicate interested subcategories for this, I just place a '*' next to the category title. By way of an example, I've shown that done above for the first two hard drive categories.

Here's the entire script:


#!/bin/bash

  # root url of supplier website
  baseurl="https://moonbasealphax.com"

  # create ./tmp
  mkdir -p ./tmp

  # create subcategory list for components
  if [ ! -f './subcat_list.html' ]; then
    url="$baseurl/product.php?scid=3"

    wget --output-document=./tmp/index.html $url

    cat ./tmp/index.html | pup ".featureBoxContent li a" > ./subcat_list.html
  fi

  # create csv output script for jq JSON processor
  if [ ! -f './tocsv.jq' ]; then

    cat << 'EOF' > ./tocsv.jq
    # commands for jq to create csv from JSON array 
      def tocsv:
      if length == 0 then empty
      else
        (.[0] | keys_unsorted) as $keys
        | (map(keys) | add | unique) as $allkeys
        | ($keys + ($allkeys - $keys)) as $cols
        | ($cols, (.[] as $row | $cols | map($row[.])))
        | @csv
      end ;

      tocsv
EOF
  # ensure no space before EOF on previous line!
  fi


  function getStoreStockAsCSV {
    
    # call parameters:
    #   $1 : numerical store ID
    #   $2 : alpha store location
    

    store=$1
    paras="&rad_which_stock=3&refinebystore=$store"

    # output csv file name 
    csv_file="$2_items.csv"

    rm -f ./tmp/items.json

    i=0

    # get href link (selected with *) 
    cat ./subcat_list.html | pup 'a:contains("*") attr{href}' |

    while read url; do

      i=$((i+1))

      # remove & html entities
      furl=${url//&/&}

      # download html
      wget --output-document="./tmp/$i.html" "$baseurl/$furl$paras"

      # find JSON statement in script block 
      script=`cat "./tmp/$i.html" |
        pup 'script:contains("window.universal") text{}'`

      # remove variable name
      jvar=${script/window.universal_variable =/}

      # remove trailing semicolon and collect item
      echo ${jvar%;*} | jq '.listing.items[]' >> ./tmp/items.json  2>/dev/null 

    done 

    # With jq slurp up the collected items into an array and output as csv 
    cat ./tmp/items.json |
      jq --raw-output --slurp --from-file tocsv.jq > $csv_file 

    echo "****** stock list for $2 **********" 
    cat $csv_file 

  }



  title="Which Store?"
  prompt="Pick a store:"
  locations=("10 Mars" "11 Jupiter" "12 Pluto")

  while storelocation=$(zenity --title="$title" --text="$prompt" --list \
                     --column="Location" "${locations[@]}" 2>/dev/null); do
     
    if [ -n "$storelocation" ]; then
       clear
       getStoreStockAsCSV $storelocation
    fi

  done

Mars, Jupiter, Pluto and moonbasealphax.com are all, of course, completely fictitious.

They need to be changed for sites in reality, should you want to further modify and reuse this script for some other purpose. Enjoy :)

Extracting data from a website to csv using wget, pup and jq

March 09, 2017

Extracting data from a website to csv using wget, pup and jq

The Script

No comments :

Post a Comment

Image Overlay

Usage

Send Adrian an email!

Search these blog posts

Extracting data from a website to csv using wget, pup and jq

March 09, 2017

Extracting data from a website to csv using wget, pup and jq

The Script

No comments :

Post a Comment

Image Overlay

Usage

Send Adrian an email!