I've got a little spending habit at a certain hi-tech retailer that lists second hand computer parts on their website.
There are bargains here to be found, but I find daily exploration a pain, with the multiple categories of products shown on separate slow pages. Almost a bit like this blog ;-)
So my quickly written script does the boring work instead using wget, pup and jq, and leaves me a spreadsheet to browse over.
The script requires these tools installed.
- pup : https://github.com/ericchiang/pup
- "pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors."
- jq : https://stedolan.github.io/jq/
- "jq is a lightweight and flexible command-line JSON processor. jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data..."
I think the wget tool should already be present in most linux distributions.
The Script
I chose to use pup after looking at the website returned search results. These static HTML coded pages contained a structured 'window.universal_variable' JSON object. Storing a maximum of 16 data records, this object sufficiently held the data that I was interested in.
I did also find further records delivered by dynamic XMLHttpRequest, but with not that many items per interested category shown in their stores, I found this mechanism rarely used. So, no need for me to create code to interface with that :)
In the rest of my script, jq is used to filter, collect and convert these JSON objects to csv.
Running it for the first time, a file 'subcat_list.html' is created. An extract looks like this:
...
...
<a href="product.php?mode=buy&plid=8">
Data Storage
</a>
<a href="search/index.php?stext=*§ion=&catid=1">
* IDE Hard Drives
</a>
<a href="search/index.php?stext=*§ion=&catid=2">
* Laptop Hard Drives
</a>
<a href="search/index.php?stext=*§ion=&catid=3">
Network Attached Hard Drives
</a>
... ,
This lists the subcategories and their internal links found within the website, and is now a configuration file that requires some simple editing to allow script csv output.
To indicate interested subcategories for this, I just place a '*' next to the category title. By way of an example, I've shown that done above for the first two hard drive categories.
Here's the entire script:
#!/bin/bash
# root url of supplier website
baseurl="https://moonbasealphax.com"
# create ./tmp
mkdir -p ./tmp
# create subcategory list for components
if [ ! -f './subcat_list.html' ]; then
url="$baseurl/product.php?scid=3"
wget --output-document=./tmp/index.html $url
cat ./tmp/index.html | pup ".featureBoxContent li a" > ./subcat_list.html
fi
# create csv output script for jq JSON processor
if [ ! -f './tocsv.jq' ]; then
cat << 'EOF' > ./tocsv.jq
# commands for jq to create csv from JSON array
def tocsv:
if length == 0 then empty
else
(.[0] | keys_unsorted) as $keys
| (map(keys) | add | unique) as $allkeys
| ($keys + ($allkeys - $keys)) as $cols
| ($cols, (.[] as $row | $cols | map($row[.])))
| @csv
end ;
tocsv
EOF
# ensure no space before EOF on previous line!
fi
function getStoreStockAsCSV {
# call parameters:
# $1 : numerical store ID
# $2 : alpha store location
store=$1
paras="&rad_which_stock=3&refinebystore=$store"
# output csv file name
csv_file="$2_items.csv"
rm -f ./tmp/items.json
i=0
# get href link (selected with *)
cat ./subcat_list.html | pup 'a:contains("*") attr{href}' |
while read url; do
i=$((i+1))
# remove & html entities
furl=${url//&/&}
# download html
wget --output-document="./tmp/$i.html" "$baseurl/$furl$paras"
# find JSON statement in script block
script=`cat "./tmp/$i.html" |
pup 'script:contains("window.universal") text{}'`
# remove variable name
jvar=${script/window.universal_variable =/}
# remove trailing semicolon and collect item
echo ${jvar%;*} | jq '.listing.items[]' >> ./tmp/items.json 2>/dev/null
done
# With jq slurp up the collected items into an array and output as csv
cat ./tmp/items.json |
jq --raw-output --slurp --from-file tocsv.jq > $csv_file
echo "****** stock list for $2 **********"
cat $csv_file
}
title="Which Store?"
prompt="Pick a store:"
locations=("10 Mars" "11 Jupiter" "12 Pluto")
while storelocation=$(zenity --title="$title" --text="$prompt" --list \
--column="Location" "${locations[@]}" 2>/dev/null); do
if [ -n "$storelocation" ]; then
clear
getStoreStockAsCSV $storelocation
fi
done
Mars, Jupiter, Pluto and moonbasealphax.com are all, of course, completely fictitious.
They need to be changed for sites in reality, should you want to further modify and reuse this script for some other purpose. Enjoy :)