
I've got a little spending habit at a certain hi-tech retailer that lists second hand computer parts on their website.
There are bargains here to be found, but I find daily exploration a pain, with the multiple categories of products shown on separate slow pages. Almost a bit like this blog ;-)
So my quickly written script does the boring work instead using wget, pup and jq, and leaves me a spreadsheet to browse over.
The script requires these tools installed.
- pup : https://github.com/ericchiang/pup
- "pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors."
- jq : https://stedolan.github.io/jq/
- "jq is a lightweight and flexible command-line JSON processor. jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data..."
I think the wget tool should already be present in most linux distributions.
The Script
I chose to use pup after looking at the website returned search results. These static HTML coded pages contained a structured 'window.universal_variable' JSON object. Storing a maximum of 16 data records, this object sufficiently held the data that I was interested in.I did also find further records delivered by dynamic XMLHttpRequest, but with not that many items per interested category shown in their stores, I found this mechanism rarely used. So, no need for me to create code to interface with that :)
In the rest of my script, jq is used to filter, collect and convert these JSON objects to csv.
Running it for the first time, a file 'subcat_list.html' is created. An extract looks like this:
This lists the subcategories and their internal links found within the website, and is now a configuration file that requires some simple editing to allow script csv output.
...
...
<a href="product.php?mode=buy&plid=8">
Data Storage
</a>
<a href="search/index.php?stext=*§ion=&catid=1">
* IDE Hard Drives
</a>
<a href="search/index.php?stext=*§ion=&catid=2">
* Laptop Hard Drives
</a>
<a href="search/index.php?stext=*§ion=&catid=3">
Network Attached Hard Drives
</a>
... ,
To indicate interested subcategories for this, I just place a '*' next to the category title. By way of an example, I've shown that done above for the first two hard drive categories.
Here's the entire script:
Mars, Jupiter, Pluto and moonbasealphax.com are all, of course, completely fictitious.
#!/bin/bash
# root url of supplier website
baseurl="https://moonbasealphax.com"
# create ./tmp
mkdir -p ./tmp
# create subcategory list for components
if [ ! -f './subcat_list.html' ]; then
url="$baseurl/product.php?scid=3"
wget --output-document=./tmp/index.html $url
cat ./tmp/index.html | pup ".featureBoxContent li a" > ./subcat_list.html
fi
# create csv output script for jq JSON processor
if [ ! -f './tocsv.jq' ]; then
cat << 'EOF' > ./tocsv.jq
# commands for jq to create csv from JSON array
def tocsv:
if length == 0 then empty
else
(.[0] | keys_unsorted) as $keys
| (map(keys) | add | unique) as $allkeys
| ($keys + ($allkeys - $keys)) as $cols
| ($cols, (.[] as $row | $cols | map($row[.])))
| @csv
end ;
tocsv
EOF
# ensure no space before EOF on previous line!
fi
function getStoreStockAsCSV {
# call parameters:
# $1 : numerical store ID
# $2 : alpha store location
store=$1
paras="&rad_which_stock=3&refinebystore=$store"
# output csv file name
csv_file="$2_items.csv"
rm -f ./tmp/items.json
i=0
# get href link (selected with *)
cat ./subcat_list.html | pup 'a:contains("*") attr{href}' |
while read url; do
i=$((i+1))
# remove & html entities
furl=${url//&/&}
# download html
wget --output-document="./tmp/$i.html" "$baseurl/$furl$paras"
# find JSON statement in script block
script=`cat "./tmp/$i.html" |
pup 'script:contains("window.universal") text{}'`
# remove variable name
jvar=${script/window.universal_variable =/}
# remove trailing semicolon and collect item
echo ${jvar%;*} | jq '.listing.items[]' >> ./tmp/items.json 2>/dev/null
done
# With jq slurp up the collected items into an array and output as csv
cat ./tmp/items.json |
jq --raw-output --slurp --from-file tocsv.jq > $csv_file
echo "****** stock list for $2 **********"
cat $csv_file
}
title="Which Store?"
prompt="Pick a store:"
locations=("10 Mars" "11 Jupiter" "12 Pluto")
while storelocation=$(zenity --title="$title" --text="$prompt" --list \
--column="Location" "${locations[@]}" 2>/dev/null); do
if [ -n "$storelocation" ]; then
clear
getStoreStockAsCSV $storelocation
fi
done
They need to be changed for sites in reality, should you want to further modify and reuse this script for some other purpose. Enjoy :)
No comments :
Post a Comment