You are not logged in.
Hello all, this is my first attempt to use Python so I'd appreciate any criticism of this little script. It seems to work, but that might be by accident, and I might well have variable types mixed up or something, or be doing things very inefficiently...
The task: go to the Adobe Flash info. page ( https://www.adobe.com/software/flash/about/ ) and retrieve the current PPAPI flash version for "Chromium-based browsers" on Linux. The page is a morass of javascript & css but the relevant section is a simple table, which looks like this:
and the code:
<table class="data-bordered max">
<tbody>
<tr>
<th scope="col">Platform</th>
<th scope="col">Browser</th>
<th scope="col">Player version</th>
</tr>
<tr>
<td rowspan="8"><strong>Windows</strong></td>
<td>Internet Explorer - ActiveX</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Internet Explorer (embedded - Windows 8.1) - ActiveX</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Edge (embedded - Windows 10) - ActiveX</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Firefox - NPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Chrome (embedded) - PPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Opera, Chromium-based browsers - PPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Internet Explorer – ActiveX (Extended Support Release)</td>
<td>18.0.0.375</td>
</tr>
<tr>
<td>Firefox – NPAPI (Extended Support Release)</td>
<td>18.0.0.375</td>
</tr>
<tr>
<td rowspan="4"><strong>Macintosh<br>OS X</strong></td>
<td>Firefox, Safari - NPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Chrome (embedded) - PPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Opera, Chromium-based browsers - PPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Firefox, Safari – NPAPI (Extended Support Release)</td>
<td>18.0.0.375</td>
</tr>
<tr>
<td rowspan="3"><strong>Linux</strong></td>
<td>Firefox - NPAPI (Extended Support Release)</td>
<td>11.2.202.635</td>
</tr>
<tr>
<td>Chrome (embedded) - PPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td>Chromium-based browsers - PPAPI</td>
<td>23.0.0.162</td>
</tr>
<tr>
<td><strong>ChromeOS</strong></td>
<td>ChromeOS - PPAPI</td>
<td>23.0.0.162</td>
</tr>
</tbody>
</table>
It would be easy enough just to count so many cells this way and that, but if Adobe ever added or removed a browser, or changed the order of entries, it would break, so I wanted to make something that would identify the "Linux" section and look in there for the browser we want. It's a bit tricky because each OS section spans several rows (except for "ChromeOS "), defined by "rowspan" in the first <td>. That throws out the cell positioning in the subsequent rows of that section. Anyway, here's my attempt to do this:
#!/usr/bin/env python2.7
from bs4 import BeautifulSoup
import urllib, re, sys
section_regex = re.compile('.*Linux')
browser_regex =re.compile('Chromium-based .* PPAPI')
r = urllib.urlopen('https://www.adobe.com/software/flash/about/').read()
soup = BeautifulSoup(r)
table = soup.find('table')
section_span = 1
n = 1
for row in table.find_all('tr'):
cells = row.find_all('td')
if not cells:
continue
if n == 1:
section_name = cells[0].text
if section_regex.match(section_name):
browser_name = cells[1].text
flash_version = cells[2].text
if cells[0].has_attr('rowspan'):
section_span = cells[0]['rowspan']
n += 1
else:
if section_regex.match(section_name):
browser_name = cells[0].text
flash_version = cells[1].text
if n < int(section_span):
n += 1
else:
n = 1
if not section_regex.match(section_name):
continue
if browser_regex.match(browser_name):
print flash_version
sys.exit(0)
sys.exit(1)
The two regex strings at the top can be adjusted to pull out any flash version you want. It works atm, but as I said, I'm a Python beginner and any comments are appreciated.
...elevator in the Brain Hotel, broken down but just as well...
( a boring Japan blog (currently paused), now on Bluesky, there's also some GitStuff )
Offline
Thanks! That's very neat.
Maybe there's a library that makes transforming HTML tables into matrices / array of lists a single line of code.
That's what I thought. It seems like an obvious task for an html-parsing module like BeautifulSoup, but I haven't yet been able to google up anything except home-made functions on Stackexchange etc. Yours looks nicer in fact. To make it quite generic and re-usable it needs to deal with <td colspan=... too, but, compared with rowspan, duplicating td's in the same row might be fairly easy.
I thought of using the "requests" module, but it would mean adding a dependency on python-requests to the package. What advantage does request.get() have over the built-in urllib.urlopen() ( or possibly urllib2.urlopen() ) in this case?
Could be simplified in some places too..
Missing but necessary: Exception handling.
Hmm some food for thought...
...elevator in the Brain Hotel, broken down but just as well...
( a boring Japan blog (currently paused), now on Bluesky, there's also some GitStuff )
Offline
(This has been a great crash course in Python btw - something I'd been too lazy to try up to now.)
Rowspan support was added because the table used it; I didn't have in mind a general solution.
Indeed. In fact, to cope with rowspan in cells later in the row is non-trivial, and would need some way of keeping track of column numbers. While a generic table2matrix function might have been nice to have I'm not going to pursue it at this point.
---
when things get more complex, requests is easier to use than urllib2
I've read strong recommendations of requests but I found just loading the module was taking a ridiculous amount of time. Reading from a local html file, the same script took ~100ms with urllib2 and ~700ms if requests was imported! (That's just loading the module.) For this job urllib2 seems fine.
---
One question: urllib2 seems to need a path to a collection of certificates to use https, so I ran it with
urllib2.urlopen('https://www.adobe.com/software/flash/about/', capath='/etc/ssl/certs')
It appears to be working, and throws no errors, but do you know of any way to test if it's really using those certificates and authenticating the connection?
---
rowspan handling can be rewritten without duplicating the m.append() part using just one loop
I tried this, but it's no shorter really:
for td in cells:
try:
rowsp = int(td["rowspan"])
except KeyError:
rowsp = 1
for j in xrange(i, i + int(rowsp)):
m[j].append(td.text)
---
exception handling is just a try/except around the things that can fail. It probably is enough to just use one try/except as we do not need to do fine-grained error handling or recovery.
I've thrown in a couple of tests, and was thinking of having the calling shell script do a regex check on the returned string too.
---
Your code seemed to have a Python3-ism:
version, *_ = list(map(lambda xs: xs[Co...
The asterisk on the left of the assignment isn't accepted in Python2. The script ran with the '*_' removed, but throws a "too many values to unpack" error if the list has more than one element (it shouldn't of course). In fact I got the impression that filter, map and lambda might be dropped from Python some day, so tried rewriting that part with a "comprehension":
version, = [x[Column.VERSION] for x in matrix if os_re.match(x[Column.OS]) and type_re.match(x[Column.TYPE])]
and then ran a "try" test on that function.
---
A couple of other small changes I made:
Replaced repeat with xrange to avoid another import:
def table2matrix(t):
rows = t.find_all("tr")
m = [ [] for _ in xrange(len(rows)) ]
Made the list search a regex match not a hard string match. (as above)
Deleted excess members of the m list if no <td>'s were found in a row - saving the final filter. I don't know if that actually makes any difference to execution time.
cells = tr.find_all("td")
if not cells:
del m[-1]
continue
Added some system exit calls.
Anyway the new version - hoping that my changes didn't break anything:
#!/usr/bin/env python2.7
from __future__ import print_function
from bs4 import BeautifulSoup
import urllib2, re, sys
OS_regex = '.*Linux'
TYPE_regex = 'Chromium.* PPAPI'
adobe_url = 'https://www.adobe.com/software/flash/about/'
ca_certs_path = '/etc/ssl/certs' # ca-certificates on Debian Jessie
class Column:
OS = 0
TYPE = 1
VERSION = 2
os_re = re.compile(OS_regex)
type_re =re.compile(TYPE_regex)
def err_exit(msg):
print(msg, file=sys.stderr)
sys.exit(1)
def table2matrix(t):
rows = t.find_all("tr")
m = [ [] for _ in range(len(rows)) ]
i = 0
for tr in rows:
cells = tr.find_all("td")
if not cells:
del m[-1]
continue
for td in cells:
try:
rowsp = int(td["rowspan"])
except KeyError:
rowsp = 1
for j in xrange(i, i + int(rowsp)):
m[j].append(td.text)
i += 1
return m
def get_version():
version, = [x[Column.VERSION] for x in matrix if os_re.match(x[Column.OS]) and type_re.match(x[Column.TYPE])]
return version
try:
r = urllib2.urlopen(adobe_url, capath=ca_certs_path).read()
except IOError:
err_exit('IO Error: Bad URL or certs path?')
soup = BeautifulSoup(r, 'html.parser')
matrix = table2matrix(soup.find('table', { 'class': 'data-bordered' }))
try:
print(get_version())
except ValueError:
err_exit('Parsing error: regular expressions wrong?')
else:
sys.exit(0)
Last edited by johnraff (2016-10-06 05:45:04)
...elevator in the Brain Hotel, broken down but just as well...
( a boring Japan blog (currently paused), now on Bluesky, there's also some GitStuff )
Offline
^EDIT: Restore call to Python2.7 (runs on Python3 too not true, but the version below does), add stderr printing, move user-tweakable variables to top.
Last edited by johnraff (2016-10-13 08:22:31)
...elevator in the Brain Hotel, broken down but just as well...
( a boring Japan blog (currently paused), now on Bluesky, there's also some GitStuff )
Offline
Just in case anyone in the future reads this (please don't pay it too much attention), a slightly tidied-up version here (this one does run on both python 2 & 3):
#!/usr/bin/env python2.7
from __future__ import print_function
from bs4 import BeautifulSoup
import re, sys
try:
# For Python 3
from urllib.request import urlopen, URLError
except ImportError:
# For Python 2
from urllib2 import urlopen, URLError
# These regexes will be *searched* for in the <td> text.
# Use ^ and $ as necessary.
OS_regex = 'Linux'
TYPE_regex = 'Chromium.* PPAPI'
adobe_url = 'https://www.adobe.com/software/flash/about/'
ca_certs_path = '/etc/ssl/certs' # ca-certificates on Debian Jessie
def err_exit(*msgs):
print('\n'.join(msgs), file=sys.stderr)
sys.exit(1)
class Column:
OS = 0
TYPE = 1
VERSION = 2
def table2matrix(t):
rows = t.find_all("tr")
m = [ [] for _ in range(len(rows)) ]
i = 0
for tr in rows:
for td in tr.find_all("td"):
if 'rowspan' in td.attrs:
rowsp = int(td["rowspan"])
else:
rowsp = 1
for j in range(i, i + int(rowsp)):
m[j].append(td.get_text(separator=' ', strip=True))
i += 1
return [ x for x in m if x ]
def latest_version():
try:
body = urlopen(adobe_url, capath=ca_certs_path).read()
except URLError as e:
err_exit('URLError: Bad URL or certs path?', str(e.reason))
soup = BeautifulSoup(body, 'html.parser')
matrix = table2matrix(soup.find('table', { 'class': 'data-bordered' }))
version, = [x[Column.VERSION] for x in matrix if re.search(OS_regex, x[Column.OS]) and re.search(TYPE_regex, x[Column.TYPE])]
return version
try:
print(latest_version())
except ValueError:
err_exit('Parsing error: regular expressions wrong?')
Last edited by johnraff (2016-10-13 08:25:56)
...elevator in the Brain Hotel, broken down but just as well...
( a boring Japan blog (currently paused), now on Bluesky, there's also some GitStuff )
Offline