Python: parse html table - first attempt

johnraff · 2016-10-02 07:01:30

Hello all, this is my first attempt to use Python so I'd appreciate any criticism of this little script. It seems to work, but that might be by accident, and I might well have variable types mixed up or something, or be doing things very inefficiently...

The task: go to the Adobe Flash info. page ( https://www.adobe.com/software/flash/about/ ) and retrieve the current PPAPI flash version for "Chromium-based browsers" on Linux. The page is a morass of javascript & css but the relevant section is a simple table, which looks like this:

and the code:

<table class="data-bordered max">
    <tbody>
    <tr>
      <th scope="col">Platform</th>
      <th scope="col">Browser</th>
      <th scope="col">Player&nbsp;version</th>
    </tr>
     <tr>
      <td rowspan="8"><strong>Windows</strong></td>
      <td>Internet Explorer - ActiveX</td>
      <td>23.0.0.162</td>
    </tr>
    <tr>
      <td>Internet Explorer (embedded - Windows 8.1) - ActiveX</td>
      <td>23.0.0.162</td>
    </tr>
   
    <tr>
      <td>Edge (embedded - Windows 10) - ActiveX</td>
      <td>23.0.0.162</td>
    </tr>
     <tr>
      <td>Firefox - NPAPI</td>
      <td>23.0.0.162</td>
    </tr>
    <tr>
      <td>Chrome (embedded) - PPAPI</td>
      <td>23.0.0.162</td>
    </tr>
    <tr>
      <td>Opera, Chromium-based browsers - PPAPI</td>
      <td>23.0.0.162</td>
    </tr>
    <tr>
      <td>Internet Explorer – ActiveX (Extended Support Release)</td>
      <td>18.0.0.375</td>
    </tr>
 <tr>
      <td>Firefox – NPAPI (Extended Support Release)</td>
      <td>18.0.0.375</td>
    </tr>
    
    <tr>
      <td rowspan="4"><strong>Macintosh<br>OS X</strong></td>
        <td>Firefox, Safari - NPAPI</td>
    <td>23.0.0.162</td>
    </tr>
    <tr>
      <td>Chrome (embedded) - PPAPI</td>
    <td>23.0.0.162</td>
    </tr>
    <tr>
      <td>Opera, Chromium-based browsers - PPAPI</td>
    <td>23.0.0.162</td>
    </tr>
     <tr>
      <td>Firefox, Safari – NPAPI (Extended Support Release)</td>
    <td>18.0.0.375</td>
    </tr>
   <tr>
      <td rowspan="3"><strong>Linux</strong></td>
      <td>Firefox - NPAPI (Extended Support Release)</td>
      <td>11.2.202.635</td>
    </tr>
    <tr>
      <td>Chrome (embedded) - PPAPI</td>
      <td>23.0.0.162</td>
    </tr>
        <tr>
      <td>Chromium-based browsers - PPAPI</td>
      <td>23.0.0.162</td>
    </tr>
    <tr>
      <td><strong>ChromeOS</strong></td>
      <td>ChromeOS - PPAPI</td>
      <td>23.0.0.162</td>  
    </tr>
  </tbody>
</table>

It would be easy enough just to count so many cells this way and that, but if Adobe ever added or removed a browser, or changed the order of entries, it would break, so I wanted to make something that would identify the "Linux" section and look in there for the browser we want. It's a bit tricky because each OS section spans several rows (except for "ChromeOS "), defined by "rowspan" in the first <td>. That throws out the cell positioning in the subsequent rows of that section. Anyway, here's my attempt to do this:

#!/usr/bin/env python2.7

from bs4 import BeautifulSoup
import urllib, re, sys

section_regex = re.compile('.*Linux')
browser_regex =re.compile('Chromium-based .* PPAPI')

r = urllib.urlopen('https://www.adobe.com/software/flash/about/').read()
soup = BeautifulSoup(r)
table = soup.find('table')

section_span = 1
n = 1
for row in table.find_all('tr'):
    cells = row.find_all('td')
    if not cells:
        continue
    if n == 1:
        section_name = cells[0].text
        if section_regex.match(section_name):
            browser_name = cells[1].text
            flash_version = cells[2].text
        if cells[0].has_attr('rowspan'):
            section_span = cells[0]['rowspan']
            n += 1
    else:
        if section_regex.match(section_name):
            browser_name = cells[0].text
            flash_version = cells[1].text
        if n < int(section_span):
            n += 1
        else:
            n = 1
    if not section_regex.match(section_name):
        continue
    if browser_regex.match(browser_name):
        print flash_version
        sys.exit(0)
sys.exit(1)

The two regex strings at the top can be adjusted to pull out any flash version you want. It works atm, but as I said, I'm a Python beginner and any comments are appreciated.

johnraff · 2016-10-03 06:14:52

Thanks! That's very neat.

twoion wrote:

Maybe there's a library that makes transforming HTML tables into matrices / array of lists a single line of code.

That's what I thought. It seems like an obvious task for an html-parsing module like BeautifulSoup, but I haven't yet been able to google up anything except home-made functions on Stackexchange etc. Yours looks nicer in fact. To make it quite generic and re-usable it needs to deal with <td colspan=... too, but, compared with rowspan, duplicating td's in the same row might be fairly easy.

I thought of using the "requests" module, but it would mean adding a dependency on python-requests to the package. What advantage does request.get() have over the built-in urllib.urlopen() ( or possibly urllib2.urlopen() ) in this case?

Could be simplified in some places too..
Missing but necessary: Exception handling.

Hmm some food for thought...

johnraff · 2016-10-05 09:38:59

(This has been a great crash course in Python btw - something I'd been too lazy to try up to now.)

earlybird wrote:

Rowspan support was added because the table used it; I didn't have in mind a general solution.

Indeed. In fact, to cope with rowspan in cells later in the row is non-trivial, and would need some way of keeping track of column numbers. While a generic table2matrix function might have been nice to have I'm not going to pursue it at this point.

---

when things get more complex, requests is easier to use than urllib2

I've read strong recommendations of requests but I found just loading the module was taking a ridiculous amount of time. Reading from a local html file, the same script took ~100ms with urllib2 and ~700ms if requests was imported! (That's just loading the module.) For this job urllib2 seems fine.

---

One question: urllib2 seems to need a path to a collection of certificates to use https, so I ran it with

urllib2.urlopen('https://www.adobe.com/software/flash/about/', capath='/etc/ssl/certs')

It appears to be working, and throws no errors, but do you know of any way to test if it's really using those certificates and authenticating the connection?

---

rowspan handling can be rewritten without duplicating the m.append() part using just one loop

I tried this, but it's no shorter really:

        for td in cells:
            try:
                rowsp = int(td["rowspan"])
            except KeyError:
                rowsp = 1
            for j in xrange(i, i + int(rowsp)):
                m[j].append(td.text)

---

exception handling is just a try/except around the things that can fail. It probably is enough to just use one try/except as we do not need to do fine-grained error handling or recovery.

I've thrown in a couple of tests, and was thinking of having the calling shell script do a regex check on the returned string too.

---

Your code seemed to have a Python3-ism:

version, *_ = list(map(lambda xs: xs[Co...

The asterisk on the left of the assignment isn't accepted in Python2. The script ran with the '*_' removed, but throws a "too many values to unpack" error if the list has more than one element (it shouldn't of course). In fact I got the impression that filter, map and lambda might be dropped from Python some day, so tried rewriting that part with a "comprehension":

version, = [x[Column.VERSION] for x in matrix if os_re.match(x[Column.OS]) and type_re.match(x[Column.TYPE])]

and then ran a "try" test on that function.

---

A couple of other small changes I made:

Replaced repeat with xrange to avoid another import:

def table2matrix(t):
    rows = t.find_all("tr")
    m = [ [] for _ in xrange(len(rows)) ]

Made the list search a regex match not a hard string match. (as above)

Deleted excess members of the m list if no <td>'s were found in a row - saving the final filter. I don't know if that actually makes any difference to execution time.

        cells = tr.find_all("td")
        if not cells:
            del m[-1]
            continue

Added some system exit calls.

Anyway the new version - hoping that my changes didn't break anything:

#!/usr/bin/env python2.7

from __future__ import print_function
from bs4 import BeautifulSoup
import urllib2, re, sys

OS_regex = '.*Linux'
TYPE_regex = 'Chromium.* PPAPI'
adobe_url = 'https://www.adobe.com/software/flash/about/'
ca_certs_path = '/etc/ssl/certs' # ca-certificates on Debian Jessie

class Column:
    OS = 0
    TYPE = 1
    VERSION = 2

os_re = re.compile(OS_regex)
type_re =re.compile(TYPE_regex)

def err_exit(msg):
    print(msg, file=sys.stderr)
    sys.exit(1)

def table2matrix(t):
    rows = t.find_all("tr")
    m = [ [] for _ in range(len(rows)) ]
    i = 0
    for tr in rows:
        cells = tr.find_all("td")
        if not cells:
            del m[-1]
            continue
        for td in cells:
            try:
                rowsp = int(td["rowspan"])
            except KeyError:
                rowsp = 1
            for j in xrange(i, i + int(rowsp)):
                m[j].append(td.text)
        i += 1
    return m

def get_version():
    version, = [x[Column.VERSION] for x in matrix if os_re.match(x[Column.OS]) and type_re.match(x[Column.TYPE])]
    return version

try:
    r = urllib2.urlopen(adobe_url, capath=ca_certs_path).read()
except IOError:
    err_exit('IO Error: Bad URL or certs path?')

soup = BeautifulSoup(r, 'html.parser')
matrix = table2matrix(soup.find('table', { 'class': 'data-bordered' }))

try:
    print(get_version())
except ValueError:
    err_exit('Parsing error: regular expressions wrong?')
else:
    sys.exit(0)

Last edited by johnraff (2016-10-06 05:45:04)

johnraff · 2016-10-06 05:45:57

^EDIT: Restore call to Python2.7 (runs on Python3 too not true, but the version below does), add stderr printing, move user-tweakable variables to top.

Last edited by johnraff (2016-10-13 08:22:31)

johnraff · 2016-10-12 09:09:47

Just in case anyone in the future reads this (please don't pay it too much attention), a slightly tidied-up version here (this one does run on both python 2 & 3):

#!/usr/bin/env python2.7

from __future__ import print_function
from bs4 import BeautifulSoup
import re, sys
try:
    # For Python 3
    from urllib.request import urlopen, URLError
except ImportError:
    # For Python 2
    from urllib2 import urlopen, URLError

# These regexes will be *searched* for in the <td> text.
# Use ^ and $ as necessary.
OS_regex = 'Linux'
TYPE_regex = 'Chromium.* PPAPI'

adobe_url = 'https://www.adobe.com/software/flash/about/'
ca_certs_path = '/etc/ssl/certs' # ca-certificates on Debian Jessie

def err_exit(*msgs):
    print('\n'.join(msgs), file=sys.stderr)
    sys.exit(1)

class Column:
    OS = 0
    TYPE = 1
    VERSION = 2

def table2matrix(t):
    rows = t.find_all("tr")
    m = [ [] for _ in range(len(rows)) ]
    i = 0
    for tr in rows:
        for td in tr.find_all("td"):
            if 'rowspan' in td.attrs:
                rowsp = int(td["rowspan"])
            else:
                rowsp = 1
            for j in range(i, i + int(rowsp)):
                m[j].append(td.get_text(separator=' ', strip=True))
        i += 1
    return [ x for x in m if x ]

def latest_version():
    try:
        body = urlopen(adobe_url, capath=ca_certs_path).read()
    except URLError as e:
        err_exit('URLError: Bad URL or certs path?', str(e.reason))
    soup = BeautifulSoup(body, 'html.parser')
    matrix = table2matrix(soup.find('table', { 'class': 'data-bordered' }))
    version, = [x[Column.VERSION] for x in matrix if re.search(OS_regex, x[Column.OS]) and re.search(TYPE_regex, x[Column.TYPE])]
    return version

try:
    print(latest_version())
except ValueError:
    err_exit('Parsing error: regular expressions wrong?')

Last edited by johnraff (2016-10-13 08:25:56)

#1 2016-10-02 07:01:30

Python: parse html table - first attempt

#2 2016-10-03 06:14:52

Re: Python: parse html table - first attempt

#3 2016-10-05 09:38:59

Re: Python: parse html table - first attempt

#4 2016-10-06 05:45:57

Re: Python: parse html table - first attempt

#5 2016-10-12 09:09:47

Re: Python: parse html table - first attempt

Board footer