Select Page

Some technical details about the process for the Hansard data analysis project:

  • Downloading Hansard transcripts (XML) from the Parliament of Australia website using a Python script (some Javascript hackery was required to get the proper path)
  • Python with BeautifulSoup to parse the XML and write individual words to a local database, with a reference to the Member along with the context of the word (session, time, speech, question, answer, etc)
  • Stripping out punctuation, numbers, apostrophe + ‘s’, URLs, etc.
  • Dealing with some “interesting” data at times:
<span style="font-style:italic;">Bringing </span>, <span style="font-style:italic;">t</span>, <span style="font-style:italic;">hem </span>, <span style="font-style:italic;">h</span>, <span style="font-style:italic;">ome</span>
  • Python dictionary keys to determine the ongoing unique words (and count) for each member
  • Google Cloud Platform to try BigTable, BigQuery and Datalabs
  • A realisation that my laptop could manage the DB and queries as long as I was prepared to wait occasionally
  • Juypter Notebook
  • Wrangling the data with numpy, panda and matplotlib
  • Plenty of research on Stack Overflow