PINGDOM_CHECK
3 Mins

PyCon Philippines 2015

Earlier this month we attended PyCon Philippines as a gold sponsor, presenting on the 2nd day. This was particularly exciting as it was the first time the whole Philippines team was together in one place and it was nice meeting each other in person!

Checkout the Slides below :

[slideshare id=50539838&doc=scrapinghubpyconphilippines2015-150715055151-lva1-app6892]

The talk started with how people would scrape manually in the past, the pain of dealing with handling timeouts, retries, HTTP errors and so forth. We presented Scrapy as a solution to these issues and explained how to address them, as well as giving a brief history of how Scrapy came to be.

11698839_10204654662429255_7239153432639870114_o

We proceeded with a live demo showing how to scrape the Republic Acts of the Philippines from the Philippine government website as well as scraping clickthecity.com to retrieve cinema screening schedules in Metro Manila. Some of the audience joined in during the demo and we helped answer their questions.

We also talked about some of the projects we have done for customers at Scrapinghub:

  • Collecting product information from various retailers worldwide for a UK analytics company. This data is used to discover who has the cheapest products, which retailers are running promotions etc. This is useful for customers who want to find the best deals, retailers to see how they compare, and brands to ensure retailers are confirming to their guidelines.
  • Scraping home appliances along with their price, specifications and ratings for a U.S. Department of Energy laboratory. This data is used to better understand the relation of product price, energy efficiency and other factors and their evolution over time.
  • DARPA's Memex project.

We then showed a number of side projects including:

  • Using Scrapy to crawl the Metro Rail Transit website with the aid of computer vision to gauge the number of people on a scale of 1 to 10 through their CCTV images. Visualisation of the data was collected and presented with an explanation of the possibility of using historical data to predict future results.
  • Minibalita.com: Jolo showed scraping Philippine news websites and then running them through his site TextTeaser to produce article summaries. Balita is the Tagalog word for news.
  • Mikko presented his crawling of the Philippines’ 2013 general election, emphasising the power of structured data in finding trends. Some unusual trends discussed were:
    • How one clustered precinct of around 400 people only voted for one party list despite having the option to vote for two.
    • How one clustered precinct only voted for one senator, out of the maximum of 12 votes they can cast for the position.
    • How 70 clustered precincts recorded a 100% voter turnout which is highly unlikely to happen.

Finally we discussed some of the legalities of scraping, of which there were a lot of questions. We also had many questions on how we deal with complaints and sites blocking us, and how to deal with sites that make heavy use of JavaScript and AJAX.

11056608_10204654663429280_6966676948524246209_o

Afterwards we had dinner in a nearby mall with the other speakers and it was great meeting like minded people from the Python community.