PINGDOM_CHECK

How to fill login forms automatically

Read Time

2 Mins

Posted on

October 26, 2012

Categories
We often have to write spiders that need to login to sites, in order to scrape data from them. Our customers provide us with the site, username and password, and we do the rest.

By

Pablo Hoffman

Return to top

How to fill login forms automatically

We often have to write spiders that need to login to sites, in order to scrape data from them. Our customers provide us with the site, username and password, and we do the rest.

The classic way to approach this problem is:

  1. launch a browser, go to site and search for the login page
  2. inspect the source code of the page to find out:
    1. which one is the login form (a page can have many forms, but usually one of them is the login form)
    2. which are the field names used for username and password (these could vary a lot)
    3. if there are other fields that must be submitted (like an authentication token)
  3. write the Scrapy spider to replicate the form submission using FormRequest (here is an example)

Being fans of automation, we figured we could write some code to automate point 2 (which is actually the most time-consuming) and the result is loginform, a library to automatically fill login forms given the login page, username and password.

Here is the code of a simple spider that would use loginform to login to sites automatically:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from loginform import fill_login_form
class LoginSpider(BaseSpider):
start_urls = ["http://github.com/login"]
login_user = "foo"
login_pass = "bar"
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
def after_login(self, response):
# you are logged in here
from scrapy.spider import BaseSpider from scrapy.http import FormRequest from loginform import fill_login_form class LoginSpider(BaseSpider): start_urls = ["http://github.com/login"] login_user = "foo" login_pass = "bar" def parse(self, response): args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass) return FormRequest(url, method=method, formdata=args, callback=self.after_login) def after_login(self, response): # you are logged in here
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from loginform import fill_login_form

class LoginSpider(BaseSpider):

    start_urls = ["http://github.com/login"]
    login_user = "foo"
    login_pass = "bar"

    def parse(self, response):
        args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
        return FormRequest(url, method=method, formdata=args, callback=self.after_login)

    def after_login(self, response):
        # you are logged in here

In addition to being open source, loginform code is very simple and easy to hack (check the README on Github for more details). It also contains a collection of HTML samples to keep the library well-tested, and a convenient tool to manage them. Even with the simple code so far, we have seen accuracy rates of 95% in our tests. We encourage everyone with similar needs to give it a try, provide feedback and contribute patches.

We contributed this project to the Scrapy group to better encourage community adoption and contributions. Like w3lib and scrapely, loginform is completely decoupled and not dependent on the Scrapy crawling framework.