BioStar users update: using mechanize to fetch user IP addresses

In my previous post I outlined a clumsy, manual method to retrieve user IP addresses from BioStar, using Javascript. Jukka left a helpful comment, explaining how to send an authentication cookie to a website. So, we’re now in a position to automate the fetching of user IP addresses.

Before we get started, a few words of caution. First, using code to pass secret authentication variables around might not be a great idea, from a security point of view. Second, some websites do not like to be “scraped” using code – so check first. Third, even when a site does permit scraping, exercise some restraint and common sense. It’s not polite to write code that hammers a server with 1000 requests per second – and may lead to blacklisting of your IP address. Finally, be respectful of confidential data. I’m publishing only cities and their coordinates, not the details of the users at those coordinates.

1. Find your authentication cookie
For this approach to work, you need to login to BioStar (via web browser) and be a moderator. I use the Chrome browser so from any BioStar page I right-click in the body of the page, select “Inspect Element” and then click the “Storage” tab in the Developer Tools console. From there click “biostar.stackexchange.com” under “Cookies” in the left tab and look for the cookie named “user”. Copy its value.

2. Ruby code
To fetch IP addresses, I wrote some code around the Ruby mechanize gem. The mechanize library has been ported to several languages and is a great way to automate interaction with websites. Here’s the code:

#!/usr/bin/ruby

require "rubygems"
require "mechanize"
require "logger"

log    = Logger.new('ip.txt')
url    = "http://biostar.stackexchange.com/users"
auth   = ARGV[0] or abort("Usage = biostar.rb auth-cookie-string")
agent  = Mechanize.new
cookie = Mechanize::Cookie.new("user", auth)
cookie.domain = ".stackexchange.com"
cookie.path   = "/"

# login first
page  = agent.get(url)
agent.cookie_jar.add(page.uri, cookie)

i = 1
loop do
  page  = agent.get("#{url}?page=#{i}")
  users = []

  page.links.each do |link|
    if link.uri.to_s =~/\/users\/\d+\// && link.text != ""
      users.push(link)
    end
  end

# skip first user link after login (= you, top of page)
  users[1..users.count - 1].each do |user|
    ip = ""
    userpage = user.click
    lastip   = userpage.search(".//div[@class='summaryinfo']").inner_text
    if lastip =~/from\s+(.*?)$/
      ip = $1
    end
    log.debug "#{user.text}\t#{ip}"
  end
  break if page.link_with(:text => ' next').nil?
  i += 1
end

The script is named biostar_get_user_ip.rb and is run by typing biostar_get_user_ip.rb “cookie” (including the quotes), where cookie is the value of the user cookie. Line 11 creates the cookie and line 17 saves it for future use.

The core of the code is lines 20-42. We loop through each page of users and store the links to their profile pages in an array, users. Then we loop through the array, visit each profile page and look for the DIV containing their last known IP address, which is presented only to authenticated moderators. I’ve used Logger to write the results to a log file, but you could just as easily print or write to a file. The loop breaks after the final page is processed, by checking for the absence of a “Next page” link.

This code retrieved 578 IP addresses; I’ve updated both the GitHub repository and the ZeeMaps map with the new data. In addition, I now have a working solution to update at any time, as new users join BioStar.