WordPress.com search results filtering script (one week scripting exercise n. 1)

I thought it would be nice to publish some code snippets and scripts on this blog, starting with simple exercises which I can conclude in less then a week (to avoid complicating my projects and exercises so much they remain unfinished).

WordPress.com gave me an idea for the first of these exercises. I searched wordpress.com for blogs on themes similar to mine, but a lot of results were of blogs with very little posts, so I thought to write a simple script which searches wordpress.com, and then filters the results to find posts from blogs with more then a certain number of posts.

The idea for my script is simple. For each blog, I figure out how many posts for page it has, and in order to find out if it has at least n posts, I calculate where would that nth post be found (on which page and in which position on the page), and find out if it exists.

This means a lot of (blocking) HTTP requests, and it would be MUCH FASTER if I knew how to use threads, asynchronous sockets, or something like that. It would be nice to use a JavaScript like approach, with requests and callbacks. I even spent two days (not whole days, I have other programming work to do besides these exercises), and finally decided that it would be too much for a less-then-a-week long beginner’s exercise. But that is definitely something I need to look into soon!

I have always wanted to take my web experience beyond just browsing. As Davide Eynard explains in his powerbrowsing tutorial, “a computer is not a TV“. Anyone can write many scripts that help see the web and the Internet from a variety of interesting points of view, and so can I. So I slowly started to explore. First I read the page source of websites I visited, switched CSS and JavaScript off and on, to see what that does. Then I studied some JavaScript, and started to understand the DOM (how the browser sees the page as a tree) and HTTP requests and responses. I installed Mozilla add-ons like the DOM inspector, Firebug and Live HTTP headers (more on those in some future post). Then I started writing some simple scripts. This is one of them.

You can download the script here.

A screen shot of wordpress.com search with use of the Live HTTP Headers Mozilla add-on.

Script takes some simple command like arguments:
-p or –posts for the number of posts the blogs should have
-s or –search for the search string
-f or –from and -t or –to for the range of result pages
-v or –verbose for the ammount of detail about each post to display
-a or –all if you want to see the posts that didn’t qualify, separately

——————————————————————————————-Screenshot of my script for searching wordpress.com and filtering results based on blogs' post numbers

——————————————————————————————-

SOME SNIPPETS:

1) Separating the page into segments I can parse easily. Each segment contains html describing one found blog post. It probably could be improved, to make it less vulnerable to possible page variations, but for today it will do.

chunks = site.split('<!-- // result -->')
chunks[0] = chunks[0].rpartition('<!-- google_ad_section_start -->')[2]
chunks=chunks[:-1]

2) Parsing the html with regular expressions. I have used the “non-greedy” version of the get-any number of any characters regexp, that is (.*?). For instance, if I search for something between ‘<a href=”bla”>’ and ‘</a>’, it will fetch all the characters in between them, no matter what they are (apart from a newline – and that too can be overridden, with Python’s re module, see the DOTALL flag), but it will also stop at ‘</a>, since it is ‘non-greedy’. It grabs as little as possible. Information about this simple method can be found on the web, and is also mentioned in Davide Eynard’s powerbrowsing tutorial as  “a regular expression which matches (…) the smallest text chunk between the two conditions”. I keep the information on posts in a list of dictionaries, so I can easily refer to any found post by its index in the list, and immediately find its title, url, etc. in the dictionary.

for i, item in enumerate(chunks):
    item_dicts.append({})
    regobj = re.compile('a href="(.*?)">(.*?)</a>')
    values = regobj.findall(item)[0]
    item_dicts[i]['post_url'] = values[0]
    title = string.replace(values[1],'<strong>','')
    title = string.replace(title, '</strong>','')
    item_dicts[i]['post_title'] = title

3) The functions that counts posts on each page is very basic. It just counts the occurrences of html ‘<div id=”post-‘ or ‘<div class=”post-‘. Nothing fancy, for today it’s enough.

count = page.count('<div id="post-')
count = count + page.count('<div class="post-')

4) The function that checks if the nth posts we are interested in is found

def check_nth_post(wp_posts, posts_per_page, blog_url):

# This function checks if this blog has at least n posts,
# by checking if that posts exists in its calculated position

    if posts_per_page>=wp_posts: return 1
    else:
        if posts_per_page>0:
        # For instance, if we have a blog with 10 posts on every page,
        # the 37th post is seventh on the fourth page.
        # So page is 4 (37/10 +1), and nth is 7 (37%10)
        # If we have the same blog, but want the 30th post,
        # it will be tenth on the third page,
        # page will be 3 (30/10 +0), and nth will be 10
        # Probably there are more elegant ways to calculate that,
        # but for now this works.
        page = int(wp_posts/posts_per_page)+((wp_posts%posts_per_page)!=0)
        nth = wp_posts%posts_per_page
        if nth==0: nth=page
        if get_posts_per_page(blog_url+"page/"+str(page))>=nth:
             return 1
         else:
             return 0

5) I keep track of excluded posts, and if you want to see the list of those too, you add the -a or –all flag when launching the script in the command line. I keep track of possible strange cases as well, like those that don’t turn out any ‘<div id=”post-‘ or ‘<div class=”post-‘ strings I can use to count posts on the page. Those are very rare, and I might look into it in the future. Maybe.

I have learned new things writing this script, and I have also used it to discover some interesting blogs.

I have also discovered some areas I need to work on:

1) And am now working on improving my use of regular expressions.
2) Intend to study threading/asynchronous sockets in Python, because blocking
HTTP requests (sockets?) are making this script very slow 😦
3) Need to understand the best way to interface Python to the web,
especially outside frameworks like Django
4) Need to better understand url and html encoding and decoding in Python
5) I should probably handle exceptions and possible problems more carefully,
checking my data as I pass it along, and as functions return it
6) And so on.

I kept the default user-agent. I could’ve changed it like so:

“import urllib
class AppURLopener(urllib.FancyURLopener):
version = “App/1.7″
urllib._urlopener = AppURLopener()”

http://docs.python.org/library/urllib.html#urllib._urlopener

I could’ve done so much more, I know, but an important part of the purpose
of this exercise is to get into habit of finishing and dellivering, instead
of entering the endless loop of modifying and researching, branching off
endlessly and getting confused by the wealth of issues, options and choices.

And doing will lead to improving, more then getting lost trying does.

Feedback is welcome. Please be gentle 🙂

Some links:
*Davide Eynard on powerbrowsing http://davide.eynard.it/malawiki/PowerBrowsing
*Urllib and urllib2 http://docs.python.org/library/urllib.html, http://docs.python.org/library/urllib2.html
*asyncore — Asynchronous socket handler – http://docs.python.org/library/asyncore.html
*http://stackoverflow.com/questions/668257/python-simple-async-download-of-url-content

(If  any other links come to mind, I’ll add them here.)

regular expression which matches (that is, which is satisfied
     by) the smallest text chunk between the two conditions
Advertisements

About apprenticecoder

My blog is about me learning to program, and trying to narrate it in interesting ways. I love to learn and to learn through creativity. For example I like computers, but even more I like to see what computers can do for people. That's why I find web programming and scripting especially exciting. I was born in Split, Croatia, went to college in Bologna, Italy and now live in Milan. I like reading, especially non-fiction (lately). I'd like to read more poetry. I find architecture inspiring. Museums as well. Some more then others. Interfaces. Lifestyle magazines with interesting points of view. Semantic web. Strolls in nature. The sea.
This entry was posted in my code and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s