Atm-mi path-finder scraping script update (combining two scrapes)

I’ve upgraded my atm-mi interface-for-means-of-transportation-finder script. Now you can not only ask the best way to get from point A to point B in Milan, but you can also get the news about the lines you are advised to use (which sometimes change course for whatever reason). Click ‘Continue reading’ for more details, some thoughts (‘powerbrowsing is more then just scraping’), a link to another interesting script, anecdotes and fun screenshots.

See the script work here: http://oljapetrovic.com/atm/

And download the new source code here: http://oljapetrovic.com/atm/atm-mi_script_current.tar.gz

(0.2.0 Added lookup for news related to tram and subway lines, news get added to the page, 0.2.1 Minor fix, checking if variables set/empty – $url, $news_urls, $metro_lines)

Powerbrowsing is more then just scraping. It is about combining, mixing and matching information, using and combining data flows, and creating new ones. Like computing in general, it is about input, processing, and output, and you can also put together these steps like they were Lego. You can take different inputs, process them, use the output as input for something else, or output your data in different formats. As Davide Eynard shows in his new blog post “Perl Hacks: automatically get info about a movie from IMDB“, you can take a cryptic file name like The.Best.Movie.Ever(2011).SiLeNT.[dvdrip].md.xvid.whatever_else.avi, extract a movie title, and look it up on IMDb (The Internet Movie Data Base), and once you find it there you can scrape the info, and after that you can do any number of things – access this data from your movie center, or look up data on actors, and keep searching until something fun happens.

Although I am mostly studying frameworks this week, I managed to find some time to update my atm-mi script, which serves as an interface for Milan’s public transportation website. It only does what you as a user would do with a browser – enters two addresses and ask for a way to get from one to another – it asks the question, gets the answer and then presents it to you in a simplified format.

In the first version of the script I was scraping just one web page, and this time I decided to combine different pages into one, adding any relevant news about the tram and subway lines that the atm-mi site is suggesting (because the path is calculated for the theoretical state of things, and the tram may be taking another route for any number of reasons).

This image illustrates two methods you might use to find out if there are modifications to your tram line – either asking first for path from A to B, and if it’s a tram you will get a link to relevant news (for some reason you don’t get that with a subway line). Or by asking information about a single line, and getting that same link (still no luck with subway, so for that I modified the URL to get the info, more about that later in this post).

atm-mi.it interface analysis

So,  for the tram lines were easy. Ask the question, get the answer, see if the answer contains a link to a page of news, get the link, follow the link, extract the news, add it to the output.

The urls to the news pages have the format: http://www.atm-mi.it/it/ViaggiaConNoi/InfoTraffico/Pagine/default.aspx?l=X, where X is the tram number. For the subway lines, the X is negative, so for line 1 we have ?l=-1, for the subway 2 we have ?l=-2, etc.

atm-mi.it screenshot analysis

(Yeah, I need to figure out how to stop Gimp from SOMETIMES blurring images on resize. Ghgh.)

So, when the line is a tram line, the link to the list of relevant news is provided to me with the results. For the subway lines I create an URL like that, and check it for news. Slows down the script (an extra HTTP Request/Response), but nevermind. I’m still not sure if I cover all possibilities (‘metro leggera’ a small piece of subway that connects a hospital, ‘linee interurbane’ the lines that go outside Milan, …) That will take some testing.

And testing can reveal interesting things. Like I was using my interface to figure out a way to get from some place to San Siro, and I was offered a select which let me choose between ‘San Siro’ (the stadium old name and nickname) and ‘via san siro’ (the street), but if I chose San Siro, I kept getting the select. I tried the atm-mi site and that worked ok, no infinite select. But of course, the atm-mi site uses the POST request, and I use a GET request, the GET request they use for when you send an e-mail to a friend with an URL to a page with the path description, because with GET everything fits into an URL, and with POST, a part of the data is in the Request contents. In fact, if you try to send an e-mail explaining how to get to San Siro with their ‘e-mail this path to your friend’ service, you get the same problem. I noticed the same thing happens with ‘San Vittore’ (a prison) as well, that is when more strings contain the string you are searching for. In that case, even if one of them is a perfect match. I’ll have to study these mechanisms more, but not today.

Here’s an illustration of differences between GET and POST:

A screenshot of a HTTP Header

A screenshot of a HTTP Header

I finished 98% of the scripting tonight at about 1:30. I tried to test the subway line news gathering, but as I asked for paths that should require the subway, atm-mi kept telling me to go by foot. Nice of them to worry about my health and fitness, but walking 3km didn’t sound reasonable. Then I remembered that the subway closes after 00:30, and that my script can only ask for the path in the moment when you use it, I haven’t yet implemented a way to set any day and hour you want. So I had to stop working on the script and go to sleep. No! Finished it this morning. Kind of funny.

A funny screenshot

Ok, that’s pretty much it for now. I know the script is far from perfect, any suggestions are welcome. Every time I write code, powerbrowse, scrape, I learn something. Even from studying the interfaces I need to scrape. It was fun. See you next time. Have fun doing whatever you are going to be doing.

The.Best.Movie.Ever(2011).SiLeNT.[dvdrip].md.xvid.whatever_else.avi
Advertisements

About apprenticecoder

My blog is about me learning to program, and trying to narrate it in interesting ways. I love to learn and to learn through creativity. For example I like computers, but even more I like to see what computers can do for people. That's why I find web programming and scripting especially exciting. I was born in Split, Croatia, went to college in Bologna, Italy and now live in Milan. I like reading, especially non-fiction (lately). I'd like to read more poetry. I find architecture inspiring. Museums as well. Some more then others. Interfaces. Lifestyle magazines with interesting points of view. Semantic web. Strolls in nature. The sea.
This entry was posted in my code and tagged , , , , . Bookmark the permalink.

3 Responses to Atm-mi path-finder scraping script update (combining two scrapes)

  1. claudio brandolino says:

    Looks great! Just a small observation about your extract_number(): it does not accept numbers > 1000, ranges of numbers (via Duomo, 11/13) and stuff like “42/a”, “42/bis”.

    Maybe it’s not relevant, since the fallback – just ignoring the numbers, what the ATM site will do – is not that bad. In case I’d add a comment specifying it.

    BTW, we love your blog!

    Claudio and Federica.

  2. Hi Claudio and Federica!

    Thank you so much for commenting, and for reading my code, and so carefully, too! In the next releases I will definitely improve the quality of the comments. I’ve heard stories of people being against comments, saying only the actual code is to be trusted, so comments are not useful. I think comments are great, and I’ll try to improve mine.

    I’ve looked into the function that you’ve mentioned and from what I’ve seen, it works as I expected, both calling the function from a test script and in the interface. I have specifically tried the examples you ask about, and they seem to work. So, let’s look into the regular expression at the heart of the extract_number() function.
    ‘#\d+(/\w{1,3})?$#’
    I’ve used hashes as delimiters so I wouldn’t have to escape the slash later on.
    I accept any number of digits, as long as there is at least one present (\d+),
    and then an optional combination (the whole parenthesis is followed by a question mark – so the whole combination is optional). The combination has to include a slash and 1-3 ‘word characters’, which can be letters, digits or underscores (“may vary if locale-specific matching is taking place”, like accented letters may be considered or not). Maybe I can increase the range, I’ll think about it.
    http://it.php.net/manual/en/regexp.reference.escape.php (usec find-on-page to skip to \w, wish I could link to an anchor)

    As far as the atm-mi website, it doesn’t ignore the street numbers when processing the results, in fact it will give a very different answer for going from Viale Monza 1 to Piazzale Loreto – it tells you to walk, then for going from Viale Monza 140 (the Zelig comedy club 😀 ) to Piazzale Loreto (it tells you to take the subway). The atm-mi website loses your street number when you ask for an address that is imprecise and needs disambiguation (like ‘Corso Monza 140’ instead of ‘Viale Monza 140’). It gives you a select to chose from similar addresses, but loses the street number you have provided. So you will get ‘Viale Monza’ without the number, which is like street number 1, and will be told to walk for 250 meters 😀 I tried to fix that, by reintroducing the number I’ve extracted and saved into the select.

    I’m not sure I understand what you mean by number ranges in this context?

    Be free to ask again, and correct me if I’m missing something! Thank you again for this comment, I learn so much every time I get feedback.

    I’ve found your blog on the Cherryblossom website,
    http://cherryblossomweb.de/blawg/
    I like it very much and have subscribed to your RSS feed. I’m looking forward to reading your future posts (hopefully soon and often), and to commenting your posts, future and already published, as soon as I come up with something interesting.

    Nice to get in touch,
    greets,
    Olja

  3. claudio brandolino says:

    Hi Olja,

    thanks for your answer, and sorry – I guess I need to get more sleep: the first time I checked out the code, I didn’t get why it was matching things != numbers, and I “corrected” the regexp on my copy and forgot about it.

    Then, when I got back to it to read the whole thing, I saw the regex and thought: oh, that will just match digits.

    -.-‘

    I really appreciate your patience and human skills, I would have hated anyone “correcting” my code like that.

    ATM site’s fallback: not that it’s important now, but what i meant is, if a number is unmatched by the regexp it will be ignored in the suggestions, so no great harm is done.

    As for our blawg, we’ll try to give it more love.

    Best,
    Cla.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s