I’ve upgraded my atm-mi interface-for-means-of-transportation-finder script. Now you can not only ask the best way to get from point A to point B in Milan, but you can also get the news about the lines you are advised to use (which sometimes change course for whatever reason). Click ‘Continue reading’ for more details, some thoughts (‘powerbrowsing is more then just scraping’), a link to another interesting script, anecdotes and fun screenshots.
See the script work here: http://oljapetrovic.com/atm/
And download the new source code here: http://oljapetrovic.com/atm/atm-mi_script_current.tar.gz
(0.2.0 Added lookup for news related to tram and subway lines, news get added to the page, 0.2.1 Minor fix, checking if variables set/empty – $url, $news_urls, $metro_lines)
Powerbrowsing is more then just scraping. It is about combining, mixing and matching information, using and combining data flows, and creating new ones. Like computing in general, it is about input, processing, and output, and you can also put together these steps like they were Lego. You can take different inputs, process them, use the output as input for something else, or output your data in different formats. As Davide Eynard shows in his new blog post “Perl Hacks: automatically get info about a movie from IMDB“, you can take a cryptic file name like The.Best.Movie.Ever(2011).SiLeNT.[dvdrip].md.xvid.whatever_else.avi, extract a movie title, and look it up on IMDb (The Internet Movie Data Base), and once you find it there you can scrape the info, and after that you can do any number of things – access this data from your movie center, or look up data on actors, and keep searching until something fun happens.
Although I am mostly studying frameworks this week, I managed to find some time to update my atm-mi script, which serves as an interface for Milan’s public transportation website. It only does what you as a user would do with a browser – enters two addresses and ask for a way to get from one to another – it asks the question, gets the answer and then presents it to you in a simplified format.
In the first version of the script I was scraping just one web page, and this time I decided to combine different pages into one, adding any relevant news about the tram and subway lines that the atm-mi site is suggesting (because the path is calculated for the theoretical state of things, and the tram may be taking another route for any number of reasons).
This image illustrates two methods you might use to find out if there are modifications to your tram line – either asking first for path from A to B, and if it’s a tram you will get a link to relevant news (for some reason you don’t get that with a subway line). Or by asking information about a single line, and getting that same link (still no luck with subway, so for that I modified the URL to get the info, more about that later in this post).
So, for the tram lines were easy. Ask the question, get the answer, see if the answer contains a link to a page of news, get the link, follow the link, extract the news, add it to the output.
The urls to the news pages have the format: http://www.atm-mi.it/it/ViaggiaConNoi/InfoTraffico/Pagine/default.aspx?l=X, where X is the tram number. For the subway lines, the X is negative, so for line 1 we have ?l=-1, for the subway 2 we have ?l=-2, etc.
(Yeah, I need to figure out how to stop Gimp from SOMETIMES blurring images on resize. Ghgh.)
So, when the line is a tram line, the link to the list of relevant news is provided to me with the results. For the subway lines I create an URL like that, and check it for news. Slows down the script (an extra HTTP Request/Response), but nevermind. I’m still not sure if I cover all possibilities (‘metro leggera’ a small piece of subway that connects a hospital, ‘linee interurbane’ the lines that go outside Milan, …) That will take some testing.
And testing can reveal interesting things. Like I was using my interface to figure out a way to get from some place to San Siro, and I was offered a select which let me choose between ‘San Siro’ (the stadium old name and nickname) and ‘via san siro’ (the street), but if I chose San Siro, I kept getting the select. I tried the atm-mi site and that worked ok, no infinite select. But of course, the atm-mi site uses the POST request, and I use a GET request, the GET request they use for when you send an e-mail to a friend with an URL to a page with the path description, because with GET everything fits into an URL, and with POST, a part of the data is in the Request contents. In fact, if you try to send an e-mail explaining how to get to San Siro with their ‘e-mail this path to your friend’ service, you get the same problem. I noticed the same thing happens with ‘San Vittore’ (a prison) as well, that is when more strings contain the string you are searching for. In that case, even if one of them is a perfect match. I’ll have to study these mechanisms more, but not today.
Here’s an illustration of differences between GET and POST:
I finished 98% of the scripting tonight at about 1:30. I tried to test the subway line news gathering, but as I asked for paths that should require the subway, atm-mi kept telling me to go by foot. Nice of them to worry about my health and fitness, but walking 3km didn’t sound reasonable. Then I remembered that the subway closes after 00:30, and that my script can only ask for the path in the moment when you use it, I haven’t yet implemented a way to set any day and hour you want. So I had to stop working on the script and go to sleep. No! Finished it this morning. Kind of funny.
Ok, that’s pretty much it for now. I know the script is far from perfect, any suggestions are welcome. Every time I write code, powerbrowse, scrape, I learn something. Even from studying the interfaces I need to scrape. It was fun. See you next time. Have fun doing whatever you are going to be doing.