Why screenscraping is a bad idea |
As a developer I see screenscraping as an absolute last resort or temporary option at best. If the data the customer needs is completely under their control then there is not a good, logical reason (that I can think of) for screenscraping data. The recent ditching of a service based on scraping del.icio.us brings this practice to the foreground.
Richard McManus wonders if the homepage change by del.icio.us was intentional or not:
This plainly illustrates the danger for remix or mash-up service providers who rely on third party sites for their data. del.icio.us can not only giveth, it can taketh away. Now, it appears as if del.cio.us celebrated its second birthday by re-designing its homepage. I’m curious if they intended to take away the data that populicio.us needed to operate, or was it an unintended consequence?
Dare Obsanjo hits the proverbial nail:
Versioning APIs is hard enough, let alone trying to figure out how to version an HTML website so screen scrapers are not broken. Web 2.0 isn’t about screenscraping. Turning the Web into an online platform isn’t about legitimizing bad practices from the early days of the Web. Screen scraping needs to die a horrible death. Web APIs and Web feeds are the way of the future.
Amen, Dare.
Good programming practice employs routine reviews of fault points in program design and implementation. Bad programming practice is just throwing up and never changing code that just “works.” I’m not bashing the programmer in question who might in fact be a really good programmer, but I am saying the practice of using a non fault-tolerant data scheme is a programming practice to avoid.
I wonder if the developer who was scraping the del.icio.us page ever contacted the del.icio.us team and asked for access to the information he needed to run his program? His comments do not make this answer clear:
I’m so sad for this but there is no other way. Del.icio.us doesn’t serve its homepage as it did and I’m not able to get all needed data to continue Populicio.us.
So does this mean he asked and was denied? Maybe he has asked by now, in light of the news coverage that has flared up, but it seems that this should have been a #1 priority as soon as the program was released into the public. Especially after the program started to gain some traction.
This seems to me like one of the terrible side effects of programs being released as long running beta. That developers are rolling out code that should never be seeing the light of day for public use.
BTW, a lot of these Google Adsense stats notification programs are screenscraping the data. You’ll notice this by the fact that almost every time Google changes their design or adds new features that alters the HTML the developers behind these programs have to scramble to make their programs function properly again. Some of these developers have commercial programs too. The Revenue Checker for Adsense program I recommended awhile back scrapes the data and there has been some fallout related to this issue. The developer of that program released a new version today, in fact, so he is trying to stay up with this type stuff, but I wonder how many of his customers realize that the program they bought may just end up worthless overnight? I do, and don’t mind the price of admission because I don’t want to take the time to work out the new screenscraping code every time Google adds something new. And yes, I do realize there are free plugins that do this (a firefox plugin and even a new Google sidebar plugin). Interestingly enough, the Google Adsense sidebar plugin was broken when Google made changes recently too.
Bottom line is screenscraping is a very risky and potentially very time wasteful way to handle data in a program. In some cases screenscraping may violate the copyright of a website, which is a whole other can of worms to consider.
Developer and user beware.
Did this post make you go hmm?
Maybe Related Posts (plugin generated)
- eBay drops developer API fees
- Google offers specialized source code search
- Google Adsense adds new ad formats and language support
- Help make the Skype Developer program dreams come true
- code.google.com
- Google Adsense offers alternate ads




[…] Then there is this whole mashup thing. Why are financial people (VCs primarily) getting all jazzed up about technology that more often than not violates the TOS of other sites when used for commercial purposes? Are they hoping the big companies will buy these crappy programs without consulting their gargantuan legal teams? It seems like a very dicey argument when Adsense is placed next to some mashup programs out there saying they are not being commercially used. Some of these programs are skipping the APIs altogether and scraping pages. Hey, if you won’t give us the data, we’ll just take it. That’s hasn’t ever been considered a stable method of data collection and is downright dishonest. […]
Pingback by Make You Go Hmm: » Web Pooh-Point … Oh — December 19, 2005 @ 3:04 pm PST
[…] This reminds me of the way some of these mashup sites that are scraping pages without permission. It happened with Craigslist and some of the pushback suggested that just because it was on the web meant it was freely available to be used. No, just because it’s on the web doesn’t mean it’s ok to scrape it and use it elsewhere, for free, for profit or for any reason. Just because it is in a full text RSS feed doesn’t mean it is for the taking either. […]
Pingback by Make You Go Hmm: » YouTube great example that Wild, Wild, Web still alive — February 21, 2006 @ 10:40 am PST
[…] covered why screenscraping is bad from a developer perspective over two years ago. My feelings on scraping, if anything, have […]
Pingback by Mint’s unrefreshing contracted web scraping » Make You Go Hmm — September 28, 2007 @ 7:11 am PST
[…] is one of manners. Taking without permission on the web is a bad netiquette. It’s like screenscraping or hotlinking without permission. There is a lot of great information on IRC and that’s what […]
Pingback by IRCSeeK protects privacy and respects others in IRC how? » Make You Go Hmm — November 30, 2007 @ 8:40 am PST