Before reading the post below, If you want to learn and understand more examples and sample codes, I have authored a book on Beautiful Soup 4 and you can find more details here
You have a Chance to Win Free Copies of Getting Started with Beautiful Soup by Packt Publishing. Please follow the instruction in this post
Web scraping is the easiest job you can ever find. It is quite useful because even if you don’t have access to database of the website , you can still get the data out of those sites using web scraping. For web data extraction we are creating a script that will visit the websites and extract the data you want, without any extra authentication and with these scripts its easy to get more data in less time from these websites.
I always relied on python to do tasks and in this too there is a good third party library, Beautiful Soup. The official site itself have good documentation and it is clearly understandable. For those who don’t want to read that lengthy one and just want to try something using Beautiful Soup and python, read this simple script with the explanation.
Task : Extract all U.S university name and url from University of texas website as a csv ( comma-seperated values ) format.
Dependencies : python and Beautiful Soup
Script with explanation:
from BeautifulSoup import BeautifulSoup import urllib2
We have to use urllib2 to open the URL. Before we proceed further we should know these things. Web scraping will be effective only if we can find patterns used in the websites for denoting contents. For example in the university of texas website if you view the source of the page then you can see that all university names have a common format like as shown below in the screeshshot
url="http://www.utexas.edu/world/univ/alpha/" page=urllib2.urlopen(url) soup = BeautifulSoup(page.read())
Here we opened the university of texas using urllib2.urlopen(url) and we create a BeautifulSoup Object using soup = BeautifulSoup(page.read()) . Now we can manipulate the webpage using the methods of the soup object.