Before reading the post below, If you want to learn and understand more examples and sample codes, I have authored a book on Beautiful Soup 4 and you can find more details here

Getting Started with Beautiful Soup

Update 1: 

You have a Chance to Win Free Copies of Getting Started with Beautiful Soup by Packt Publishing. Please follow the instruction in this post

Web scraping is the easiest job you can ever find. It is quite useful because even if you don’t have access to database of the website , you can still get the data out of those sites using web scraping. For web data extraction we are creating a script that will visit the websites and extract the data you want, without any extra authentication and with these scripts its easy to get more data in less time from these websites.
I always relied on python to do tasks and in this too there is a good third party library, Beautiful Soup. The official site itself have good documentation and it is clearly understandable. For those who don’t want to read that lengthy one and just want to try something using Beautiful Soup and python, read this simple script with the explanation.
Task : Extract all U.S university name and url from University of texas website as a csv ( comma-seperated values ) format.
Dependencies : python and Beautiful Soup

Script with explanation:

from BeautifulSoup import BeautifulSoup
import urllib2


We have to use urllib2 to open the URL. Before we proceed further we should know these things. Web scraping will be effective only if we can find patterns used in the websites for denoting contents. For example in the university of texas website if you view the source of the page then you can see that all university names have a common format like as shown below in the screeshshot

View Source
Patterns found in the Texas university page









soup = BeautifulSoup(

Here we opened the university of texas using urllib2.urlopen(url) and we create a BeautifulSoup Object using soup = BeautifulSoup( . Now we can manipulate the webpage using the methods of the soup object.

37 thoughts on “Let’s scrape the page, Using Python Beautiful Soup

  1. Good work buddy..
    very concise and great content for neoohytes like me…
    although to import Beautifulsoup you may want to edit the command published in this web page..

    from bs4 import BeautifulSoup

  2. I get following error after typing the page statement. Please let me know the solution to the issue.

    >>> from BeautifulSoup import BeautifulSoup
    >>> import urllib2
    >>> url=””
    >>> page=urllib2.urlopen(url)

    Traceback (most recent call last):
    File “”, line 1, in
    File “C:\Users\us67771\Python27\lib\”, line 126, in urlopen
    return, data, timeout)
    File “C:\Users\us67771\Python27\lib\”, line 400, in open
    response = self._open(req, data)
    File “C:\Users\us67771\Python27\lib\”, line 418, in _open
    ‘_open’, req)
    File “C:\Users\us67771\Python27\lib\”, line 378, in _call_chain
    result = func(*args)
    File “C:\Users\us67771\Python27\lib\”, line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File “C:\Users\us67771\Python27\lib\”, line 1177, in do_open
    raise URLError(err)

  3. The syntax error is likely due to the last line not being indented if you just c+p the above code.

    Needs to be.

    for eachuniversity in universities:
    print eachuniversity[‘href’]+”,”+eachuniversity.string

    • My indent didn’t work then. I’ll try again. Ignore the dots before print

      for eachuniversity in universities:
      ……print eachuniversity[‘href’]+”,”+eachuniversity.string

  4. Encountered this error:

    UnicodeEncodeError: ‘charmap’ code can’t encode character u’\u2013′ in position

    What is it?
    How do I avoid it?

  5. Thank you, I found this extremely useful and straightforward. For my code, I eventually added a try/except clause in there in case the page I was looking for didn’t exist:

    page =
    soup = BeautifulSoup(
    /…other code…/
    except (urllib2.HTTPError):
    /…more code…/

  6. Hello and thanks for the tutorial. When I essentially just copied and pasted your code into my IDE, it worked. Also I was able to follow along. However, when I tried to modify it for my own purposes I ran into trouble, maybe you can help? I’m trying to scrape all the comments from articles on various websites. So, taking this very website as an example:

    I noticed that the comments appear under the following tag:

    And that is about as far as I got, which isn’t very far! My code looked like this:

    from BeautifulSoup import BeautifulSoup
    import urllib2
    soup = BeautifulSoup(
    for eachcomment in replies:
    print eachcomment[‘span7’]+”,”+eachcomment.string

    And it did nothing. Anyone want to give me any hints? Thanks!!

    • Hi Kristina,
      In this page comments appear under the div tag with class “comment-content span7”. Try replies=soup.findAll(‘div’,{‘class’:’comment-content’})
      for eachcomment in replies:
      print eachcomment.

  7. I am getting the following error . I am nooby.

    print eachuniversity['href']+","+eachuniversity.string
    IndentationError: expected an indented block

  8. Pingback: How to extract the critical information from an html file with BeautifulSoup? | CopyQuery

  9. Pingback: My |

  10. Great job! I converted the code over to Python 3.3 ( and BeautifulSoup BS4):

    import urllib.request
    import urllib.parse
    from bs4 import BeautifulSoup
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(
    print("soup.title ", soup.title)
    for school in universities:
  11. Pingback: Python Beautiful Soup 4 Example - Data extraction from website using Beautiful Soup4 |

  12. I enjoy what you guys tend to be up too. This type
    of clever work and coverage! Keep up the awesome works guys I’ve incorporated you guys to my
    own blogroll.

  13. nice article……recently i came across some information about python libraries. Python foundation has released two new libraries for interacting with web…….1. urllib2 2. requests……you can find out more information from python foundation website…..

  14. I am getting the desired result but I need to know about writing the output data to a CSV or JSON file for future use.

  15. I have run the same program I am trying to use a separate file
    I have run in the same manner for data scrapping from the website for video cards ad devices.
    I want video devices and cards to be output by the user
    so I have run
    videos=soup.findall(‘a’,{‘Video Cards’:’Price’})
    But there is an error with the file which comes as
    Traceback (most recent call last):
    File “”, line 1, in
    videos1=soup.findall(‘a’,{‘Video Card’:’Price’})
    TypeError: ‘NoneType’ object is not callable

    Can someone tell me how to extract the device items and price ad specifications as a CSV file

  16. i’m getting this Error :

    Traceback (most recent call last):
    File “E:\Python\”, line 18, in
    from BeautifulSoup import BeautifulSoup
    ImportError: No module named ‘BeautifulSoup’


Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>