The How To Decode A Web Page Two exercise was a follow-up to the extremely challenging prelude, How To Decode A Web Page. The purpose of the original (and the follow-up exercise) was to give people a chance to poke around at a close-to-real application of Python.
A few people submitted solutions to this exercise (one example is given at the bottom of the page), and many of them used things that I haven’t yet talked about in the exercises. I will not go into extreme detail about how to solve this problem - I will just present the important points for one way you could solve this problem. For a more detailed solution to a similar problem, I encourage you to look at the solution to the previous exercise (How To Decode A Web Page).
And without further ado, a plan for achieving the solution:
requests
library to load the HTML of the page into PythonBeautifulSoup
to process HTML and extract the text of the articleIn my opinion, the difficult part of this exercise is figuring out the elements on the page (whether that is through HTML or CSS selectors) that contain all the text of the article. It requires knowing or exploring a little bit about CSS selectors and reading documentation. Such is the way of working with web pages.
A few observations on that point:
p
elements that are inside div
elements of class body
that are inside div
elements that have classes parbase
and cn_text
. But it also seems that there are other things like hyperlinks and article comments also fall inside this structure.BeautifulSoup
, the first 7 elements that come out are links and the introduction, so we just want to ignore those.Given these observations, this solution:
prints to the screen the full text of the article.
The CSS selectors are complex, and it takes some practice to get comfortable with them. We’ll re-visit this later (perhaps much later), but for now I encourage you to poke around the internet and look at sources of websites, playing with BeautifulSoup
and figuring out how CSS selectors behave.
Here is an example of a user-submitted solution that is not completely automated, and that does not use the exact article specified. However, it does show you that you can start solving your own problems using the techniques I am talking about on this blog.