The web contains a very large amount of diverse valuable data. Let's say you are looking for your dream apartment but don’t want to go through hundreds of adds every day or you are a part of a company that wants to compare prices, generate potential leads or check customer satisfaction. In all those cases if you wanted to access this information you either must use the format of the website or copy/paste manually to a new document, a very tedious endeavor. This is where web scraping come in, so welcome to your Scrapy journey!

In this paper we will talk about what is web scraping?! Why we should learn it?! is it legal?! how should we approach a website we wish to scrape?! And what are the fundamental computational techniques of web scraping.

What is web scraping?! Web scraping (also called “web harvesting”, “web data extraction”, or even “web data mining”), can be defined as the construction of an agent to download, parse, and organize data from the web in an automated manner. (1)

The web contains lots of interesting data sources, the current unstructured nature of the web does not always make it easy to gather or export this data in an easy manner. (1) In an ideal world. Web scraping would not be necessary, and each website would provide an API to share their data in a structured format. Indeed, some websites do provide API, but they are typically restricted by what data is available and how frequently it can be accessed. Meaning we cannot relay on API and need to lean about web scraping techniques, (2) which is a very important tool for a data scientist or a data engineer because it basically turns the entire internet to a one large database.

Website data are like free samples, and "web scraping" is the method of taking the free samples. (3) If the scraped data is being used for personal use, in practice, there is no problem. However, if the data is going to be republished then the samples are no longer free. (2) Because there are growing numbers of people who depend on scraping for profit, content theft is becoming a major concern, affecting the competitiveness of businesses. (4) several court cases around the world have helps establish what is permissible when scraping a website, when the scraped data constitutes facts it can be republished. However, if the data is original it most likely it cannot be republished due to copyright reasons. (2) Also, repeated scraping can use up bandwidth and lead to network crashes, scraping can also have the unintended (or intentional) consequence of slowing access to the scraped site and disrupting service to consumers. (5) In any case, when scraping the web remember that you are a guest in their home, and you need to play by the rules, websites will not hesitate to ban your IP address or proceed with legal action. This means that you should make download requests and at a reasonable rate.

The web scraping process can be divided the three stages: setup, acquisition and processing. In the setup stage, we understand what we want to do and find the sources to help us do it. In the acquisition stage, we read in the raw data and format it to a chosen usable data structure. Finally, in the processing stage, we run the structured data through whatever analysis or processes needed to achieve desired goal. (6)

Let's say we completed the setup stage; found the data we want to scrape and the website to scrape it from. But it's also important to develop an understanding about the scale and structure of our target website, the technology used and to find the owner of the website if its relevant. All might affect how we crawl the website. Information of the website can also be gathered from the website's robots.txt which is used to tell web crawlers what they can scrape and what they should not touch, and sitemap files. (2)

In this paper we will focus on the Acquisition stage, retrieve the HTML data from the domain name, parse that data in the desired format and store the target information.

Retrieve the HTML data from the domain name

urllib
requests

One useful library is urllib, a standard Python library and contains functions for requesting data across the web, handling cookies and even changing metadata such as headers and your user agent.

In the IDLE's interaction window, type the following import and give to urlopen function the given link from http://toscrape.com/ a site dedicated to practice web scraping.

The urlopen() function opens a connection and grabs the page, it returns an HTTPResponse Object. Now we read the html data from the page with HTTPResponse.read() which returns a sequence of bytes.

Another known library is requests, which gives the same functionality but with fewer lines of code, simple and efficient.

Parse that data in the desired format

Basic methods

There are many ways to extract information from a web page’s HTML, we will talk about:

String methods
Regular expressions

String methods is basically using the functionality of a string to look for a relevant tag in the html and extract it of the web page. For instance, we can use .find() or .index() functions, both returns the index of the first occurrence of a substring. In our case we need to search through the text of the HTML for the relevant tag and extract the text of the web page. Make sure you write the tag according to the notation used in the website (css or xpath ect), In the following example we used a tag in css notation.

In our example we used a random quote and we'd like to print it. Remember until now we just extracted the html but it's not yet a string, so first we decode the html to string, now it's easier for humans to read.

To find the quote's starting and end index we need to decide on the text we are going to search- the tag, for that purpose we searched the html. The quote belongs to the tag but comes also with attributes, we define the start_tag to include them all. In the same line of thinking the tag appears a lot in the html, to find the end of the quote we find the tag with a quote sign (").

When finding the quote's starting index we take in mind the length of the tag itself, also in the quote's end index we remember to include the quote sign (").

Finally, you can extract the quote by slicing the html string.

We can see that there is still a bit of HTML mixed in the quote, Real-world HTML can be much more complicated, less predictable and even unclean (in our case). Moreover, what if we have more than one quote?! that’s why we'll introduce you to more reliable ways to extract text from HTML.

Regular expressions (regex), a standard Python library as re module, String containing a combination of normal characters and special matacharacters that describes patterns to find text or positions within a text. More information can be found on https://docs.python.org/3.8/howto/regex.html

Some of these patterns look a bit strange because they contain both the content we want to match and special characters that change how the pattern is interpreted. Regular expressions are useful for parsing string information.

An example to a simple regex: r'st\d\s\w

st – Normal characters to match

\? – Metacharaters represent types of characters

\d - digit, \D - non digit, \w - word, \W – non word, \s - white space, \S – non white space

We'll start with a few easy examples, We'll use the .findall|() function, that returns a list of all matches of the given pattern regex. If no match is found, then .findall() returns an empty list.

Now we want to use this on our html data and find all the links in the page.

+ - once or more times

As we can see we got the links but with them a lot unnecessary of html code, lets try to fix that.

text[\'"s]- says to match " text", possibly followed by a ' , " or s, often used because it's hard to say how horrible is the HTML you're looking at.

? – appears zero times or once

[^\'" >]+ - says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

Enclosing the a regex with "()" says to make it a "group", which means to split it out and return it separately to us.

Regular expressions are a powerful tool when used correctly and we just scratched the surface, I encourage you to learn about it further. Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages.

HTML Parser

There are many Python tools written for this purpose, but two stand out:

Beautiful Soup
Scrapy

Beautiful Soup library, helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable python objects representing XML structures. Because Beautiful Soup is not a default Python library, it must be installed.

Go to the command prompt you are using and run the following code in your terminal:

So now like before we are going to retrieve the HTML data from the domain, I used Requests library, and then create a Beautiful Soup object called html_soup. For that I used the function .Soup() that receives two arguments: HTML to be parsed and the parser we want to use. In this case I used the "html.parser", Python’s built-in HTML parser.

Beautiful Soup offers the ability to call any tag in the html. For example, lets try title tag:

We can also choose a tag by referring to the hierarchy of tags and Beautiful Soup will show us the first occurrence of that hierarchy.

Beautiful Soup also offers methods, the common ones are: .find(), .find_all() and get_text(). The get_text() method can be used to extract all the text from the document and automatically remove all HTML tags with blank lines, to remove the blank lines I used the .replace() method.

The find_all() method returns a list of all instances of a particular tag, given that our html now contains images we will now look for the tag.

We can also save the images to an object which will still be an Beautiful Soup element that now can be called by its own properties.

Now lets say I want to find a tag to find but the html is too messy, in that case I can use the prettify() method. We can also use find_all() method with a filtering characteristic on the desired tag.

Lets say I want all the side_categories on the website, I used prettify() method and found that its under ul tag and class 'nav nav-list'. I used find_all() that now created a list object that contains all the occurrences of class 'nav nav-list', because there is only one I chose the first item in the list. But under the ul tag the actual categories are within several other tags so we used the find_all() method again.

When scraping data from websites with Python, you’re often interested in specific parts of the page. By looking through the HTML, you can identify tags and attributes that you can use to extract the data. That way, instead of complicated solutions with Regular expressions Beautiful Soup allows you to directly access the tag you are interested in and extract the data.

Scrapy library, is a web framework for scraping data from various sources in a robust and efficient manner. It allows us to cascade operations that clean. Form and enrich data, store them in databases, while enjoying very low degradation in performance. Because Scrapy is also not a default Python library, it must be installed.

Go to the command prompt you are using and run the following code in your terminal:

Again, we are going to retrieve the HTML data from the domain by using Requests library. The first Scrapy element that we'll talk about is the Selector, it’s the Scrapy Object used to select portions of the HTML using xpath or css. The sel Selector first selects the entire html and returns a Selector list of Selector Objects.

Each object can be called using xpath or css notation, by choosing all HTML with

Unlike tools we've seen before, Scrapy allows us to define a "spider" that crawls through the web on multiple pages and scrape each of those pages automatically.

First, we import necessary modules.

Next, we define the code for the actual spider that tells which websites to scrape and how. The code comes in the form of a class, that group together variables and methods we need to scrap the web.

The start_requests() method tells the spider which site or sites we want to scrape and where to send the information from the site to be parsed (see callback). In this case I gave only one url but you can most defiantly use more. As you can see Scrapy doesn’t need an additional library to retrieve the HTML data from the domain url (see scrapy.Request).

The parse() method parses the data, here we implement all the knowledge we gathered so far and tell the spider which data from the html to parse and how. Here I introduced another important Scrapy element that is Response. Response has all the tools that Selector has but the Response Object keeps track of the url where the HTML code loaded from and helps us move from site to site that we can "crawl" the web while scraping. Also in the parse() method we can also store the data, as shown herethe images are saved to a csv file. In the same way here we can also follow the links in the url and crawl them too, I leave it to you to try.

Lastly, we need to initiate the crawl process.

As we seen throughout this paper there are many popular scrapers but you must choose the best scraping tool to mine your data effectively. So one would ask Scrapy or Beautiful soup?! First, Beautiful Soup only parses and extracts data from HTML files, while Scrapy downloads, processes and saves data. Scrapy is very good at automatically following links in a site, no matter what the format of those links is, while Beautiful soup needs an additional content downloader like Requests to download those HTML files. In this sense, Beautiful Soup is a content parser, while Scrapy is a full web spider and scraper.

Conclusion

In this paper we learned how to Request a web page using Python’s built-in urllib module and the Requests library. We learned about parsing methods both basic like string methods and regular expressions or more advanced like HTML parsers like Beautiful Soup and Scrapy.

Although its possible to extract and parse data using standard Python library's, there are many tools that can make the process easier and efficient.

Remember, when doing web scraping you are looking through someone else's home. Always check the website’s Terms of Use before you start scraping and make your requests in timely manner.

Writing automated web scraping programs is fun and challenging, and there is no shortage of content for you to experience on. I encourage you to try, go be Magicians

Sources

(1) Practical Web Scraping for Data Science Best Practices and Examples with Python, 1st ed. 2018, Bart Baesens and Seppe. vanden Broucke, Berkeley, CA

(2) web scraping with python, 2015, Richard Lawson, Packt publishing,UK

(3) Brian Kladko, Screen Scrapers: Web Sites Fight to Stop Theft of Free Data, N.J. Rec., Apr. 3, 2005, available at 2005 WLNR 26680149.

(4) Christin McMeley et al., Supreme Court Grants Cert in Spokeo v. Robins, Privacy & Security Law Blog (Apr. 27, 2015), http://www. privsecblog.com/2015/04/articles/ marketing-and-consumer-privacy/supremecourt-grants-cert-in-spokeo-v-robins/.

(5) Web scraping - limits on free samples, November-December 2015, Philip H. Liu and Mark Edward Davis, Landslide(Vol. 8, Issue 2), American Bar Association

(6) Datacamp, Web scraping in python, 2020, https://learn.datacamp.com/courses/web-scraping-with-python

(7) Web Scraping with Python: Collecting More Data from the Modern Web, second edition, Ryan Mitchell, 2018,O'Reilly Media

(8) Python Requests Essentials, Rakesh Vidya Chandra and Bala Subrahmanyam Varanasi, 2015, Packt Publishing

(9) Learning Scrapy, 2016, Dimitrios Kouuzis -Loukas, Packt Publishing