- Any language that provides APIs or libraries for an Http client and HTML parser is able to provide you web scraping facility. Go also provides you the ability to write web scrapers. Go is a compiled and static type language and could be very beneficial to write efficient and quick and scaleable web scrapers.
- Building a Web Scraper. As I mentioned in the introduction, we’ll be building a simple web scraper in Go. Note that I didn’t say web crawler because our scraper will only be going one level deep (maybe I’ll cover crawling in another post).
- Scraping framework for extracting the data you need from websites, used for a wide range of applications, like data mining, data processing or archiving.
Create a Web Scraper Using Go October 22, 2019. Web scraping holds a dear place in my heart. During my undergrad, I had the opportunity to present some research at the Canadian Undergraduate Math Conference, that being numerical solutions to biological aggregation differential equations 1. In other words, I simulated animal and insect. Web scraping is a form of data extraction that basically extracts data from websites. A web scraper is usually a bot that uses the HTTP protocol to access websites, extract HTML elements from them and use the data for various purposes. I’ll be sharing with you how you can scrape a website with minimal effort in go, let’s go 🚀.
Follow me on twitch!Web scraping is practically parsing the HTML output of a website and taking the parts you want to use for something. In theory, that’s a big part of how Google works as a search engine. It goes to every web page it can find and stores a copy locally.
For this tutorial, you should have go installed and ready to go, as in, your $GOPATH
set and the required compiler installed.
Parsing a page with goQuery
goQuery is pretty much like jQuery, just for go. It gives you easy access to the HTML structure of a page and enables you to pick which elements you want to access by attribute or content.
If you compare the functions, they are very close to jQuery with the .Text()
for text content of an element and .Attr()
or .AttrOr()
for attribute names values.
In order to get started with goQuery, just run the following in your terminal:
Scraping Links of a Page with golang and goQuery
Now let’s create our test project, I did that by the following:
Now we can create the example files for the programs listed below. Usually you shouldn’t have multiple main()
functions inside one directory, but we’ll make an exception, because we’re beginners, right?
List all Posts on Blog Page
The following program will list all articles on my blogs front page, composed of their title and a link to the post.
Since we’re using .Each()
we also get a numeric index, which starts at 0
and goes as far as we have elements of the selector #main article .entry-title
on the page.
If you come from a language where functions can’t have multiple returns, look at this for a second: link, _ := linkTag.Attr('href')
, if we would define a name instead of _
and call it something like present
, we could test if an attribute is actually set.
The output of the above program should be something like the following:
Scrape all Links on the Page with Go
Scraping all links on a page doesn’t look much different to be honest, we just use a more general selector, body a
and go through the logging for each of the links. I’m getting the content of the respective <a>
tag by using linkText := linkTag.Text()
.
The output of the above code should be something like:
Now we know how to get all links from a page, including their link text! That would probably be pretty useful to a bunch of SEO or analytics people, because it displays the context of how another website is linked, so what kind of keywords it should be associated with.
Get Title and Meta Data with Golang scraping
Golang Web Scraper
Lastly we should cover what we typically don’t select with jQuery a lot, the page title and the meta description:
This should yield:
Now what’s a little bit different in the above example is that we’re using AtrrOr( value, fallback_value)
in order to be sure we have data at all. This is kind of a short hand instead of writing a check if an attribute is found or not.
Golang Concurrent Web Scraper
For the title we can just plainly select the Contents of a *Selection
, because it’s typically the only tag of its kind on a website: pageTitle = doc.Find('title').Contents().Text()
.
Summary
Go is still pretty new to me, but it’s getting more and more familiar. Some things the compiler worries about make me re-think how I think of code in general, which is a great thing. In terms of libraries, goQuery is very awesome and I want to thank the author a lot for providing such a powerful parsing library, that is so incredibly easy to use.
Do you do web scraping / crawling? What do you use it for? Did you like the post or do you have some suggestions? Let me know in the comments!
Thank you for reading! If you have any comments, additions or questions, please leave them in the form below! You can also tweet them at me
Web Scraping With Go
Golang Colly
If you want to read more like this, follow me on feedly or other rss readers
Comments are closed.