Automating Away Repetitive Web Tasks
Table of Contents
Do you have menial or repetitive tasks you find yourself performing manually within a web browser, over-and-over-and-over again? Maybe your job requires you to search or message relevant people over linkedin? Maybe you run a publishing company that wastes countless hours repetitively uploading your ebooks and their metadata to different services like Amazon and Pubit? If you were given an intern who could follow instructions perfectly, would you be able to describe your exact process for completing these repetitive tasks? If so, this tutorial walks you through how you can go about automating such workflows by:
Webdriving is the act of having a computer perform steps and actions within a web browser, on your behalf, such as typing keys, performing searches, submitting forms, and clicking links/buttons. It is advantageous for automating menial, repetitive tasks (kind of like what mechanical turk [mturk] is used for) and also in software development for allowing programmers to write tests and catch irregularities when their web application is run on different browsers (e.g. Firefox, Chome, Safari, Opera, etc.). In some organizations, hundreds of people independently make contributions to the code which runs a web application. Can you imagine comprehensively testing every web page feature using every browser, every time one of these changes occur?
Writing Webdrivers isn't "free" as it takes a significant time investment to formalize the steps of your routine and translate your manual workflow into steps the computer can understand. For the non-programmer, expect your first Selenium script to take you about a week (2h a day for 5 days). Most of your time will be spent installing + setting up + learning the basics, reading the included resources, and above all, trying to map/reconsile your visual interpretation of the elements you see within a webpage to the browser's underlying/internal identifiers which it understands. More on this is Render v. Markup.
Webdriving excels when your workflow is:
Maintenance & Execution Costs
In addition to programming costs, there's also the cost of execution and maintenance. When we as humans run into exceptions in our workflow (e.g. if linkedin updates their website and changes the location of a button), we can adapt. Your computer program will break and halt its work. If you aren't a good manager (i.e. you don't check your script often), you can lose significant time and progress.
Webdriving is very rarely a silver bullet, especially since many workflows require intervention of an expert. Some tasks have very repetitive steps which can be automated in the background, but occasionally the workflow has specific steps which require completion by a human expert. For example, in the example of a LinkedIn webdriver which finds relevant people, maybe we want to pause the webdriving process to write them a personal message. This pausing disrupts the whole automation process, requires the human to become involved, and eliminates most of the advantage of using a webdriver.
One of the biggest components of designing effective automation pipelines is getting around this type of problem. Optimizing such scenarios is one of the biggest opportunities for gains. We won't be going too deep into this now, but you can imagine solving this problem by templating orbatching.
A popular tool for "Webdriving" is Selenium. Selenium is a toolkit which you can import and program using Java and Python -- in my opinion, the Python interface (the set of functions and the conventions for calling these functions) for Selenium is much simpler for the first-time programmer (both in installation/setup of the tools, writing the code, and executing the code).
First Step: A Visual (Non Programming) Example
Before getting started with Webdriver programming, there's a neat way to use Selenium interactively through Firefox, without requiring direct programming: http://www.seleniumhq.org/projects/ide/ (see download link to install and try the Plugin). You can watch a short video explaining how to use the graphical interface to record your browser actions here:https://www.youtube.com/watch?v=gsHyDIyA3dg
At a high level, you should be familiar with the concept of recording your browser actions, but a lot of the features and specifics (addressing elements) should leave you a bit confused. This will be addressed in Render v. Markup.
Usage + Example Program
Referencing Webpage (DOM) Elements
The power of Selenium is being able to programmatically address different browser elements. To do this, we need to understand the language the browser uses, learn how the browser addresses elements, and how this is different from how we talk about browser elements.
Render v. Markup
Have you ever tried to give map directions using visual indicators to a friend who only understands areas by names? Perhaps you forget the exact address on a location but you know it's next to a red coffee shop and there is a big willow tree out front. This means nothing to your friend, they need the coffee shop name. Webdriving is a comparably frustrating and analogous task.
One thing you'll notice (the hardest part about webdriving using Selenium) is the browser understands a web page and its structure very differently than a human does. While different elements of a webpage (such as a hypertext link or a button) may appear to us humans as being identical, or visually appear to be in a certain logical order, the browser may have have an entirely different internal representation of what the page "looks like". The actual structure and organization of a webpage is not how it visually appears to the user (this is called rendering), but how the web page's HTML elements are marked-up.
What a webpage is
Without diving in too deep, when you (a user) visit a webpage in your browser, what you see (maybe we navigate to the URL google.com) is your browser's rendering of a HTML file which is sent to us by the server we requested (in this case we go to google.com, google's server returns an html file of google.com to your browser which interprets this HTML markup and renders it based on its browser rules for HTML).
Viewing a Webpage's Source
By default, we see the rendered version of the resoures (web page) we request. However, you can view the actual underlying HTML markup that your browser has received from the server by viewing the webpage's source (Ctrl + u) or, preferred by most, by using the browser's wonderful element inspector developer tool (chrome and firefox both have comparable inspectors).
Tag Names, Element IDs, CSS Selectors, and XPath
Within the source, you will see a bunch of HTML tags. These tags are all apart of the document's Object Model (collectively called the DOM). Tags within the DOM can be referenced in various ways. Just as humans have names, social security numbers, parents or children, telephone numbers, and email addresses (unique ways of being referenced), every HTML tag within the DOM has a unique way of being referenced.
A browser refers to elements by their tag's ID (if it exists):
By a tag name:
By a class name...
By its xpath (the location of this element relative to others within the document / the page's DOM)
Using IDs, tag names, and class names is generally more convenient than xpaths -- the following explains effective ways of determining xpaths of elements: