It took me a long time to come up with a solution that I was willing to share publicly. I feel like my time was split almost equally between determining how to slice up the page, generating the regular expressions that would be required to get the information I needed to build a feed and packaging the whole thing into something remotely elegant.
I’ll do my best to keep the feed up to date and working (if only because I’m using it myself), but realize that the slightest change on Google’s end can break the pipe. (Hopefully Google will soon provide a proper RSS feed, or at the very least not make frequent changes to the structure of the site’s HTML.)
For those interested in the pipe itself, please see the below screenshot.
Essentially, the pipe grabs the page, splits it according to the delimiter I gave it, copies each of the resulting data chunks into various elements of an RSS feed and then uses a regular expression to pull out the data relevant to the particular element.
While not terribly clear in the above image, I cut the content from
g-section cx-search-result> to
g-section cx-paginator>, and used
g-section cx-search-result> as the delimiter to get at the individual extensions. I then replicated the data chunks captured by the Fetch Page module across
Finally, I built regular expressions for each of the elements I wanted in the feed, namely
.*<p>(.*)</p>.* (for description),
.*<h2>.*href=(.*).* (for link) and
.*<h2>.*nW*(.*)n.* (for title) (each with the s option turned on, so that
newlines are included in the dot character’s matching calculus).