Skip to main content
Question

Scraping URLs from Google Search Results

  • 14 August 2024
  • 2 replies
  • 32 views

I’m trying to build a bot in A360 that searches a string in Google and clicks on each of the links that are returned. The issue is that Google randomises the path each time so it’s almost impossible to loop through each of the links using a counter in domx/path. I also looked at scraping URLs from the source code but it’s all embedded in js. Any ideas?

 

Domx path is working consistently for me on google searches. Try capturing the entire box at the top of a search result.

That has a DomX of: //divv@id='rso']/divv3]/divv1]/divv1]/divv1]/divv1]/divv1]/divv1]

The first Div increments with each result so I’ve inserted: //div/@id='rso']/div/$nSearchResultsRow.Number:toString$]/div/1]/div/1]/div/1]/div/1]/div/1]/div/1]

Make sure the only things you are using for the object properties are the HTML Tag, the DOMX Path, and maybe the HTML HasFrame.

I’ve placed the recorder action in a loop that goes 5 times. I start with the nSearchResultsRow

equal to 3 since that seems to be the first row and then increment from there in the loop.

To get the URL I’m grabbing the “HTML InnerText” property which for example looks something like this:

Best Vegan Chocolates: Ideal for Plant-Based TreatsDallmann Confectionshttps://dallmannconfections.com › collections › vegan-c..

So you would need to use the string tools to isolate the URL out of there!

Since the number of results will vary, you will need to error trap if you run out of rows and need to click the “More Options” button to expand the results, and then the Next button. Note that clicking next probably resets the rows so you’ll need to set variable back to 3 to start scraping again.


This is the first thing I tried but the order of results can be random i.e. if you enter a completely different search string, sometimes the domx might be -1 and not necessarily follow an incremental order. I ended up using the headless browser method of REST GET using the URL in URI and string manipulation to capture all the URLs. 

Thanks for your suggestion.


Reply