Best practices to scrape this amount of data

Hi all, I need to scrape all 548 records of the site pictured. Both active tab and background data for each record. What is the best practice for this seeing as the first page only gives 28 records and scraper will need to go through more than 10 pages. In some cases, I will need to scrape 2000 records. I worry I will get blocked by the website and I want to avoid this at all cost, what is the best way to go about this? Thanks

Hi @rubyjoy2019

You’re use-case seems pretty possible!

Seems like this page doesn’t have a next button for pagination, so it’ll require some work arrounds. :next_track_button:

Some questions that can help us understand the case better.

  1. Is this a public website? If so, can you share the link?
  2. Does the URL change when you change from page to page? Ex, when you paginate to page 2 does the URL change to Home Page | PageGroup

We don’t really have context on what this website is about.

Yet it’s very few websites that have methods to detect scraping.

Some best practices to avoid being blocked when scraping are:

  1. Use delays on your scraper actions
  2. Don’t abuse of the scraper’s use (ex: entering 1000 pages in a few minutes) → This only applies for certain pages like LinkedIn that can actually sense unusual behaviours.

Yet still, for your case being 500 items I don’t think there can be an issue.

More info:
FAQ: Scraper: Are there any risks to scraping LinkedIn? | Bardeen.ai.

Hi Ivan,

Thanks for your contribution. Greatly appreciated. In regards to paignation, here is the website and what will be the best solution to sort paignation?

https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx

Also I’m not only looking to scrape 500. I will go up to 15000 at some point but as you said I will use time delay action. Should I use time delays in both in the active tab and background data?

Looking forward to hear back

Hi @ivan , I tried to run the playbook but the paignation is a problem as the robot doesnt recognise this and thus copies from the same page multiple times. To your question above, the page URL doesnt seem to change when you click to go to the next page.

What’s the work around for this? Would really appreciate your help with this as its quite urgent. Thanks

Pagination is done by JavaScript on this site, I cannot think of a way to really get thru the pages. I would advise not to waste time on it.

Thanks for your reply. There must be a way around this? @ivan what are your thoughts on this

Can’t think of a way to approach this either.

Since the URL doesn’t change and there isn’t any way to trigger the pagination, this website is an outlier case, don’t think it’s possible to scrape in bulk.

Yet, we are creating a web agent that’s able to extract data on prompt, it’s currently on demo. I think cases like these will be possible to approach with the agent in the near future.

@michael.lutz thoughts?

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.