Scraping using wildcards (i.e. www.*.org)

Hello,

I’m needing a way to scrape the web in a wildcard capacity (trying to create a dataset). In this example, I’m seeking a way to scrape .org domains across the web for wishlists and needs often posted on nonprofit websites.

Can anyone help me with the using Bardeen? Any and all help is greatly appreciated.

Thank you,

Mark

Hi Lucy,

It may be an example of one but I have sense reached out to the root DNS domain search registry.

Reached out to IANA and they point me to ICANN.

Essentially we can approach this with a pull of all .org domains first.

AI doesn’t seem to actually digest the web yet. Which is understandable but also misguided for research purposes.

I will try to work with ICANN first to get the list of .orgs. Then we can use the aggregate to pull from scraping logic.

Do you have any other ideas as to how you can approach wildcard requests than this?

Mark

Hi Mark,

Let me check with the team and circle back.

Thanks!
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

Hi Lucy,

I was able to get the .org domain list from ICANN. Is there a way to use bardeen to go through the list and initiate the scraping process?

Thank you,

Mark

Hi Mark,

Good to hear from you. Can you get them into a Google sheet and share it with me so I can check it out with our engineers?

Thanks!
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

Hi Lucy,

The file is way too large for that - this is all the .org domains across the web. It’s here in a compressed file.

org.txt.gz

Please confirm you can access it when possible.

Mark

Thanks Mark. I will have the engineer take a look. Give us a few days to wrap our heads around this.

Cheers,
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

Thank you Lucy,

Certainly, essentially what I was hoping to accomplish is:

  1. to issue a script that goes through the .org listing’s

  2. searching for keywords as ‘wishlist’, ‘needs list’

  3. returning results that fall beneath these sections on nonprofit sites.

  4. goal is to capture product needs lists from across domains and further filter them down for identifying patterns/grouping.

Hopefully this helps summarize for your team.

Mark

Hi Mark,

Thanks for your patience. Our scraper engineers had a chance to review in detail and it looks like we won’t be able to handle the scope of this project just yet. The browser agent that we are developing is not yielding consistent enough results and is sitll in beta and the scraper models would not be able to grab the data you need as the site structures would vary greatly.

We appreciate you considering Bardeen for this!
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community