Scraping using wildcards (i.e. www.*.org)

mecourt · March 3, 2024, 5:28pm

Hello,

I’m needing a way to scrape the web in a wildcard capacity (trying to create a dataset). In this example, I’m seeking a way to scrape .org domains across the web for wishlists and needs often posted on nonprofit websites.

Can anyone help me with the using Bardeen? Any and all help is greatly appreciated.

Thank you,

Mark

mecourt · March 13, 2024, 9:21pm

Hi Lucy,

It may be an example of one but I have sense reached out to the root DNS domain search registry.

Reached out to IANA and they point me to ICANN.

Essentially we can approach this with a pull of all .org domains first.

AI doesn’t seem to actually digest the web yet. Which is understandable but also misguided for research purposes.

I will try to work with ICANN first to get the list of .orgs. Then we can use the aggregate to pull from scraping logic.

Do you have any other ideas as to how you can approach wildcard requests than this?

Mark

Bardeeni · March 14, 2024, 6:19pm

Hi Mark,

Let me check with the team and circle back.

Thanks!
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

mecourt · March 21, 2024, 8:28pm

Hi Lucy,

I was able to get the .org domain list from ICANN. Is there a way to use bardeen to go through the list and initiate the scraping process?

Thank you,

Mark

Bardeeni · March 21, 2024, 11:31pm

Hi Mark,

Good to hear from you. Can you get them into a Google sheet and share it with me so I can check it out with our engineers?

Thanks!
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

mecourt · March 22, 2024, 12:21am

Hi Lucy,

The file is way too large for that - this is all the .org domains across the web. It’s here in a compressed file.

org.txt.gz

Please confirm you can access it when possible.

Mark

Bardeeni · March 22, 2024, 1:14am

Thanks Mark. I will have the engineer take a look. Give us a few days to wrap our heads around this.

Cheers,
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

mecourt · March 22, 2024, 12:50pm

Thank you Lucy,

Certainly, essentially what I was hoping to accomplish is:

to issue a script that goes through the .org listing’s
searching for keywords as ‘wishlist’, ‘needs list’
returning results that fall beneath these sections on nonprofit sites.
goal is to capture product needs lists from across domains and further filter them down for identifying patterns/grouping.

Hopefully this helps summarize for your team.

Mark

Bardeeni · April 8, 2024, 7:09pm

Hi Mark,

Thanks for your patience. Our scraper engineers had a chance to review in detail and it looks like we won’t be able to handle the scope of this project just yet. The browser agent that we are developing is not yielding consistent enough results and is sitll in beta and the scraper models would not be able to grab the data you need as the site structures would vary greatly.

We appreciate you considering Bardeen for this!
Lucy

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

Topic		Replies	Views
Hey! Scraping a directory site - best practice +method please ❓Help and questions bardeen-actions , integrations , scraping	7	399	October 23, 2023
Looking for a tutor for a simple scrape of a websites' "people" page 💼 Jobs and Gigs google-sheets , scraper , bardeen-tutorials	5	56	August 12, 2024
Started Bardeen today Hi Everyone. I need a some help ❓Help and questions	4	271	September 1, 2023
Deep Scraping for directory 🏁 Getting started.	3	43	October 16, 2024
Using two or more domains for data scraping ❓Help and questions bardeen-actions , scraper	5	112	May 6, 2024

Scraping using wildcards (i.e. www.*.org)

Related topics