How are you? I understand how to use Bardeen for scraping of websites that are quite structured. Thanks so much also on helping me out with Google Maps and LinkedIN the other day!
Here is a new use case: I was wondering, if there is a way to use bardeen to scrape a website that has a huge amount of sub-pages and is quite unstructured overall. Here is the website of the German Tax Ministry: BMF - Amtliches Einkommensteuer-Handbuch
They have an html based ebook that explains you how to do your tax declaration. The Problem is, that there are a ton of sub-pages, and you need to click on hyperlinks to open them. Many sub-pages have further sub-pages. Also, the structure is different on each page.
I was wondering if there is maybe a tool to first get all the sub-urls of a given website somehow, and then do a complete scrape of all the sub-pages?
As a final output, I’d like to create a PDF which I want to use for a GermanTaxGPT.
We’re so glad to hear you’re moving on to more advanced use cases with Bardeen and exploring the world of possibilities.
We can use a custom scraper with a CSS selector that looks for all links on a page and then scrape those pages. I have created two playbooks for you (linked below) that do just that
Playbook 1 (file:///C:/Users/v3119/Pictures/Screenpresso/2023-11-22_18h15_39.gif): You’ll need to open the main page you shared in your initial message, run this playbook to scrape all the links from the website. It will then output it to a Google Sheet. However, because many of the links are duplicated, you can a run ‘Remove duplicates’ on Google Sheets.
Playbook 2: This will open all the links, take the html of the page, convert it to text and output it to a google doc.
From there, you’ll be able to create the PDF to create your GPT. Here’s a video of how it works. I reduced the number of links to scrape to shorten the video but you can run it for all the links.
If you would like to learn more about CSS selectors, here is a link to that part of our Masterclass that covers this topic
Regards,
Vin
Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community
thanks so much! For Playbook 1, I think there is a problem with the hyperlink. I only got this “file:///C:/Users/v3119/Pictures/Screenpresso/2023-11-22_18h15_39.gif”. Can you send me the link to this?
I will try out both playbooks once I have feedback from your colleagues how the credits work