Scraping All Content of unstructured website

Hey guys,

How are you? I understand how to use Bardeen for scraping of websites that are quite structured. Thanks so much also on helping me out with Google Maps and LinkedIN the other day!

Here is a new use case: I was wondering, if there is a way to use bardeen to scrape a website that has a huge amount of sub-pages and is quite unstructured overall. Here is the website of the German Tax Ministry: BMF - Amtliches Einkommensteuer-Handbuch

They have an html based ebook that explains you how to do your tax declaration. The Problem is, that there are a ton of sub-pages, and you need to click on hyperlinks to open them. Many sub-pages have further sub-pages. Also, the structure is different on each page.

I was wondering if there is maybe a tool to first get all the sub-urls of a given website somehow, and then do a complete scrape of all the sub-pages?

As a final output, I’d like to create a PDF which I want to use for a GermanTaxGPT.

How would you approach this?

Hi Moritz,

We’re so glad to hear you’re moving on to more advanced use cases with Bardeen and exploring the world of possibilities.

We can use a custom scraper with a CSS selector that looks for all links on a page and then scrape those pages. I have created two playbooks for you (linked below) that do just that

  1. Playbook 1 (file:///C:/Users/v3119/Pictures/Screenpresso/2023-11-22_18h15_39.gif): You’ll need to open the main page you shared in your initial message, run this playbook to scrape all the links from the website. It will then output it to a Google Sheet. However, because many of the links are duplicated, you can a run ‘Remove duplicates’ on Google Sheets.

  2. Playbook 2: This will open all the links, take the html of the page, convert it to text and output it to a google doc.

From there, you’ll be able to create the PDF to create your GPT. Here’s a video of how it works. I reduced the number of links to scrape to shorten the video but you can run it for all the links.

If you would like to learn more about CSS selectors, here is a link to that part of our Masterclass that covers this topic

Regards,
Vin

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

Hi Vin,

thanks so much! For Playbook 1, I think there is a problem with the hyperlink. I only got this “file:///C:/Users/v3119/Pictures/Screenpresso/2023-11-22_18h15_39.gif”. Can you send me the link to this?

I will try out both playbooks once I have feedback from your colleagues how the credits work :slight_smile:

Thanks!

Sent via Superhuman

Hi Moritz,

Apologies, here is the correct link: https://www.bardeen.ai/playbook/community/Scrape-German-Tax-42cOgtVcCPMEB5SlHo

You can find more details on how our credits work here : https://www.bardeen.ai/pricing under the FAQ ‘Waht’s a credit ? What can I do with credits ?’

Only premium playbook runs will consume credits. All non-premium playbooks remain available for unlimited usage, free of charge.

Let me know if you need further help

Regards,
Vin

Customer Support - bardeen.ai
Explore | @bardeenai | Bardeen Community

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.