Exclude a CSS class from scraping

Hi guys,

How can I exclude a CSS class from scraping?
Suddenly, :has doesn’t work yet.

For example, #content-main > p:not(:has(strong, em, iframe))

Xpath: //div[@id="content-main"]//p[not(.//strong) or contains(@class, "cssclass1")] | //div[@id="content-main"]//iframe/*

Doesn’t work neither.

Hey @den

Can you please provide more information on what are you trying to scrape? Also can you please let us know what browser are you using and the version?

Thanks in advance.

Hey @manvel

I’m trying to scrape text from a page

where

#content-main - container with the content
p - tag for the text content

I need to exclude strong, .cssclass1 and iframe tags from the selection.

Basically, the XPath works perfectly within the developer’s console (Chrome 120.0.6099.199). But for some reason doesn’t work within Bardeen. It grabs only the very first paragraph.

@den from what I see the CSS you shared is accurate and works for me, the XPath doesn’t work on my side not in the Dev tools nor in the Bardeen, the selector seem to be wrong, am I missing something? See the loom: Loom | Free Screen & Video Recording Software | Loom
Pages:

@den for the XPath I think you mean to use(or similar):

//div[@id="content-main"]//p[not(.//strong or .//*[contains(@class, "cssclass1")] or .//iframe)]

Let me know if that fixes the issue, at least it picks correctly on my side on current URL.

But I would rather use CSS equivalent as it might be shorter and clearer.

@manvel
thnx for help👍

It seems that I figured out a workaround:
I created a new scraping template (list), defined a custom selector (DIV#content-main> P:not(:has(strong)):not(.cssclass1)) and then just used :scope. Then I use “Merge text” to merge the content.

I don’t know why and what is the difference, but it works. Probably just a conflict within Bardeen’s logics for single page templates (when you can’t define a custom selector).

P.S. The XPath you’ve provided grabs the very first paragraph only.

It seems that I figured out a workaround:
I created a new scraping template (list), defined a custom selector (DIV#content-main> P:not(:has(strong)):not(.cssclass1)) and then just used :scope. Then I use “Merge text” to merge the content.

List scraper and Single page scrapers are different and unless you use :scope before your selector for the fields in the List scraper, your scope will be the page Body <html> element, if you do provide the scope then path will be calculated from the container, similar to the . in the XPath.

I see that your selector still targeting different element, think you mean to use(or similar):

DIV#content-main> P:not(:has(strong, .cssclass1, iframe))

P.S. The XPath you’ve provided grabs the very first paragraph only.

Can you please provide loom, recording or similar? It’s not what I see. Unless I see the exact markup and exact Scraper model you are building, it’s really hard to help meaningfully unfortunately.

As mentioned above, it’s meant to be used on the current page, which was generated based on your request above, if I understood you correctly.

Here is the sreenshot on my side, but again not sure what exactly you are building:

@manvel

Thanks for your help!:muscle:

Here is a more detailed explanation. The main question how to make it to grab all of the items (content), not the very first one. And how to merge them into one single piece.

Thanks @den for the detailed information on the use-case here is a demo for you on how to build it: Loom | Free Screen & Video Recording Software | Loom

And also the playbook: https://www.bardeen.ai/playbook/community/Scraping-www.marijuanamoment.net-HnFaFOcrG9pll3SuGV itself.

@manvel

Many thanks for your time and help! I do really appreciate this :pray:

The thing here, that it looks like a very clumsy workaround. For example, scraping only just a single page may take around 35 credits (using the premium background mode). I believe if it worked as intended, it would use only a couple of credits per page.

Actually, I think that implementing a filtering logics into the scraper could be very beneficial for many users. Just like “Selector Include:” & “Selector Exclude:”. Just in case if the developers read this tread.

Also, implementing a freestyle Puppeteer action could be beneficial in case if I’d like to compound scraping with some other automations tasks.

P.S. BTW, I’ve tried Puppeteer itself with the mentioned CSS selector. Worked like a charm, all of necessary information was scraped easily. So, I believe, it is something wrong with the Bardeen’s logic.

The same issue was for the listing itself. Bardeen grabs the very first item only (despite I point 10 items), and that’s all.

Also:

  • You barely can update templates, in 90% cases you have to create new ones.
  • Sometimes you can’t even update your entire playbook after all the changes you have done, it returns an error.
  • For some reason new documents are not displayed in Google Docs selection.
  • It consumes premium “credits” unpredictable.
  • Overall, the behavior of the Bardeen is unstable and unpredictable.

It was a great try, but it seems that it is too early and the app feels like still in beta. Anyway, thanks for what you are doing, the idea is great itself. I hope that the project will reach real success very soon. For now, I think Browserless + Zapier will be more efficient and straightforward to use.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.