Enable selecting which columns to include/exclude in the crawl function. #391

eliasdabbas · 2024-11-13T19:20:47Z

eliasdabbas
Nov 13, 2024
Maintainer

Crawling large websites can consume a lot of disk space. In many cases we are not interested in all the columns, only a subset.

This can be solved by enabling users to select columns they (don't) want, and thereby making the output file as big/small as they want.

Possible approach: Add two new parameters to the crawl function, columns_to_keep (I want only these columns) and columns_to_discard (I want everything but these) or example.

An important issue is groups of columns like h1..h6, resp_headers_*, request_headers_*, jsonld_*. It can be easier because in many cases we don't know what actual columns would exist, and they can change from page to page.

For example:

import advertools as adv
adv.crawl(
    url_list="https://example.com",
    output_file="output.jsonl",
    follow_links=True,
    columns_to_keep=["url", "title", "status"],
    columns_to_keep_regex="jsonld_.*",
    )

Keeps url, title, status, and anything matching "jsonld_.*"

The logic needs to be figured out (conflicts between keeping and discarding should be solved).
This would introduce four new parameter and looks ugly. I would like this to be more streamlined.

antoineeripret · 2024-11-14T07:54:51Z

antoineeripret
Nov 14, 2024

Hi @eliasdabbas,

Just to add more context and why I think it might be useful. While the solution you described obviously works, the underlying issue is the amount of data generated.

In the case I'm currently working on, advertools more or less generates 50GB / 1M URLs. By not being able to select the columns I want, I need to launch it in a machine with at least 75GB of free space on its hard disk. Not an issue when you have a single crawl in progress, trickier when you have several in parallel.

An important issue is groups of columns like h1..h6, resp_headers_, request_headers_, jsonld_*. It can be easier because in many cases we don't know what actual columns would exist, and they can change from page to page.

Would you mind telling me what part of your code controls the way the output.jl file is written? I'm trying to find it but I'm not sure where to look. I may come up with some ideas by looking at your current logic.

This would introduce four new parameter and looks ugly. I would like this to be more streamlined.

I agree ! I see two options:

Force the user to include or exclude, you'd add 2 extra parameters instead of 4.
Instead of using columns_to_keep, you'd have columns where the value would be a dict as follows:

columns = {
    'keep': ['h*', 'url'],
    'remove': ['h5']
}

This would allow complex matching (I want h* but h5) while limiting the number of new parameters. By using the re module, we can easily check if a string is a REGEX, so we can have a mix of both.

try:
    re.compile('[')
    is_valid = True
except re.error:
    is_valid = False

What do you think?

3 replies

eliasdabbas Nov 14, 2024
Maintainer Author

Thanks @antoineeripret

Yes, I realize the main issue is in saving a lot of data, and that's why I'm opening a separate topic so it can be better designed. Or did I miss something?

The solution can be done here:

advertools/advertools/spider.py

Line 869 in 7afcc8f

yield dict(

The function ends up yielding a dictionary. This dictionary can simply be filtered to only yield the keys that were specified.

This can be done before getting to the yield statement.

For example, undesired keys can be deleted from this dictionary before yielding:

page_content = _extract_content(response, **tags_xpaths)

Similarly, this can be done for all other dictionaries that extract content.

Alternatively, the undesired dictionaries shouldn't be created to begin with, if the user doesn't want them.

Those details need to be properly figured out.

Still thinking about the best approach for regex, and the number of parameters.

Let me know your thoughts, and thanks again!

antoineeripret Nov 14, 2024

Yes, I realize the main issue is in saving a lot of data, and that's why I'm opening a separate topic so it can be better designed. Or did I miss something?

No, sorry. I just wanted to be sure that I explained myself correctly :)

Thank your for the explanation, the yield concept is not one I master to be honest, I'll have a look at it !

Do not hesitate if you need help (as much as I can, there is a clear knowledge gap between use hehe) for this implementation!

Thank you !

eliasdabbas Nov 14, 2024
Maintainer Author

Great!
I'll just have to see a proper way of doing this, because each type of content is extracted on its own. Just making sure it's organized and maintainable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable selecting which columns to include/exclude in the crawl function. #391

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Enable selecting which columns to include/exclude in the crawl function. #391

eliasdabbas Nov 13, 2024 Maintainer

Replies: 1 comment · 3 replies

antoineeripret Nov 14, 2024

eliasdabbas Nov 14, 2024 Maintainer Author

antoineeripret Nov 14, 2024

eliasdabbas Nov 14, 2024 Maintainer Author

eliasdabbas
Nov 13, 2024
Maintainer

Replies: 1 comment 3 replies

antoineeripret
Nov 14, 2024

eliasdabbas Nov 14, 2024
Maintainer Author

eliasdabbas Nov 14, 2024
Maintainer Author