Enable selecting which columns to include/exclude in the crawl function. #391
Replies: 1 comment 3 replies
-
Hi @eliasdabbas, Just to add more context and why I think it might be useful. While the solution you described obviously works, the underlying issue is the amount of data generated. In the case I'm currently working on, advertools more or less generates 50GB / 1M URLs. By not being able to select the columns I want, I need to launch it in a machine with at least 75GB of free space on its hard disk. Not an issue when you have a single crawl in progress, trickier when you have several in parallel.
Would you mind telling me what part of your code controls the way the
I agree ! I see two options:
columns = {
'keep': ['h*', 'url'],
'remove': ['h5']
} This would allow complex matching (I want try:
re.compile('[')
is_valid = True
except re.error:
is_valid = False What do you think? |
Beta Was this translation helpful? Give feedback.
-
Crawling large websites can consume a lot of disk space. In many cases we are not interested in all the columns, only a subset.
This can be solved by enabling users to select columns they (don't) want, and thereby making the output file as big/small as they want.
Possible approach: Add two new parameters to the
crawl
function,columns_to_keep
(I want only these columns) andcolumns_to_discard
(I want everything but these) or example.An important issue is groups of columns like
h1..h6
,resp_headers_*
,request_headers_*
,jsonld_*
. It can be easier because in many cases we don't know what actual columns would exist, and they can change from page to page.For example:
Keeps
url
,title
,status
, and anything matching "jsonld_.*"Beta Was this translation helpful? Give feedback.
All reactions