Security issues in crawling #12

innovationchef · 2018-05-28T17:12:26Z

Thoughts -

Faithful Crawling - the input website may not contain relevant bioschemas data
Massive JSON-LD or web pages
Filter frontend inputs
Denial of service attacks

innovationchef · 2018-06-01T05:22:55Z

How do I check if the json-ld received contains the relevant life sciences data?
I don't think scrapy takes care of massive web pages (I will update this in case I find something in documentation). So how do we check if the web page is massive? What limit should be put on the size?
Will be taken later when we start the frontend part.

justinccdev · 2018-06-03T17:41:49Z

Good question. I think this is an argument for explicitly selecting the sites indexed rather than doing a general crawl. A very good source of sites may be https://fairsharing.org, run by @Drosophilic
Of course I believe you buy I'm surprised it doesn't. Yeah, not sure about size limit, what's the size of the average Biosamples page?

innovationchef · 2018-06-03T20:33:12Z

Took me some time to find this out, but Scrapy has a safety valve around the downloader. If the aggregated size of Response in progress is larger than 5 MB it stops the flow of further Request into the downloader.
See here

innovationchef · 2018-06-03T20:48:31Z

@justinccdev
There are 791 databases currently listed on the fairsharing.org website related to life sciences. ebi.co.uk/biosamples is one of them. So, should I check if the website that the user is giving as input is listed in that database before crawling?
Link here

justinccdev · 2018-06-08T09:33:01Z

Whilst we may very probably want to use fairsharing.org information and/or Bioschemas live deploys in the future as sources for default sites to crawl, we don't want to restrict the user as to what they can crawl, I think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security issues in crawling #12

Security issues in crawling #12

innovationchef commented May 28, 2018

innovationchef commented Jun 1, 2018 •

edited

Loading

justinccdev commented Jun 3, 2018

innovationchef commented Jun 3, 2018

innovationchef commented Jun 3, 2018

justinccdev commented Jun 8, 2018

Security issues in crawling #12

Security issues in crawling #12

Comments

innovationchef commented May 28, 2018

innovationchef commented Jun 1, 2018 • edited Loading

justinccdev commented Jun 3, 2018

innovationchef commented Jun 3, 2018

innovationchef commented Jun 3, 2018

justinccdev commented Jun 8, 2018

innovationchef commented Jun 1, 2018 •

edited

Loading