Entity graph created by scraping GOTWiki- https://gameofthrones.fandom.com/wiki
Nodes- ['Organization', 'Person', 'Event', 'Episode', 'Animal', 'Location', 'HistoriesNLore', 'Weapon', 'House', 'PersonType', 'Religion', 'Season']
Relationships- ['SeenOrMentioned', 'Membership', 'Religion', 'Center', 'Location', 'Clergy', 'Allegiance', 'Leader', 'Founder', 'Predecessor', 'Death', 'Culture', 'Conflict', 'Place', 'Outcome', 'AssociatedLocation', 'Father', 'Mother', 'Spouse', 'Siblings', 'Battles', 'Rulers', 'Narratedby', 'Lovers', 'Successor', 'Children', 'Maker', 'Owner', 'Lord', 'Capital', 'Cities', 'Towns', 'Castles', 'Species', 'Range', 'Ruler', 'Population', 'Heir', 'Ancestralweapon', 'PlacesofNote', 'Formerly', 'Placesofnote', 'Military', 'Institutions', 'Villages', 'Placeoforigin', 'Formedfrom', 'Cadetbranches', 'Militarystrength', 'Premiere', 'Finale']
I have written a introductory blog about web scraping - https://codefringo.wordpress.com/2018/10/22/webcralwer-in-python/
Important files-
- spiders/GotTGraphSpider.py is the main spider used to scrape fandom wiki
- DataProcessor/ScrapyOutputProcessing.ipynb is the jupyter notebook that processes the scrapedOutput and generates tabular data for entities and creates graph in neo4j instance.
Run the whole project-
- Run command- "scrapy crawl GotGraphSpider -o Data/ScrapedData.json"
- Execute the jupyter notebook - DataProcessor/ScrapyOutputProcessing.ipynb.