Merge pull request #210 from DerwenAI/prep-release

Prep release
DerwenAI · Mar 6, 2022 · 98960da · 98960da
2 parents 0d78d12 + 18f0f05
commit 98960da
Show file tree

Hide file tree

Showing 13 changed files with 720 additions and 55 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,8 +2,10 @@
 
 ## 3.2.3
 
-2022-02-??
+2022-03-06
 
+  * handles missing `noun_chunks` in some language models (e.g., "ru")
+  * add *TopicRank* algorithm; kudos @tomaarsen
   * improved test suite; fixed tests for newer spacy releases; kudos @tomaarsen
 
 

diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ This includes the family of
   - *TextRank* by [[mihalcea04textrank]](https://derwen.ai/docs/ptr/biblio/#mihalcea04textrank)
   - *PositionRank* by [[florescuc17]](https://derwen.ai/docs/ptr/biblio/#florescuc17)
   - *Biased TextRank* by [[kazemi-etal-2020-biased]](https://derwen.ai/docs/ptr/biblio/#kazemi-etal-2020-biased)
+  - *TopicRank* by [[bougouin-etal-2013-topicrank]](https://derwen.ai/docs/ptr/biblio/#bougouin-etal-2013-topicrank)
 
 Popular use cases for this library include:
 

diff --git a/docs/apidocs.yml b/docs/apidocs.yml
@@ -3,4 +3,4 @@ apidocs:
   template: ref.jinja
   package: pytextrank
   git: https://github.com/DerwenAI/pytextrank/blob/main
-  includes: BaseTextRankFactory, BaseTextRank, PositionRankFactory, PositionRank, BiasedTextRankFactory, BiasedTextRank, Lemma, Phrase, Sentence, VectorElem
+  includes: BaseTextRankFactory, BaseTextRank, TopicRankFactory, TopicRank, PositionRankFactory, PositionRank, BiasedTextRankFactory, BiasedTextRank, Lemma, Phrase, Sentence, VectorElem
diff --git a/docs/biblio.md b/docs/biblio.md
@@ -0,0 +1,97 @@
+# Bibliography
+
+Where possible, the bibliography entries use conventions at
+<https://www.bibsonomy.org/>
+for [*citation keys*](https://bibdesk.sourceforge.io/manual/BibDeskHelp_2.html).
+
+Journal abbreviations come from
+<https://academic-accelerator.com/Journal-Abbreviation/System>
+based on [*ISO 4*](https://en.wikipedia.org/wiki/ISO_4) standards.
+
+Links to online versions of cited works use
+[DOI](https://www.doi.org/)
+for [*persistent identifiers*](https://www.crossref.org/education/metadata/persistent-identifiers/).
+When available, 
+[*open access*](https://peerj.com/preprints/3119v1/)
+URLs are listed as well.
+
+
+## – B –
+
+### bougouin-etal-2013-topicrank
+
+["TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction"](https://aclanthology.org/I13-1062/)  
+[**Adrien Bougouin**](https://derwen.ai/s/t67yc26f6hcg), [**Florian Boudin**](https://derwen.ai/s/y89xdcbr3mj8), [**Béatrice Daille**](https://derwen.ai/s/25nynb9g79jt)  
+[*IJCNLP*](https://aclanthology.org/I13-1062/) pp. 543-551 (2013-10-14)  
+open: <a href="https://aclanthology.org/I13-1062.pdf" target="_blank">https://aclanthology.org/I13-1062.pdf</a>  
+slides: <a href="http://adrien-bougouin.github.io/publications/2013/topicrank_ijcnlp_slides.pdf" target="_blank">http://adrien-bougouin.github.io/publications/2013/topicrank_ijcnlp_slides.pdf</a>  
+> Keyphrase extraction is the task of identifying single or multi-word expressions that represent the main topics of a document. In this paper we present TopicRank, a graph-based keyphrase extraction method that relies on a topical representation of the document. Candidate keyphrases are clustered into topics and used as vertices in a complete graph. A graph-based ranking model is applied to assign a significance score to each topic. Keyphrases are then generated by selecting a candidate from each of the topranked topics. We conducted experiments on four evaluation datasets of different languages and domains. Results show that TopicRank significantly outperforms state-of-the-art methods on three datasets.
+
+
+## – F –
+
+### florescuc17
+
+["PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents"](https://doi.org/10.18653/v1/P17-1102)  
+[**Corina Florescu**](https://derwen.ai/s/y3w6mvj2r9wv), [**Cornelia Caragea**](https://derwen.ai/s/v3rq24nf6426)  
+[*Comput Linguist Assoc Comput Linguis*](https://www.mitpressjournals.org/loi/coli) pp. 1105-1115 (2017-07-30)  
+DOI: 10.18653/v1/P17-1102  
+open: <a href="https://www.aclweb.org/anthology/P17-1102.pdf" target="_blank">https://www.aclweb.org/anthology/P17-1102.pdf</a>  
+> The large and growing amounts of online scholarly data present both challenges and opportunities to enhance knowledge discovery. One such challenge is to automatically extract a small set of keyphrases from a document that can accurately describe the document’s content and can facilitate fast information processing. In this paper, we propose PositionRank, an unsupervised model for keyphrase extraction from scholarly documents that incorporates information from all positions of a word’s occurrences into a biased PageRank. Our model obtains remarkable improvements in performance over PageRank models that do not take into account word positions as well as over strong baselines for this task. Specifically, on several datasets of research papers, PositionRank achieves improvements as high as 29.09%.
+
+
+## – G –
+
+### gleich15
+
+["PageRank Beyond the Web"](https://doi.org/10.1137/140976649)  
+[**David Gleich**](https://derwen.ai/s/7zk738z8fn9t)  
+[*SIAM Review*](https://www.siam.org/publications/journals/siam-review-sirev) **57** 3 pp. 321-363 (2015-08-06)  
+DOI: 10.1137/140976649  
+open: <a href="https://www.cs.purdue.edu/homes/dgleich/publications/Gleich%202015%20-%20prbeyond.pdf" target="_blank">https://www.cs.purdue.edu/homes/dgleich/publications/Gleich%202015%20-%20prbeyond.pdf</a>  
+> Google&#39;s PageRank method was developed to evaluate the importance of web-pages via their link structure. The mathematics of PageRank, however, are entirely general and apply to any graph or network in any domain. Thus, PageRank is now regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It&#39;s even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics. We&#39;ll see the mathematics and ideas that unite these diverse applications.
+
+
+## – K –
+
+### kazemi-etal-2020-biased
+
+["Biased TextRank: Unsupervised Graph-Based Content Extraction"](https://doi.org/10.18653/v1/2020.coling-main.144)  
+[**Ashkan Kazemi**](https://derwen.ai/s/rjsnrs5jhswk), [**Verónica Pérez-Rosas**](https://derwen.ai/s/svmndvvnndkv), [**Rada Mihalcea**](https://derwen.ai/s/wwrw59tbtzzp)  
+[*COLING*](https://www.aclweb.org/anthology/venues/coling/) **28** pp. 1642-1652 (2020-12-08)  
+DOI: 10.18653/v1/2020.coling-main.144  
+open: <a href="https://www.aclweb.org/anthology/2020.coling-main.144.pdf" target="_blank">https://www.aclweb.org/anthology/2020.coling-main.144.pdf</a>  
+> We introduce Biased TextRank, a graph-based content extraction method inspired by the popular TextRank algorithm that ranks text spans according to their importance for language processing tasks and according to their relevance to an input &#39;focus&#39;. Biased TextRank enables focused content extraction for text by modifying the random restarts in the execution of TextRank. The random restart probabilities are assigned based on the relevance of the graph nodes to the focus of the task. We present two applications of Biased TextRank: focused summarization and explanation extraction, and show that our algorithm leads to improved performance on two different datasets by significant ROUGE-N score margins. Much like its predecessor, Biased TextRank is unsupervised, easy to implement and orders of magnitude faster and lighter than current state-of-the-art Natural Language Processing methods for similar tasks.
+
+
+## – M –
+
+### mihalcea04textrank
+
+["TextRank: Bringing Order into Text"](https://www.aclweb.org/anthology/W04-3252/)  
+[**Rada Mihalcea**](https://derwen.ai/s/wwrw59tbtzzp), [**Paul Tarau**](https://derwen.ai/s/vnfvsgvc9gfy)  
+[*EMNLP*](https://www.aclweb.org/anthology/venues/emnlp/) pp. 404-411 (2004-07-25)  
+open: <a href="https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf" target="_blank">https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf</a>  
+> In this paper, the authors introduce TextRank, a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications.
+
+
+## – P –
+
+### page1998
+
+["The PageRank Citation Ranking: Bringing Order to the Web"](http://ilpubs.stanford.edu:8090/422/)  
+[**Lawrence Page**](https://derwen.ai/s/mk6xj6cfrrxg), [**Sergey Brin**](https://derwen.ai/s/j636dghdyws5), [**Rajeev Motwani**](https://derwen.ai/s/9hhpmgjs7kwt), [**Terry Winograd**](https://derwen.ai/s/jdxk7fz84nzq)  
+[*Stanford InfoLab*](http://infolab.stanford.edu/) (1999-11-11)  
+open: <a href="http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf" target="_blank">http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf</a>  
+> The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
+
+
+## – W –
+
+### williams2016
+
+["Summarizing documents"](https://mike.place/talks/pygotham/)  
+[**Mike Williams**](https://derwen.ai/s/2t2mbms2x4p3)  
+(2016-09-25)  
+> I&#39;ve recently given a couple of talks (PyGotham video, PyGotham slides, Strata NYC slides) about text summarization. I cover three ways of automatically summarizing text. One is an extremely simple algorithm from the 1950s, one uses Latent Dirichlet Allocation, and one uses skipthoughts and recurrent neural networks. The talk is conceptual, and avoids code and mathematics. So here is a list of resources if you&#39;re interested in text summarization and want to dive deeper. This list useful is hopefully also useful if you&#39;re interested in topic modelling or neural networks for other reasons.
+