Skip to content

Commit

Permalink
v1.0.17 Release
Browse files Browse the repository at this point in the history
- Added Turkish language support.
- Added Tamil language support.
- Added Italian language support.
- Added Afrikaans language support.
- Improved documentation, especially with regards to adding additional languages.
  • Loading branch information
Donatello-za committed Jun 21, 2021
1 parent 567b7e9 commit 5522006
Show file tree
Hide file tree
Showing 3 changed files with 164 additions and 38 deletions.
138 changes: 100 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,65 @@
# rake-php-plus
Yet another PHP implementation of the Rapid Automatic Keyword Extraction algorithm (RAKE).
A keyword and phrase extraction library based on the Rapid Automatic Keyword Extraction algorithm (RAKE).

[![Latest Stable Version](https://poser.pugx.org/donatello-za/rake-php-plus/v/stable)](https://packagist.org/packages/donatello-za/rake-php-plus)
[![Total Downloads](https://poser.pugx.org/donatello-za/rake-php-plus/downloads)](https://packagist.org/packages/donatello-za/rake-php-plus)
[![License](https://poser.pugx.org/donatello-za/rake-php-plus/license)](https://packagist.org/packages/donatello-za/rake-php-plus)

## Why is this package useful?
## Introduction

Keywords describe the main topics expressed in a document/text. Keyword *extraction* in turn allows for the extraction of important words and phrases from text. This in turn can be used for building a list of tags or to build a keyword search index or grouping similar content by its topics and much more. This library provides an easy method for PHP developers to get a list of keywords and phrases from a string of text.
Keywords describe the main topics expressed in a document/text. Keyword *extraction* in turn allows for the extraction of important words and phrases from text.

This project is based on another project called [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) by Richard Filipčík, which is a translation from a Python implementation simply called [RAKE](https://github.com/aneesha/RAKE).
Extracted keywords can be used for things like:
- Building a list of useful tags out of a larger text
- Building search indexes and search engines
- Grouping similar content by its topic.

*As described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).
Extracted phrases can be used for things like:
- Highlighting important areas of a larger text
- Language or documentation analysis
- Building intelligent searches based on contextual terms

This library provides an easy method for PHP developers to get a list of keywords and phrases from a string of text
and is based on another smaller and unmaintained project called [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) by Richard Filipčík,
which is a translation from a Python implementation simply called [RAKE](https://github.com/aneesha/RAKE).

> *As described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).
[Automatic Keyword Extraction from Individual Documents](https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents).
In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.*


This particular package intends to include the following benefits over the original [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) package:

1. Add [PSR-2](http://www.php-fig.org/psr/psr-2/) coding standards.
2. Implement [PSR-4](http://www.php-fig.org/psr/psr-4/) in order to be [Composer](https://getcomposer.org) installable.
3. Add additional functionality such as method chaining.
4. Add multiple ways to provide source stopwords.
1. [PSR-2](http://www.php-fig.org/psr/psr-2/) coding standards.
2. [PSR-4](http://www.php-fig.org/psr/psr-4/) to be [Composer](https://getcomposer.org) installable.
3. Additional functionality such as method chaining.
4. Multiple ways to provide source stopwords.
5. Full unit test coverage.
6. Performance improvements.
7. Improved documentation.
8. Easy language integration and multibyte string support.

## Currently Supported Languages

* Arabic (United Arab Emirates)/لإمارات العربية المتحدة (ar_AE)
* Brazilian Portuguese/português do Brasil (pt_BR)
* English US (en_US)
* Spanish/español (es_AR)
* European Portuguese/português europeu (pt_PT)
* French/le français (fr_FR)
* German (Germany)/Deutsch (Deutschland) (de_DE)
* Italian (Italiano)
* Polish/język polski (pl_PL)
* Russian/русский язык (ru_RU)
* Brazilian Portuguese/português do Brasil (pt_BR)
* European Portuguese/português europeu (pt_PT)
* Sorani Kurdish/سۆرانی (ckb_IQ)
* Arabic (United Arab Emirates)/لإمارات العربية المتحدة (ar_AE)
* German (Germany)/Deutsch (Deutschland) (de_DE)
* Spanish/español (es_AR)
* Tamil (தமிழ்)
* Turkish (Türkçe)

> If your language is not listed here it can be added, please see the section
called **How to add additional languages** at the bottom of the page.

## Version

v1.0.16
v1.0.17

## Special Thanks

Expand All @@ -51,6 +69,9 @@ v1.0.16
* [Khoshbin Ali Ahmed](https://github.com/Xoshbin): Sorani Kurdish and Arabic languages.
* [RhaPT](https://github.com/RhaPT): European Portuguese language.
* [Peter Thaleikis](https://github.com/spekulatius): German language.
* [Yusuf Usta](https://github.com/yusufusta): Turkish language.
* [orthosie](https://github.com/orthosie): Tamil language.
* [ScIEnzY](https://github.com/ScIEnzY): Italian language.

## Installation

Expand Down Expand Up @@ -423,54 +444,95 @@ Array

## How to add additional languages

**Using the stopwords extractor tool**
The library requires a list of "stopwords" for each language. Stopwords are common words used in a language such as "and", "are", "or", etc.

The library requires a list of "stopwords" for each language. Stopwords are common words used in a language such as "and", "are", "or", etc. An example list of such stopwords can be found [here (en_US)](http://www.lextek.com/manuals/onix/stopwords2.html). You can also [take a look at this list](https://github.com/Donatello-za/stopwords-json) which have stopwords for 50 different languages in individual JSON files.
There are [stopwords for 50 languages](https://github.com/Donatello-za/stopwords-json#languages) (including the ones already supported) available in JSON format.
If you are lucky enough to have your language listed then you can easily import it into the library. To
do so, read the section below:

When working with a simple list such as in the first example, you can copy and paste the text into a text file and use the extractor tool to convert it into a format that this library can read efficiently. *An example of such a stopwords file that have been copied from the hyperlink above have been included for your convenience (console/stopwords_en_US.txt)*
**Using the stopwords extractor tool**

Alternatively you can extract the stopwords from a JSON file of which an example have also been supplied, look under `console/stopwords_en_US.json`
> Note: These instructions assumes you are using Linux
**Note:** Simply replace `en_US` to whatever locale you wish to use in the examples below.
We will be using the Greek language as an example:

**Important:** Before using the `extractor` tool, make sure to use the following Linux command to check whether your locale is supported:
1. Check to see if your operating have the Greek localisation files, the Greek locale
code you have to look for is: `el_GR`. So run the command `$ locale -a` to see if it is listed.
2. If it is not listed, you'll need to create it, so run:

```sh
$ locale -a
sudo locale-gen el_GR
sudo locale-gen el_GR.utf8
```

If you do not see the locale you wish to use in the list you can install it as follows: (in this case we are installing the French locale):
3. Go the [list of stopword files](https://github.com/Donatello-za/stopwords-json#languages) and
find the Greek language, the file will be called `el.json` and it will contain 75 stopwords.
4. Download the `el.json` file and store it somewhere on your system.
5. In you terminal, go to the directory of the `rake-php-plus` library, it will
be under `vendor/donatello-za/rake-php-plus` if you used Composer to install it.

We now need to use the JSON file to create two new files, one will be a `.php` file
that contains the stopwords as a PHP array and one fill be a `.pattern` file which
is a text file containing the stopwords as a regular expression:

1. Extract and convert the .json file to a PHP file by running:

```sh
$ sudo locale-gen fr_FR
$ sudo locale-gen fr_FR.utf8
$ php ./console/extractor.php path/to/el.json --locale=el_GR --output=php > ./some/dir/el_GR.php
```

To extract stopwords from a text file, run the following from the command line:
2. Extract and convert the .json file to a .pattern file by running:

```sh
$ cd ./console
$ php extractor.php stopwords_en_US.txt --locale=en_US --output=php
$ php ./console/extractor.php path/to/el.json --locale=el_GR --output=pattern > ./some/dir/el_GR.pattern
```

That is it! You can now use the new stopwords by specifying it when creating an instance
of the RakePlus class, for example:

```php
$rake = RakePlus::create($text, '/some/dir/el_GR.pattern');
```

To extract stopwords from a JSON file, run the following from the command line:
or

`$ php extractor.php stopwords_en_US.json --locale=en_US --output=php`
```php
$rake = RakePlus::create($text, '/some/dir/el_GR.php');
```

It will output the results to the terminal. You will notice that the results looks like PHP and in fact it is. You can write the results directly to a PHP file by piping it:
**Contribute by Adding a Language**

`$ php extractor.php stopwords_en_US.txt --locale=en_US --output=php > en_US.php`
If you want your language to be officially support, you can fork this library,
generate the `.pattern` and `.php` stopword files as described above, place it
in the `./rake-php-plus/lang/` directory and submit it as a pull request.

Finally, copy the `en_US.php` file to the `lang/` directory and then instantiate php-rake-plus like so:
Once your language is officially supported, you'll be able to specify the language
without having to specify the file to use, for example:

```php
$rake = RakePlus::create($text, 'en_US');
$rake = RakePlus::create($text, 'el_GR');
```

RakePHP will always look for a `.pattern` file first and if not found it will
look for a `.php` file in the `./lang/` directory.

**I don't have a stopwords file for my language, what now?**

If your language is not covered in the [list of 50 languages here](https://github.com/Donatello-za/stopwords-json#languages)
you may have to try and find it elsewhere, try searching for "yourlanguage stopwords". If you
find a list or decide to create your own list, you can also just place it in a standard text
file instead of a .json file and extract the stopwords using the extractor tool, for
example:

```sh
$ php ./console/extractor.php path/to/mystopwords.txt --locale=LOCAL_CODE --output=php > ./some/dir/LOCAL_CODE.php
$ php ./console/extractor.php path/to/mystopwords.txt --locale=LOCAL_CODE --output=php > ./some/dir/LOCAL_CODE.php
```
To improve the initial loading speed of the language file within RakePlus, you can also set the exporter to produce the results as a regular expression pattern using the `--output` argument:

`$ php extractor.php stopwords_en_US.txt --locale=en_US --output=pattern > en_US.pattern`
*Remember to replace `LOCAL_CODE` for the correct local you wish to use.*

RakePHP will always look for a `.pattern` file first and if not found it will look for a `.php` file in the `./lang/` directory.
Here is an example text file containing stopwords that was copied and pasted from a
site: [stopwords_en_US](./console/stopwords_en_US.txt)

## To run tests

Expand Down
1 change: 1 addition & 0 deletions lang/af_ZA.pattern
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/\bwat\b|\bwas\b|\bvir\b|\bvan\b|\buit\b|\btoe\b|\bte\b|\bsy\b|\bso\b|\bsien\b|\bse\b|\bsal\b|\bsaam\b|\bop\b|\bons\b|\bom\b|\bnie\b|\bna\b|\bʼn(?!(-|'))\b|\b'n\b|\bmy\b|\bmet\b|\bmaar\b|\bma\b|\bkom\b|\bkan\b|\bjy\b|\bjou\b|\bis\b|\bin\b|\bhy\b|\bhulle\b|\bhom\b|\bhet\b|\bhaar\b|\bgesê\b|\bgaan\b|\ben\b|\bek\b|\been\b|\bdit\b|\bdie\b|\bdat\b|\bdag\b|\bdaar\b|\bby\b|\bbaie\b|\bas\b|\bal\b|\baf\b|\baan\b/i
63 changes: 63 additions & 0 deletions lang/af_ZA.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
<?php

/**
* Stopwords list for the use in the PHP package rake-php-plus.
* See: https://github.com/Donatello-za/rake-php-plus
*
* Extracted using extractor.php @ 2021-06-21T12:26:39+00:00
*/

return [
'wat',
'was',
'vir',
'van',
'uit',
'toe',
'te',
'sy',
'so',
'sien',
'se',
'sal',
'saam',
'op',
'ons',
'om',
'nie',
'na',
'ʼn',
'\'n',
'my',
'met',
'maar',
'ma',
'kom',
'kan',
'jy',
'jou',
'is',
'in',
'hy',
'hulle',
'hom',
'het',
'haar',
'gesê',
'gaan',
'en',
'ek',
'een',
'dit',
'die',
'dat',
'dag',
'daar',
'by',
'baie',
'as',
'al',
'af',
'aan'
];

0 comments on commit 5522006

Please sign in to comment.