Skip to content

Commit

Permalink
V1.0.9 Extractor tool can now extract stopwords from JSON files.
Browse files Browse the repository at this point in the history
  • Loading branch information
Donatello-za committed Mar 1, 2019
1 parent 96aac94 commit 6c759ee
Show file tree
Hide file tree
Showing 4 changed files with 79 additions and 31 deletions.
35 changes: 24 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ This particular package intends to include the following benefits over the origi

## Version

v1.0.8
v1.0.9

## Special Thanks

Expand Down Expand Up @@ -397,29 +397,43 @@ Array

```

## The stopword extractor tool
## How to add additional languages

The library requires a list of "stopwords". Stopwords are common words
used in a language such as "and", "are", "or", etc. A list of such stopwords
can be found [here](http://www.lextek.com/manuals/onix/stopwords2.html). You
can copy and paste the text into a text file and use the extractor tool to
**Using the stopwords extractor tool**

The library requires a list of "stopwords" for each language. Stopwords are
common words used in a language such as "and", "are", "or", etc. An example
list of such stopwords can be found
[here (en_US)](http://www.lextek.com/manuals/onix/stopwords2.html). You can
also [take a look at this list](https://github.com/Donatello-za/stopwords-json)
which have stopwords for 50 different languages in individual JSON files.

When working with a simple list such as in the first example, you can copy and
paste the text into a text file and use the extractor tool to
convert it into a format that this library can read efficiently. *An example
of such a stopwords file that have been copied from the hyperlink above have
been included for your convenience (console/stopwords_en_US.txt)*

To extract and convert such a file, run the following from the command line:
Alternatively you can extract the stopwords from a JSON file of which an
example have also been supplied, look under `console/stopwords_en_US.json`

To extract stopwords from a text file, run the following from the command line:

`$ php -q extractor.php stopwords_en_US.txt`

To extract stopwords from a JSON file, run the following from the command line:

`$ php -q extractor.php stopwords_en_US.json`

It will output the results to the terminal. You will notice that the results looks
like PHP and in fact it is. You can write the results directly to a PHP file by
piping it:

`$ php -q extractor.php stopwords_en_US.txt > en_US.php`

Finally, copy the `en_US.php` file to the `lang/` directory (you may have to
set its permissions for the web server to execute it) and then instantiate php-rake-plus
like so:
set its permissions for the web server to execute it) and then instantiate
php-rake-plus like so:

```php
$rake = RakePlus::create($text, 'en_US');
Expand All @@ -430,10 +444,9 @@ using the `-p` switch:

`$ php -q extractor.php stopwords_en_US.txt -p > en_US.pattern`

RakePHP will always first look for a .pattern file and if not found will look
RakePHP will always first look for a .pattern file first and if not found will look
for a .php file in the ./lang/ directory.


## To run tests

`./vendor/bin/phpunit tests/RakePlusTest.php`
Expand Down
9 changes: 5 additions & 4 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,13 @@
}
],
"require": {
"php": ">=5.4.0"
"php": ">=5.4.0",
"ext-json": "*",
"ext-mbstring": "*"
},
"require-dev": {
"php": ">=5.5.0",
"phpunit/phpunit": "~4.0|~5.0",
"ext-mbstring": "*"
"phpunit/phpunit": "~4.0|~5.0"
},
"autoload": {
"psr-4": {
Expand All @@ -42,7 +43,7 @@
},
"extra": {
"branch-alias": {
"dev-master": "1.0.5-dev"
"dev-master": "1.0.9-dev"
}
},
"scripts": {
Expand Down
65 changes: 49 additions & 16 deletions console/extractor.php
Original file line number Diff line number Diff line change
@@ -1,17 +1,23 @@
<?php

/**
* Extracts stopwords from a file copied and pasted from
* Stopwords are either supplied in simple text files that
* are copied from web pages such as this:
* http://www.lextek.com/manuals/onix/stopwords2.html
*
* and produces an output containing the contents for a
* PHP language file containing an array with all the
* stopwords.
* or it can be supplied as a .json file that is stored in the
* format ["a","a's","able","about","above", .... ]
*
* Usage:
* This tool extracts the stopwords from these files and
* produces either a .php output (containing a PHP array)
* or a .pattern file containing a regular expression pattern.
*
* Usage:
* To generate PHP output:
* php -q extractor.php stopwords_en_US.txt
*
* To generate a regular expression pattern:
* php -q extractor.php stopwords_en_US.txt -p
*/

/**
Expand All @@ -24,11 +30,21 @@ function check_args($arg_count)
echo "Error: Please specify the filename of the stopwords file to extract.\n";
echo "Example:\n";
echo " php -q extractor.php stopwords_en_US.txt\n";
echo " php -q extractor.php stopwords_en_US.json\n";
echo "\n";
echo "For better RakePlus performance, use the -p switch to produce regular\n";
echo "expression pattern instead of a PHP script.\n";
echo "Example:\n";
echo " php -q extractor.php stopwords_en_US.txt -p\n";
echo " php -q extractor.php stopwords_en_US.json -p\n";
echo "\n";
echo "You can pipe the output of this tool directly into a\n";
echo ".php or .pattern file:\n";
echo "Example:\n";
echo " php -q extractor.php stopwords_en_US.txt > en_US.php\n";
echo " php -q extractor.php stopwords_en_US.json -p > un_US.pattern\n";
echo "\n";

exit(1);
}
}
Expand All @@ -42,7 +58,7 @@ function check_args($arg_count)
*/
function get_arg($args, $arg_no, $default = null)
{
if ($arg_no <= count($args)) {
if ($arg_no < count($args)) {
return $args[$arg_no];
} else {
return $default;
Expand All @@ -58,21 +74,38 @@ function load_stopwords($stopwords_file)
{
$stopwords = [];

if ($h = @fopen($stopwords_file, 'r')) {
while (($line = fgets($h)) !== false) {
$line = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $line);
if (!empty($line) && $line[0] != '#') {
$stopwords[$line] = true;
}
}
} else {
$ext = pathinfo($stopwords_file, PATHINFO_EXTENSION);
if (!file_exists($stopwords_file)) {
echo "\n";
echo "Error: Could not read file \"{$stopwords_file}\".\n";
echo "Error: Stopwords file \"{$stopwords_file}\" not found.\n";
echo "\n";
exit(1);
}

return $stopwords;
if ($ext === 'txt') {
if ($h = @fopen($stopwords_file, 'r')) {
while (($line = fgets($h)) !== false) {
$line = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $line);
if (!empty($line) && $line[0] != '#') {
$stopwords[$line] = true;
}
}

return $stopwords;
} else {
echo "\n";
echo "Error: Could not read text file \"{$stopwords_file}\".\n";
echo "\n";
exit(1);
}
}

if ($ext === 'json') {
$stopwords = json_decode(file_get_contents($stopwords_file), true);
return array_fill_keys($stopwords, true);
}

return [];
}

/**
Expand Down
1 change: 1 addition & 0 deletions console/stopwords_en_US.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]

0 comments on commit 6c759ee

Please sign in to comment.