V1.0.9 Extractor tool can now extract stopwords from JSON files.

Donatello-za · Mar 1, 2019 · 6c759ee · 6c759ee
1 parent 96aac94
commit 6c759ee
Show file tree

Hide file tree

Showing 4 changed files with 79 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -41,7 +41,7 @@ This particular package intends to include the following benefits over the origi
 
 ## Version
 
-v1.0.8
+v1.0.9
 
 ## Special Thanks
 
@@ -397,29 +397,43 @@ Array
 
 ```
 
-## The stopword extractor tool
+## How to add additional languages
 
-The library requires a list of "stopwords". Stopwords are common words
-used in a language such as "and", "are", "or", etc. A list of such stopwords
-can be found [here](http://www.lextek.com/manuals/onix/stopwords2.html). You
-can copy and paste the text into a text file and use the extractor tool to
+**Using the stopwords extractor tool**
+
+The library requires a list of "stopwords" for each language. Stopwords are 
+common words used in a language such as "and", "are", "or", etc. An example 
+list of such  stopwords can be found 
+[here (en_US)](http://www.lextek.com/manuals/onix/stopwords2.html). You can
+also [take a look at this list](https://github.com/Donatello-za/stopwords-json) 
+which have stopwords for 50 different languages in individual JSON files.
+
+When working with a simple list such as in the first example, you can copy and 
+paste the text into a text file and use the extractor tool to
 convert it into a format that this library can read efficiently. *An example
 of such a stopwords file that have been copied from the hyperlink above have 
 been included for your convenience (console/stopwords_en_US.txt)*
 
-To extract and convert such a file, run the following from the command line:
+Alternatively you can extract the stopwords from a JSON file of which an
+example have also been supplied, look under `console/stopwords_en_US.json`
+
+To extract stopwords from a text file, run the following from the command line:
 
 `$ php -q extractor.php stopwords_en_US.txt`
 
+To extract stopwords from a JSON file, run the following from the command line:
+
+`$ php -q extractor.php stopwords_en_US.json`
+
 It will output the results to the terminal. You will notice that the results looks
 like PHP and in fact it is. You can write the results directly to a PHP file by
 piping it:
 
 `$ php -q extractor.php stopwords_en_US.txt > en_US.php` 
 
 Finally, copy the `en_US.php` file to the `lang/` directory (you may have to
-set its permissions for the web server to execute it) and then instantiate php-rake-plus
-like so:
+set its permissions for the web server to execute it) and then instantiate
+ php-rake-plus like so:
 
 ```php
 $rake = RakePlus::create($text, 'en_US');
@@ -430,10 +444,9 @@ using the `-p` switch:
 
 `$ php -q extractor.php stopwords_en_US.txt -p > en_US.pattern` 
 
-RakePHP will always first look for a .pattern file and if not found will look
+RakePHP will always first look for a .pattern file first and if not found will look
 for a .php file in the ./lang/ directory.
 
-
 ## To run tests
 
 `./vendor/bin/phpunit tests/RakePlusTest.php`

diff --git a/composer.json b/composer.json
@@ -23,12 +23,13 @@
     }
   ],
   "require": {
-    "php": ">=5.4.0"
+    "php": ">=5.4.0",
+    "ext-json": "*",
+    "ext-mbstring": "*"
   },
   "require-dev": {
     "php": ">=5.5.0",
-    "phpunit/phpunit": "~4.0|~5.0",
-    "ext-mbstring": "*"
+    "phpunit/phpunit": "~4.0|~5.0"
   },
   "autoload": {
     "psr-4": {
@@ -42,7 +43,7 @@
   },
   "extra": {
     "branch-alias": {
-      "dev-master": "1.0.5-dev"
+      "dev-master": "1.0.9-dev"
     }
   },
   "scripts": {

diff --git a/console/extractor.php b/console/extractor.php
@@ -1,17 +1,23 @@
 <?php
 
 /**
- * Extracts stopwords from a file copied and pasted from
+ * Stopwords are either supplied in simple text files that
+ * are copied from web pages such as this:
  * http://www.lextek.com/manuals/onix/stopwords2.html
  *
- * and produces an output containing the contents for a
- * PHP language file containing an array with all the
- * stopwords.
+ * or it can be supplied as a .json file that is stored in the
+ * format ["a","a's","able","about","above", .... ]
  *
- * Usage:
+ * This tool extracts the stopwords from these files and
+ * produces either a .php output (containing a PHP array)
+ * or a .pattern file containing a regular expression pattern.
  *
+ * Usage:
+ * To generate PHP output:
  * php -q extractor.php stopwords_en_US.txt
  *
+ * To generate a regular expression pattern:
+ * php -q extractor.php stopwords_en_US.txt -p
  */
 
 /**
@@ -24,11 +30,21 @@ function check_args($arg_count)
         echo "Error: Please specify the filename of the stopwords file to extract.\n";
         echo "Example:\n";
         echo "  php -q extractor.php stopwords_en_US.txt\n";
+        echo "  php -q extractor.php stopwords_en_US.json\n";
         echo "\n";
         echo "For better RakePlus performance, use the -p switch to produce regular\n";
         echo "expression pattern instead of a PHP script.\n";
         echo "Example:\n";
         echo "  php -q extractor.php stopwords_en_US.txt -p\n";
+        echo "  php -q extractor.php stopwords_en_US.json -p\n";
+        echo "\n";
+        echo "You can pipe the output of this tool directly into a\n";
+        echo ".php or .pattern file:\n";
+        echo "Example:\n";
+        echo "  php -q extractor.php stopwords_en_US.txt > en_US.php\n";
+        echo "  php -q extractor.php stopwords_en_US.json -p > un_US.pattern\n";
+        echo "\n";
+
         exit(1);
     }
 }
@@ -42,7 +58,7 @@ function check_args($arg_count)
  */
 function get_arg($args, $arg_no, $default = null)
 {
-    if ($arg_no <= count($args)) {
+    if ($arg_no < count($args)) {
         return $args[$arg_no];
     } else {
         return $default;
@@ -58,21 +74,38 @@ function load_stopwords($stopwords_file)
 {
     $stopwords = [];
 
-    if ($h = @fopen($stopwords_file, 'r')) {
-        while (($line = fgets($h)) !== false) {
-            $line = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $line);
-            if (!empty($line) && $line[0] != '#') {
-                $stopwords[$line] = true;
-            }
-        }
-    } else {
+    $ext = pathinfo($stopwords_file, PATHINFO_EXTENSION);
+    if (!file_exists($stopwords_file)) {
         echo "\n";
-        echo "Error: Could not read file \"{$stopwords_file}\".\n";
+        echo "Error: Stopwords file \"{$stopwords_file}\" not found.\n";
         echo "\n";
         exit(1);
     }
 
-    return $stopwords;
+    if ($ext === 'txt') {
+        if ($h = @fopen($stopwords_file, 'r')) {
+            while (($line = fgets($h)) !== false) {
+                $line = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $line);
+                if (!empty($line) && $line[0] != '#') {
+                    $stopwords[$line] = true;
+                }
+            }
+
+            return $stopwords;
+        } else {
+            echo "\n";
+            echo "Error: Could not read text file \"{$stopwords_file}\".\n";
+            echo "\n";
+            exit(1);
+        }
+    }
+
+    if ($ext === 'json') {
+        $stopwords = json_decode(file_get_contents($stopwords_file), true);
+        return array_fill_keys($stopwords, true);
+    }
+
+    return [];
 }
 
 /**

diff --git a/console/stopwords_en_US.json b/console/stopwords_en_US.json
@@ -0,0 +1 @@
+["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]