- 1. Description
- 2. Installation
- 3. Generating Author Fingerprints
- 4. Matching Unknown Text
- 5. Counter-Stylometry
- 6. Plotting Text Distributions
- 7. Options
- 7.1. -t,–text <string or filename>
- 7.2. -n,–name <author>
- 7.3. -f,–fingerprint
- 7.4. -m,–match <directory>
- 7.5. -l,–list <directory>
- 7.6. -c,–compare <fingerprint>
- 7.7. –missing <fingerprint>
- 7.8. –more <fingerprint>
- 7.9. –less <fingerprint>
- 7.10. -p,–plot <directory>
- 7.11. –plottexts <directory>
- 7.12. –title <text>
- 7.13. -v,–version
- 7.14. -h,–help
- 8. Bugs
- 9. Author
This is a tool to facilitate the stylometric analysis of texts. It could be used for academic disambiguation of disputed authorship, and to help identify plagiarists, astroturfers, sockpuppets and guerilla marketers. Another possible use case is as an assistance to the anonymisation of writing style.
Every author has their own unique writing style, and if enough writing examples are available then it's possible to construct a quantitative model of their style which can be compared against others.
The easiest way to install is from a pre-compiled package, a number of which are available at:
https://build.opensuse.org/project/show?project=home%3Amotters%3Astylom
But if you prefer not to get involved with binaries then this program is pretty easy to compile and install as follows:
make sudo make install
If you wish to generate a Debian package you can also run the following script:
./debian.sh
To plot graphs you will need to have gnuplot installed. For example:
sudo apt-get install gnuplot
Fingerprints are high dimensional vectors which represent the writing style of particular authors. To generate a fingerprint first gather some examples of the author's writing within plain text files and put them in a directory. Then run something similar to the following:
cat texts/dickens/* | stylom -n "Charles Dickens" -f > fingerprints/dickens.style
Here the texts get piped into the command and the resulting fingerprint is then saved to a file. Ideally the amount of example text should be as large as possible.
If you have a file containing a sample of text for which the author is unknown, but who is likely to be an author for which you have previously generated a fingerprint, you can find the most likely candidate in the following way:
cat unknown.txt | stylom --match fingerprints
Where fingerprints is the name of a directory containing previously saved fingerprints for a variety of possible authors. This will return a single name, but it's also possible to return a list of candidates in the following way.
cat unknown.txt | stylom --list fingerprints
Stylometrics have various legitimate uses in terms of the study of disputed historical texts or detecting plagiarism. However, a possible danger of stylometric methods is their abuse by powerful organisations against bloggers or whistleblowers who may be trying to raise issues of public concern in an anonymous manner in order to avoid serious reprisals.
A known method to render stylometry less effective in determining the identity of an author is to try to make the style of writing as similar as possible to some existing author, hence obscuring the differences. Automatically transforming the writing style of a given text into the writing style of a known author is a very difficult and likely AI-complete problem, but this program can be used provide the writer with advice on how to alter their work to achieve a greater degree of similarity.
cat mytext.txt | stylom -c fingerprints/charles_dickens.style
or
stylom -t mytext.txt -c fingerprints/charles_dickens.style
Takes some text and compares it to a pre-computed fingerprint for Charles Dickens. It then gives some indication of how to alter mytext.txt in order to make it more similar to Dickens' writing style. The number of differences is also shown, which can be used as an indicator of progress.
If you are only interested in which words are present in mytext.txt but missing from Dickens writing:
cat mytext.txt | stylom --missing fingerprints/charles_dickens.style
or
stylom -t mytext.txt --missing fingerprints/charles_dickens.style
The results could then be used by some other program (maybe highlighted in a text editor GUI). The same can also be done for words which are more frequent within the fingerprint or less frequent within the fingerprint.
cat mytext.txt | stylom --more fingerprints/charles_dickens.style
or
cat mytext.txt | stylom --less fingerprints/charles_dickens.style
The above methods can be used to make the vocabulary and the word frequency similar to an existing known author. It's not a perfect solution, since it takes no account of syntax, and it still involves mannual editing effort by the writer.
You can plot the distribution of fingerprints in the following way.
stylom --plot <directory>
This plots any fingerprints within the given directory within a 2D graph so that you can see how authors are distributed. The graph is saved by default with the filename result.png
If necessary you can also specify a title for the graph.
stylom --title "My Graph Title" --plot <directory>
You can also plot texts more directly without having previously calculated fingerprints for them.
stylom --title "My Texts" --plottexts <directory>
This can be either text to be analysed or a filename containing plain text.
Match the given fingerprint against ones stored within a directory and show a list of the most similar authors.
Compares the current text against the given fingerprint and reports differences.
Report words which are present in the current text but which are missing from the given fingerprint.
Report words which are present more frequently in the given fingerprint than in the current text.
Report words which are present less frequently in the given fingerprint than in the current text.
Plots fingerprints within the given directory using gnuplot. The resulting image is saved as result.png
Plots texts within the given directory using gnuplot. The resulting image is saved as result.png
Report all bugs to https://github.com/fuzzgun/stylom/issues
Bob Mottram <[email protected]>
GPG ID: 0xEA982E38
GPG Fingerprint: D538 1159 CD7A 2F80 2F06 ABA0 0452 CC7C EA98 2E38