Home   >   CSC-OpenAccess Library   >    Manuscript Information
Evaluating Binary n-gram Analysis For Authorship Attribution
Mark Carman, Helen Ashman
Pages - 60 - 91     |    Revised - 31-10-2019     |    Published - 01-12-2019
Volume - 10   Issue - 4    |    Publication Date - December 2019  Table of Contents
Authorship Attribution, Binary n-gram, Stop Word, Cross-domain, Cross-genre.
Authorship attribution techniques focus on characters and words. However the inclusion of words with meaning may complicate authorship attribution. Using only function words provides good authorship attribution with semantic or character n-gram analyses but it is not yet known whether it improves binary n-gram analyses.

The literature mostly reports on authorship attribution at word or character level. Binary n-grams interpret text as binary. Previous work with binary n-grams assessed authorship attribution of full texts only. This paper evaluates binary n-gram authorship attribution over text stripped of content words as well as over a range of cross-domain scenarios.

This paper reports a sequence of experiments. First the binary n-gram analysis method is directly compared with character n-grams for authorship attribution. Then it is evaluated over three forms of input text, full text, stop words and function words only, and content words only. Subsequently, it was tested over cross-domain and cross-genre texts, as well as multiple-author texts.
1 Google Scholar 
2 refSeek 
3 Doc Player 
4 Scribd 
A. Hamilton, J. Madison, J. Jay and J. Rakove J. "The Federalist". Bedford/St. Martin's, Boston. 2003
A. Rocha, W. Scheirer, C. Forstall, T. Cavalcante, Theophilo, B. Shen, A. Carvalho and E. Stamatatos. "Authorship Attribution for Social Media Forensics". IEEE Transactions on Information Forensics and Security, vol. 12 (1), pp. 5-33. 2017.
Alexa. "Facebook.com Traffic, Demographics and Competitors". (accessed 2019/09/10), 2019. https://www.alexa.com/siteinfo/facebook.com.
Alexa. "Twitter.com Traffic, Demographics and Competitors". (accessed 2019/09/10), 2019. https://www.alexa.com/siteinfo/twitter.com.
B. Blatt. Nabokov's favourite word is Mauve. Simon and Schuster. 2017.
B. Kjell, W. Woods and O. Frieder. "Discrimination of authorship using visualization". Information Processing and Management, vol. 30 (1), pp. 141-150.
D. Doyle. "Stopwords" (English) (accessed 2019/09/10), http://www.ranks.nl/stopwords. 2017.
D. Lowe and R. Matthews. "Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions". Computers and the Humanities, vol. 29 (6), pp. 449-461. 1995.
E. Stamatatos. "A survey of modern authorship attribution methods". Journal of the American Society for Information Science and Technologies, vol. 60 (3), pp. 538-556. 2009.
E. Stamatatos. "On the robustness of authorship attribution based on character n-gram features". (Symposium: Authorship Attribution Workshop). Journal of Law and Policy, vol. 21, pp. 421-439. 2013.
H. Fouche Gaines. H. Cryptanalysis. Dover, New York. 1956.
HDJ. Coupe. "Non-Symbolic Fragmentation Cryptographic Algorithms". PhD thesis, University of Nottingham, UK. 2005.
J. Peng, K-KR. Choo and Ashman H. "Bit-level n-gram based forensic authorship analysis on social media: Identifying individuals from linguistic profiles". Journal of Networked and Computer Applications, vol. 70, pp. 171-182. 2016.
J. Peng, S. Detchon, K-KR. Choo and H. Ashman. "Astroturfing Detection in Social Media: A Binary N-gram Based Approach". Concurrency and Computation: Practice and Experience, doi: 10.1002/cpe.4013. 2016.
J. Peng. "Authorship Attribution with Binary N-gram Analysis for Detecting Astroturfing in Social Media". PhD thesis, University of South Australia, Australia. 2017.
J. Rowling, J. Tiffany J and J. Thorne. Harry Potter and the cursed child. Little & Brown, London. 2016.
J. Rowling. Harry Potter and the Half-Blood Prince. Pottermore, England. 2012.
Judges 5:5-6. Holy Bible. Authorised King James Version.
K. Sundararajan and D. Woodard. "What constitutes 'style' in authorship attribution?". Proc. 27th Int. Conf. on Computational Linguistics. Assoc. Computational Linguistics. pp. 2814–2822, https://www.aclweb.org/anthology/C18-1238. 2018.
L. Milos. "Playing the Pronoun Game: Are All of The Hobbit’s Dwarves Male?". http://middleearthnews.com/2018/01/09/playing-the- pronoun-game-are-all-of-the-hobbits-dwarves-male/ (accessed 2019/09/10). 2018.
M. Kestemont. "Function Words in Authorship Attribution: From Black Magic to Theory?". Proc. 3rd workshop on Computational Linguistics for Literature, pp. 59-66, Gothenburg, Sweden, ACL, https://www.aclweb.org/anthology/W14-0908 2014,
P. Juola. "Authorship Attribution". Foundations and Trends in Information Retrieval, vol. 1, (3), pp. 233-334. 2006.
R. Galbraith. "About Robert Galbraith". 2019/07/25, http://robert-galbraith.com/about/. 2017. (accessed 2019/09/10).
R. Matthews. "Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher". Literary and Linguistic Computing, vol. 8 (4), pp. 203-210. 1993.
S. Rogers. "The Boston Bombing: How journalists used Twitter to tell the story". (accessed 2019/09/10), https://blog.twitter.com/official/en_us/a/2013/the-boston-bombing-how-journalists-used-twitter-to-tell-the-story.html. 2017.
S. Walker. "Salutin' Putin: inside a Russian troll house". (accessed 2019/09/10), https://www.theguardian.com/world/2015/apr/02/putin-kremlin-inside- russian-troll-house. 2017.
T. Clancy. Locked On, by Tom Clancy with Mark Greaney. (accessed 2019/09/10), https://tomclancy.com/product/locked-on/. 2017.
T. Merriam. "Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe". Literary and Linguistic Computing, vol. 9 (1), pp. 1-6. 1994.
U. Sapkota, S. Bethard, M. Montes-y-G mez and T. Solorio. "Not all character n-grams are created equal: A study in authorship attribution". Proc. Annual Conf. North Amer. Chapter ACL Human Lang. Technologies. https://www.aclweb.org/anthology/N15-1010, pp. 93-102. 2015.
US Congress. "The Federalist Papers". Congress.gov Resources. (accessed 2019/09/10), 2017. https://www.congress.gov/resources/display/content/The+Federalist+Papers.
V. Kešelj, F. Peng, N. Cercone and C. Thomas. "N-gram-based author profiles for authorship attribution". Proc. of the Pacific association for computational linguistics, Vol. 3, pp/ 255-264). 2003.
Mr. Mark Carman
School of Information Technology and Mathematical Sciences, University of South Australia, Adelaide, SA 5095, Australia - Australia
Dr. Helen Ashman
School of Information Technology and Mathematical Sciences, University of South Australia, Adelaide, SA 5095, Australia - Australia