Spotting opinion spammers using behavioral footprints

    KDD, pp. 632-640, 2013.

    Cited by: 305|Bibtex|Views28|Links
    EI
    Keywords:
    latent population distributionopinion spammersbehavioral footprintproduct reviewmodel inference resultMore(6+)
    Wei bo:
    This paper proposed a novel and principled method to exploit observed reviewing behaviors to detect opinion spammers in an unsupervised Bayesian inference framework

    Abstract:

    Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming (e.g., writing fake reviews) to promote or to demote some target products. In recent years, fake review detection h...More

    Code:

    Data:

    0
    Introduction
    • Online reviews of products and services are used extensively by consumers and businesses to make critical purchase, product design, and customer service decisions.
    • Due to the financial incentives associated with positive reviews, imposters try to game the system by posting fake reviews and giving unfair ratings to promote or demote target products and services.
    • Such individuals are called opinion spammers and their activities are called opinion spamming [14].
    • Unlike many other forms of spamming, the key difficulty for solving the opinion spam problem is that it is hard to find gold-standard data of fake and non-fake reviews for model building because it is very difficult, if not impossible, to manually recognize/label fake/non-fake reviews by mere reading [14, 34]
    Highlights
    • Online reviews of products and services are used extensively by consumers and businesses to make critical purchase, product design, and customer service decisions
    • Due to the financial incentives associated with positive reviews, imposters try to game the system by posting fake reviews and giving unfair ratings to promote or demote target products and services
    • This paper proposes a novel and principled technique to model and to detect opinion spamming in a Bayesian framework
    • This paper proposed a novel and principled method to exploit observed reviewing behaviors to detect opinion spammers in an unsupervised Bayesian inference framework
    • Existing methods are mostly based on heuristics and/or ad-hoc labels for opinion spam detection
    • As different authors have differences in writing, the accuracy is greater than 50%
    • The paper proposed a novel way to evaluate the results of unsupervised opinion spam models using supervised classification without the need of any manually labeled data
    Results
    • As noted in §1, the authors are not aware of any gold-standard ground truth labeled data for opinion spammers.
    • To evaluate the author spamicities computed by different systems, the authors use two methods: review classification and human evaluation.
    • Running the systems in §3.2 on the data generates a ranking of 50,704 reviewers.
    • Human evaluation on all authors is clearly impossible.
    • The authors sample the rank positions 1, 10, 20,..., 50000 to construct the evaluation set, of 5000 rank positions.
    • 5000 is reasonable for review classification, but for human evaluation the authors need to use a subset
    • Using a fixed sampling lag ensures that performance on is a good approximation over the entire range. 5000 is reasonable for review classification, but for human evaluation the authors need to use a subset
    Conclusion
    • This paper proposed a novel and principled method to exploit observed reviewing behaviors to detect opinion spammers in an unsupervised Bayesian inference framework.
    • The Bayesian framework facilitates characterization of many behavioral phenomena of opinion spammers using the estimated latent population distributions
    • It enables detection and posterior density analysis in a single framework.
    • A comprehensive set of experiments based on the proposed automated classification evaluation and human expert evaluation have been conducted to evaluate the proposed model
    • The results across both evaluation metrics show that the proposed model is effective and outperforms strong competitors
    Summary
    • Introduction:

      Online reviews of products and services are used extensively by consumers and businesses to make critical purchase, product design, and customer service decisions.
    • Due to the financial incentives associated with positive reviews, imposters try to game the system by posting fake reviews and giving unfair ratings to promote or demote target products and services.
    • Such individuals are called opinion spammers and their activities are called opinion spamming [14].
    • Unlike many other forms of spamming, the key difficulty for solving the opinion spam problem is that it is hard to find gold-standard data of fake and non-fake reviews for model building because it is very difficult, if not impossible, to manually recognize/label fake/non-fake reviews by mere reading [14, 34]
    • Results:

      As noted in §1, the authors are not aware of any gold-standard ground truth labeled data for opinion spammers.
    • To evaluate the author spamicities computed by different systems, the authors use two methods: review classification and human evaluation.
    • Running the systems in §3.2 on the data generates a ranking of 50,704 reviewers.
    • Human evaluation on all authors is clearly impossible.
    • The authors sample the rank positions 1, 10, 20,..., 50000 to construct the evaluation set, of 5000 rank positions.
    • 5000 is reasonable for review classification, but for human evaluation the authors need to use a subset
    • Using a fixed sampling lag ensures that performance on is a good approximation over the entire range. 5000 is reasonable for review classification, but for human evaluation the authors need to use a subset
    • Conclusion:

      This paper proposed a novel and principled method to exploit observed reviewing behaviors to detect opinion spammers in an unsupervised Bayesian inference framework.
    • The Bayesian framework facilitates characterization of many behavioral phenomena of opinion spammers using the estimated latent population distributions
    • It enables detection and posterior density analysis in a single framework.
    • A comprehensive set of experiments based on the proposed automated classification evaluation and human expert evaluation have been conducted to evaluate the proposed model
    • The results across both evaluation metrics show that the proposed model is effective and outperforms strong competitors
    Tables
    • Table1: List of notations features than those used in modeling. Thus, if this classifier can classify accurately, it gives a good confidence that the unsupervised spamicity model is effective (details in §3.3)
    • Table2: a, b): 5-fold SVM CV for review classification using top k (%) authors’ reviews as the spam (+) class and bottom k % authors’ reviews as the non-spam (-) class. P: Precision, R:Recall, F1:F1-Score, A:Accuray
    • Table3: Number of spammers detected in each bucket (B1, B2, B3) by each judge (J1, J2,
    Download tables as Excel
    Funding
    • This project is supported in part by a grant from HP Labs Innovation Research Program and a grant from National Science Foundation (NSF) under grant no
    Reference
    • Popken, B. 2010.
      Google ScholarFindings
    • Bishop, C.M. 2006. Pattern Recognition and Machine Learning. Springer.
      Google ScholarFindings
    • Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M. and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum. (2006).
      Google ScholarLocate open access versionFindings
    • Celeux, G., Chaveau, D., & Diebolt, J. 1996. Stochastic versions of the em algorithm: an experimental study in the mixture case. Journal of Statistical Computation and Simulation. (1996).
      Google ScholarLocate open access versionFindings
    • Chirita, P.A., Diederich, J., and Nejdl, W. 200MailRank: Using Ranking for Spam Detection. CIKM (2005).
      Google ScholarLocate open access versionFindings
    • Duda, R. O., Hart, P. E., and Stork, D.J. 2001. Pattern Recognition. Wiley.
      Google ScholarFindings
    • Fayyad, U., & Irani, K. 1993. Multi-interval discretization of continuousvalued attributes for classification learning. UAI (1993), 1022–1027.
      Google ScholarLocate open access versionFindings
    • Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M. and Ghosh, R. 2013. Exploiting Burstiness in Reviews for Review Spammer Detection. ICWSM. (2013).
      Google ScholarLocate open access versionFindings
    • Feng, S., Xing, L., Gogar, A. and Choi, Y. 2012. Distributional Footprints of Deceptive Product Reviews. ICWSM (2012).
      Google ScholarLocate open access versionFindings
    • Feng, S., Banerjee R., Choi, Y. 2011. Syntactic Stylometry for Deception Detection. ACL (2011).
      Google ScholarLocate open access versionFindings
    • Fleiss, J. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin. (1971), 378–382.
      Google ScholarLocate open access versionFindings
    • Frietchen, C. 2009. How to spot fake user reviews. Consumersearch.com.
      Google ScholarLocate open access versionFindings
    • Ghosh, S., Viswanath, B., Kooti, F., Sharma, N.K., Korlam, G., Benevenuto, F., Ganguly, N. and Gummadi, K.P. 2012. Understanding and combating link farming in the twitter social network. WWW. (2012).
      Google ScholarLocate open access versionFindings
    • Jindal, N. and Liu, B. 2008. Opinion Spam and Analysis. WSDM (2008).
      Google ScholarLocate open access versionFindings
    • Jindal, N., Liu, B. and Lim, E.-P. 2010. Finding Unusual Review Patterns Using Unexpected Rules. CIKM (2010).
      Google ScholarLocate open access versionFindings
    • Joachims, T. 1999. Making large-scale support vector machine learning practical. Advances in Kernel Methods. (1999).
      Google ScholarLocate open access versionFindings
    • Joachims, T. 2002. Optimizing Search Engines Using Clickthrough Data. KDD (2002).
      Google ScholarLocate open access versionFindings
    • Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. ECML (1998).
      Google ScholarLocate open access versionFindings
    • Kang, H., Wang, K., Soukal, D., Behr, F. and Zheng, Z. 2010. Large-scale bot detection for search engines. WWW (2010).
      Google ScholarLocate open access versionFindings
    • Keselj, V., Peng, F., Cercone, N., Thomas, C. 2003. N- Gram-Based Author Profiles for Authorship Attribution. PACL (2003), 255–264.
      Google ScholarLocate open access versionFindings
    • Klementiev, A., Roth, D. and Small, K. 2007. An Unsupervised Learning Algorithm for Rank Aggregation. ECML (2007).
      Google ScholarLocate open access versionFindings
    • Kolari, P., Java, A., Finin, T., Oates, T. and Joshi, A. 2006. Detecting Spam Blogs: A Machine Learning Approach. AAAI (2006).
      Google ScholarLocate open access versionFindings
    • Landis, J. R. and Koch, G.G. 1977. The measurement of observer agreement for categorical data. Biometrics. (1977), 159–174.
      Google ScholarLocate open access versionFindings
    • Lauw, H.W., Lim, E. and Wang, K. 2007. Summarizing Review Scores of “ Unequal ” Reviewers. SIAM SDM (2007), 539–544.
      Google ScholarLocate open access versionFindings
    • Li, F., Huang, M., Yang, Y. and Zhu, X. 2011. Learning to Identify Review Spam. IJCAI (2011), 2488–2493.
      Google ScholarLocate open access versionFindings
    • Lim, E.-P., Nguyen, V.-A., Jindal, N., Liu, B. and Lauw, H.W. 2010. Detecting product review spammers using rating behaviors. CIKM (2010)
      Google ScholarLocate open access versionFindings
    • Liu, T.Y. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval. (2009), 225–331.
      Google ScholarLocate open access versionFindings
    • McAuliffe, D.B. and J. 2007. Supervised Topic Models. NIPS (2007).
      Google ScholarLocate open access versionFindings
    • Mukherjee, A., Liu, B. and Glance, N. 2012. Spotting Fake Reviewer Groups in Consumer Reviews. WWW (2012).
      Google ScholarLocate open access versionFindings
    • Mukherjee, A., Liu, B., Wang, J., Glance, N. and Jindal, N. 2011. Detecting Group Review Spam. WWW (2011).
      Google ScholarFindings
    • Mukherjee, A., Venkataraman, V., Liu, B. and Glance, N. 2013. What Yelp Fake Review Filter might be Doing? ICWSM. (2013).
      Google ScholarFindings
    • Newman, M.L., Pennebaker, J.W., Berry, D.S., Richards, J.M. 2003. Lying words: predicting deception from linguistic styles. Personality and Social Psychology Bulletin. (2003), 665–675.
      Google ScholarLocate open access versionFindings
    • Ott, M., Cardie, C. and Hancock, J. 2012. Estimating the prevalence of deception in online review communities. WWW (2012).
      Google ScholarLocate open access versionFindings
    • Ott, M., Choi, Y., Cardie, C. and Hancock, J.T. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. ACL (2011), 309–319.
      Google ScholarLocate open access versionFindings
    • Pandit, S., Chau, D.H., Wang, S. and Faloutsos, C. 2007. NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. WWW.
      Google ScholarLocate open access versionFindings
    • Ramage, D., Hall, D., Nallapati, R., & Manning, C.D. 2009. A supervised topic model for credit attribution in multi-labeled corpora. EMNLP (2009).
      Google ScholarLocate open access versionFindings
    • Smyth, P. 1999. Probabilistic Model-Based Clustering of Multivariate and Sequential Data. AISTATS (1999).
      Google ScholarLocate open access versionFindings
    • Spirin, N. and Han, J. 2012. Survey on Web Spam Detection: Principles and Algorithms. ACM SIGKDD Explorations. 13, 2 (2012), 50–64.
      Google ScholarLocate open access versionFindings
    • Streitfeld, D. 2012. Buy Reviews on Yelp, Get Black Mark. (2012).
      Google ScholarFindings
    • Streitfeld, D. 2012. Fake Reviews, Real Problem. New York Times.
      Google ScholarFindings
    • Vogt, C.C., Cottrell, G.W. 1999. Fusion via a linear combination of scores. Information Retrieval. (1999), 151–173.
      Google ScholarLocate open access versionFindings
    • Wang, G., Xie, S., Liu, B. and Yu, P.S. 2011. Review Graph Based Online Store Review Spammer Detection. ICDM (2011), 1242–1247.
      Google ScholarLocate open access versionFindings
    • Wang, Z. 2010. Anonymity, Social Image, and the Competition for Volunteers: A Case Study of the Online Market for Reviews. The B.E. Journal of Economic Analysis & Policy. 10, 1 (Jan. 2010), 1–34.
      Google ScholarLocate open access versionFindings
    • Wei, F., Li, W., Liu, S. 2010. iRANK: A Rank-Learn-Combine Framework for Unsupervised Ensemble Ranking. Journal of the American Society for Information Science and Technology. (2010).
      Google ScholarLocate open access versionFindings
    • Wu, B., Goel V. & Davison, B.D. 2006. Topical TrustRank: using topicality to combat Web spam. WWW (2006).
      Google ScholarLocate open access versionFindings
    • Xie, S., Wang, G., Lin, S. and Yu, P.S. 2012. Review spam detection via temporal pattern discovery. KDD. (2012).
      Google ScholarLocate open access versionFindings
    • Freund, Y., Iyer, R., Schapire, R. and Singer, Y. 2003. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research. 4 (2003), 933–959.
      Google ScholarLocate open access versionFindings
    • Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. 1997. L-BFGS-B: Fortran routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software. (1997).
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments