User:Jwalling

From Katrina Help Info

John Walling

Table of contents

Spam Patrol

--jwalling 22:08, 24 Jan 2006 (CET)

SpammerBlockPattern

Introduction

  • The following Regex (http://en.wikipedia.org/wiki/Regular_expression) fragment list came from scraping affiliate marketing (http://en.wikipedia.org/wiki/Affiliate_marketing) spam
  • The most useful regex fragment is height:\s*\dpx
  • The list minimizes domain names since they change constantly
  • The list is maintained with a regex editor: http://www.weitz.de/regex-coach/
  • The list can be used as a
    • spam blacklist [10] (http://meta.wikimedia.org/wiki/Spam_blacklist)[11] (http://www.communitywiki.org/en/BannedContent)[12] (http://wikitravel.org/bannedcontent.txt)
    • guide for spam search
    • regular expression ($wgSpamRegex = "/frag1|frag2|...|fragn/i";)
  • More details at KatrinaHelp.info [13] (http://www.katrinahelp.info/wiki/index.php/Spam_Patrol),[14] (http://www.katrinahelp.info/wiki/index.php/Talk:Spam_Patrol)
    • Botspam posting has dropped over 90% since installing the list Dec 15, 2005.

Regex fragment list

Rationale: Sensitivity vs. Specificity

Maximizing rates for True Positives and True Negatives while minimizing rates for False Positives and False Negatives is a classic problem in medical testing (http://bmj.bmjjournals.com/cgi/content/full/327/7417/716). This tension is expressed as Sensitivity vs. Specificity. In practice, the more sensitive a test is the more likely you will see false positives (Type I error) and the more specific a test is the more likely you will see false negatives (Type II error). (This conundrum is reminiscent of the Uncertainty Principle.) We have the same problem with blacklists. (Whitelists counteract false positives.)

In a nutshell:

Sensitivity=TP/(TP+FN) 
Specificity=TN/(FP+TN) 
Prevalence=(TP+FN)/(TP+FN+FP+TN) 
Positive Predictive Value: PPV =TP/(TP+FP)
 (e.g. if PPV=80%, 80% of positives will be correct)
Negative Predictive Value: NPV =TN/(FN+TN)
 (e.g. if NPV=80%, 80% of negatives will be correct)
Positive Likelihood=SENS/(1-SPEC) 
Negative Likelihood=(1-SENS)/SPEC)

The true rates of error depend on testing accuracy and precision, and prevalence in the population of interest. These factors can be addressed with Bayesian inference and Bayesian analysis (http://www.intmed.mcw.edu/clincalc/bayes.html) which is beyond the scope of this discussion.

In medical screening tests on healthy populations, a common strategy is to test first with low cost high sensitivty tests and then to confirm positives with higher cost high specifity tests. This strategy has the benefit of reducing the cost of testing and minimizing the risk of false positives. As the prevalance of true positives increases, the predicative value of positive screening tests go up.

In order to follow the medical model we need a two stage blacklist. The first blacklist would have high sensitivity (low cost spammy words) and the second list would have high specificity (high cost spammy domains). In contrast, both tests would be performed while keeping the domain blacklist trimmed for spammers who avoid posting spammy word.

In using blacklists we can measure indirect cost as the number of elements (words,URLs,patterns) that must be compared to new content. The direct costs are system loading; user inconvenience; list maintenance; and the harvesting of new spammy words and domains. Our goal is keep the cost as low as possible without missing new spammers.

To recap (more% indicates percentage of a finite resource):

more% blacklisted URLs => more specificity
more specificity => more false negatives
more false negatives => more permission for bad content
more% blacklisted words => more sensitivity
more sensitivity => more false positives
more false positives => more blocks to good content

A minimax (http://en.wikipedia.org/wiki/Minimax) strategy for one ot more blacklists:

  1. Reduce the 'cost' by relying more on spammy words that are associated with many spammer URLs
  2. Reduce the number of false positives by tuning spam text patterns with regex (http://en.wikipedia.org/wiki/Regular_expression)
  3. Reduce the number of false negatives by including spammer URLs that do not associate with spam text

Most blacklists that I have seen depend primarily on banning domains and URLs. Spammers have an easy time finding new URLs and makes the blocking effort open ended. (A Google search for free web hosting (http://www.google.com/search?q=%22free+web+hosting%22) produces a few million hits.)

I am developing a blacklist that uses banned text primarily and banned URLs where needed. The downside is it may block text that interferes with openness. (If a user attempts to post blocked text, feedback is solicited. [15] (http://www.katrinahelp.info/wiki/index.php/Spam_Protection_Comment)) The false positives can be mitigated by setting thresholds for more than one spammy word. The upside is the blacklist should require less intervention by administrators to block unidentified spammers.

--jwalling 04:25, 4 Feb 2006 (CET)

Help us stay online!