Block spam by alphabet
James Williams
Fighting spam is an ongoing arms race. There will always be nefarious attempts to post unwanted content onto websites, that's just the nature of the global internet nowadays, but can we keep ahead of it? Some techniques are complex, maybe using AI / natural language processing, but there are also quite simple opportunities to reject spam.
We had a lot of contact requests come into our site that used the Cyrillic script - which is used for the Russian language. (привет!) Realistically we're very unlikely to treat any request that comes to us in Cyrillic as something worth responding to. We're a UK web agency with most of our clients based in the UK, and whilst we do work with clients outside the UK, and for projects using languages other than English, we can afford to ignore any requests made to us in a language we don't usually read. Given that, we can immediately block a large proportion of the spam requests by simply detecting what alphabet they look to have been written in. Most of this will be Russian, because so much spam comes through Russia.
Drupal makes it easy for us to add an extra validation handler to our contact form with a form alter hook. We could then get on with creating custom logic to check whether submissions contained too much Cyrillic text. We needed to account for a few things:
How much Cyrillic content is there, in comparison to content from the Latin alphabet (a-z)?
Although much of the unwanted content was using Cyrillic, we realised content in anything other than the Latin alphabet could probably be rejected in the same way. We're just not likely to do much business with people who can't contact us in English, let alone in an alphabet we can't read.
How much is too much?
We will happily accept some amount of Cyrillic text in a contact request - for example, if someone is explaining about spam they are receiving, or asking to add some translations to a site.
Ignore links and HTML tags, since those are written using a-z characters from the Latin alphabet.
We've also seen a lot of spam containing links formatted in an unusual format which we could strip:
[url=http...]...[/url]
Ignore whitespace and punctuation, to some extent.
We're going to end up using some regular expressions from PHP, so we can make use of its support for matching character scripts. For example,
/[^\p{Common}\p{Latin}]/u
will match any character that is NOT in either of the 'common' (punctuation etc) or Latin sets of characters.- We decided to ignore the potential for characters which are defined by multiple bytes (like emojis 👀) to interfere with calculating the proportion of non-Latin text. There are rarely that many of them in a message ... and pragmatically, do we really want to be doing business with people communicating using so many emojis?? (This is only for our website's contact form, they can always flood us with emojis later! 😄)
How can we best guide genuine leads to contact us more appropriately?
Drupal allows us to set the error message when failing form validation. We would quite like to advise real human beings who want to pay us money for our services, how best to do that!
All these things boiled down to a relatively simple form validation callback, that is just stuffed with a few bits of code that are relatively unusual, such as those regex script classes:
So now users see this error message if they attempt to contact us using too much Cyrillic:
This stops the unwanted messages coming through, whilst helping real users understand how to contact us more appropriately, should they really need to. Thankfully we now get very little spam coming through our contact page!