Google’s AI to detect toxic comments can be easily fooled with ‘love’

A group of researchers has found that simple changes in sentences and its structure can fool Google’s perspective AI, made for detecting toxic comments and hate speech. These methods involve inserting typos, spaces between words or add innocuous words to the original sentence.

The AI project, which was started in 2016 by a Google offshoot called Jigsaw, assigns a toxicity score to a piece of text. Google defines a toxic comment as a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion. The researchers suggest that even a slight change in the sentence can change the toxicity score dramatically. They saw that changing “You are great” to “You are fucking great”, made the score jump from a totally safe 0.03 to a fairly toxic 0.82.

Simple changes in the sentence fools Google AI

This clearly denotes that the toxicity score is probably not the best measure to identify hate speech. Last year, another study found that inserting spaces and making typos reduced the toxicity score drastically. Google has improved its AI since then to detect these changes. But it’s not perfect, the researchers presenting the latest study said if someone introduced a word like ‘love’ in these sentences the score took a plunge.

Google AI toxic sentence detection — Change in toxicity score of a sentence after introducing a typo

So anyone can probably introduce a few positive words in a hateful sentence, to reduce the score, or insert a few cuss words, to increase the score.

TNW Conference 2024- 2for1 offer this week only!

Don't miss out on the world-class speakers. Secure your 2for1 tickets before 23 April.

Historically, tech companies and their algorithms have struggled with hate speech. In 2016, Microsoft had released a Twitter bot called Tay, whose tweets quickly turned abusive, as it relied on the user responses. Twitter, meanwhile, had a curious case of banning userswho had the phrase ‘Kill me’ in their tweets without knowing the context. And between October 2017 and March 2018, Facebook’s systems were able to filter out only 38 percent of the hate speech posts, made its way to the platform.

Google’s team should be working on understanding the context of a particular word used in a sentence and detecting intentional and unintentional typos which can game the system. We have written to Google to learn more about the project.

Story by Ivan Mehta

Ivan covers Big Tech, India, policy, AI, security, platforms, and apps for TNW. That's one heck of a mixed bag. He likes to say "Bleh." Ivan covers Big Tech, India, policy, AI, security, platforms, and apps for TNW. That's one heck of a mixed bag. He likes to say "Bleh."

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with

Google

Google’s AI to detect toxic comments can be easily fooled with ‘love’

Get the TNW newsletter

Also tagged with

Google DeepMind unveils AI watermarking tool as political pressure mounts

Google DeepMind has a new family of open AI models for devs: Gemma

Join TNW All Access

The Digital Markets Act will change how you use apps

French competition watchdog fines Google €250M for AI copyright breaches