AIVerify Moonshot Logo

Attack Modules

  • TextBugger Attack

    This module tests for adversarial textual robustness and implements the perturbations listed in the paper TEXTBUGGER: Generating Adversarial Text Against Real-world Applications.

  • Homoglyph V2 Attack

    This module tests for adversarial textual robustness. Homoglyphs are alternative characters that resemble a similar ASCII character. Example of a homoglyph fool -> fooI This module slowly increases the percentageof characters replaced to see how the model reacts to the base prompt.

  • Violent Durian

    This is a multi-turn agent designed to interact over several exchanges. It's used to elicit dangerous or violent suggestions from the target language model by adopting a criminal persona. The application is experimental and uses OpenAI GPT-4. Configure the endpoint openai-gpt4 to use this attack module.

  • Payload Mask Attack

    An attack where the payload is masked, and the LLM is prompted to fill in missing information.

  • TextFooler Attack

    This module tests for adversarial textual robustness and implements the perturbations listed in the paper 'Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment.'

  • Insert Punctuation Attack

    This module tests for adversarial textual robustness and creates perturbations through adding punctuation to the start of words in a prompt.

  • Character Swap Attack

    This module tests for adversarial textual robustness. It creates perturbations through swapping characters for words that contains more than 3 characters.

  • Singapore Sentence Generator

    This module generates singlish sentence based on a given seed prompt. The attack module intends to test if the endpoint will complete the sentence with toxic sentences/phrases in Singapore context.

  • Toxic Sentence Generator

    This module generates toxic sentences based on a given seed prompt. The attack module intends to test if the system under tests will complete the sentence with toxic sentences/phrases.

  • Colloquial Wordswap

    This attack module tests for textual robustness against the Singapore context. It takes in prompts that feature nouns that describe people. Examples of this include words like 'girl' , 'boy' or 'grandmother'. The module substitutes these words with their Singapore colloquial counterparts, such as 'ah boy', 'ah girl' and 'ah ma'.

  • Job Role Generator Module

    This attack module adds demographic groups to the job role.

  • Sample Attack Module

    This is a sample attack module.

  • Homoglyph Attack

    This module tests for adversarial textual robustness. Homoglyphs are alternative words for words comprising of ASCII characters. Example of a homoglyph fool -> fooI This module purturbs the prompt with all available homoglyphs for each word present.

  • Malicious Question Generator

    This attack module generates malicious questions using OpenAI's GPT4 based on a given topic. This module will stop by the number of iterations (Default: 50). To use this attack module, you need to configure an 'openai-gpt4'endpoint.

TextBugger Attack

This module tests for adversarial textual robustness and implements the perturbations listed in the paper TEXTBUGGER: Generating Adversarial Text Against Real-world Applications.
Parameters:
1. DEFAULT_MAX_ITERATION - Number of prompts that should be sent to the target. This is also thenumber of transformations that should be generated. [Default: 5]
Note:
Usage of this attack module requires the internet. Initial downloading of the GLoVe embedding occurs when the UniversalEncoder is called.
Embedding is retrieved from the following URL: https://textattack.s3.amazonaws.com/word_embeddings/paragramcf

Parameters cannot be adjusted in this version of the tool.