Models / Text / Personal Information Detection

Personal Information Detection

Overview

The Personal Information Detection model detects PII (personally identifiable information) in any user-generated text: comments, messages, posts, reviews etc. This model is made available as part of the Text Moderation API, along with other models such as Profanity Detection and Link Moderation.

The model is useful to detect the following instances of personal information:

Email addresses, including obfuscated email addresses
Phone numbers from multiple countries, including obfuscated phone numbers
Usernames, typically from social networks
IP addresses (both IPv4 and IPv6)
US social security numbers (SSN)

Please contact us if you need us to detect any other type of personal information.

Detect Email addresses

The model will detect any email address present in the provided text item. Emails are flagged with type email.

Sightengine's email address detection is a lot stronger than standard REGEX-based approaches. Users sometimes try to obfuscate email addresses to evade filters. The model has been designed to catch obfuscated email addresses and make sure none gets through.

Here are a few examples of the types of obfuscations that will be caught (not exhaustive):

Obfuscation	Example
Character weirdness	Ř𝐢¢Ҝ@𝑔𝕞𝔞𝕀𝓛.𝕔ỖＭ
Insertions	rick[@]gmail . com
Phonetic edits	rick(at)gmail(dot)com
Replacements	rick/gmail/com
Missing or altered parts	rick_at_gmail
Combinations of the above	ཞıƈƙ(αт)ꮐⓂḁ̷̫̤͂i̴̡͔̊l̸̪͒͜

Detect Phone numbers

The model will detect phone numbers with formats that are valid in countries of your choosing.

Robustness and obfuscation

Phone numbers can be written in many different ways and have different formats, even within a given country. Area codes, regional codes, county codes, as well as international vs local formats make phone number detection challenging.

As users sometimes obfuscate phone numbers to try to evade filters, the model has been designed to be a lot more robust than standard REGEX based approaches, to help you catch obfuscated phone numbers.

Here are a few examples of the types of obfuscations that will be caught (not exhaustive):

	Example
Standard	+1(605)493-3483
Character weirdness	16̶⊘❺４[̲̅9]➂𝟛４８3
Letter-digit mixing	1 6 zero fifty-four 9 three 3 fourty eight 3
Insertions	6/0/5/4_9.3+3_4 8_3
Combinations of the above	16̶⊘❺４🄽In̴̪͕̒̆̊̇e̸̮͗̉̚ ➂𝟛𝕗𝕠𝕣𝕥𝕪 𝑒𝒾𝑔𝒽𝓉3

Specifying target countries

Given the very large variety of phone number formats across the world, having a detector that flags phone numbers from any country would result in many false positives, as almost any sequence of 5 to 12 digits matches a phone number somewhere.

Therefore, the list of countries you wish to support has to be provided as a comma-separated list of ISO 3166 2-letter in the opt_countries parameter. For instance us for the United-States, fr for France. See the full list of supported countries.

If you do not specify any country, the API will default to the following list of countries: United States us, France fr, United Kingdom gb

Detect Usernames

The model will detect any username present in the provided text item. Usernames are flagged with type username.

Here are a few examples of usernames that will be caught:

EXAMPLES
@lili760
iamthebest007
tom_the_cat
…

Detect IP addresses

In some situations, IP addresses are considered to be personal information and should therefore be redacted or filtered in text.

The PII detection model detects both IPv4 addresses (type ipv4) and IPv6 addresses (type ipv6).

IPv4 is the most frequent and oldest format for IP addresses. It is typically written as 4 numbers separated by dots, for instance 52.222.158.75. IPv6 is the newer format and allows for a greater variety of formatting options, for instance 2600:9000:2247:6000:8:a1f0:7e00:93a1.

Detect Social security numbers

The model will detect US social security numbers in the provided text item. Social security numbers are flagged with type ssn. Detected numbers are not checked against any database to confirm the actual validity or existence of the number.

Use the model

Simply send a POST request containing the UTF-8 formatted text along with the ISO 639-1 language code (such as en for english) and the comma-separated list of countries for phone number detection (such as us,gb,fr for the United States, United Kingdom and France). Here is an example:


curl -X POST 'https://api.sightengine.com/1.0/text/check.json' \
  -F 'text=I am rick(at)gmail(dot)com or 1(800)343-3598' \
  -F 'lang=en' \
  -F 'opt_countries=us,gb,fr' \
  -F 'mode=rules' \
  -F 'api_user={api_user}' \
  -F 'api_secret={api_secret}'


# this example uses requests
import requests
import json

data = {
  'text': 'I am rick(at)gmail(dot)com or 1(800)343-3598',
  'mode': 'rules',
  'lang': 'en',
  'opt_countries': 'us,gb,fr',
  'api_user': '{api_user}',
  'api_secret': '{api_secret}'
}
r = requests.post('https://api.sightengine.com/1.0/text/check.json', data=data)

output = json.loads(r.text)


$params = array(
  'text' => 'I am rick(at)gmail(dot)com or 1(800)343-3598',
  'lang' => 'en',
  'opt_countries' => 'us,gb,fr',
  'mode' => 'rules',
  'api_user' => '{api_user}',
  'api_secret' => '{api_secret}',
);

// this example uses cURL
$ch = curl_init('https://api.sightengine.com/1.0/text/check.json');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$response = curl_exec($ch);
curl_close($ch);

$output = json_decode($response, true);


// this example uses axios and form-data
const axios = require('axios');
const FormData = require('form-data');

data = new FormData();
data.append('text', 'I am rick(at)gmail(dot)com or 1(800)343-3598');
data.append('lang', 'en');
data.append('opt_countries', 'us,gb,fr');
data.append('mode', 'rules');
data.append('api_user', '{api_user}');
data.append('api_secret', '{api_secret}');

axios({
  url: 'https://api.sightengine.com/1.0/text/check.json',
  method:'post',
  data: data,
  headers: data.getHeaders()
})
.then(function (response) {
  // on success: handle response
  console.log(response.data);
})
.catch(function (error) {
  // handle error
  if (error.response) console.log(error.response.data);
  else console.log(error.message);
});

See request parameter description

Parameter	Type	Description
text	string	UTF-8 encoded text to moderate
mode	string	comma-separated list of modes. Modes are rules for the rule-based model or ml for ML models
categories	string	comma-separated list of categories to check. Possible values: profanity, personal, link, drug, weapon, violence, self-harm, medical, extremism, spam, content-trade, money-transaction (optional)
lang	string	comma-separated list of target languages
opt_countries	string	comma-separated list of target countries for phone number detection (optional)
list	string	id of a custom list to be used for rule-based moderation (optional)
api_user	string	your API user id
api_secret	string	your API secret

The JSON response contains a description of all personal information found along with the positions within the text string. The description can be found under the personal key, as you can see below:


{
  "status": "success",
  "request": {
    "id": "req_6cujQglQPgGApjI5odv0P",
    "timestamp": 1471947033.92,
    "operations": 1
  },
  "profanity": {
    "matches": []
  },
  "personal": {
    "matches": [
      {
        "type": "email",
        "match": "rick(at)gmail(dot)com",
        "start": 5,
        "end": 25
      },
      {
        "type": "phone_number_us",
        "match": "1(800)343-3598",
        "start": 30,
        "end": 43
      }
    ]
  },
  "link": {
    "matches": []
  },
}

Any other needs?

See our full list of Text models for details on other filters and checks you can run on your text content. You might also want to check our Image & Video models to moderate images and videos. This includes moderation of text in images/videos.

Was this page helpful?

Products

MODERATION

REDACTION

REFERENCE