Products

SIGN UP LOG IN

Models / Text / Personal Information Detection

Personal Information Detection

Overview

The Personal Information Detection model detects PII (personally identifiable information) in any user-generated text: comments, messages, posts, reviews etc. This model is made available as part of the Text Moderation API, along with other models such as Profanity Detection and Link Moderation.

The model is useful to detect the following instances of personal information:

Please contact us if you need us to detect any other type of personal information.

Detect Email addresses

The model will detect any email address present in the provided text item. Emails are flagged with type email.

Sightengine's email address detection is a lot stronger than standard REGEX-based approaches. Users sometimes try to obfuscate email addresses to evade filters. The model has been designed to catch obfuscated email addresses and make sure none gets through.

Here are a few examples of the types of obfuscations that will be caught (not exhaustive):

ObfuscationExample
Character weirdnessล˜๐ขยขาœ@๐‘”๐•ž๐”ž๐•€๐“›.๐•”แป–๏ผญ
Insertionsrick[@]gmail . com
Phonetic editsrick(at)gmail(dot)com
Replacementsrick/gmail/com
Missing or altered partsrick_at_gmail
Combinations of the aboveเฝžฤฑฦˆฦ™(ฮฑั‚)๊ฎโ“‚aฬทอ‚ฬฅฬซฬคiฬดฬŠฬกอ”lฬธอ’ฬชอœ

Detect Phone numbers

The model will detect phone numbers with formats that are valid in countries of your choosing.

Robustness and obfuscation

Phone numbers can be written in many different ways and have different formats, even within a given country. Area codes, regional codes, county codes, as well as international vs local formats make phone number detection challenging.

As users sometimes obfuscate phone numbers to try to evade filters, the model has been designed to be a lot more robust than standard REGEX based approaches, to help you catch obfuscated phone numbers.

Here are a few examples of the types of obfuscations that will be caught (not exhaustive):

Example
Standard+1(605)493-3483
Character weirdness16ฬถโŠ˜โบ๏ผ”[ฬ…ฬฒ9]โž‚๐Ÿ›๏ผ”๏ผ˜3
Letter-digit mixing1 6 zero fifty-four 9 three 3 fourty eight 3
Insertions6/0/5/4_9.3+3_4 8_3
Combinations of the above16ฬถโŠ˜โบ๏ผ”๐Ÿ„ฝInฬดฬ’ฬ†ฬŠฬ‡ฬชอ•eฬธอ—ฬ‰ฬšฬฎ โž‚๐Ÿ›๐•—๐• ๐•ฃ๐•ฅ๐•ช ๐‘’๐’พ๐‘”๐’ฝ๐“‰3

Specifying target countries

Given the very large variety of phone number formats across the world, having a detector that flags phone numbers from any country would result in many false positives, as almost any sequence of 5 to 12 digits matches a phone number somewhere.

Therefore, the list of countries you wish to support has to be provided as a comma-separated list of ISO 3166 2-letter in the opt_countries parameter. For instance us for the United-States, fr for France. See the full list of supported countries.

If you do not specify any country, the API will default to the following list of countries: United States us, France fr, United Kingdom gb

Detect IP addresses

In some situations, IP addresses are considered to be personal information and should therefore be redacted or filtered in text.

The PII detection model detects both IPv4 addresses (type ipv4) and IPv6 addresses (type ipv6).

IPv4 is the most frequent and oldest format for IP addresses. It is typically written as 4 numbers separated by dots, for instance 52.222.158.75. IPv6 is the newer format and allows for a greater variety of formatting options, for instance 2600:9000:2247:6000:8:a1f0:7e00:93a1.

Detect Social security numbers

The model will detect US social security numbers in the provided text item. Emails are flagged with type ssn. Detected numbers are not checked against any database to confirm the actual validity or existence of the number.

Use the model

Simply send a POST request containing the UTF-8 formatted text along with the ISO 639-1 language code (such as en for english) and the comma-separated list of countries for phone number detection (such as us,gb,fr for the United States, United Kingdom and France). Here is an example:


curl -X POST 'https://api.sightengine.com/1.0/text/check.json' \
  -F 'text=I am rick(at)gmail(dot)com or 1(800)343-3598' \
  -F 'lang=en' \
  -F 'opt_countries=us,gb,fr' \
  -F 'mode=standard' \
  -F 'api_user={api_user}' \
  -F 'api_secret={api_secret}'


# this example uses requests
import requests
import json

data = {
  'text': 'I am rick(at)gmail(dot)com or 1(800)343-3598',
  'mode': 'standard',
  'lang': 'en',
  'opt_countries': 'us,gb,fr',
  'api_user': '{api_user}',
  'api_secret': '{api_secret}'
}
r = requests.post('https://api.sightengine.com/1.0/text/check.json', data=data)

output = json.loads(r.text)


$params = array(
  'text' => 'I am rick(at)gmail(dot)com or 1(800)343-3598',
  'lang' => 'en',
  'opt_countries' => 'us,gb,fr',
  'mode' => 'standard',
  'api_user' => '{api_user}',
  'api_secret' => '{api_secret}',
);

// this example uses cURL
$ch = curl_init('https://api.sightengine.com/1.0/text/check.json');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$response = curl_exec($ch);
curl_close($ch);

$output = json_decode($response, true);


// this example uses axios and form-data
const axios = require('axios');
const FormData = require('form-data');

data = new FormData();
data.append('text', 'I am rick(at)gmail(dot)com or 1(800)343-3598');
data.append('lang', 'en');
data.append('opt_countries', 'us,gb,fr');
data.append('mode', 'standard');
data.append('api_user', '{api_user}');
data.append('api_secret', '{api_secret}');

axios({
  url: 'https://api.sightengine.com/1.0/text/check.json',
  method:'post',
  data: data,
  headers: data.getHeaders()
})
.then(function (response) {
  // on success: handle response
  console.log(response.data);
})
.catch(function (error) {
  // handle error
  if (error.response) console.log(error.response.data);
  else console.log(error.message);
});

The JSON response contains a description of all personal information found along with the positions within the text string. The description can be found under the personal key, as you can see below:


{
  "status": "success",
  "request": {
    "id": "req_6cujQglQPgGApjI5odv0P",
    "timestamp": 1471947033.92,
    "operations": 1
  },
  "profanity": {
    "matches": []
  },
  "personal": {
    "matches": [
      {
        "type": "email",
        "match": "rick(at)gmail(dot)com",
        "start": 5,
        "end": 25
      },
      {
        "type": "phone_number_us",
        "match": "1(800)343-3598",
        "start": 30,
        "end": 43
      }
    ]
  },
  "link": {
    "matches": []
  },
}

Any other needs?

See our full list of Text models for details on other filters and checks you can run on your text content. You might also want to check our Image & Video models to moderate images and videos. This includes moderation of text in images/videos.

Was this page helpful?