The Personal Information Detection model detects PII (personally identifiable information) in any user-generated text: comments, messages, posts, reviews etc. This model is made available as part of the Text Moderation API, along with other models such as Profanity Detection and Link Moderation.
The model is useful to detect the following instances of personal information:
Please contact us if you need us to detect any other type of personal information.
The model will detect any email address present in the provided text item. Emails are flagged with type email.
Sightengine's email address detection is a lot stronger than standard REGEX-based approaches. Users sometimes try to obfuscate email addresses to evade filters. The model has been designed to catch obfuscated email addresses and make sure none gets through.
Here are a few examples of the types of obfuscations that will be caught (not exhaustive):
Obfuscation | Example |
Character weirdness | ล๐ขยขา@๐๐๐๐๐.๐แป๏ผญ |
Insertions | rick[@]gmail . com |
Phonetic edits | rick(at)gmail(dot)com |
Replacements | rick/gmail/com |
Missing or altered parts | rick_at_gmail |
Combinations of the above | เฝฤฑฦฦ(ฮฑั)๊ฎโaฬทอฬฅฬซฬคiฬดฬฬกอlฬธอฬชอ |
The model will detect phone numbers with formats that are valid in countries of your choosing.
Phone numbers can be written in many different ways and have different formats, even within a given country. Area codes, regional codes, county codes, as well as international vs local formats make phone number detection challenging.
As users sometimes obfuscate phone numbers to try to evade filters, the model has been designed to be a lot more robust than standard REGEX based approaches, to help you catch obfuscated phone numbers.
Here are a few examples of the types of obfuscations that will be caught (not exhaustive):
Example | |
Standard | +1(605)493-3483 |
Character weirdness | 16ฬถโโบ๏ผ[ฬ ฬฒ9]โ๐๏ผ๏ผ3 |
Letter-digit mixing | 1 6 zero fifty-four 9 three 3 fourty eight 3 |
Insertions | 6/0/5/4_9.3+3_4 8_3 |
Combinations of the above | 16ฬถโโบ๏ผ๐ฝInฬดฬฬฬฬฬชอeฬธอฬฬฬฎ โ๐๐๐ ๐ฃ๐ฅ๐ช ๐๐พ๐๐ฝ๐3 |
Given the very large variety of phone number formats across the world, having a detector that flags phone numbers from any country would result in many false positives, as almost any sequence of 5 to 12 digits matches a phone number somewhere.
Therefore, the list of countries you wish to support has to be provided as a comma-separated list of ISO 3166 2-letter in the opt_countries parameter. For instance us for the United-States, fr for France. See the full list of supported countries.
If you do not specify any country, the API will default to the following list of countries: United States us, France fr, United Kingdom gb
The model will detect any username present in the provided text item. Usernames are flagged with type username.
Here are a few examples of usernames that will be caught:
EXAMPLES |
@lili760 |
iamthebest007 |
tom_the_cat |
โฆ |
In some situations, IP addresses are considered to be personal information and should therefore be redacted or filtered in text.
The PII detection model detects both IPv4 addresses (type ipv4) and IPv6 addresses (type ipv6).
IPv4 is the most frequent and oldest format for IP addresses. It is typically written as 4 numbers separated by dots, for instance 52.222.158.75. IPv6 is the newer format and allows for a greater variety of formatting options, for instance 2600:9000:2247:6000:8:a1f0:7e00:93a1.
The model will detect US social security numbers in the provided text item. Social security numbers are flagged with type ssn. Detected numbers are not checked against any database to confirm the actual validity or existence of the number.
Simply send a POST request containing the UTF-8 formatted text along with the ISO 639-1 language code (such as en for english) and the comma-separated list of countries for phone number detection (such as us,gb,fr for the United States, United Kingdom and France). Here is an example:
curl -X POST 'https://api.sightengine.com/1.0/text/check.json' \
-F 'text=I am rick(at)gmail(dot)com or 1(800)343-3598' \
-F 'lang=en' \
-F 'opt_countries=us,gb,fr' \
-F 'mode=rules' \
-F 'api_user={api_user}' \
-F 'api_secret={api_secret}'
# this example uses requests
import requests
import json
data = {
'text': 'I am rick(at)gmail(dot)com or 1(800)343-3598',
'mode': 'rules',
'lang': 'en',
'opt_countries': 'us,gb,fr',
'api_user': '{api_user}',
'api_secret': '{api_secret}'
}
r = requests.post('https://api.sightengine.com/1.0/text/check.json', data=data)
output = json.loads(r.text)
$params = array(
'text' => 'I am rick(at)gmail(dot)com or 1(800)343-3598',
'lang' => 'en',
'opt_countries' => 'us,gb,fr',
'mode' => 'rules',
'api_user' => '{api_user}',
'api_secret' => '{api_secret}',
);
// this example uses cURL
$ch = curl_init('https://api.sightengine.com/1.0/text/check.json');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$response = curl_exec($ch);
curl_close($ch);
$output = json_decode($response, true);
// this example uses axios and form-data
const axios = require('axios');
const FormData = require('form-data');
data = new FormData();
data.append('text', 'I am rick(at)gmail(dot)com or 1(800)343-3598');
data.append('lang', 'en');
data.append('opt_countries', 'us,gb,fr');
data.append('mode', 'rules');
data.append('api_user', '{api_user}');
data.append('api_secret', '{api_secret}');
axios({
url: 'https://api.sightengine.com/1.0/text/check.json',
method:'post',
data: data,
headers: data.getHeaders()
})
.then(function (response) {
// on success: handle response
console.log(response.data);
})
.catch(function (error) {
// handle error
if (error.response) console.log(error.response.data);
else console.log(error.message);
});
See request parameter description
Parameter | Type | Description |
text | string | UTF-8 encoded text to moderate |
mode | string | comma-separated list of modes. Modes are rules for the rule-based model or ml for ML models |
categories | string | comma-separated list of categories to check. Possible values: profanity, personal, link, drug, weapon, violence, self-harm, medical, extremism, spam, content-trade, money-transaction (optional) |
lang | string | comma-separated list of target languages |
opt_countries | string | comma-separated list of target countries for phone number detection (optional) |
list | string | id of a custom list to be used for rule-based moderation (optional) |
api_user | string | your API user id |
api_secret | string | your API secret |
The JSON response contains a description of all personal information found along with the positions within the text string. The description can be found under the personal key, as you can see below:
{
"status": "success",
"request": {
"id": "req_6cujQglQPgGApjI5odv0P",
"timestamp": 1471947033.92,
"operations": 1
},
"profanity": {
"matches": []
},
"personal": {
"matches": [
{
"type": "email",
"match": "rick(at)gmail(dot)com",
"start": 5,
"end": 25
},
{
"type": "phone_number_us",
"match": "1(800)343-3598",
"start": 30,
"end": 43
}
]
},
"link": {
"matches": []
},
}
See our full list of Text models for details on other filters and checks you can run on your text content. You might also want to check our Image & Video models to moderate images and videos. This includes moderation of text in images/videos.
Was this page helpful?