- Notifications
You must be signed in to change notification settings - Fork0
Text anonymization in many languages using Faker
License
hal9ai/anonymization
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Text anonymization in many languages for python3.6+ usingFaker.
pip install anonymization
This example use NamedEntitiesAnonymizer which requirespacy and a spacy model.
pip install spacypython -m spacy download en_core_web_lg
>>>fromanonymizationimportAnonymization,AnonymizerChain,EmailAnonymizer,NamedEntitiesAnonymizer>>>text="Hi John,\nthanks for you for subscribing to Superprogram, feel free to ask me any question at secret.mail@Superprogram.com\n Superprogram the best program!">>>anon=AnonymizerChain(Anonymization('en_US'))>>>anon.add_anonymizers(EmailAnonymizer,NamedEntitiesAnonymizer('en_core_web_lg'))>>>anon.anonymize(text)'Hi Holly,\nthanks for you for subscribing to Ariel, feel free to ask me any question at shanestevenson@gmail.com\n Ariel the best program!'
Or make it reversible with pseudonymize:
>>>fromanonymizationimportAnonymization,AnonymizerChain,EmailAnonymizer,NamedEntitiesAnonymizer>>>text="Hi John,\nthanks for you for subscribing to Superprogram, feel free to ask me any question at secret.mail@Superprogram.com\n Superprogram the best program!">>>anon=AnonymizerChain(Anonymization('en_US'))>>>anon.add_anonymizers(EmailAnonymizer,NamedEntitiesAnonymizer('en_core_web_lg'))>>>clean_text,patch=anon.pseudonymize(text)>>>print(clean_text)'Christopher,\nthanks for you for subscribing to Audrey, feel free to ask me any question at colemanwesley@hotmail.com\n Audrey the best program!'revert_text=anon.revert(clean_text,patch)>>>print(text==revert_text)true
Our solution supports many languages along with their specific information formats.
For example, we can generate a french phone number:
>>>fromanonymizationimportAnonymization,PhoneNumberAnonymizer>>>>>>text="C'est bien le 0611223344 ton numéro ?">>>anon=Anonymization('fr_FR')>>>phoneAnonymizer=PhoneNumberAnonymizer(anon)>>>phoneAnonymizer.anonymize(text)"C'est bien le 0144939332 ton numéro ?"
More examples in/examples
name | lang |
---|---|
FilePathAnonymizer | - |
name | lang |
---|---|
EmailAnonymizer | - |
UriAnonymizer | - |
MacAddressAnonymizer | - |
Ipv4Anonymizer | - |
Ipv6Anonymizer | - |
name | lang |
---|---|
PhoneNumberAnonymizer | 47+ |
msisdnAnonymizer | 47+ |
name | lang |
---|---|
DateAnonymizer | - |
name | lang |
---|---|
NamedEntitiesAnonymizer | 7+ |
DictionaryAnonymizer | - |
SignatureAnonymizer | 7+ |
Custom anonymizers can be easily created to fit your needs:
classCustomAnonymizer():def__init__(self,anonymization:Anonymization):self.anonymization=anonymizationdefanonymize(self,text:str)->str:returnmodified_text# or replace by regex patterns in text using a faker providerreturnself.anonymization.regex_anonymizer(text,pattern,provider)# or replace all occurences using a faker providerreturnself.anonymization.replace_all(text,matchs,provider)
You may also add new faker provider with the helperAnonymization.add_provider(FakerProvider)
or access the faker instance directlyAnonymization.faker
.
This module is benchmarked onsynth_dataset frompresidio-research and returns accuracy result(0.79) better than Microsoft's solution(0.75)
You can run the benchmark using docker:
docker build. -f ./benchmark/dockerfile -t anonbenchdocker run -it --rm --name anonbench anonbench
MIT