[Artifact] Code for analyzing information types in mTLS certificates
We provide the code used for conducting the privacy analysis presented in Section 6 of the paper. This code allows you to retrieve the count of each information type (e.g., domain name, IP address, MAC address, personal name, etc.) in the CN and SAN fields of certificates, as shown in Table 8.
Classifying types of information in CN and SAN
Prerequisites
- CCADB: We used CCADB to determine whether certificate authorities are listed as the Common Name (CN) or Subject Alternative Name (SAN).
- All Certificate Information (root and intermediate) in CCADB - CSV
- spaCy pretrained pipeline: We employed spaCy’s pre-trained model for Named Entity Recognition (NER), which categorizes text into labels such as ‘PERSON’, ‘ORG’, ‘PRODUCT’, and ‘DATE’.
- spaCy en_core_web_trf; Specifically, we used version en_core_web_trf-3.7.3
- Company name datasets: We utilized publicly available company name lists from Kaggle to further classify entries that were not recognized by the spaCy NER model.
- BigPicture 2023 Q4 Free Company Dataset - 17M+ Company Dataset
- People Data Labs 2019 Global Company Dataset - 7+ Million Company Dataset
Code
You can download the code from our GitHub repository - Link
*Please note that, due to privacy concerns regarding our datasets—collected from within a campus network and containing sensitive user information—we are unable to provide the exact code used in the paper. Some parameters and functions have been modified accordingly.
How to run?
-
Download the prerequisites and the code (
analysis_code.py
). -
Set the paths for the prerequisite files:
SPACY_NER_PATH = "/path/to/en_core_web_trf-3.7.3/" CCADB_PATH = "/path/to/AllCertificateRecordsReport.csv" COMPANY_DATA1_PATH = "/path/to/companies-2023-q4-sm.csv" COMPANY_DATA2_PATH = "/path/to/companies_sorted.csv"
-
Set the paths for intermediate files (used to save some intermediate results and reduce future running time):
COMPANY_SIM_SEED_PATH = "/path/to/company_sim_seed.csv" #Can be any name, but in CSV format NER_SEED_PATH = "/path/to/ner_seed.csv" #Can be any name, but in CSV format
-
Load datasets as Pandas dataframe:
def load_certificate_data(): path = "/path/to/certificate_dataset" #we do not provide the raw data due to privacy concerns # df = load data from path #load the data using pandas return df
-
Run the script.
For example, you can see that the following output is printed, showing the count of each information type:
- [CN] Public CA Domain, 3153 Email, 2 IP, 1 MAC, 0 SIP, 0 Localhost, 1 PERSON, 133 Campus_person, 0 # Personal names included in certificates issued by the Campus CA Gov_person, 0 # Personal names associated with government-related domains' certificates Campus_ID, 0 # User accounts (IDs) used within the campus CA, 1 # Certificate authority names ORG, 40 # Organization names classified by spaCy's model Company, 16 PRODUCT, 5603 Unknown, 13397 Empty, 1 Total: 22461
Dataset
Unfortunately, we are unable to provide the raw certificate data used in the analysis in Section 6 due to its sensitive nature.
*Section 1-5: for the same reason, we are also unable to provide the artifacts used in the analyses for Sections 1-5.