[Artifact] Code for analyzing information types in mTLS certificates

We provide the code used for conducting the privacy analysis presented in Section 6 of the paper. This code allows you to retrieve the count of each information type (e.g., domain name, IP address, MAC address, personal name, etc.) in the CN and SAN fields of certificates, as shown in Table 8.

Classifying types of information in CN and SAN

Prerequisites

CCADB: We used CCADB to determine whether certificate authorities are listed as the Common Name (CN) or Subject Alternative Name (SAN).
- All Certificate Information (root and intermediate) in CCADB - CSV
spaCy pretrained pipeline: We employed spaCy’s pre-trained model for Named Entity Recognition (NER), which categorizes text into labels such as ‘PERSON’, ‘ORG’, ‘PRODUCT’, and ‘DATE’.
- spaCy en_core_web_trf; Specifically, we used version en_core_web_trf-3.7.3
Company name datasets: We utilized publicly available company name lists from Kaggle to further classify entries that were not recognized by the spaCy NER model.
- BigPicture 2023 Q4 Free Company Dataset - 17M+ Company Dataset
- People Data Labs 2019 Global Company Dataset - 7+ Million Company Dataset

Code

You can download the code from our GitHub repository - Link

*Please note that, due to privacy concerns regarding our datasets—collected from within a campus network and containing sensitive user information—we are unable to provide the exact code used in the paper. Some parameters and functions have been modified accordingly.

How to run?

Download the prerequisites and the code (analysis_code.py).

Set the paths for the prerequisite files:

SPACY_NER_PATH = "/path/to/en_core_web_trf-3.7.3/"
CCADB_PATH = "/path/to/AllCertificateRecordsReport.csv"
COMPANY_DATA1_PATH = "/path/to/companies-2023-q4-sm.csv"
COMPANY_DATA2_PATH = "/path/to/companies_sorted.csv"

Set the paths for intermediate files (used to save some intermediate results and reduce future running time):

COMPANY_SIM_SEED_PATH = "/path/to/company_sim_seed.csv" #Can be any name, but in CSV format
NER_SEED_PATH = "/path/to/ner_seed.csv" #Can be any name, but in CSV format

Load datasets as Pandas dataframe:

def load_certificate_data():
    path = "/path/to/certificate_dataset" #we do not provide the raw data due to privacy concerns
    # df = load data from path #load the data using pandas
    return df

Run the script.

For example, you can see that the following output is printed, showing the count of each information type:

- [CN] Public CA
Domain, 3153
Email, 2
IP, 1
MAC, 0
SIP, 0
Localhost, 1
PERSON, 133
Campus_person, 0  # Personal names included in certificates issued by the Campus CA
Gov_person, 0   # Personal names associated with government-related domains' certificates
Campus_ID, 0   # User accounts (IDs) used within the campus
CA, 1       # Certificate authority names
ORG, 40     # Organization names classified by spaCy's model
Company, 16
PRODUCT, 5603   
Unknown, 13397
Empty, 1
Total: 22461

Dataset

Unfortunately, we are unable to provide the raw certificate data used in the analysis in Section 6 due to its sensitive nature.

*Section 1-5: for the same reason, we are also unable to provide the artifacts used in the analyses for Sections 1-5.

Mutual TLS Analysis

About this paper