A domain can be used for malicious purposes like
- Malware, Virus or Trojan delivery
- Phishing
- Spam mails
- Malicious Ad Campaigns (Malvertising)
- Command and Control (C2C)
- DGA (Domain Generation Algorithms)
- Data Exfiltration etc.
So our idea was to develop an open source code to detect malicious domains using machine learning. We are using Scikit-learn, a free machine learning library for the python programming language.
Note here that we are detecting malicious domain not malicious URL, because we are focusing to prevent victims from attackers. The reason is 90% attacks are performed using domain only, so if we detect malicious domain rather than malicious domain than actually we are stopping 90% attacks.
- There are many repositories are available to detect malicious url, phishing domains, DGA in github. But the problem we have seen is, for different attacks we have different solutions.
- Even though attacks have same behaviours in most of the attacks, we have different solutions.
- The repositories are not updated up to the mark.
So we have decided to consolidate these behaviours into single problem and develop a prediction model for the detection of malicious domains. Thus we don't have to rely on different solutions and maintaining different models.
requirements.txt file contains actual dependencies to run this project. Install it using pip install requirements.txt
command.
To-Do
- URL length
- Host length
- Number of dots
- Host ranking in city
- Host ranking in country
- URL average token length
- Host average token length
- Path average token length
- URL token count (Considering words as a token)
- Host token count
- Path token count
- URL largest token length
- Host largest token length
- Path largest token length
- IP address presence
- ASN number
- Safe browsing
- Domain age
- Number of subdomains
- Is IDN (International Domain Name)
- Will add more machine learning models
- Will add Is domain from dynamic DNS as a feature
- Will add shortened URL as a feature
- Will add number of special characters (- and _) as a feature
- Will add website contents as a feature
Testing Accuracy :: 94.67%
Confusion Matrix :: [102, 4]
[5, 58]
Feel free to fork and submit pull requests in development.