TY - GEN
T1 - Privacy-preserving internet traffic publication
AU - Guo, Longkun
AU - Shen, Hong
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016
Y1 - 2016
N2 - As machine learning (ML)-based traffic classification develops, Internet traffic data is published in public to serve as test data. Although the IP addresses therein are anonymized, it is given explicitly which data belongs to an identical user. Then using the information, an adversary can identify a user from the anonymized users. The paper first gives a k-anonymity method to reduce the probability of information leak to P/k, where P is the probability of information leak without k-anonymity. Assume the number of the flows belonging to an IP address follows Normal distribution, the information loss is shown μ2+σ2/kμ2+σ2, where μ and σ are respectively the mean and the variance of the Normal distribution. Later, random noise is added to further reduce the probability of information leak to P/k2, with an expected distortion rate of approximately 2d+log k-log|X|, where d is the number of dimensions and |X| is the number of the vectors. At last, real-world Internet traffic data is used to evaluate the utility of the anonymized traffic data. According to the experimental results, the k-anonymized noised data can be clustered with an overall accuracy rate close to the state-of-the-art results for non-anonymized traffic data.
AB - As machine learning (ML)-based traffic classification develops, Internet traffic data is published in public to serve as test data. Although the IP addresses therein are anonymized, it is given explicitly which data belongs to an identical user. Then using the information, an adversary can identify a user from the anonymized users. The paper first gives a k-anonymity method to reduce the probability of information leak to P/k, where P is the probability of information leak without k-anonymity. Assume the number of the flows belonging to an IP address follows Normal distribution, the information loss is shown μ2+σ2/kμ2+σ2, where μ and σ are respectively the mean and the variance of the Normal distribution. Later, random noise is added to further reduce the probability of information leak to P/k2, with an expected distortion rate of approximately 2d+log k-log|X|, where d is the number of dimensions and |X| is the number of the vectors. At last, real-world Internet traffic data is used to evaluate the utility of the anonymized traffic data. According to the experimental results, the k-anonymized noised data can be clustered with an overall accuracy rate close to the state-of-the-art results for non-anonymized traffic data.
KW - Clustering
KW - K-anonymity
KW - Privacy preserving
KW - Traffic classification
UR - http://www.scopus.com/inward/record.url?scp=85015202184&partnerID=8YFLogxK
U2 - 10.1109/TrustCom.2016.0152
DO - 10.1109/TrustCom.2016.0152
M3 - Conference contribution
AN - SCOPUS:85015202184
T3 - Proceedings - 15th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 10th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Symposium on Parallel and Distributed Processing with Applications, IEEE TrustCom/BigDataSE/ISPA 2016
SP - 884
EP - 891
BT - Proceedings - 15th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 10th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Symposium on Parallel and Distributed Processing with Applications, IEEE TrustCom/BigDataSE/ISPA 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - Joint 15th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 10th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Symposium on Parallel and Distributed Processing with Applications, IEEE TrustCom/BigDataSE/ISPA 2016
Y2 - 23 August 2016 through 26 August 2016
ER -