TY - JOUR
T1 - Improved approximate detection of duplicates for data streams over sliding windows
AU - Shen, Hong
AU - Zhang, Yu
N1 - Funding Information:
Regular Paper This work is supported by the “Hundred Talents Program” of CAS and the National Natural Science Foundation of China under Grant No. 60772034.
PY - 2008/11
Y1 - 2008/11
N2 - Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O( √G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
AB - Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O( √G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
KW - Approximate query
KW - Bloom filter
KW - Data stream
KW - Duplicate detection
KW - Sliding window
UR - http://www.scopus.com/inward/record.url?scp=57049103006&partnerID=8YFLogxK
U2 - 10.1007/s11390-008-9192-1
DO - 10.1007/s11390-008-9192-1
M3 - Article
AN - SCOPUS:57049103006
SN - 1000-9000
VL - 23
SP - 973
EP - 987
JO - Journal of Computer Science and Technology
JF - Journal of Computer Science and Technology
IS - 6
ER -