Predicting disclosure risk in the numeric database.

Topic > Predicting disclosure risk in the numeric database.

Internal data in your organization can grow rapidly over time. In order to reduce organization costs, they can choose a third-party storage provider to store entire data. A loss crisis occurs when the supplier cannot be trusted. Another scenario, a retailer collects all transaction data and publishes it to the data analytics company for marketing purposes. It could reveal privacy when the company is malicious. For this reason, preserving privacy in the database becomes a very important issue. This document concerns the risk of disclosure of forecasts in the numerical database. We present an efficient noise generation that is based on the Huffman coding algorithm. We also build a noise matrix that can intuitively add noise to the original value. Furthermore, we adopt the clustering technique before generating noise. The result shows that the noise generation running time of the clustering scheme is faster than the unclustering scheme. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Technology brings convenience and the technique of cloud computing is on the rise in recent years. Internal data in your organization can grow rapidly. Despite the organization, it can build the storage space by itself, but it can publish this data to the data analytics company for some marketing purposes. Therefore, data mining techniques play an important role in Knowledge Discovery in Databases (KDD). But the malicious data analytics company can record personal data when the organization publishes the statistical database on behalf of the company. If the company does not enjoy trust, a loss crisis occurs. For these reasons, privacy research is becoming more and more popular in recent years. Statistical databases (SDBs) are used to produce a result of statistical aggregates, such as sum, average, maximum and minimum. Statistical aggregate results do not reveal the contents of each individual tuple. However, the user can ask many legal questions to infer confidential information from obtaining database answers. Improving the security of statistical databases has received much attention in recent years. The security problem in the classical statistical database involves three different roles [17]: statistician, whose interest is to obtain aggregate data; data owner, who wants individual data to be protected; database administrator, who must fulfill both of the above roles. Privacy challenges in statistical databases are classified into two aspects [15]: for the data owner, it should avoid data theft by a hacker, avoid data abuse by the service provider, and limit the right user access; for the user, it should hide the content of the query and the database does not reveal the details of the query. Many approaches have been proposed. Navarro-Arribas and Torra organize four categories of approaches as follows [16]:1) Perturbative methods, which modify the original data to achieve a certain degree of privacy. They usually called it noise;2) Non-perturbative methods, that technique masks the data without introducing errors. Unlike perturbative methods, the data is not distorted; 3) Cryptographic methods, which use the classical cryptography system; 4) Synthetic data generation, which generates random data while maintaining a relationship with the original data. In order to protect confidential information in the database, Statistical Disclosure Control (SDC) is mainly used for a privacy protection solutionin the statistical database. Microaggregation techniques (MAT) belong to the DSC family and belong to perturbative methods. The microaggregation method has many attractive features including robust performance, consistent responses, and ease of implementation [6]. A user is able to get useful information since this method would not reduce the information in the content. In other words, information loss through this method is minimal. Furthermore, we will examine some approaches to preserve privacy [1-5,8.12-14,17]. In particular, in recent years the microaggregation scheme is attracted to use in statistical databases, because it replaces the original value, with less distortion, to prevent the disclosure of identity and prediction. And the replaced data caused no problems for data analysis or data mining applications. All records in the database can be represented at one data point in coordinate systems. This paper considers that a combination of two or more non-confidential attributes, such as age and weight, can be used to link an individual. This set of attributes is collectively called a quasi-identifier. A popular approach to replacing the original data is to use the clustering-based technique to prevent identity disclosure. Therefore, the adversary may be confused when the original data is replaced by a clustering measure. Although the data in the dataset is homogeneous by the clustering-based technique, there is a problem of prediction disclosure. 2. Proposed scheme. The article concerns the problem of disclosing the prediction that the quasi-identifier is generalized by a homogeneous microaggregation method. The quasi-identifier has one or more attributes that can link to an individual. For a short time we only consider a quasi-identifier with two attributes. First, all values of the quasi-identifier are converted to a data point on the coordinate system. To address prediction disclosure, the homogeneous values after the process of the original microaggregation method are first clustered. We then generate noise based on the centroid of these groups. To increase the speed of noise injection, all noise values are formed into a set, called noise matrix in this paper. Each original value corresponds to a noise value. In this section, we introduce the concept of microaggregation and then illustrate Prim's MST-based clustering technique. The main idea of the paper is noise generation and noise injection procedure. These two will be described in the remainder of this section. 2.1 Preliminary. The microaggregation technique is the family of statistical disclosure control and is applied to numerical data, categorical data, sequences and heterogeneous data [16]. Calculates a value to represent a group and replaces the original value to confuse the opponent. All records form a group with the closest records. it is a constant value, threshold, preset by the data protector. It is higher, the degree of privacy is higher, but the data quality is lower. On the contrary, it is lower, the degree of privacy is lower, but the data quality is higher. This is a trade-off between the risk of data disclosure and less information loss. Although this method may damage the original data and lead to data distortion. But it only guarantees low levels of data distortion. It did not affect the functioning of the database. Therefore, minimizing information loss is one of the main challenges of this method. There are two main operations for microaggregation, namely partition and aggregationwhich we describe in detail as follows:Partition: Records are partitioned into multiple disjoint groups, and at least records are included in each group.Aggregation: Each record in the group is replaced by the centroid of the group, which is a value calculated to represent the group. 2.2 MST Clustering. We adopt the Prim minimum cost spanning tree clustering technique proposed by Lazlo and Mukherjee in 2005 [11]. The first step, the proposed clustering technique is based on the least cost Prim spanning tree, which is constructed based on all records in the dataset. Prim's algorithm is a greedy algorithm that finds a minimum-cost spanning tree for an undirected graph with connected edges. Find a subset of edges to form a minimum-cost spanning tree that connects to all nodes, where the total weight of all edges is minimized. Some notations are defined to facilitate discussion. Each record with multiple attributes in dataset D can be converted to a data point on coordinate systems and is considered a node u in the minimum cost spanning tree. Node u can be connected to the other node v in dataset D and forms an edge e(u,v), u,vD. All edges can be computed to a value from two random nodes in the dataset. This calculated value can be used as the weight w for each edge. According to Prim's algorithm, it first selects a single node uD and constructs a spanning tree of minimum cost F={u}, without edges. The next step of Prim's algorithm selects another node v FD, where v is closest to set F and is closest to node u. There is a new edge e(u,v) formed by two nodes u, vD and the node v points to the parent node u and adds v to the set F, F={u,v}. Each node points to its parent node in the tree, but the starting node points to null. In this case the node u points to zero. This is an iterative process until F=D. Prim's algorithm selects a single node, considered as a root of the tree, in the graph to grow to a spanning tree of minimum cost. The total weight in all selected edges is minimized. The result of Prim's MST algorithm is shown in Fig 1, where the tree nodes are connected by red lines and the weight number is near each edge. The second step, to partition all nodes to form a cluster in the MST, we should consider how many edges in the MST are removable. The idea is to visit all edges in the MST from longest to shortest and determine the edge cut while keeping the remaining edges. After edge trimming, the MST is split into several subtrees and these can be formed into a cluster. All edges are assigned to a priority queue in descending order. Then, we get an edge sequentially from the priority queue and consider each edge whether it is removable, where the visiting node is located, and is the parent node. We consider the size of the two subtrees from the visited node and the parent node respectively and determine that each size is larger than the one preset by the protector. The edge is removable when both dimensions of two subtrees are respectively greater than. In contrast, the edge is not removable. First, we get the size of the subtree by visiting the node, where it is used to get the size of the subtree from the node. Second, we consider the root node from the visited node towards its parent node. So we get another size of the subtree. To briefly illustrate, suppose the size of these two subtrees is larger than where the edge is removable. We remove the edge from the priority queue and replace the parent pointer to represent that it is a root node of the subtree. The final step is simple processing to partition all nodes into the disjoint cluster. Each.