Table of contents:
- Deduplication Methods
- Deduplication While Backing Up
- Advantages And Disadvantages
- Use Cases in Business
Many companies are forced to use a file server in their work. It can be considered one of the most inefficient, since, in addition to the necessary data, such a server often contains a huge amount of “unnecessary information”: duplicate files, old backups, etc. The presence of such files does not depend on the server, but on the way, the storage structure is organized.
For example, very often the database stores file templates that differ by a few bits of information. As a result, the volume of data used is constantly increasing, which increases the need for additional devices for storing backups.
The way to deal with this problem is to conduct data deduplication. The procedure eliminates redundant copies and reduces the need for storage space. As a result, storage capacity is optimized and additional devices can be avoided.
The technology allows you to get rid of numerous copies of files and save only one unit of the file on the storage medium. However, in order for such a procedure to be effective and really help to eliminate all copies, you will need to choose the right level of detail.
Data deduplication can be performed at several levels:
- separate files;
Each of the approaches has its own characteristics and advantages that should be considered when choosing a solution.
Using blocks is the most popular option. In short, data deduplication at this level is the analysis of files, after which only non-repeating information for a single block is stored. A block is a logical unit of information with a specific size. Moreover, the size of this unit may vary depending on the tasks.
An important feature of deduplication at this level is the use of hashing. Due to this, it is possible to create and store in a common database the required signature, which denotes a data block.
The next possible level of deduplication is the level of files. In this case, the later file is compared with the earlier one. In the event that unique information is found, it is stored. If the new file duplicates the previous one, only a link to the original information is displayed.
That is, in fact, the original file is written, and all subsequent copies have a pointer to the information. Implementing this deduplication option is quite simple. Typically, there is no degradation in server performance during data processing. However, the effectiveness of the procedure is lower than when using the block approach.
The third possible way of data deduplication is a separate process at the byte level. The principle of conducting this option is similar to the block method, but new and old files are compared by bytes. It is this approach to data deduplication that allows you to completely get rid of duplicates. But there are some drawbacks: the procedure uses serious server capacity, so the device itself has increased requirements.
Deduplication While Backing Up
The procedure for removing duplicates is often performed during the process of saving a backup. Moreover, the process may differ in the place of execution, the source of information (client), and the storage method (server used).
This is a combined option, in which the event can be executed both on the client and on the server itself. Before sending information to the server, special software tries to determine what information has already been written. Typically, block-type deduplication is used. A hash is calculated for a single block of information, and a list of hash keys is sent to the server. At the server level, the keys are compared, after which the client receives the necessary data blocks. Using this solution reduces the overall load on the network, as only unique files are transferred.
Deduplication on the server
This option is used in cases where information is transmitted to the device without processing. A software or hardware data verification procedure may be performed. Deduplication software involves the use of special software that launches the required processes. With this approach, it is important to consider the load on the system, as it may be too high. The hardware type combines special solutions based on deduplication and backup procedures.
Deduplication on the client
This method allows you to use only the capacity of the client itself. After verifying the data, all files are sent to the server. Data deduplication on the client requires special software. The disadvantage of the solution is that it leads to increased loading of RAM.
Advantages And Disadvantages
The advantages of the procedure include the following:
- Deduplication allows you to store your backups for almost unlimited time.
- As a result of deduplication, it is possible to reduce storage requirements by almost 30 times.
- The solution can be used even with slightly reduced network bandwidth. Unique data is transmitted, which saves traffic.
- Deduplication dramatically reduces storage costs.
- Benefits of Dividing Data into Arbitrary Size Chunks.
- Data Integrity Protection and Hash Collision Elimination.
- Data deduplication facilitates disaster recovery.
However, the technology also has disadvantages. The main one is the risk of conflict if several blocks generate the same hash key at the same time. This can provoke a violation of the integrity of the databases, which will make it impossible to restore the created copy. In addition, various errors can occur with a large amount of data.
Frequent difficulties arise when using the Windows Server service. It significantly slows down the work of the file server, since during the procedure the files are first copied to the disks, and only after that the check for duplicates takes place.
Use Cases in Business
Especially often the deduplication process is used by developers in the backup market.
In addition, the technology is often used on the servers of a productive system. In this case, the procedure can be performed by means of the OS or additional software.
Deduplication is helpful, paying little mind to the responsibility type. The most extreme advantage is seen in virtual conditions where various virtual machines are utilized for test/dev and application arrangements.
Virtual desktop infrastructure (VDI) is another excellent contender for deduplication, on the grounds that the copy information among work areas is extremely high.
A few social data sets, for example, Oracle and SQL don’t benefit enormously from deduplication, since they frequently have a one-of-a-kind key for every data set record, which keeps the deduplication engine from distinguishing them as copies.
Regular deduplication allows you to reduce the amount of entropy on file servers and contribute to a better quality of their work, which ultimately should play into the hands of the company’s business processes.