...
  • By Ivan Gavryliuk
  • In C#
  • Posted Wednesday, January 3, 2018

How to Extract a ZIP Archive in Parallel

These modern days .NET Plarform has a built-in support for ZIP archives in System.IO.Compression Namespace. I find it exciting as there is no need to depend on a popular third-party library and native support from Microsoft is more encouraging.

One limitation that we've hit using this library is that you cannot extract files in parallel, the library is not thread safe. It's sort of understandable as .zip format wasn't designed for parallel processing. Internally it consists of small chunks of data from different files and moving between them in multi-threaded fashion is not a good option as file handles are not thread safe by nature. In fact, this is the primary limitation.

I've tried many answers on StackOverflow which never worked for me or were overcomplicated/overengineered therefore attaching a code fragment solving this problem here:

   public class ParallelZipArchive : IDisposable
   {
      private readonly string _filePath;

      public ParallelZipArchive(string filePath)
      {
         _filePath = filePath;
      }

      public IReadOnlyCollection<string> GetEntries()
      {
         using (FileStream fs = File.OpenRead(_filePath))
         {
            using (var archive = new ZipArchive(fs, ZipArchiveMode.Read, true))
            {
               return archive.Entries.Select(e => e.FullName).ToList();
            }
         }
      }

      public Dictionary<string, string> Extract(IEnumerable<string> entries, int maxDop, int maxFilesPerThread, CancellationToken cancellationToken)
      {
         var result = new ConcurrentDictionary<string, string>();

         IEnumerable<IEnumerable<string>> batched = entries.Chunk(maxFilesPerThread);

         try
         {
            Parallel.ForEach(
               batched,
               new ParallelOptions { MaxDegreeOfParallelism = maxDop, CancellationToken = cancellationToken },
               entry => ExtractSequentiall(entry, result, cancellationToken));
         }
         catch(OperationCanceledException)
         {
            //when the task is cancelled it's fine to ignore it and return an empty result
            log.Trace("zip extraction cancelled");
         }

         return new Dictionary<string, string>(result);
      }

      private void ExtractSequentiall(IEnumerable<string> entries, ConcurrentDictionary<string, string> result, CancellationToken cancellationToken)
      {
         using (FileStream fs = File.Open(_filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
         {
            using (var archive = new ZipArchive(fs, ZipArchiveMode.Read, true))
            {
               foreach (string entry in entries)
               {
                  if (cancellationToken.IsCancellationRequested) return;

                  ZipArchiveEntry ze = archive.GetEntry(entry);

                  using (Stream es = ze.Open())
                  {
                     byte[] data = es.ToByteArray();
                     string s = Encoding.UTF8.GetString(data);

                     result[entry] = s;
                  }
               }
            }
         }
      }

      public void Dispose()
      {
      }
   }

The idea here is really simple - we are opening the same file in multiple threads, making it thread safe. maxDop is a number specifying max degree of paralellism i.e. maximum number of threads to use, maxFilesPerThread specifies how many files to use within each thread. The second value exists because opening a zip file is itself not a cheap operation, therefore it would be nice to reuse the same thread for a few files. I found that setting those numbers to something like 20, 20 achieves the maximum performance on average server.

Our Blog

Blogging is one of the oldest publishing platforms. Although we don't blog much there is an article once in a while published here.