Sunday, January 31, 2016

Preserving Encoding and BOM

I was running some files through Regex.Replace and a diff showed the first line of some files had changed, but the text looked the same. It turns out that by using StreamReader to read the input files and StreamWriter to write the output files I had removed the BOMs (byte order markers). For example, some of the input XML files started with the UTF-8 BOM bytes 0xEFBBBF but they were stripped off the output files.

After an hour of searching and fiddling around I came to the conclusion that the none of the System.IO classes correctly report the encoding type, which means you can't simply and automatically round-trip a file's encoding when it's processed as a text file. Other web comments seems to support this, but if anyone knows otherwise, please let me know.

I reluctantly used the following code to calculate a file's text encoding.

private static Encoding CalcEncoding(string filename)
{
  var prencs = Encoding.GetEncodings()
      .Select(e => new { Enc = e.GetEncoding(), Pre = e.GetEncoding().GetPreamble() })
      .Where(e => e.Pre.Length > 0).ToArray();
  using (var reader = File.OpenRead(filename))
  {
    var lead = new byte[prencs.Max(p => p.Pre.Length)];
    reader.Read(lead, 0, lead.Length);
    var match = prencs.FirstOrDefault(p => Enumerable.Range(0, p.Pre.Length).All(i => p.Pre[i] == lead[i]));
    return match == null ? null : match.Enc;
  }
}

This method 'sniffs' the file and finds any encoding with preamble bytes that match the start of the file. It's clumsy to have to do this. If you get null back then you have to chose a suitable default encoding, and a new UTF8Encoding(false) class is a good choice on Windows where UTF8 encoding without a BOM is the default for most text file processing.

Once you have the original encoding (or a suitable default), pass it into the StreamWriter's constructor and you can be sure that the original encoding and BOM will be preserved.

No comments:

Post a Comment