utf8 character in csv file

Question 1

I am trying to create a database using mongoid but Mongo is creating the database, I got an issue on encoding the database in utf8

extract_data class:

class ExtractData

  include Mongoid::Document
  include Mongoid::Timestamps

  def self.create_all_databases
    @cbsa2msa = DbForCsv.import!('./share/private/csv/cbsa_to_msa.csv')
    @zip2cbsa = DbForCsv.import!('./share/private/csv/zip_to_cbsa.csv')
  end

  def self.show_all_database
    ap @cbsa2msa.all.to_a
    ap @zip2cbsa.all.to_a
  end

end

the class DbForCSV works as below:

class DbForCsv
  include Mongoid::Document
  include Mongoid::Timestamps
  include Mongoid::Attributes::Dynamic

  def self.import!(file_path)
    columns = []
    instances = []
    CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8') do |row|
      if columns.empty?
        # We dont want attributes with whitespaces
        columns = row.collect { |c| c.downcase.gsub(' ', '_') }
        next
      end
      instances << create!(build_attributes(row, columns))
    end
    instances
  end

  private

  def self.build_attributes(row, columns)
    attrs = {}
    columns.each_with_index do |column, index|
      attrs[column] = row[index]
    end
    ap attrs
    attrs
  end
end

I am using the encoding to make sure only UTF8 char are handled but I still see:

{
       "ï»¿zip" => "71964",
         "cbsa" => "31680",
    "res_ratio" => "0.086511098",
    "bus_ratio" => "0.012048193",
    "oth_ratio" => "0.000000000",
    "tot_ratio" => "0.082435345"
}

when doing 'ap attrs' in the code. how to make sure that 'ï»¿zip' -> 'zip'

I have tried also :

    columns = row.collect { |c| c.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''}).downcase.gsub(' ', '_')}

but same things

ArgumentError - invalid byte sequence in UTF-8

here is the file csv file

Thanks

Question 2

If I take the word you read in from the csv file:

 ï»¿zip

and paste it into a hex editor, it reveals that the word consists of the bytes:

                   z  i  p
                   |  |  |
                   V  V  V
C3 AF C2 BB C2 BF 7A 69 70

So what is that junk in front of "zip"?

The UTF-8 BOM is the string:

"\xEF\xBB\xBF"

If I force the encoding of the BOM string (which is UTF-8 by default in a ruby program) to iso-8859-1:

"\xEF\xBB\xBF".force_encoding("ISO-8859-1")

then look at an iso-8859-1 chart for those hex codes, I find:

EF  => ï
BB  => »
BF  => ¿

Next, if I encode the BOM string to UTF-8:

"\xEF\xBB\xBF".force_encoding("ISO-8859-1").encode("UTF-8")

that asks ruby to replace the hex escapes in the string with the hex escapes for the same characters in the UTF-8 encoding, which are:

ï   c3 af   LATIN SMALL LETTER I WITH DIAERESIS
»   c2 bb   RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
¿   c2 bf   INVERTED QUESTION MARK

giving me:

"\xC3\xAF\xC2\xBB\xC2\xBF"

Removing ruby's hex escape syntax gives me:

C3 AF C2 BB C2 BF

Compare that to what the hex editor revealed:

                   z  i  p
                   |  |  |
                   V  V  V
C3 AF C2 BB C2 BF 7A 69 70

Look familiar?

You are asking ruby to do the same thing as above when you write this:

 CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8')

In the file, you have a UTF-8 BOM at the beginning of the file:

 "\xEF\xBB\xBF"

But, you tell ruby that the file is encoded in ISO-8859-1 and that you want ruby to convert the file to UTF-8 strings inside your ruby program:

                               external encoding 
                                     |
                                     V
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8')
                                               ^
                                               |
                                        internal encoding

Therefore ruby goes through the same process as described above to produce a string inside your ruby program that looks like the following:

"\xC3\xAF\xC2\xBB\xC2\xBF"

which screws up your first row of CSV data. You said:

I am using the encoding to make sure only UTF8 char are handled

but that doesn't make any sense to me. If the file is UTF-8, then tell ruby that the external encoding is UTF-8:

                               external encoding 
                                     |
                                     V
CSV.foreach(file_path, encoding: 'UTF-8:UTF-8')
                                           ^
                                           |
                                        internal encoding

Ruby does not automatically skip the BOM when reading a file, so you will still get funny characters at the start of your first row. To fix that, you can use the external encoding 'BOM|UTF-8', which tells ruby to use a BOM if present to determine the external encoding, then skip over the BOM; or if no BOM is present, then use 'UTF-8' as the external encoding:

                               external encoding 
                                     |
                                     V
CSV.foreach(file_path, encoding: 'BOM|UTF-8:UTF-8')
                                             ^
                                             |
                                          internal encoding

That encoding works fine with CSV.foreach(), and it will cause CSV to skip over the BOM after CSV determines the file's encoding.

Response to comment:

The file you posted isn't UTF-8 and there is no BOM. When you specify the external encoding as "BOM|UTF-8" and there is no BOM, you are telling CSV to fall back to an external encoding of UTF-8, and CSV errors out on this row:

"Doña Ana County"

The character ñ is specified as F1 in the file, which is the ISO-8559-1 hex code for the character ñ, and there is no random UTF-8 character with the encoding F1 (in UTF-8 the hex code for LATIN SMALL LETTER N WITH TILDE is actually C3 B1).

If you change the external encoding to "ISO-8859-1" and you specify the internal encoding as "UTF-8", then CSV will process the file without error, and CSV will convert the F1 read from the file to C3 B1 and hand your program UTF-8 encoded strings. The bottom line is: you have to know the encoding of a file to read it. If you are reading many files and they all have different encodings, then you have to know the encoding of each file before you can read it. If you are certain all your files are either ISO-8859-1 or UTF-8, then you can try reading the file with one encoding, and if CSV errors out, you can catch the encoding error and try the other encoding.

utf8 character in csv file

芒果数据

Answer a question

Answers

所有评论(0)

芒果数据