I am trying to create a database using mongoid but Mongo is creating the database, I got an issue on encoding the database in utf8
extract_data class:
class ExtractData
include Mongoid::Document
include Mongoid::Timestamps
def self.create_all_databases
@cbsa2msa = DbForCsv.import!('./share/private/csv/cbsa_to_msa.csv')
@zip2cbsa = DbForCsv.import!('./share/private/csv/zip_to_cbsa.csv')
end
def self.show_all_database
ap @cbsa2msa.all.to_a
ap @zip2cbsa.all.to_a
end
end
the class DbForCSV works as below:
class DbForCsv
include Mongoid::Document
include Mongoid::Timestamps
include Mongoid::Attributes::Dynamic
def self.import!(file_path)
columns = []
instances = []
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8') do |row|
if columns.empty?
# We dont want attributes with whitespaces
columns = row.collect { |c| c.downcase.gsub(' ', '_') }
next
end
instances << create!(build_attributes(row, columns))
end
instances
end
private
def self.build_attributes(row, columns)
attrs = {}
columns.each_with_index do |column, index|
attrs[column] = row[index]
end
ap attrs
attrs
end
end
I am using the encoding to make sure only UTF8 char are handled but I still see:
{
"zip" => "71964",
"cbsa" => "31680",
"res_ratio" => "0.086511098",
"bus_ratio" => "0.012048193",
"oth_ratio" => "0.000000000",
"tot_ratio" => "0.082435345"
}
when doing 'ap attrs' in the code. how to make sure that 'zip' -> 'zip'
I have tried also :
columns = row.collect { |c| c.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''}).downcase.gsub(' ', '_')}
but same things
ArgumentError - invalid byte sequence in UTF-8
here is the file csv file
Thanks
If I take the word you read in from the csv file:
zip
and paste it into a hex editor, it reveals that the word consists of the bytes:
z i p
| | |
V V V
C3 AF C2 BB C2 BF 7A 69 70
So what is that junk in front of "zip"?
The UTF-8 BOM is the string:
"\xEF\xBB\xBF"
If I force the encoding of the BOM string (which is UTF-8 by default in a ruby program) to iso-8859-1:
"\xEF\xBB\xBF".force_encoding("ISO-8859-1")
then look at an iso-8859-1 chart for those hex codes, I find:
EF => ï
BB => »
BF => ¿
Next, if I encode the BOM string to UTF-8:
"\xEF\xBB\xBF".force_encoding("ISO-8859-1").encode("UTF-8")
that asks ruby to replace the hex escapes in the string with the hex escapes for the same characters in the UTF-8 encoding, which are:
ï c3 af LATIN SMALL LETTER I WITH DIAERESIS
» c2 bb RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
¿ c2 bf INVERTED QUESTION MARK
giving me:
"\xC3\xAF\xC2\xBB\xC2\xBF"
Removing ruby's hex escape syntax gives me:
C3 AF C2 BB C2 BF
Compare that to what the hex editor revealed:
z i p
| | |
V V V
C3 AF C2 BB C2 BF 7A 69 70
Look familiar?
You are asking ruby to do the same thing as above when you write this:
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8')
In the file, you have a UTF-8 BOM at the beginning of the file:
"\xEF\xBB\xBF"
But, you tell ruby that the file is encoded in ISO-8859-1 and that you want ruby to convert the file to UTF-8 strings inside your ruby program:
external encoding
|
V
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8')
^
|
internal encoding
Therefore ruby goes through the same process as described above to produce a string inside your ruby program that looks like the following:
"\xC3\xAF\xC2\xBB\xC2\xBF"
which screws up your first row of CSV data. You said:
I am using the encoding to make sure only UTF8 char are handled
but that doesn't make any sense to me. If the file is UTF-8, then tell ruby that the external encoding is UTF-8:
external encoding
|
V
CSV.foreach(file_path, encoding: 'UTF-8:UTF-8')
^
|
internal encoding
Ruby does not automatically skip the BOM when reading a file, so you will still get funny characters at the start of your first row. To fix that, you can use the external encoding 'BOM|UTF-8', which tells ruby to use a BOM if present to determine the external encoding, then skip over the BOM; or if no BOM is present, then use 'UTF-8' as the external encoding:
external encoding
|
V
CSV.foreach(file_path, encoding: 'BOM|UTF-8:UTF-8')
^
|
internal encoding
That encoding works fine with CSV.foreach(), and it will cause CSV to skip over the BOM after CSV determines the file's encoding.
Response to comment:
The file you posted isn't UTF-8 and there is no BOM. When you specify the external encoding as "BOM|UTF-8" and there is no BOM, you are telling CSV to fall back to an external encoding of UTF-8, and CSV errors out on this row:
"Doña Ana County"
The character ñ is specified as F1 in the file, which is the ISO-8559-1 hex code for the character ñ, and there is no random UTF-8 character with the encoding F1 (in UTF-8 the hex code for LATIN SMALL LETTER N WITH TILDE is actually C3 B1).
If you change the external encoding to "ISO-8859-1" and you specify the internal encoding as "UTF-8", then CSV will process the file without error, and CSV will convert the F1 read from the file to C3 B1 and hand your program UTF-8 encoded strings. The bottom line is: you have to know the encoding of a file to read it. If you are reading many files and they all have different encodings, then you have to know the encoding of each file before you can read it. If you are certain all your files are either ISO-8859-1 or UTF-8, then you can try reading the file with one encoding, and if CSV errors out, you can catch the encoding error and try the other encoding.
所有评论(0)