Get to know MySQL character sets and collation for efficient database design
Published under Wordpress Themes » WP ArticlesArticle Preview Get to know MySQL character sets and collation for efficient database design By: James Dupont Article ID: 892074 Published: December 21, 2007 Category: Computers and Technology
Software Article Word Count: 532
Character sets and collation are important concepts to grasp before designing your database. A character set is a set of symbols and encodings while a collation is a set of rules for comparing the characters in a character set. In reality most character sets have many characters, special symbols and punctuation marks, requiring the collations to have many rules. These include case insensitivity and accent insensitivity (an "accent" is a mark attached to a character as in the German Õ) and multiple-character mappings (such as the rule that Ö = OE in one of the two German collations)
In the single character set model (which refers to the pre MySQL 4.1 days) character set values are interpreted with respect to the server's character set. This is the default character set selected when the server was built and is usually latin1. This can be overridden at runtime by using the default-character-set option. The problem with this is that it limits the database to one character set at a time and can lead to index-related problems. The solution is to rebuild the indexes for each existing table that has a character-based index, using the collating order of the new character set.Post MySQL 4.1, the MySQL server supports the use of multiple character sets simultaneously (more than 70 collations for more than 30 character sets). However, you cannot mix character sets within a string or use different character sets for different rows in a given column.
There are default settings for character sets and collations at four levels:
1. The server character set and collation are determined:
* According to the option settings in effect when the server starts.
* According to the values set at runtime.
2. Each database has a database character set and a database collation.
3. Every table has a table character set and table collation.
4. Every "character" column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation.
Some character set and collation variables play a role in the connection between a client and the server. On connection, the client indicates to the server the name of the character set that it wants to use. The server sets the character_set_client, character_set_results, and character_set_connection variables to that character set. Conversion may be lossy if there are characters that are not in both character sets.
There are many different character sets because of the different encodings required by the various languages. Different languages also require different numbers of bytes to represent a character. All characters in the latin1 character set can be represented by a single byte. Other languages may need more than one byte per character.
Unicode provides a single character-encoding system wherein character sets from all languages can be represented in a consistent manner. Within MySQL there are two Unicode sets:
1. ucs2 Which corresponds to the Unicode UCS-2 encoding. Each character is represented by two bytes.
2. utf8 Which has a variable length format, representing characters with from 1 to 3 bytes.
You can store text in about 650 languages using these character sets.
www.idig.za.net
Character sets and collation are important concepts to grasp before designing your database. A character set is a set of symbols and encodings while a collation is a set of rules for comparing the characters in a character set. In reality most character sets have many characters, special symbols and punctuation marks, requiring the collations to have many rules. These include case insensitivity and accent insensitivity (an "accent" is a mark attached to a character as in the German Õ) and multiple-character mappings (such as the rule that Ö = OE in one of the two German collations)
In the single character set model (which refers to the pre MySQL 4.1 days) character set values are interpreted with respect to the server's character set. This is the default character set selected when the server was built and is usually latin1. This can be overridden at runtime by using the default-character-set option. The problem with this is that it limits the database to one character set at a time and can lead to index-related problems. The solution is to rebuild the indexes for each existing table that has a character-based index, using the collating order of the new character set.Post MySQL 4.1, the MySQL server supports the use of multiple character sets simultaneously (more than 70 collations for more than 30 character sets). However, you cannot mix character sets within a string or use different character sets for different rows in a given column.
There are default settings for character sets and collations at four levels:
1. The server character set and collation are determined:
* According to the option settings in effect when the server starts.
* According to the values set at runtime.
2. Each database has a database character set and a database collation.
3. Every table has a table character set and table collation.
4. Every "character" column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation.
Some character set and collation variables play a role in the connection between a client and the server. On connection, the client indicates to the server the name of the character set that it wants to use. The server sets the character_set_client, character_set_results, and character_set_connection variables to that character set. Conversion may be lossy if there are characters that are not in both character sets.
There are many different character sets because of the different encodings required by the various languages. Different languages also require different numbers of bytes to represent a character. All characters in the latin1 character set can be represented by a single byte. Other languages may need more than one byte per character.
Unicode provides a single character-encoding system wherein character sets from all languages can be represented in a consistent manner. Within MySQL there are two Unicode sets:
1. ucs2 Which corresponds to the Unicode UCS-2 encoding. Each character is represented by two bytes.
2. utf8 Which has a variable length format, representing characters with from 1 to 3 bytes.
You can store text in about 650 languages using these character sets.
www.idig.za.net
Comments: 0
