Building JSON from Mysql

I used to loop through my MySQL queries, in my server side language of choice, to build JSON. But there is a far better way that will save you some coding, add to simplicity and might even save some valuable server time. If you’re running MySQL 4.1 or later you can use the nifty function GROUP_CONCAT() together with the normal CONCAT() function to build all your JSON straight from your SQL query.

username email
mike mike@mikesplace.com
jane jane@bigcompany.com
stan stan@stanford.com

Our SQL table.

SELECT 
     CONCAT("[",
          GROUP_CONCAT(
               CONCAT("{username:'",username,"'"),
               CONCAT(",email:'",email),"'}")
          )
     ,"]") 
AS json FROM users;

A MySQL-query that returns JSON.

[
     {username:'mike',email:'mike@mikesplace.com'},
     {username:'jane',email:'jane@bigcompany.com'},
     {username:'stan',email:'stan@stanford.com'}
]

The returned JSON structure. You could build XML as wellOf course you could use GROUP_CONCAT() to build XML as well (or HTML tables for that matter). If you want to read up on how to disable the default comma separator, order things inside the group_concat etc you’ll find all the information you need in the MySQL manual, section 12.10.1.

Advertisements

Install RHEL EPEL repository on Centos 5.x or 6.x

The following article will article CentOS 5.x-based or Centos 6.x-based  system using Fedora Epel repos, and the third partyremi package repos. These package repositories are not officially supported by CentOS, but they provide much more current versions of popular applications like PHP or MYSQL.

Install the extra repositories

The first step requires downloading some RPM files that contain the additional YUM repository definitions.

Centos 5.x

wget http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://rpms.famillecollet.com/enterprise/remi-release-5.rpm
sudo rpm -Uvh remi-release-5*.rpm epel-release-5*.rpm

Centos 6.x

wget http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm
wget http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
sudo rpm -Uvh remi-release-6*.rpm epel-release-6*.rpm

Once installed you should see some additional repo definitions under the /etc/yum.repos.d directory.

$ ls -1 /etc/yum.repos.d/epel* /etc/yum.repos.d/remi.repo
/etc/yum.repos.d/epel.repo
/etc/yum.repos.d/epel-testing.repo
/etc/yum.repos.d/remi.repo

Enable the remi repository at Time of install

The remi repository provides a variety of up-to-date packages that are useful or are a requirement for many popular web-based services.  That means it generally is not a bad idea to enable the remi repositories by default.

–enablerepo=remi

You will now have a larger array of yum repositories to install from.

Set Up Metric Collection Using Graphite and Statsd on Centos 6.3 x86_64

Visualizing metrics is as important as collecting them. We are using the awesome open source tool, graphite to help us visualize our application and system metrics. We also use the fantastic work done by Etsy and use statsd,  their simple node.js daemon that collects and aggregates metrics via UDP and flushes them to graphite.

Notes:

  • The install process assumes that you are running statsd and carbon/graphite-web on the same server
  • The installation configuration for graphite currently has hardcoded paths to /opt/graphite. It is possible to change this but would require modification to the graphite installation scripts.

The following gist has all the steps to install and configure graphite and statsd.

cd /usr/local/src
git clone git://github.com/joyent/node.git
cd node
./configure --prefix=/opt
make
make install
ln -s /opt/node/bin/node /usr/local/bin
ln -s /opt/node/bin/npm /usr/local/bin

cd /opt
git clone git://github.com/etsy/statsd.git

yum install python-devel memcached python-setuptools mysql-devel 
yum install cairo-devel
yum install pycairo-devel
yum install bitmap-fonts

pip install django==1.3
pip install django-tagging
pip install python-memcached
pip install carbon
pip install whisper
pip install graphite-web

pip install twisted

This takes care of the all the package install. Now next is config and running
#
# Configure carbon
#
cd /opt/graphite/conf/
cp carbon.conf.example carbon.conf
# Create storage schema and copy it over
# Using the sample as provided in the statsd README
cat >> /tmp/storage-schemas.conf << EOF
# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, …
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10s:6h,1m:7d,10m:1y
EOF
cp /tmp/storage-schemas.conf storage-schemas.conf
# Make sure log dir exists for webapp
sudo mkdir -p /opt/graphite/storage/log/webapp
# Copy over the local settings file and initialize database
cd /opt/graphite/webapp/graphite/
cp local_settings.py.example local_settings.py
python manage.py syncdb
# Follow the prompts, creating a superuser is optional
# StatsD configuration
cat >> /tmp/localConfig.js << EOF
{
graphitePort: 2003
, graphiteHost: “127.0.0.1”
, port: 8125
}
sudo cp /tmp/localConfig.js /opt/statsd/localConfig.js

To test your installation, do the following:

Run carbon-cache:

cd /opt/graphite && sudo ./bin/carbon-cache.py –debug start

Run graphite-web:

cd /opt/graphite && sudo ./bin/run-graphite-devel-server.py .

Run statsd:

cd /opt/statsd && node ./stats.js ./localConfig.js

Run the example client (any one will suffice, python client shown here):

cd /opt/statsd/examples && python ./python_example.py

Point your web browser at http://127.0.0.1:8080/ to bring up the graphite webapp and start creating graphs!

Python, Unicode & Mysql

I spent a good two days trying to resolve a problem with a python based application I’ve been working on.
The app was running as a daemon process, sending items to database.
The database, table and data are all UTF8 encoded. All was fine until an email that actually utilized the benefits of UTF and contained other characters than the usual English ones – the email that was sent came up as either garbage or a series of question marks. From python – no luck. WTF I said (well, I didn’t actually speak out the acronym, but I did utter the expanded version a few times, making sure I repeat the F word for good measure).

My first route was to tackle the email encoding issues but after a day of sweating it and getting no results, It dawned on me that perhaps I’m bashing my head on the wrong wall. A quick interactive shell session confirmed that – the data coming out from the queries was returning as garbage whenever an extended character was supposed to show. Absolutely nothing would make it display properly. The only half-assed thing that worked is encoding the utf8 data as latin1 (and later in the email mime part to set the encoding to utf8. That just didn’t feel right. Any character that could not be represented in latin-1 would have been either ignore, replaced or thrown a Unicode error.

Many Googles later, a long session at #python on undernet (thanks tommi) and a pinch of perseverance, I had a working solution. The secret ingredient is to add a charset=”utf8″ to your connection parameters, and use_unicode=True. The former will tell mySql that it should work in the utf8 character set. The latter is to tell python its on utf8.

import MySQLdb

from DBUtils.PooledDB import PooledDB
pool = PooledDB( creator = MySQLdb, mincached = 5, db = DB_NAME, host = DB_HOST, user = DB_USER, passwd= DB_PASS,
charset = "utf8", use_unicode = True)
row = __cur.fetchone()
print row[0]

Python, Unicode and UnicodeDecodeError

For historical reasons, the most common encoding (in Western Europe and the US, anyway) is ASCII. This is also Python’s default encoding. 

Let’s think about ASCII for a moment. It’s an encoding that uses 7 bits, which limits it to 128 possible values. That’s enough to represent all the characters that Western Europe and the US use (letters in both cases, the numbers, punctuation, a few characters with diacritics). Therefore, Unicode strings that only include code points that are in these 128 ASCII characters can be encoded as ASCII. Conversely, any ASCII encoded string can be decoded to Unicode.

It’s worth reiterating that terminology, as you come across it a lot: the transformation from Unicode to an encoding like ASCII is called ‘encoding’. The transformation from ASCII back to Unicode is called ‘decoding’.

    Unicode  ---- encode ----> ASCII
    ASCII    ---- decode ----> Unicode

Non-ASCII encodings

Most people don’t live in the US or Western Europe, and therefore have a requirement to store more characters than can be represented with ASCII. What those folk need to represent *is* part of the Unicode set (Unicode is massive!) – so a different encoding is required. Common encodings have familiar names: UTF-8 and UTF-16. UTF-8, for example, uses a single byte for encoding all the ASCII values, then variable numbers of bytes to encode further characters. (The ins and outs of these encodings are beyond the scope of this article – check out their respective Wikipedia entries for the gory details.)

The fact that the first byte of UTF-8 isthe same as ASCII is important, since it means that the encoding is backwards-compatible with ASCII. However, it can mask problems in software. We’ll come to this shortly.

Some terminology

Unicode-related terminology can get confusing. Here’s a quick glossary:

  • To encode
    • Encoding (the verb) means to take a a Unicode string and produce a byte string
  • To decode
    • Decoding (the verb) means to take a byte string and produce a Unicode string
  • An encoding
    • An encoding (the noun) is a mapping that describes how to represent a Unicode character as a byte or series of bytes. Encodings are named (like ‘ascii’, or ‘utf-8’) and are used both when encoding (verb!) Unicode strings and decoding byte strings.

In other words, when you encode or decode, you need to specify the encoding that you’re using. This will become clearer shortly.

Python, bytes and strings

You’ve probably noticed that there seems to be a couple of ways of writing down strings in Python. One looks like this:

  'this is a string'

Another looks like this:

  u'this is a string'

There’s a good chance that you also know that the second one of those is a Unicode string. But what’s the first one? And what does it actually mean to ‘be a Unicode string’?

The first one is simply a sequence of bytes. This byte sequence is, by convention, an ASCII representation (ie. encoding) of a string. The whole Python standard library, and most third-party modules, happily deal with strings natively in this encoding. As long as you live in US or Western Europe, then that’s probably fine for you.

The second one is a representation of a Unicode string. This can therefore contain any of the Unicode code points. It’s possible that whatever you’re using to edit the Python code (or just view it) might not be able to display the entire Unicode character set – for instance, a terminal usually has an encoding that it assumes data it’s trying to display is in. There’s a special notation, therefore, for representing arbitrary Unicode code points within a Python Unicode string: the \u and \U escapes. These will be followed by four or eight hex digits; there’s some subtlety here (see the Python string reference for further information) but you can simply think of the number after the \u (or \U) representing the Unicode code point of the character. So, for example, the following Python string:

  u'\u0062'

represents LATIN SMALL LETTER B, or more simply:

  u'b'

To summarise then: the Unicode character set encompasses all characters that we may wish to represent. Individual encodings (ASCII, UTF-8, UTF-16, etc.) are representations of all or some of that full Unicode character set.

Encoding and Decoding

Byte strings and Unicode strings provide methods to perform the encoding and decoding for you. Remembering that you *encode* from Unicode to an encoding, you might try the following:

>>> u'\u0064'.encode('ascii')
'd'

As you’d expect, the Unicode string has an ‘encode’ method. You tell Python which encoding you want (‘ascii’ in this case, there are lots more supported by Python – check the docs) using the first parameter to the encode() call.

Conversely, byte strings have a decode() method:

>>> 'b'.decode('ascii')
u'b'

Here, we’re telling Python to take the byte string ‘b’, decode it based on the ASCII decoder and return a Unicode string. 

Note that in both these previous cases, we didn’t really need to specify ‘ascii’ manually, since Python uses that as a default.

UnicodeEncodeError

So, we’ve established that there are encodings which can represent Unicode, or more usually, a certain subset of the Unicode character set. We’ve already talked about how ASCII can only represent 128 characters. So, what happens if you have a Unicode string that contains code points that are outside that 128 characters? Let’s try something all too familiar to UK users: the £ sign. The Unicode code point for this character is 0x00A3:

>>> u'\u00A3'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'
in position 0: ordinal not in range(128)

Boom. This is Python telling you that it encountered a character in the Unicode string which it can’t represent in the requested encoding. There’s a fair amount of information in the error: it’s giving you the character that it’s having problems with, what position it was at in the string, and (in the case of ASCII) it’s telling you that the number it was expecting was in the range 0 – 127.

How do you fix a UnicodeEncodeError? Well, you’ve got a couple of options:

  • Pick an encoding that does have a representation for the problematic character
  • Use one of the error handling arguments to encode()

The first option is obviously ideal, although its practicality depends on what you’re doing with the encoded data. If you’re passing it to another system that (for example) requires its text files in ASCII format, you’re stuck. In that case, you’re left with one of the other two options. You can pass ‘ignore’, ‘replace’, ‘xmlcharrefreplace’ or ‘backslashreplace’ to the encode call:

>>> u'\u0083'.encode('ascii', 'ignore')
''
>>> u'\u0083'.encode('ascii', 'replace')
'?'
>>> u'\u0083'.encode('ascii','xmlcharrefreplace')
'ƒ'
>>> u'\u0083'.encode('ascii','backslashreplace')
'\\x83'

If you choose one of those options, you’ll have to let the eventual consumer of your encoded text know how to handle these.

UnicodeDecodeError

This one is probably more familiar to most developers. A UnicodeDecodeError occurs when you ask Python to decode a byte string using a specified encoding, but Python encounters a byte sequence in that string that isn’t in the encoding that you specified (phew!). This one probably benefits from an example.

Consider once more the ASCII encoding. Being a 7-bit representation, ASCII only has 127 characters, represented by the numbers 0 – 127. So let’s imagine the ASCII-encoded string below:

'Hi!'

In terms of ASCII numbers, that is:

72 105 33

Or in actual Python:

>>> s = chr(72) + chr(105) + chr(33)
>>> s
'Hi!'
>>> s.decode('ascii')
u'Hi!'

That’s all great. But what happens if we add a byte that’s not in the ASCII range?

>>> s = s + chr(128)
>>> s.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80
in position 3: ordinal not in range(128)

Boom. Python is saying that it encountered a character 0x80 (which is 128 in hex, the one we added) which was at position 3 (counting from zero) in the source byte string which was not in the range 0 – 127.

This is normally caused by using the incorrect encoding to try to decode a byte string to Unicode. So, for example, if you were given a UTF-8 byte string, and tried to decode it as ASCII, then you might well see a UnicodeDecodeError.

But why only might?

Well, remember what I mentioned before – UTF-8 shares the first 127 characters with ASCII. That means that you can take a UTF-8 byte sequence, and decode it with the ASCII decoder, and *as long as there are no characters outside the ASCII range* it will work. *Only* when that byte string starts featuring characters which don’t exist within the ASCII encoding do errors start being thrown.

ASCII – the default codec

Lots of Python programmers (well, US and Western European ones) can get quite a way into their Python careers converting byte strings to unicode like this:

>>> print unicode('hi!')
u'hi!'

What’s going on here? Well, Python uses the ascii codec by default. So, the above is equivalent to:

>>> 'hi!'.decode('ascii')
u'hi!'

And, because most US/European test data is composed of this byte string:

  'test'

… nobody notices the problem until the Japanese office complains the intranet is broken.

Unicode Coercion

If you try to interpolate a byte string with a Unicode string, or vice-versa, Python will try and convert the byte string to Unicode using the default (ie. ascii) codec. So:

>>> u'Hi' + ' there'
u'Hi there'
>>> u'Hi %s' % 'there'
u'Hi there'
>>> 'Hi %s' % u'there'
u'Hi there'

These all work fine, because all the strings that we’re working with can be represented with ASCII. Look what happens when we try a character which can’t be represented with ASCII though:

>>> u'Hi ' + chr(128)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80
in position 0: ordinal not in range(128)

Python sees we’re trying to combine a Unicode string with a byte string, so tries to decode the byte string to Unicode using the ASCII codec. Since character 128 (the Euro symbol, as it happens) can’t be represented in ASCII, Python throws a UnicodeDecodeError.

In my experience, Unicode coercion is often where UnicodeDecodeErrors manifest themselves. The programmer has a Unicode string (probably a template) into which they’re trying to put some data from a database. Relational databases tend to supply byte strings. Usually the encoding is a property on the database connection. Often, however, developers simply assume it’s ASCII (or don’t do anything special at all, which in Python amounts to the same thing). They try to stick the data from the database (perhaps in UTF-8 or ISO-8859-1) into a Unicode string using the %s format specifier, Python tries to decode the byte string using the ascii codec, and the whole thing falls flat on its face.

Why do Python byte strings have an encode() method?

The sharp-eyed amongst you will have noticed that byte strings have an encode() method as well as a decode() method. What does this do? Quite simply, it does a decode-then-encode. The byte string is decoded to Unicode using the default (ascii) encoding, and is then encoded to the target encoding specified in the call to encode() using the appropriate encoding. As you’d expect, fun and games ensue if the original byte string isn’t actually encoded in ASCII at all.

Avoiding Unicode Errors

So – this is really what you care about, right? How do you avoid these Unicode problems? Well, there are three simple rules:

  • Within your application, always use Unicode
  • When you’re reading text in to your application, decode it as soon as possible with the correct encoding
  • When you’re outputting text from your application, encode at that point and do it explicitly

What does this mean in practice? Well, it means:

  • Whenever you’re writing string literals in code, always use u”.
  • Whenever you read any text in, call .decode(‘encoding’) on the byte string to obtain Unicode
  • Whenever you’re writing text out, pick an appropriate encoding to handle whatever Unicode you’re outputting – remember that ASCII can only represent a very limited subset

There are more places than you probably realise that text can get into your application. Here’s some:

  • An incoming request from a web browser
  • Some text read in from a data file on disk
  • A template file read in from disk
  • Some user’s input from a form
  • Some data from a database
  • Data returned from a web services call

Frameworks help a lot here. Many frameworks handle the common encoding and decoding cases (usually the template encoding, and data encoding from a database) for you, and just pass you back Unicode strings. Watch out for web request variables – many of those may be plain byte strings. Also watch out for web service responses; you might need to inspect the response headers to find out the encoding. And even then be careful; I’ve come across situations with in-house apps where declared encoding were simply wrong, leading to unexpected UnicodeDecodeErrors.

Figuring out which encoding to use

When you’re faced with a byte string, how do you know which decoding to use? The answer is, unfortunately, simple: you don’t. Some environments (such as the Web) may help you – HTTP requests and responses contain headers which specify the encoding used within them. You can inspect those, and if they’re wrong – well, at least you’ve got someone else to blame.

If you’re lucky, you know the byte string is encoding some XML. XML is gets a lot of flack, but one of the things it does right is to specify explicitly a default encoding that’s actually useful (UTF-8) and provide a mechanism to declare a different encoding. So with XML, you can scan the first few bytes of the file, decode using UTF-8, and look for the magic encoding declaration. If there isn’t one, then you can safely decode the rest of the file using UTF-8. If there is one, then switch encoding. Of course, your XML library of choice will do all this for you, and should give you Unicode text back once you’ve read your XML in.

If you’re unlucky, then you’ve got two more options. First off, you can talk to the people who run your source (or destination) system – find out what encodings they’re using, or accept, and use those.

The final, last resort option is to simply have a range of common encodings to try. A list I often use is ASCII, ISO-8859-1, UTF-8, UTF-16. Keep trying to decode with each of those in turn until one works. Which encodings you pick of course depends on what kind of files you’re expecting to see. You may also run into problems of course if you have a byte string in encoding X which also happens to be valid when decoded using encoding Y – in this case, you’ll just get garbage data. This is the cause of many of the ‘funny character’ bugs you see in web applications: byte strings being decoded using an encoding which happened to work, but was in fact not the original encoding used to create the byte string.

Python 3

I’m not going to talk too much about Python 3, since I haven’t actually used it yet. 

But – you rarely hear .NET or Java programmers complaining about Unicode errors. This is simply because both .NET and Java define a string to *be* Unicode in the first place. Anything involving the String class (in either runtime) is Unicode anyway; the developer sees encoding problems much less frequently as it’s much less common for unexpected byte data to creep into applications. This doesn’t mean the problems don’t exist, of course: at the end of the day, text is still being encoded to and from byte strings; it’s just done explicitly. (The fact that the default encoding on MS Windows, the OS on which many of these systems run, is UTF-16 helps here too – many more characters can be encoded in UTF-16 than ASCII).

My understanding is that Python 3 takes this general approach. Python 2’s ‘str’ type is gone. In its place is the ‘unicode’ type (equivalent to Java and .NET’s String class), and the ‘bytes’ type. String operations are done on ‘unicode’ instances.

Coding in a Unicode world

Unicode is here to stay. The days of writing software that would only need to work in American universities, where the only language and script used was US English in Latin text are long gone. There’s no magic to Unicode and the various encodings, and once you understand what’s going on, there’s no reason to have that sick feeling in the pit of your stomach the next time you see a UnicodeDecodeErrror. Just remember these rules:

  • Decode on the way in
  • Unicode everywhere in your application
  • Encode on the way out

Building High Performance Web Server nginx, luajit, phpfm on Centos

These are my logs for building a high performance ngnix luajit based server. I have added phpfm also as component. Please note these are actual dump of command and use it with your own discreation

4 cd /usr/local/src
5 yum groupinstall “Development Tools”
7 wget http://nginx.org/download/nginx-1.2.4.tar.gz
11 yum -y install zlib-devel openssl-devel cpio expat-devel gettext-devel
25 git clone http://luajit.org/git/luajit-2.0.git
27 cd luajit-2.0/
29 git pull
30 make
31 make install
32 lua
33 ln -sf luajit-2.0.0-beta10 /usr/local/bin/luajit
44 cd /usr/local/src
46 git clone https://github.com/simpl/ngx_devel_kit.git
86 tar -zxvf nginx-1.2.4.tar.gz
91 git clone https://github.com/chaoslawful/lua-nginx-module.git
103 export LUA_LIB=/usr/local/lib/
104 export LUA_INC=/usr/local/include/luajit-2.0/
124 ln -s /usr/local/lib/libluajit-5.1.so.2.0.0 /usr/local/lib/liblua.so
126 cd /usr/local/src/nginx-1.2.4
130 yum install openssh pcre pcre-devel
152 ./configure –prefix=/opt/nginx –with-http_ssl_module –add-module=/usr/local/src/ngx_devel_kit/ –add-module=/usr/local/src/lua-nginx-module/
153 make -j2
154 make install
155 export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
156 cd /opt/
157 cd ngnix/
158 cd sbin/
159 ./nginx
160 ps aux | grep ngnix
161 ps aux | grep nginx
162 exit
163 ps aux | grep nginx
171 ls -al
172 vi nginx
173 chmod +x nginx
174 chkconfig nginx on
175 chkconfig –list nginx
176 service nginx start
177 service nginx stop
178 ps aux | grep nginx
179 service nginx start
180 ps aux | grep nginx
181 ps aux | grep nginx
182 /etc/init.d/nginx start
183 ps aux | grep nginx
184 vi nginx
185 service nginx start
186 ps aux | grep nginx