Friday, September 26, 2014

Big O playing catchup.

I see that a new release of MySQL has been made, and they are touting the fact that they are allowing the omission of unaggregated items in a SELECT list from a GROUP BY clause, if they are functionally dependent on the items in the GROUP BY clause. This would happen, for example, where the items in the GROUP BY list form a primary key. It's a nice feature.

It's also a feature that PostgreSQL has had for three years.

Importing JSON data

Say you have a file that consists of one JSON document per line. How can you import it into a table easily? This is a problem I was called on to help a colleague with yesterday. Using COPY is the obvious answer, but this turns out not to be quite so simple to do.

In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.

CSV mode is more flexible, but poses different problems. Here, instead of backslash causing a problem, QUOTE characters can cause a problem. First, JSON itself uses the default QUOTE character (double quote) to quote all string values. But if we change use an alternative like single quote, then the presence of any single quote in the JSON leads us into difficulties. Second, JSON also uses the default DELIMITER (comma) extensively. So, clearly we need to use something else for the QUOTE and DELIMITER options. (By default, in CSV mode, the ESCAPE character is the same as the QUOTE character, so we don't need to worry about it separately.)

What we in fact want is  to specify QUOTE and DELIMITER characters that can't appear at all in the JSON. Then the whole line will be seen as a single unquoted datum, which is exactly what we want. There is a small set of single-byte characters that happen to be illegal in JSON, so we can be sure that choosing them for these options should do the right thing with any legal JSON. These are the control characters. So the solution we came up with looks like this:
copy the_table(jsonfield) 
from '/path/to/jsondata' 
csv quote e'\x01' delimiter e'\x02';
Of course, if the JSON has embedded newlines as punctuation, this wont work. So it's important that you configure whatever is producing the JSON not to insert newlines anywhere but at the end of each JSON document.

Now this solution is a bit of a hack. I wonder if there's a case for a COPY mode that simply treats each line as a single datum. I also wonder if we need some more specialized tools for importing JSON, possibly one or more Foreign Data Wrappers. Such things could handle, say, embedded newline punctuation.

Note too that files produced by PostgreSQL's COPY ... TO command will be properly quoted and escaped and won't need to be handled like this to read them back. Of course, if you want them to be readable by other non-CSV processors, then you might need to use similar options to those above to avoid unwanted quoting and escaping.

Monday, September 8, 2014

PLV8 and harmony scoping

The other day I complained on the PostgreSQL hackers list about a couple of aspects of Javascript that make it quite bothersome for large scale programming, namely the old style variable scoping rules and the very limited string literals, which can't stretch across lines (don't mention the awful backslash trick, please) and don't have any provision for interpolation. If you're used as I am to Perl, which has had lexically scoped variables for about 20 years and awsome string literal support for a good deal longer than that, these things are really quite painful.

The good news if that the forthcoming standard, ECMAScript6, also known as "harmony", contains features to deal with both of these issues.

The latest versions of the V8 engine actually support harmony scoping rules, with one restriction, namely that it's only supported in contexts that are in "strict" mode. I believe that this restriction will go away in due course.

Petr Jelinek dropped me a note that other day to tell me how to set V8 flags, and based in that I have developed a patch for PLV8 that allows for harmony scoping. It requires a new GUC setting that is applied during the module's initialization code.

This is available in my clone of the plv8 code, and you can see what it's doing at https://code.google.com/r/amdunstan-harmony-scoping/source/detail?r=8acdcdabcd0c2b9ad99f66a5258920db805efdc3#

I'll do a bit more testing and then talk to the other PLV8 developers about merging it in.

Things are less rosy on the string handling front, I'm afraid. I have no idea when V8 will get the "template strings" feature that will address the string literal deficiencies. As far as I can tell nobody is working on it.