Comments | rushyo: ABNFLP thing

rushyo

ABNFLP thing

Jun 06, 2009 10:29

So on Thursday I had to write an Augmented Backus-Naur Form Lexical Parser ( Read more... )

Comments 3

vret June 6 2009, 13:19:04 UTC

Your colleagues' suggestions are very wrong; that would be incredibly slow if you are reading a lot of data. I use a mix of a complex regular expression and quote counting to read CSV files. The quote counting is because one cell could contain multiple lines. For each row of csv I keep reading lines of text until I have a string with an even number of double quotes.

I can't remember where I got it from now, but this is the expression:
"(?:^|,)(\\\"(?:[^\\\"]+|\\\"\\\")*\\\"|[^,]*)"
Each match gives you one field, from which you may also need to remove outer quotes, and then change double double quotes to single double quotes.

rushyo June 6 2009, 23:21:47 UTC

The parser seems to work very quickly thus far and conforms to RFC4180, so it deals with all common CSV variants.

As regards that RegEx, would it deal with this?

This,is,a,"perfectly
valid bit of ""CSV""",representing,"a,
single,record,",of,nine,fields

vret June 7 2009, 01:25:26 UTC

Should do, yes. I use it for parsing CSV with enormous XML records embedded in it.