ABNFLP thing

Jun 06, 2009 10:29

So on Thursday I had to write an Augmented Backus-Naur Form Lexical Parser ( Read more... )

Leave a comment

Comments 3

vret June 6 2009, 13:19:04 UTC
Your colleagues' suggestions are very wrong; that would be incredibly slow if you are reading a lot of data. I use a mix of a complex regular expression and quote counting to read CSV files. The quote counting is because one cell could contain multiple lines. For each row of csv I keep reading lines of text until I have a string with an even number of double quotes.

I can't remember where I got it from now, but this is the expression:
"(?:^|,)(\\\"(?:[^\\\"]+|\\\"\\\")*\\\"|[^,]*)"
Each match gives you one field, from which you may also need to remove outer quotes, and then change double double quotes to single double quotes.

Reply

rushyo June 6 2009, 23:21:47 UTC
The parser seems to work very quickly thus far and conforms to RFC4180, so it deals with all common CSV variants.

As regards that RegEx, would it deal with this?

This,is,a,"perfectly
valid bit of ""CSV""",representing,"a,
single,record,",of,nine,fields

Reply

vret June 7 2009, 01:25:26 UTC
Should do, yes. I use it for parsing CSV with enormous XML records embedded in it.

Reply


Leave a comment

Up