org.apache.solr.internal.csv

Class CSVParser



  • public class CSVParser
    extends Object
    Parses CSV files according to the specified configuration. Because CSV appears in many different dialects, the parser supports many configuration settings by allowing the specification of a CSVStrategy.

    Parsing of a csv-string having tabs as separators, '"' as an optional value encapsulator, and comments starting with '#':

      String[][] data = 
       (new CSVParser(new StringReader("a\tb\nc\td"), new CSVStrategy('\t','"','#'))).getAllValues();
     

    Parsing of a csv-string in Excel CSV format

      String[][] data =
       (new CSVParser(new StringReader("a;b\nc;d"), CSVStrategy.EXCEL_STRATEGY)).getAllValues();
     

    Internal parser state is completely covered by the strategy and the reader-state.

    see package documentation for more details

    • Field Summary

      Fields 
      Modifier and Type Field and Description
      protected static int TT_EOF
      Token (which can have content) when end of file is reached.
      protected static int TT_EORECORD
      Token with content when end of a line is reached.
      protected static int TT_INVALID
      Token has no valid content, i.e.
      protected static int TT_TOKEN
      Token with content, at beginning or in the middle of a line.
    • Constructor Detail

      • CSVParser

        public CSVParser(Reader input)
        CSV parser using the default CSVStrategy.
        Parameters:
        input - a Reader containing "csv-formatted" input
      • CSVParser

        public CSVParser(Reader input,
                         CSVStrategy strategy)
        Customized CSV parser using the given CSVStrategy
        Parameters:
        input - a Reader containing "csv-formatted" input
        strategy - the CSVStrategy used for CSV parsing
    • Method Detail

      • getAllValues

        public String[][] getAllValues()
                                throws IOException
        Parses the CSV according to the given strategy and returns the content as an array of records (whereas records are arrays of single values).

        The returned content starts at the current parse-position in the stream.

        Returns:
        matrix of records x values ('null' when end of file)
        Throws:
        IOException - on parse error or input read-failure
      • nextValue

        public String nextValue()
                         throws IOException
        Parses the CSV according to the given strategy and returns the next csv-value as string.
        Returns:
        next value in the input stream ('null' when end of file)
        Throws:
        IOException - on parse error or input read-failure
      • getLine

        public String[] getLine()
                         throws IOException
        Parses from the current point in the stream til the end of the current line.
        Returns:
        array of values til end of line ('null' when end of file has been reached)
        Throws:
        IOException - on parse error or input read-failure
      • getLineNumber

        public int getLineNumber()
        Returns the current line number in the input stream. ATTENTION: in case your csv has multiline-values the returned number does not correspond to the record-number
        Returns:
        current line number
      • nextToken

        protected org.apache.solr.internal.csv.CSVParser.Token nextToken()
                                                                  throws IOException
        Convenience method for nextToken(null).
        Throws:
        IOException
      • nextToken

        protected org.apache.solr.internal.csv.CSVParser.Token nextToken(org.apache.solr.internal.csv.CSVParser.Token tkn)
                                                                  throws IOException
        Returns the next token. A token corresponds to a term, a record change or an end-of-file indicator.
        Parameters:
        tkn - an existing Token object to reuse. The caller is responsible to initialize the Token.
        Returns:
        the next token found
        Throws:
        IOException - on stream access error
      • unicodeEscapeLexer

        protected int unicodeEscapeLexer(int c)
                                  throws IOException
        Decodes Unicode escapes. Interpretation of "\\uXXXX" escape sequences where XXXX is a hex-number.
        Parameters:
        c - current char which is discarded because it's the "\\" of "\\uXXXX"
        Returns:
        the decoded character
        Throws:
        IOException - on wrong unicode escape sequence or read error
      • getStrategy

        public CSVStrategy getStrategy()
        Obtain the specified CSV Strategy. This should not be modified.
        Returns:
        strategy currently being used