how to handle multiple date formats in large csv files in tsload ?

I have a large CSV file that I'm trying to load through the tsload command line tool. I see that there's an option to specify the date format with the --date_format option. What if my file has multiple date formats in the file? for example, my file has a column RECEIVED_DT with dates like "15102011" and a column RECEIVED_PROCESS_DATE with dates like "2011-15-10". so I have a file with both %m%d%Y and %Y-%m-%d. What is the best way of handling this? I'd like to import them both as date types. Is there some recommended thoughtspot way of handling this?

42replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • I have declared "DATE" for all Date related columns. Defined JSON file as guided by Suart Herd. Introduced the function to strip seconds.

    ==================================

    tsload --target_database "my_dwh" --target_table "My_Dim" --skip_second_fraction --format_file params_file.json --null_value "" --source_data_format csv --field_separator ','

    ========================================

    {
      "database": "my_dwh",
      "schema": "falcon_default_schema",
      "table": "My_Dim",
      "columns": [{
        "name": "Expansion_Open_Date",
        "date_format": "%m/%d/%Y"
      }, {
        "name": "open_date",
        "date_format": "%m/%d/%Y"
      }, {
        "name": "Dry_Run_Date",
        "date_format": "%m/%d/%Y"
      }, {
        "name": "Row_Effective_Date",
        "date_format": "%Y-%m-%d"
      },
      {
        "name": "Row_End_Date",
        "date_format": "%Y-%m-%d"
      }]
    }

    ==============================

    Only the last two Date Columns in JSON file above has time component in fraction of seconds.

    "0001-01-01 00:00:00.0000000","9999-12-31 23:59:59.9970000"

    =================================

    --

    Loki

    Like
  • Still throwing out error, unable to convert the Timestamp inside data into DATE field as defined in DDL.

    =======================

    , line no=1 ;Reason: Conversion failed, error=Can not convert "Open_Date" to date using date format "%m/%d/%Y", data column=18, schema column=Open_Date
    Source has 912 data rows, ignored row count 912
    E0919 16:04:20.604288 39786 data_importer.cpp:1823] Extract failed, all rows ignored

    =============================

    {
        "name": "Open_Date",

    "date_format": "%m/%d/%Y",

    "datetime_format": "%m/%d/%Y %H:%M:%S %p"
      }

     

    --

    Loki

    Like
  • When importing to a date field, you need to use the date_format option - that is the correct one to go for. However when you define the format you have to tell it what the input looks like. In your case you need to define your date_format as "%Y-%m-%d %H:%M:%S" and then use --skip_second_fraction in order to truncate the fractional second parts. 

     

    As an example I took your two problematic timestamps:

    "0001-01-01 00:00:00.0000000","9999-12-31 23:59:59.9970000"

    and loaded them into a table consisting of two date fields using the following command:

    tsload --target_database test --target_table ts_to_date --source_file timestamps.csv --date_format "%Y-%m-%d %H:%M:%S" --skip_second_fraction

    Hope that helps clarify things. 

    Like
  • I have five date fields in my data file - Only two at the end of JSON file has seconds in fraction.

    ===============================

    {
      "database": "my_dwh",
      "schema": "falcon_default_schema",
      "table": "My_Dim",
      "columns": 
     [
      {
        "name": "Open_Date",

    "date_format": "%m/%d/%Y",

    "datetime_format": "%m/%d/%Y %H:%M:%S %p"
      }, 
      {
        "name": "Expansion_Open_Date",

    "datetime_format": "%m/%d/%Y %H:%M:%S"
      }, 
      {
        "name": "Dry_Run_Date",

    "datetime_format": "%m/%d/%Y %H:%M:%S %p"
      }, 
      {
        "name": "Row_Effective_Date",

    "datetime_format": "%Y-%m-%d %H:%M:%S"
      },
      {
        "name": "Row_End_Date",

    "datetime_format": "%Y-%m-%d %H:%M:%S"
      }
     ]
    }

    =======================================

    Like
  • tsload --target_database "my_dwh" --target_table "My_Dim" --skip_second_fraction --format_file params_file.json --null_value "" --source_data_format csv --field_separator ','

    Like
  • If these are all dates, you need to specify "date_format:", NOT "datetime_format:". We don't care what it looks like, we care about the field we're putting it into. In this case, if I understand correctly, that's a date. So if these are all DATE not DATETIME fields in ThoughtSpot, and based on what's in your error row, your json should be something like:

    {
      "database": "my_dwh",
      "schema": "falcon_default_schema",
      "table": "My_Dim",
      "columns": 
     [
      {
        "name": "Open_Date",
        "date_format": "%m/%d/%Y %H:%M:%S %p"
      }, 
      {
        "name": "Expansion_Open_Date",
        "date_format": "%m/%d/%Y %H:%M:%S %P"
      }, 
      {
        "name": "Dry_Run_Date",
        "date_format": "%m/%d/%Y %H:%M:%S %p"
      }, 
      {
        "name": "Row_Effective_Date",
        "date_format": "%Y-%m-%d %H:%M:%S"
      },
      {
        "name": "Row_End_Date",
        "date_format": "%Y-%m-%d %H:%M:%S"
      }
     ]
    }

    Like
  • All Date columns in my table is set to "DATE" datatype only.

    I have tried that option too - very first date column on the data file - it fails:

    , line no=1 ;Reason: Conversion failed, error=Can not convert "Open_Date" to date using date format "%m/%d/%Y %H:%M:%S %p", data column=18, schema column=Open_Date
    ====

    {
      "database": "my_dwh",
      "schema": "falcon_default_schema",
      "table": "My_Dim",
      "columns": 
     [
      {
        "name": "Open_Date",
        "date_format": "%m/%d/%Y %H:%M:%S %p"
      }, 
      {
        "name": "Expansion_Open_Date",
        "date_format": "%m/%d/%Y %H:%M:%S %p"
      }, 
      {
        "name": "Dry_Run_Date",
        "date_format": "%m/%d/%Y %H:%M:%S %p"
      }, 
      {
        "name": "Row_Effective_Date",
        "date_format": "%Y-%m-%d %H:%M:%S"
      },
      {
        "name": "Row_End_Date",
        "date_format": "%Y-%m-%d %H:%M:%S"
      }
     ]
    }

     

    ====

     

    --

    Lokesh

    Like
  • The first three columns defined inside the JSON file is of this format:

    "11/25/1992 12:00:00 AM"
    "8/31/2011 12:00:00 AM",
    "11/25/1992 12:00:00 AM"

    Remaining two columns is Timestamp with seconds in fraction.

    "0001-01-01 00:00:00.0000000",

    "9999-12-31 23:59:59.9970000",

     

    --

    Loki

    Like
  • OK, I set up an input file:

    "11/25/1992 12:00:00 AM","8/31/2011 12:00:00 AM","11/25/1992 12:00:00 AM","0001-01-01 00:00:00.0000000","9999-12-31 23:59:59.9970000"
     

    Defined a table with five date columns (imaginative names):

    CREATE TABLE "falcon_default_schema"."dates" (
      "col1" DATE,
      "col2" DATE,
      "col3" DATE,
      "col4" DATE,
      "col5" DATE
    );

    Created a json file:

    {
      "database": "test",
      "schema": "falcon_default_schema",
      "table": "dates",
      "columns":
     [
      {
        "name": "col1",
        "date_format": "%m/%d/%Y %I:%M:%S %p"
      },
      {
        "name": "col2",
        "date_format": "%m/%d/%Y %I:%M:%S %p"
      },
      {
        "name": "col3",
        "date_format": "%m/%d/%Y %I:%M:%S %p"
      },
      {
        "name": "col4",
        "date_format": "%Y-%m-%d %H:%M:%S"
      },
      {
        "name": "col5",
        "date_format": "%Y-%m-%d %H:%M:%S"
      }
     ]
    }

     

    and tsloaded:

    admin@ip-172-31-27-26 test]$ tsload --target_database "test" --target_table "dates" --skip_second_fraction --format_file params_file.json --null_value "" --source_data_format csv --field_separator ',' --source_file dates.csv
    Started processing data row
    Source has 1 data rows, ignored row count 0
    Waiting for rows to commit...(please wait)

    Source summary
    --------------
    Data source:                   dates.csv
    Source data format             csv
    Header row?                    no
    Tokenizer Options:             escape_char: "" field_separator: "," enclosing_char: "\"" null_value: "" trailing_field_separator: false
    Date format:                   %Y%m%d
    Date time format:              %Y%m%d %H:%M:%S
    Flexible mode?                 no

    Load summary
    ------------
    Target table:                  dates
    Should empty target?           no
    Status:                        Successful
    Rows total:                    1
    Rows successfully loaded:      1
    Rows failed to load:           0
    % of Rows successfully loaded: 100.00 %
    Load Rate (MB/s):              0.00 MB/s
    Load Rate (Rows/s):            0.88 Rows/s
    Start time (Wallclock):        Wed Sep 19 22:23:39
    End time (Wallclock):          Wed Sep 19 22:23:40
    Total load time = 1.14 seconds = 0.02 minutes = 0.00 hours
    Data size = 134 bytes = 0.13 KB = 0.00 MB

     

    I made a couple of changes:

    Replaced %H with %I as I believe we are dealing with a 12 hour clock

    Made all the %p values lower case as we seem to ahve an issue with an uppercase P here (even though strptime says it should be OK).

    This all worked. If you try this and still have an issue then let's arrange to get on a web meeting and sort it out one and for all :-)

    Like
  • Hi Stuart,

                 I revised the input JSON file as recommended:

     

    line no=1 ;Reason: Conversion failed, error=Can not convert "Open_Date" to date using date format "%Y%m%d", data column=18, schema column=Open_Date
     

    The error persists. I thin we can have a Webex session if you are available now ?

     

    --

    Lokesh

    Like
  • That means you've not renamed the columns in the json file; remember I set mine to col1, col2 etc.. you need to rename in the json to your column names. I know this will be the reason as "%Y%m%d", is the default format so it's not found the column in your json.

     

    It's 23:25 here so I'm afraid I cant do a web meeting now but give it one more try and we can schedule something tomorrow if needed. 

    Like
  • This is my input JSON file:

    Changes were made as you stated: replace %H with %I and the lower case p for the representation of AM/PM format was already there.

    ================

    {
      "database": "my_dwh",
      "schema": "falcon_default_schema",
      "table": "My_Dim",
      "columns":
     [
      {
        "name": "Open_Date",
        "date_format": "%m/%d/%Y %I:%M:%S %p"
      },
      {
        "name": "Expansion_Open_Date",
        "date_format": "%m/%d/%Y %I:%M:%S %p"
      },
      {
        "name": "Dry_Run_Date",
        "date_format": "%m/%d/%Y %I:%M:%S %p"
      },
      {
        "name": "Row_Effective_Date",
        "date_format": "%Y-%m-%d %H:%M:%S"
      },
      {
        "name": "Row_End_Date",
        "date_format": "%Y-%m-%d %H:%M:%S"
      }
     ]
    }

     

    =============================

    Btw, I was able to create the dummy table - "dates" as you demonstrated and it worked.

     

    --

    Lokesh

    Like
  • ERROR MESSAGE:

    --------------

    line no=1 ;Reason: Conversion failed, error=Can not convert "Open_Date" to date using date format "%m/%d/%Y %I:%M:%S %p", data column=18, schema column=Open_Date
    ----------------

     

    --

    Lokesh

    Like
  • hmm.. maybe this is more simple. Does this file have a header, and are you using --has_header_row?

    Like
  • Hi Stuart,

               Wow ! That fixed the problem. Thank you very much for your continued support, patience and perseverance.  Appreciated !

     

    --

    Lokesh

    Like
  • Glad we got there! Enjoy your new data 😉

    Like 1
  • Now I am working on a complex calendar with 23 Date columns :)

    Like
Like2 Follow
  • Status Answered
  • 2 Likes
  • 1 yr agoLast active
  • 42Replies
  • 1513Views
  • 7 Following