Community sqldatamart and null codes

Hi,

Does the community sqldatamart support null codes like NI?

I am see an error when building
.\setup_sqldatamart.bat : psql:C:/sqldatamart/dm_build_commands.sql:109: ERROR: invalid input syntax for type numeric: "NI"

Thanks

SK



«1

Comments

  • lindsay.stevenslindsay.stevens Posts: 391 ✭✭
    Yes, it should handle null codes. This is done by getting all the null codes configured for the study by event and crf, and for items in the corresponding item group datasets, replacing those null codes with SQL null.

    The problem with null codes is as you've found, they're not valid for non-string data types, so the only thing to do is remove them. The originals are still in the "clinicaldata" view if they are needed.

    If there is somehow a null code that's not being removed by this process, I could imagine these possible causes:

    - null codes are being configured at the site level instead of study level, or

    - null codes are not configured at all but are being inserted via insertaction rules or xml import (bypassing data type checks), or

    - null codes were at some point configured at the study level, but after some data entry occurred they were turned off.

    Do any of these situations apply here?
  • skwalerskwaler Posts: 47
    Hi,

    The null codes (NI) have all been configured at the event/crf level. There are null used in items with none ST datatypes.

    How do you suggest to best clean them so the script will work?

    SK
  • lindsay.stevenslindsay.stevens Posts: 391 ✭✭
    I started writing up how to work around whatever the problem is and realised that that solution was actually how it should work in the first place. So I've updated the repository with this new code, version 2016.6.

    Now, instead of trying to be clever and only nullifying null code values when the null code configuration is set for the event CRF, the following algorithm is applied to all item values:

    - If the item value is all white space, then output NULL (white space breaks DATE conversions, and generally wastes space),
    - Next, if the item is a string, then output the value as-is (it may or may not be a null code, if one is configured for the event CRF),
    - Next, if the item is not a string but it matches one of the possible null codes, then output NULL,
    - Otherwise, output the value as-is (subject to appropriate data type conversions)


    Other recent improvements:

    - Revised the demo so it's easier to do (see "docs/demo" directory),
    - Added a test script for development (although very basic right now, see "test" directory),
    - Revised the Advanced setup instructions to make it easier to follow (see "docs/setup" directory),
    - Revised the Maintenance setup instructions too, and included a SQL script to create the pgAgent job definition instead of creating it manually (see "docs/setup" directory).

  • skwalerskwaler Posts: 47
    Hi Lindsay,

    I tried your new version and it now gets another error.
    Could you take a look at the output and see if you can find the problem?

    output attached as zip

    Thanks
  • lindsay.stevenslindsay.stevens Posts: 391 ✭✭
    Thanks very much for trying it and providing the output.

    It seems that you've hit an issue with the way multi-value fields are/were handled. The choices in these fields are split to separate columns. The split column names are constructed using the item_name or item_oid, plus the choice's coded value.

    This is all fine unless a coded value includes characters that are not valid in identifiers. In the output you provided there is one with "PNS/PNfS" (forward slash invalid), and one " Stim" (space invalid).

    A while back, Hans Troost from Utrecht Uni had reported the same problem but due to negative numbers ("-1"), which was discussed in a ticket on GitHub here: https://github.com/lindsay-stevens/openclinica_sqldatamart/issues/2, there's a bit more detail and background there.

    Anyway, the other day I had a brain-wave and realised that the simple and best solution would be to replace these invalid characters with their UTF-8 hex codes. This is a relatively standard practice (e.g. OpenClinica does it with translations), and the resulting code is always numbers or letters.

    I have implemented this now and pushed the updated code to GitHub on the master branch. If you could give it another try an let me know how it goes, it'd be greatly appreciated.

    Another problem Hans came across is enormous item groups - PostgreSQL has a limit of about 1500 which can be reached with huge questionnaires in a single item group, and/or lots of large multi-choice items in a single item group. I haven't got a solution for that yet, and I hope it doesn't catch you too :)
  • skwalerskwaler Posts: 47
    edited November 2016
    I pulled the latest and tried again. Looks to be same error - attached
  • skwalerskwaler Posts: 47
    Hi Lindsay,

    Did you have chance to look at the log file i sent?
  • lindsay.stevenslindsay.stevens Posts: 391 ✭✭
    Yes, it's the same problem I tried to fix earlier. Since your last post I've been working on a solution to resolve this once and for all, but it has kind of turned into a rewrite, lol.

    New stuff:

    - handle characters that are not valid as identifiers (not in [_A-z0-9]),
    - always ensure unique, valid names for multi-choice columns,
    - always ensure unique, valid names for items with very long names,
    - split large item groups (greater than 200 columns) into smaller groups,
    - handle data type casting failures (e.g. null codes or invalid study design like choosing DATE as the type for for an INT choice list) and output these failures per row as JSON in a special "cast_failures" column present in each item group data set,
    - output original CSV data for multi-choice columns (in case it's preferred), including a CSV column of the corresponding value labels,
    - optimise queries so that study itemgroup views can run against live data, or be concurrently refreshed if the instance is large (all queries will be concurrently refreshable),
    - rework study itemgroup query generation so you can audit their content before executing them, and many other related dev/debug improvements.

    Phew!

    I plan to push the work in progress code up to GitHub very soon, best to follow it on the repository.

    If you can provide a pg_dump of your anonymised test/dev environment it would be a huge help to ensuring these things work as intended.
  • skwalerskwaler Posts: 47
    I'll prepare a dump and PM you
  • skwalerskwaler Posts: 47
    Hi Lindsay - Did you checkout my test db? Should do some testing with the lastest master branch?
Sign In or Register to comment.