• ptu@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 days ago

    Interesting, could you enlighten what types if data is in those 100 columns? I’m aware of ATGC and thought it would be just one column, but maybe the rest are some that indicate intensity or activity. Or what sequence they are part of.

    • rockSlayer@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 days ago

      Well it varies depending on what the file is meant for. Usually there’s columns like chromosome, variant position, reference nucleotide, observed nucleotide, type of variation, codon sequence, gene name, etc.

      There’s also columns that result from various analyses. In the file I’ve been working on lately, there are columns such as variant impact, level of confidence, pathogenicity, clinical significance, etc.

      • The_v@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        3 days ago

        That sounds like a marker file. It’s a bit different than a sequence file.

        Molecular markers are linked to specific sequences in the DNA. These markers are generally close by or in the gene of interest. All the extra columns described its characteristics and results. Anyplace in the entire genome where there is one nucleotide difference (polymorphic) can be another marker. There’s millions of these and they add up to massive files.

        A sequence file is basically just a long boring sequence of nucleotides and are not that large. Now some of the files you use to generate the sequence. Let’s just say they had to wait almost 20 years for computers to get fast enough to process those files in a reasonable time. Those make the marker files look like childs play.

        • rockSlayer@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 days ago

          I’m not familiar with the name of the file I’m currently working with tbh. It’s used to create the annotation files for regenie analyses. It has every variant for every gene within the biobank. There’s far more than just missense; there are stop/start gain/loss, splice donor/acceptor, frameshifts, and ptv. It contains primateAI scores, spliceAI scores, cava data, clinvar data, and more.