Ambiguous genes due to aligners and their impact on RNA-seq data analysis

cris.virtual.author-orcid0000-0002-1806-0891
cris.virtual.author-orcid0000-0001-9465-3851
cris.virtual.author-orcid0000-0003-2793-7074
cris.virtual.author-orcid#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtualsource.author-orcid6a2f8857-003b-41ec-9112-0ef6941bfd06
cris.virtualsource.author-orcida96d2343-ee65-450d-8ce2-355a51255d10
cris.virtualsource.author-orcid56a24a0b-b8df-452f-9154-3dd1ce560fc3
cris.virtualsource.author-orcid#PLACEHOLDER_PARENT_METADATA_VALUE#
dc.abstract.enThe main scope of the study is ambiguous genes, i.e. genes whose expression is difficult to estimate from the data produced by next-generation sequencing technologies. We focused on the RNA sequencing (RNA-Seq) type of experiment performed on the Illumina platform. It is crucial to identify such genes and understand the cause of their difficulty, as these genes may be involved in some diseases. By giving misleading results, they could contribute to a misunderstanding of the cause of certain diseases, which could lead to inappropriate treatment. We thought that the ambiguous genes would be difficult to map because of their complex structure. So we looked at RNA-seq analysis using different mappers to find genes that would have different measurements from the aligners. We were able to identify such genes using a generalized linear model with two factors: mappers and groups introduced by the experiment. A large proportion of ambiguous genes are pseudogenes. High sequence similarity of pseudogenes to functional genes may indicate problems in alignment procedures. In addition, predictive analysis verified the performance of difficult genes in classification. The effectiveness of classifying samples into specific groups was compared, including the expression of difficult and not difficult genes as covariates. In almost all cases considered, ambiguous genes have less predictive power.
dc.affiliationWydział Rolnictwa, Ogrodnictwa i Biotechnologii
dc.affiliation.instituteKatedra Metod Matematycznych i Statystycznych
dc.contributor.authorSzabelska-Beręsewicz, Alicja
dc.contributor.authorZyprych-Walczak, Joanna Grażyna
dc.contributor.authorSiatkowski, Idzi
dc.contributor.authorOkoniewski, Michał
dc.date.access2025-09-02
dc.date.accessioned2025-09-02T11:01:30Z
dc.date.available2025-09-02T11:01:30Z
dc.date.copyright2023-12-08
dc.date.issued2023
dc.description.abstract<jats:title>Abstract</jats:title><jats:p>The main scope of the study is ambiguous genes, i.e. genes whose expression is difficult to estimate from the data produced by next-generation sequencing technologies. We focused on the RNA sequencing (RNA-Seq) type of experiment performed on the Illumina platform. It is crucial to identify such genes and understand the cause of their difficulty, as these genes may be involved in some diseases. By giving misleading results, they could contribute to a misunderstanding of the cause of certain diseases, which could lead to inappropriate treatment. We thought that the ambiguous genes would be difficult to map because of their complex structure. So we looked at RNA-seq analysis using different mappers to find genes that would have different measurements from the aligners. We were able to identify such genes using a generalized linear model with two factors: mappers and groups introduced by the experiment. A large proportion of ambiguous genes are pseudogenes. High sequence similarity of pseudogenes to functional genes may indicate problems in alignment procedures. In addition, predictive analysis verified the performance of difficult genes in classification. The effectiveness of classifying samples into specific groups was compared, including the expression of difficult and not difficult genes as covariates. In almost all cases considered, ambiguous genes have less predictive power.</jats:p>
dc.description.accesstimeat_publication
dc.description.bibliographyil., bibliogr.
dc.description.financepublication_nocost
dc.description.financecost0,00
dc.description.if3,8
dc.description.points140
dc.description.versionfinal_published
dc.description.volume13
dc.identifier.doi10.1038/s41598-023-41085-6
dc.identifier.issn2045-2322
dc.identifier.urihttps://sciencerep.up.poznan.pl/handle/item/4588
dc.identifier.weblinkhttps://www.nature.com/articles/s41598-023-41085-6
dc.languageen
dc.pbn.affiliationbiotechnology
dc.pbn.affiliationagriculture and horticulture
dc.relation.ispartofScientific Reports
dc.relation.pagesart. 21770
dc.rightsCC-BY
dc.sciencecloudsend
dc.share.typeOPEN_JOURNAL
dc.titleAmbiguous genes due to aligners and their impact on RNA-seq data analysis
dc.typeJournalArticle
dspace.entity.typePublication
oaire.citation.issue1
oaire.citation.volume13