Ambiguous genes due to aligners and their impact on RNA-seq data analysis

Szabelska-Beręsewicz, Alicja; Zyprych-Walczak, Joanna Grażyna; Siatkowski, Idzi; Okoniewski, Michał

doi:10.1038/s41598-023-41085-6

Ambiguous genes due to aligners and their impact on RNA-seq data analysis

cris.lastimport.scopus	2025-10-23T06:55:13Z
cris.virtual.author-orcid	0000-0002-1806-0891
cris.virtual.author-orcid	0000-0001-9465-3851
cris.virtual.author-orcid	0000-0003-2793-7074
cris.virtual.author-orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtualsource.author-orcid	6a2f8857-003b-41ec-9112-0ef6941bfd06
cris.virtualsource.author-orcid	a96d2343-ee65-450d-8ce2-355a51255d10
cris.virtualsource.author-orcid	56a24a0b-b8df-452f-9154-3dd1ce560fc3
cris.virtualsource.author-orcid	#PLACEHOLDER_PARENT_METADATA_VALUE#
dc.abstract.en	The main scope of the study is ambiguous genes, i.e. genes whose expression is difficult to estimate from the data produced by next-generation sequencing technologies. We focused on the RNA sequencing (RNA-Seq) type of experiment performed on the Illumina platform. It is crucial to identify such genes and understand the cause of their difficulty, as these genes may be involved in some diseases. By giving misleading results, they could contribute to a misunderstanding of the cause of certain diseases, which could lead to inappropriate treatment. We thought that the ambiguous genes would be difficult to map because of their complex structure. So we looked at RNA-seq analysis using different mappers to find genes that would have different measurements from the aligners. We were able to identify such genes using a generalized linear model with two factors: mappers and groups introduced by the experiment. A large proportion of ambiguous genes are pseudogenes. High sequence similarity of pseudogenes to functional genes may indicate problems in alignment procedures. In addition, predictive analysis verified the performance of difficult genes in classification. The effectiveness of classifying samples into specific groups was compared, including the expression of difficult and not difficult genes as covariates. In almost all cases considered, ambiguous genes have less predictive power.
dc.affiliation	Wydział Rolnictwa, Ogrodnictwa i Bioinżynierii
dc.affiliation.institute	Katedra Metod Matematycznych i Statystycznych
dc.contributor.author	Szabelska-Beręsewicz, Alicja
dc.contributor.author	Zyprych-Walczak, Joanna Grażyna
dc.contributor.author	Siatkowski, Idzi
dc.contributor.author	Okoniewski, Michał
dc.date.access	2025-09-02
dc.date.accessioned	2025-09-02T11:01:30Z
dc.date.available	2025-09-02T11:01:30Z
dc.date.copyright	2023-12-08
dc.date.issued	2023
dc.description.abstract	<jats:title>Abstract</jats:title><jats:p>The main scope of the study is ambiguous genes, i.e. genes whose expression is difficult to estimate from the data produced by next-generation sequencing technologies. We focused on the RNA sequencing (RNA-Seq) type of experiment performed on the Illumina platform. It is crucial to identify such genes and understand the cause of their difficulty, as these genes may be involved in some diseases. By giving misleading results, they could contribute to a misunderstanding of the cause of certain diseases, which could lead to inappropriate treatment. We thought that the ambiguous genes would be difficult to map because of their complex structure. So we looked at RNA-seq analysis using different mappers to find genes that would have different measurements from the aligners. We were able to identify such genes using a generalized linear model with two factors: mappers and groups introduced by the experiment. A large proportion of ambiguous genes are pseudogenes. High sequence similarity of pseudogenes to functional genes may indicate problems in alignment procedures. In addition, predictive analysis verified the performance of difficult genes in classification. The effectiveness of classifying samples into specific groups was compared, including the expression of difficult and not difficult genes as covariates. In almost all cases considered, ambiguous genes have less predictive power.</jats:p>
dc.description.accesstime	at_publication
dc.description.bibliography	il., bibliogr.
dc.description.finance	publication_nocost
dc.description.financecost	0,00
dc.description.if	3,8
dc.description.points	140
dc.description.version	final_published
dc.description.volume	13
dc.identifier.doi	10.1038/s41598-023-41085-6
dc.identifier.issn	2045-2322
dc.identifier.uri	https://sciencerep.up.poznan.pl/handle/item/4588
dc.identifier.weblink	https://www.nature.com/articles/s41598-023-41085-6
dc.language	en
dc.pbn.affiliation	biotechnology
dc.pbn.affiliation	agriculture and horticulture
dc.relation.ispartof	Scientific Reports
dc.relation.pages	art. 21770
dc.rights	CC-BY
dc.sciencecloud	nosend
dc.share.type	OPEN_JOURNAL
dc.title	Ambiguous genes due to aligners and their impact on RNA-seq data analysis
dc.type	JournalArticle
dspace.entity.type	Publication
oaire.citation.issue	1
oaire.citation.volume	13

Ambiguous genes due to aligners and their impact on RNA-seq data analysis

Files

Communities

Ambiguous genes due to aligners and their impact on RNA-seq data analysis

Options

Files

Communities