Project Details
Projekt Print View

Data quality of textual, user-generated content

Subject Area Management and Marketing
Term since 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 494840328
 
In this age of increasing digitization, the relevance of textual user-generated content (UGC) like customer reviews, wiki content and further social media content for science and practice is considerably growing. In this line, data quality (DQ) of textual UGC, and in particular its automated assessment and improvement, is of utmost importance. Indeed, analyses of large amounts of textual UGC using modern machine learning methods and the resulting outcomes and decisions are only valid and valuable if the quality of the underlying input data is assured. In contrast to the field of structured data, for unstructured, textual UGC, however, comparable approaches for the automated assessment and improvement of DQ are still missing. Moreover, current state-of-the-art machine learning methods used to analyze textual UGC do not adequately consider that the underlying input data may be of bad DQ. Indeed, these methods operate under the assumption that DQ defects are eliminated during data preprocessing or generally assume that input data are of high quality, which does not seem realistic for most real-world environments.Summing up, the proposed project DQNGI focuses on the following research questions:1) How can DQ of textual UGC be assessed and improved in an automated manner?2) How can DQ-annotated textual UGC be processed by machine learning methods in a well-founded manner?To address these research questions, DQNGI comprises two subprojects T1 and T2. With respect to research methodology, DQNGI applies analytical, mathematical modeling in combination with an experimental evaluation based on real-world data.T1 develops new approaches for the assessment and improvement of the major data value-oriented DQ dimensions correctness/currency, completeness, consistency and identity of textual UGC. T1 results in evaluated (e.g., regarding validity and reliability) approaches for an automated assessment and improvement of DQ (including publicly accessible implemented software) and datasets which annotate textual UGC with assessed and improved DQ, respectively.T2 develops new approaches for machine learning methods, which can methodically process not only UGC, but also the respective annotated assessed and improved DQ (metric) values as input data. Thereby, DQNGI focuses on the machine learning methods neural networks and random forests, which are widely applied for the analysis of textual UGC. T2 results in new approaches for machine learning methods (including publicly accessible implemented software) processing DQ-annotated input data. Moreover, T2 provides insights regarding the (changed) quality and robustness of the results of these methods and their validity and reliability.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung