Intelligent Methods and Models for Mining Community Knowledge: Enabling enriched Understanding of Urban Development in Helsinki Metropolitan Region with Social Intelligence (DIGILENS-HKI)
(Arcada University of Applied Sciences)
Social media today is abundant with potentially valuable information for deepining and enriching our understanding of cities and society. In the meantime, it has become clear that more and more misinformation is generated and spread through social media. To extract meaningful information and understanding out of vast amounts of social media content calls for development of new smart applications and methods for data analysis, modelling and Artficial Intelligence techniques.
Meanwhile more rigorous evaluation and repeated validation practices are necessary in order to advance the development of different analysis tools. Social media data needs also to be complemented by other relevant data. Key fact-checking and being aware of the potantial cost of errors should always be remembered when using social media data.
To enable and improve the use of social media content in support of urban development, planning, policy and decision making, we need to address multiple technical challenges from data availability and quality to effective methods and digital tools for efficient content analysis.
The DIGILENS-HKI project investigated state of the art technologies and important data issues for analysis of online social media content. The project developed new applications to enrich our understanding of topics related to urban development in Helsinki region. Valuable advice and practical guidance for analyzing social media data can be drawn from our research.h.
The methodology and analysis tools developed in the project are generally applicable to any type of topics. Project’s contributions are Open Source and lay foundation for a variety of uses for online content analysis and development of digital service applications.
Conclusions and lessons learned
• Social media data analysis has benefited greatly from the recent advances in development of AI methods. There are large amounts of data and analytical resources available for content in English. However, the data availability is limited in Swedish and even more so for Finnish language content. In order to improve modelling performance on Finnish social media content, more labeled datasets in Finnish are needed.
• The project has helped us better understand the pros and cons of the current AI methods, models and tools, as well as the importance of testing and validation processes.
• In general the nature of social media content data need to be understood and the quality of the data to be analysed need to be better controlled.
Advice on the use of AI Tools in Social Media Data Analysis
• Topic Modelling Analysis with data visualization can offer a simple yet powerful means for exploring large amounts of social media content. It can help to discover specific topics and events with natural granularity. Its unsupervised nature makes it easy to apply in practice.
• Named Entity Recognition and high-dimensional dataset visualization can provide a quick overview of names of people, places, organizations, products and creative works, for example. However, emerging named entity recognition analysis of social media data is still very challenging and includes many problems far from being solved, including the poor performance level of availble tools.
• Recent development in new AI methods, platforms and open source efforts have produced large amounts of analytical resources for social media data analysis. However, the methods work mostly for content in English and are very limited in other languages.
• Social media data preprocessing choices can also have positive or negative effects on the analysis results. Fortunately with much efforts from the research community, the preprocessing components are becoming standardized and usable as reference when developing applications.
• Post-processing often has better effect in improving analytical results. The post-processing rules are accumulated from testing and validation processes. The importance of data validation can not be over-emphasized.
• In developing practical analytical applications, involvement of domain experts such as, in this case, planners or decision-makers, is imperative for quality control of the analytical process.
• Before considering using social media content data as information for decision-making, it is extremely important to understand the nature and quality of the social media platforms and datasets in each case. It is fundamental to understand effects of biases and information disorder on social media content as well as biases in the analytical methods themselves.
• It is also important to maintain the awareness of role of limited size and span training datasets. Larger complementary data sources need to be incorporated to support real world decision making.
Proposal for Action
• One critical restriction for developing better performing analytical tools is poor availability of training datasets. In order to improve analytical results for social media content in Finnish language, it is important to make more efforts in developing more and larger labeled datasets. This should be a joint-effort of the research and user community and the process could be crowdsourced. The sooner we have such larger labelled datasets available, the sooner the analysis of social media content in Finnish will improve. Meanwhile, research work is emerging on better learning methods for low resource languages. Progress in this area can hopefully bring us more insights and tools for processing and analyzing Finnish data as well.
• When using social media data as information in decision-making and in policy processes, remember key fact-checking by using information external to the social media data. The analytical results need always to be validated and conclusions drawn with great caution.
• Conventional ways for collecting social media data to support our understanding of cities and societies are generally considered more reliable but they are very labor intensive and often expensive. Processes can be slow, do not scale up easily and often produce data that is sparse with coarse location granularity and minimal context information.
• Today citizens generate and share large amounts of information about where they are and what they are doing on social media, leaving marks and notes of their interaction with their environment which also creates considerable public discourse. Such social media content is sometimes biased and generally less reliable but also much cheaper, easier and faster to collect in massive amounts as timely geo-tagged data with fine-grained location data, rich demographics and with detailed context information.
• On the methodology and technology side working with Instagram and Twitter data from the Helsinki region, the project explored state of the art natural language processing techniques, as well as machine learning and AI methods for analysing social media content data. Our study helps to open up the understanding of the possibilities and means for using of social media data in, for example, urban planning and urban development.
• First, in order to understand the topic content in large collection of social media posts and discussions, we developed topic modelling analysis and visualization tools to help explore the presence and prevalence of selected topics in social media, such as festival events, cycling and transportation, safety issues, and so on.
• In collaboration with the Digital Georgraph Lab of Helsinki University, we explored the application of our models and tools to analyzing cycling related topics from the Instagram data for summer 2016 in English and Finnish languages.
• Second, we addressed the challenging task of emerging Named Entity Recognition from user generated ‘noisy’ online content. Applying recurrent neural networks and topic modelling methods, we developed deep neural models and visualization tool for analyzing named entities. We also developed deep neural models for social media sentiment analysis, applying deep learning and transfer learning methods.
• On the data side, we conducted literature and media study to help develop an understanding of bot activities and information disorder on social media to guide the practical use of social media data. First we tried to understand the nature of bots’ activities and information disorder, after which we explored existing countermeasures.
Shuhua Liu and Patrick Jansson, “Topic Modelling Analysis of Instagram Data for the Greater Helsinki Region”, Arcada Working Paper 3/2017, Arcada UAS
Shuhua Liu, “Bot Activities and Information Disorder on Social Media”, Arcada Working Paper 2018, Arcada UAS
Helsinki Metropolitan Region Urban Research Program (2010-2018) is a horizontal cooperation network between Helsinki metropolitan area cities, universities, universities of applied sciences and two state ministries. Main goal of the program is to promote and fund multi-disciplinary, high quality urban research with a starting point that takes into consideration the special characteristics of the Helsinki Metropolitan Region.
The program aims to provide up to date scientific research results, data and practical knowledge as the basis of decision-making, to create best-practices and to help generate new innovative operational models of cooperation between different actors in the region. Special attention is paid on dissemination and improving usability of research data. The program funds demand-based urban research projects and development activities on current topics and issues in the Helsinki metropolitan area.