Student Projects, Academic Year 2018-2019

Below is a list of project topics for Masters and Bachelors theses offered by the software engineering research group for students who intend to defend in June 2019. The projects are divided into:

If you're interested in any of these projects, please contact the corresponding supervisor.


Masters projects

What happens to all those hackathon projects? -- BOOKED

Alex Nolte (alexander[dot] nolte [ät] ut [dot] ee)

Hackathons started out as time-bounded competitive events during which young developers formed small ad-hoc teams and engaged in short-term intense collaboration on software projects for pizza and sometimes the prospect of a future job. Since those humble beginnings hackathons have become a global phenomenon with the largest hackathon league alone organizing 200 collegiate events with more than 65.000 participants every year (MLH).

During such events participants create an amazing variety of ideas and innovative software products. This master project aims to assess what happens with those projects after a hackathon is over and the winners have been announced. In this master project you will thus focus on the following research question:

RQ: Which hackathon projects get continued and what are potential reasons for their (dis-)continuation?

Using a combination of qualitative and quantitative research techniques you will start your investigation from a dataset that covers more than 2000 hackathons over the past 5 years (Devpost). Most projects in the dataset are connected to a Github repository which not only allows you to track the progress of this project before and after the hackathon but also enables you to contact participants if necessary.

Hackathons as catalysts for future job opportunities

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee), Alex Nolte, Irene-Angelica Chounta

Hackathons are often perceived as events during which participants can expand their personal networks and develop or showcase their skills for future job opportunities. It is thus common for participants to participate in several hackathons that cover different themes and that take place in different locations.

The goal of this master thesis is to develop an understanding about the connection between hackathon participants and the potential impact of those connections on future job opportunities. As a starting point you will work with an existing dataset which covers roughly 120.000 hackathon participants (Devpost). Most of those participant profiles are connected to personal Github repositories, private websites or Linkedin profiles. The student will analyze these data from social network (or network science) perspective to understand the relations among the hackathon participants. Basic concepts and libraries to be used for network science and social network data analysis will be provided to speed up the process.

Predicting hackathon outcomes using machine-learning (Data Analytics)

Irene-Angelica Chounta (chounta [ät] ut [dot] ee), Alex Nolte, Rajesh Sharma

Hackathons started out as coding competitions during which participants engaged in short-term intense collaboration on software projects for pizza and sometimes other prices e.g. in the form of hardware or cash. Winning hackathon competitions also increases the visibility for winning teams and can benefit participants in terms of future job opportunities and personal development.

In this thesis we aim to use machine learning and other data analytics approaches to identify aspects of hackathon teams that improves their chances of winning. This includes the exploration of how contextual and team-structural factors – such as the topic of the hackathon project and the diversity of the team members with respect to skills, social characteristics and expectations – can impact the project’s outcome and lead a team to victory!

Based on an existing dataset which covers more than 70.000 hackathon teams (Devpost), the student will predict by feature extraction of hackathons' participants if the winning teams have a certain set of characteristics which help them in winning hackathons. Thus, by using computational approaches we would like to propose and evaluate models for predicting winning teams in hackathons.

Tracking Unusual Activities in Traffic Police Data.

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee), Flavio Bertini and Stefano Rizzo

Sensors and cameras often being places on highways are meant for disruption free traffic to detect accidents and to take appropriate and timely actions. However, the traffic data can also be used for detecting unusual activities. In this thesis, you will be analyzing large scale traffic police data shared by the Italian authorities to detect anomalies or unusual behavior on highways. Dataset will be provided. We expect you to find unusual behavior in this traffic data using machine learning techniques in particularly using anomaly detection techniques.

Behavior Analysis of Bike Users in a City Setting

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee), Flavio Bertini and Stefano Rizzo

As a part of smart cities, authorities are creating separate lanes and bicycle rack for bikers. However, the key question is the utilization of these resources put in place for better traffic management. In this thesis, we will be analyzing real dataset of a Italian city with a population of 385.192 inhabitants. The dataset is taken over a period of 6 months, from April 2017 to September 2017. We will be predicting users' behavior in terms of using these resources. We expect you to use data science techniques and machine learning techniques. Dataset is property of SRM Reti e Mobilità Srl and all the analysis must respect the NDA preserving privacy and anonymization of the users.

An investigation on the relationship between inequality and growth.

Jaan Ubi (jaanbi dot jb [ät] gmail.com ) and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

The purpose of this research is to investigate a relationship between inequality - of salaries - and growth - of firms. The first aim is to reproduce these statistical laws in the Estonian economy. First we introduce a measure of salary inequality inside a firm. Next, we look at the (synchronous) correlation between size and inequality. Are firms of different sectors adopting different inequality (e.g. banks)? We regress the current inequality measure with the future firm growth - possibly in a non-linear manner. The research question is: “Is a more unequal distribution of salaries improving the performance of a firm?” This research strand is suitable for a student who aspires to apply data science in the domain of economic/business analysis, which is an active area of endeavor in Estonia - as the country is about to take an active stance in applying such techniques for driving it's policies.

Left, Center or Right?: Controversial groups on Social Media

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

With respect to political views, the users in social media can often be classified broadly in three categories namely left, right or central. In this thesis, the users in social media platforms, in particularly in platforms like Facebook, will be studied anonymously. The crux of the problem will be to predict users' inclination in terms of right, center and left political parties. Data science techniques such as network science, machine learning, sentiment analysis will be explored for predicting problem.

Analyzing echo chambers in social networks.

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

An echo chamber is a metaphorical description of a situation in which information, ideas, or beliefs are amplified or reinforced by communication and repetition inside a defined system. In this thesis, we will investigate echo chambers in social media platforms such as Twitter or Facebook and their effect on social media users. Techniques like network science + machine learning will be explored for understanding echo chambers in social media.

Understanding Filter bubbles in social media networks..

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

A filter bubble is an algorithmic bias that skews or limits the information an individual user sees on the internet. The bias is caused by the weighted algorithms that search engines, social media sites and marketers use to personalize user experience. The concept is particularly important in creating opinionated individuals. In this thesis, a study will be performed to understand the effect of filter bubbles on social media individuals.

If media is biased ? An empirical analysis.

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

News channel often try to portray news stories from their own perspectives. It has been observed particular about media houses that they are biased towards specific topics, people and political parties. In this thesis, you will be analyzing a set of news stories derived from different news websites (such as BBC, CNN etc). The study will be done with an intention to explore if the news channels are biased towards specific 1) Topics, 2) People or 2) Political parties etc. You will be using data science techniques (such as opinion mining, machine learning) for performing the empirical analysis of your study.

Predicting Transaction type to be perfomed by a mobile user.

Huber Flores (huber dot flores [ät] helsinki dot fi) and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

This projects consists of modeling the different data transferred rate of mobile users based on mobility patterns within a trajectory. Given the following dataset collected (features described below) in the wild, the goal is to estimate the different type of transactions (app usage sessions) and amount of data that can be transferred in a particular transaction type, such that it is possible to predict the transaction type to be perfomed by a mobile user. This prediction is important to extend mobility-based contracts that will ensure that there is enough time to perfom a valid transaction while the user is on the move. Dataset: Dataset features (CellularTraffic_OneWeek is the traffic data collected by an ISP from Shanghai between Aug 1st and Aug 7st 2014). For security reasons, the ID of devices and base station ID are all anonymized.

Analyzing question-answering system: Quora case study

Rajesh Sharma (rajesh dot sharma ät ut dot ee)

Question answering systems (QASs) generate answers of questions asked in natural languages. Early QASs were developed for restricted domains and have limited capabilities. Currently platforms like Quora have been helpful in diminishing the boundaries. In this thesis, using Quora as a case study, you will perform users analytics to understand the reasons behind the success of a platform where "all kind of questions are welcome". We expect you to perform empirical study among the users using Quora as a case study in this thesis.

CV-Keskus: resumes' analysis for creating prediction models.

Rajesh Sharma (rajesh dot sharma ät ut dot ee) and Jaan Masso

Job seekers often use online social platforms for job search. In this thesis, we will use CV-Keskus dataset either for descriptive or/and for predictive analysis with an aim to find various questions such as 1)what is the wage gap between Estonian and foreigners while comparing gender wages or 2) which fields attracts foreigners. Another possible problem could be building a rating score while applying to job by matching the experience, nationality, age, skills and using historical data of similar job descriptions. The thesis can be done either by investigating various small questions or by looking at one main research questions. Dataset: CV Keskus dataset.

Variation in Estonian folksongs

Mari Sarv (mari@haldjas.folklore.ee) and Rajesh Sharma (rajesh dot sharma ät ut dot ee)

Estonian Runo songs form a digital corpus of approximately 100000 song texts unevenly spread over 101 parishes of Estonia. The task of MA research is to study the patterns of variation among this corpus on the basis of language data (word forms) and available metadata (singer, collector, place, time, classification of songs). The language of folk songs is highly variable including archaisms and dialectal variation, thus the NLP tools for standard language are not easily applicable. Main question of the research is to find out if there can be detected different patterns of variation on linguistic level and on content level, to contribute to the general discussion on the essence of folkloric communication as well as to the better knowledge of the regional variation of Estonian language and culture.

Predictive analytics of companies

Peep Kungas (peep.kungas@ir.ee) and Rajesh Sharma (rajesh dot sharma ät ut dot ee)

In this thesis, the student will be performing data science activities, in particular about predictive analysis about various companies with respect to financial status. In particular we are interested in asks such as credit limit recommendation and late payment prediction. An example of credit limit recommendation could be if company A sells to company B, then A sends B an invoice. If the invoice amount exceeds credit limit of company B then the invoice must be paid by B before goods/services are received by B from A. An example of late payment prediction includes if company A sells to company B, then A sends B an invoice. If the invoice amount exceeds credit limit of company B then the invoice must be paid by B before goods/services are received by B from A. Otherwise payment date needs to be set. Typically the number of days between invoice date and due date is fixed in company policies, but when company is optimizing its sales, it is interested in increasing the delay if it helps to close deals or keep customers or reduce customer financial stress level. Dataset will be provided by Register OÜ for the thesis which include company credit score history, default company background data features - board membership network metrics, financials and financial indicators, (tax) debts, market sector, paid taxes etc.

Media monitoring

Rajesh Sharma (rajesh dot sharma ät ut dot ee)

When businesses are caught out engaging in illegal or immoral activities, their reputation might suffer. Corporate reputation is a reflection of how a business is regarded by its customers and the public in general. If corporate misbehaviour negatively affects a business’ reputation, customers might switch to rival businesses. For this reason, reputation has got a central role in free markets as it has the potential to deter businesses from misbehaving. The extent, to which corporate wrongdoings trigger a reputational loss is still debated and is subject to a large body of academic works. Most of these works are based on survey methods to measure reputation. This research relies on a more direct method to measure reputational changes, by conducting a sentiment analysis of how the public reacted on Twitter to some of the most high-profile corporate misconducts. In this particular work thesis, corporate reputation will be studied using the Volkswagen (VW) scandal as a case study and the public reaction it created on the Twitter. VW’s scandal has been chosen because it has been widely covered over time through both traditional and social media. Moreover we can measure how changes in media coverage and social media reaction affected VW’s financial performance. The dataset and related literature will be provided for speeding up the work. Datset: Dataset for one use case will be provided however, we expect to collect some additional datasets for comparative study.

Gender-based segregation in company boards and well-being

Peep Kungas (peep.kungas@ir.ee) and Rajesh Sharma (rajesh dot sharma ät ut dot ee)

Segregation is an unjustified separation or distance in social environments (physical, working, or on-line) of individuals on the basis of any physical or cultural trait. A segregation index measures the segregation degree of a minority group within each of units (e.g. schools) weighing each unit by some relevance. The core hypothesis of this topic is that distribution of gender and age segregation reflect a change in the labor market. There are some initial findings ( e.g. isolation index negatively correlates with an unemployment rate in case of young men in Lääne-Virumaa (2008-2015)), which seem to confirm this hypothesis, but further studies are required to confirm the hypothesis. Furthermore, researchers at University of Pisa have developed models for measuring segregation in boards of companies for which input data is available on daily basis. Hence, usage of segregation metrics to now-cast (un)employment means in practice that quarterly delays in measuring the effect of policy changes to (un)employment can be reduced to virtually zero. The latter allows raising the quality of decision-making wrt (un)employment. Typically higher values of the index mean higher segregation. In this thesis, the student will analyse the even distribution with respect to gender and age in boards of companies leads to improved credit risk management low credit risks of companies in a region and if they have strong positive correlation to well-being of the society. Dataset: Dataset will be provided.

Discovery of public-private corruption cases

Peep Kungas (peep.kungas@ir.ee) and Rajesh Sharma (rajesh dot sharma ät ut dot ee)

Corruption is a major obstacle to sustainable economic, political and social development. Overall, corruption reduces efficiency and increases inequality. CleanGovBiz estimates that the cost of corruption equals more than 5% of global GDP. Hence effective means to prevent corruption have significant effect to development. Corruption is also an issue in Estonia. In fact, the number of revealed corruption cases is on the rise – registered number of corruption offences were 161 and 450 in 2012 and 2015 correspondingly. This project aims at reducing public-private corruption in Estonia by providing a mechanism for automatically revealing corruption patterns and using the mechanism to discover corruption cases already in their early stages. The approach is to apply a combination of social network analysis and machine learning techniques to analyze temporal networks of organizations, persons and assets (tenders, financial aid, real estate objects, etc) in order to find temporal network patterns, which describe existing corruption cases. Datasets: Following data for the purpose of the project: Board members and owners of businesses Some features of businesses Real estate ownership data In addition the project will benefit from the following datasets: Public tender data (sums, descriptions and winners of tenders) Grants, financial aid and subsidies Public sector officials/employees Corruption cases (specific organizations and persons) for learning For more please see the blog : http://sidh2017.ut.ee/2017/11/12/corruption/

Media monitoring for business analysis

Peep Kungas (peep.kungas@ir.ee) and Rajesh Sharma (rajesh dot sharma ät ut dot ee)

Social media reach and engagement have turned out to be key metrics, which allow measuring the performance of posts. Reach tells the size of the audience reached by a post/mention, while engagement indicates the number of individuals, which reached to the post/mention. However, there is no model to measure the same for online mentions. In this thesis, a student can investigate the problem related to what is the measurable impact of company's marketing / communication activities? [What are we mesuring ? and how ?]. This problem can be analysed either by 1) using predictive modelling or 2) by alternative approach such as by define engagement in Web through user's Web search action. In particular, media mentions of Estonian businesses is of interest to us. Dataset: The dataset in form of web server logs of companies page visits and web visitor logs will be provided. In addition, we expect to crawl google trend logs would be crawled.

Tonality of company media mentions

Peep Kungas (peep.kungas@ir.ee) and Rajesh Sharma (rajesh dot sharma ät ut dot ee)

Social media is often used by companies to reach out fast and to a broader set of audience. However, it can be used for companies to infer what is the public perception about the company itself. For example, a company might be interested in an online activity, which requires immediate action (e.g. for disaster prevention) with respect to the company. In a different case a company might be interested in determining what it should change in order to increase market share with respect to the competitors in terms of price, service/product, delivery etc. In particular we are interested in questions like what the measurable impact of company's marketing / communication activities?

Model for estimating dynamics of market size

Peep Kungas (peep.kungas@ir.ee) and Rajesh Sharma (rajesh.sharma@ut.ee)

Companies offer struggles with the question whether to focus a) when to develop new products, or b) just stick to existing products. The basic questions which revolves around this topic are 1) What is the size of a market and 2) at which pace the market is growing/shrinking? In this thesis, a student will work on marketing activity data to explore some of the case studies.

Social capital at work places

Rajesh Sharma (rajesh dot sharma ät ut dot ee)

Social capital is defined as contribution of a group in terms of resources as a whole. This is group social capital, compared to this, individual social capital is defined as collection of resources of the neighbors of a node. Using resource theory and social network analysis, a student will explore in particular in work place settings how social capital can play an important role in career advancement. Dataset: Enron dataset with nodes and their description will be provided. However, we expect student to collect additional resources.

Predictive Analysis on Twitter: Techniques and Applications

Rajesh Sharma (rajesh dot sharma ät ut dot ee) and Anurag Singh

Predictive analysis of social media data has attracted considerable attention from the research community as well as the business world because of the essential and actionable information it can provide. Over the years, extensive experimentation and analysis for insights have been carried out using Twitter data in various domains such as healthcare, public health, politics, social sciences, and demographics. Some fine-grained analysis may be done, involving aspects such as sentiment, emotion, and the use of domain knowledge in the coarse-grained analysis of Twitter data for making decisions and taking actions, and relate a few success stories. Social media data has Classal ready enabled researchers to predict the trends and outcomes of several critical real-world events, and its reliability and coverage can further be improved by incorporating background knowledge.

Predicting Information Diffusion on Social Media

Rajesh Sharma (rajesh dot sharma ät ut dot ee) and Anna Jurek

Social media such as online social networks (Facebook), micromessaging services (Twitter) or sharing sites (Instagram) provide the space in which a significant part of social interactions takes place. Many real-life situations like elections are reflected by social media and in turn social media shapes them by forming opinions or strengthening trends. In addition to providing a large audience, social media has changed the speed of interaction: Information spreads within minutes or hours, triggering equally fast reactions.

The goal of the thesis is to develop an algorithm that allows to predict how well a message will diffuse on Twitter. • The first step will be identifying some significant user/message/network features that maybe be used to predict how fast a message will spread across the social media channel. • The second step will be the implementation of a classification model that will be able to predict how well a message will diffuse on Twitter using the identified features.

Large Twitter dataset related to 2 different topics will be provided for your analysis. However, we also expect to collect new data with different topics (1 or 2 more), to have a comprehensive analysis on a variety of topics. Literature with respect to topics will also be provided for speeding up the work.

Identifying Fake News using Linked Data and Network Science Approaches

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Deepak Padmanabhan

Fake news is often generated with malicious intent of spreading misinformation and for spreading rumours. The content in fake news is generally created to mislead readers in order to gain financially or politically, as well as to grab attention. Apart from social media such as Twitter and Facebook and Whatsapp, there are dedicated news agencies that propagate fake news.

The goal of this thesis is to use the content present in the news stories to identify as fake or not by using “Linked Data” in combination with “Network Science” approaches. The linked data approach will be used for identifying fake news indicators such as enhanced topical scatter in news content to be analyzed. The network science approach will be used for identifying the similarity among the topics of the content to boost accuracy of fake news detection. This involves analysis of a corpus of news stories that will be collected for the purpose of this project. Guidance on network science and Linked Data will be provided to get started on the project.

Impact of products popularity in media on product sales.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

By analyzing social media discussions about products we would like to analyze if there exists any correlation between the sales or economic success of a particular product with the products popularity pre- and post-announcement and sales. (Also, in case of positive correlation, finding the parameters or features that drive success)

In particular, we are interested in analyzing popular products (mobile phones, computers or video games) by analyzing social platforms like Facebook, Twitter or blogs or web pages etc. This may involve sentiment analysis of the discussions about products, feature extractions of the product itself and using critics opinions, ratings etc. and looking at the sales numbers of the products in question.

Measuring Corporate Reputation through Online Social Media

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Peter Ormosi

When businesses are caught out engaging in illegal or immoral activities, their reputation might suffer. Corporate reputation is a reflection of how a business is regarded by its customers and the public in general. If corporate misbehaviour negatively affects a business’ reputation, customers might switch to rival businesses. For this reason, reputation has got a central role in free markets as it has the potential to deter businesses from misbehaving.

The extent, to which corporate wrongdoings trigger a reputational loss is still debated and is subject to a large body of academic works. Most of these works are based on survey methods to measure reputation. This research relies on a more direct method to measure reputational changes, by conducting a sentiment analysis of how the public reacted on Twitter to some of the most high-profile corporate misconducts. In this particular work thesis, corporate reputation will be studied using the Volkswagen (VW) scandal as a case study and the public reaction it created on the Twitter. VW’s scandal has been chosen because it has been widely covered over time through both traditional and social media. Moreover we can measure how changes in media coverage and social media reaction affected VW’s financial performance. The dataset and related literature will be provided for speeding up the work.

Opinion mining of Public data for a health initiative project.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Ruth Hunter

Opinion mining involves text analysis techniques for gathering sentiments about certain topic from a corpus. With the advancement of web 2.0, various social platforms have provided opportunities for people to provide their unbiased opinion with respect to various topics. In this thesis, we are particularly interested in analysing opinion with respect to a health oriented initiative by UK government. The thesis will investigate two case studies in particular 1) Is 20 plenty for health? And 2) Connswater Community Greenway.

Is 20 plenty for health? – The project involves the implementation of a transport initiative across several sites in the UK, by proposing reduction of speed limits to 20mph to result in fewer casualties and lower traffic volumes, leading to an improvement in the perception of safety and a subsequent increase in cycling and walking.

Connswater Community Greenway – The project involves a urban regeneration project in east Belfast (Northern Ireland), which includes the development of a 9km linear park and the development of purpose-built walkways, cycle paths and parks to encourage the local residents to be more active and improve their health and wellbeing.

The work will involve, 1) analysing public sentiments, 2) proposal of a model predicting public mood and 3) a sentiment package specifically related to public policies initiatives related to health, 4) Investigation of public vs policy levels i.e. those who are promoting and implementing the schemes vs those who are using it.

A small dataset will be provided. However, we also expect to collect more data as part of the thesis.

Learning Social Representation using Deep Neural Networks

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Shirin Dora

The catalogue of techniques in machine learning is massive but the recent research in this area has spotlighted the immense potential of deep neural networks for solving many problems. Deep learning is a field of machine learning that involves developing learning algorithms for training neural networks with large number of layers. Deep neural networks are presented with a real-valued multidimensional representation of an input and through multiple layer of processing, they learn to extract meaningful information from this input.

The focus of this thesis will be application of deep learning in learning social network representations. A social network is represented as a collection of nodes and edges which connected these nodes. Each node represents a single member of the network and the edges emanating from this node represent the connections of this member. As a result of this information representation mechanism, there is no straightforward way to represent each node using real valued features. This makes it difficult to use machine learning techniques to deal with problems pertaining to social networks like network classification, content recommendation, etc. The problem becomes more complex for large social networks.

To overcome this issue, many researchers focus on developing techniques that learn representations for each node using the information stored in the social network. These representations provide a real-valued multidimensional input for nodes in the social network which can be processed by existing machine learning techniques. These representations have been used for various problems in the area of neural networks. In this thesis, the goal is to leverage the capabilities of deep neural networks to train a neural network to simultaneously learn representations and perform a given social network related task. This generic approach would involve training the neural network on a particular social network problem without worrying about presenting appropriate representations as the onus of learning the suitable representations lies with the neural.

Analysing Server Logs for Predicting Job Failures.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

Server logs generally refer to files which are created for monitoring the activities being performed on servers. In recent years a lot of research has been performed in analysing server logs for analysing the status of the jobs or tasks that arrive on servers. In this thesis, you will be analysing logs from Google cluster, which is a is a set of machines responsible for running real Google jobs for example, search queries. The research encompasses the domain of large scalable predictive analytics. The main contribution of the thesis includes proposing of model to predict the job failures on servers. A real dataset of Google traces will be provided along with related literature to ramp up the learning process.

Wisdom of the crowd Vs. Expert views regarding movie’s box office results.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

The term “wisdom of the crowd” refers to the collective opinion of a community or group. In comparison, expert views refer to the views expressed by the experts of a particular domain. In this thesis, you will investigate if it is the experts or if it’s the wisdom of the crowd, that can predict the box office outcome of the movies. In particular, you will analyse tweets with respect to movies around the period of release date of movies. A small dataset of tweets of various movies will be provided. However, we also expect to expand our analysis by collecting tweets about more movies during the period of thesis. The thesis involves, sentiment analysis of the tweets and subsequently proposal of the prediction model about predicting box office result of the movies.

Title: Analysing Business Process Event Logs Using Network Science Approaches

Supervisors: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Fabrizio Maggi

Business process management (BPM) deals with improving corporate performance by managing and optimizing a company's business processes. Attempts have been made to understand the BPM through network science, where any two actors performing an activity are considered as nodes and a relation is established if any two activities are connected or related with each other. In particular, we are interested in exploiting the concept of motifs, which basically means dominant recurrent patterns occurring in an network. In a broad sense, we would like to identify motifs in event logs that allow us to discriminate between "positive" process executions and "negative" process executions. More in general we can have different types of labeling where we assign a label to each process execution and try to identify motifs that are characteristic of each label.

Case Study on Exploratory Testing

Supervisor: Dietmar Pfahl (dietmar dot pfahl ät ut dot ee)

Exploratory software testing (ET) is a powerful and fun approach to testing. The plainest definition of ET is that it comprises test design and test execution at the same time. This is the opposite of scripted testing (having test plans and predefined test procedures, whether manual or automated). Exploratory tests, unlike scripted tests, are not defined in advance and carried out precisely according to plan.

Testing experts like Cem Kaner and James Bach claim that - in some situations - ET can be orders of magnitude more productive than scripted testing, and a few empirical studies exist supporting this claim to some degree. Nevertheless, ET is usually is often confused with (unsystematic) ad-hoc testing and thus not always well regarded in both academia and industrial practice.

The objective of this project will be to conduct a case study in a software company investigating the following research questions:

  • To what extend is ET currently applied in the company?
  • What are the advantages/disadvantages of ET as compared to other testing approaches (i.e., scripted testing)?
  • How can the current practice of ET be improved?
  • If ET is currently not used at all, what guidance can be provided to introduce ET in the company?

The method applied is a case study. Case studies follow a systematic approach as outlined in: Guidelines for conducting and reporting case study research in software engineering by Per Runeson and Martin Höst Important elements of the thesis are literature study, measurement and interviews with experts in the target company.

This project requires that the student has (or is able to establish) access to a suitable software company to conduct the study.

Case Study on Test Automation

Supervisor: Dietmar Pfahl (firstname dot lastname ät ut dot ee)

Similar to the case study project on Exploratory Testing (see above), a student can work in a company to analyse the current state-of-the-practice of test automation. The objective of this project will be to investigating the following research questions:

  • To what extend is test automation currently applied in the company (i.e., what test-related activities are currently automated and how is this done)?
  • What are the perceived strengths/weaknesses of the currently applied test automation techniques and tools?
  • How can the current practice of test automation be improved (i.e., how can the currently automated test process steps be made more productive, and what steps currently done manually are promising to be automated)?

The method applied is a case study. Case studies follow a systematic approach as outlined in: Guidelines for conducting and reporting case study research in software engineering by Per Runeson and Martin Höst Important elements of the thesis are literature study, measurement and interviews with experts in the target company.

This project requires that the student has (or is able to establish) access to a suitable software company to conduct the study.

Using Data Mining & Machine Learning to Support Decision-Makers in SW Development/Testing/Management

Supervisor: Dietmar Pfahl (firstname dot lastname ät ut dot ee)

Project repositories contain much data about software development activities ongoing in a company. In addition, there exists much data from open source projects. This opens up opportunities to analysis and learning from the past which can be converted into models that help make better decisions in the future - where 'better' can relate to either 'more efficient (i.e., cheaper) or more effective (i.e., with higher quality).

For example, we have recently started a research activity that investigates whether textual descriptions contained in issue reports can help predict the time (or effort) that a new incoming issue will require to be resolved.

There are, however, many more opportunities, e.g., analysing bug reports to help triagers assign issues to developers. And of course, there are other documents that could be analysed: requirements, design docs, code, test plans, test cases, emails, blogs, social networks, etc. But not only the application can vary, also the analysis approach can vary. Different learning approaches may have different efficiency and effectiveness characteristics depending on the type, quantity and quality of data available.

Thus, this topic can be tailored according to the background and preferences of an interested student.

If you are a first year student and planning to do your thesis in 2019/20, it is also possible to combine the thesis project with an ERASMUS Traineeship. Currently, two of my students are doing such a traineeship with industry-oriented research centres in Germany and Austria. In such a setting, the topic must also be negotiated with the receiving research centre. Only top-performing students can get admission to an ERASMUS traineeship opening.

Tasks to be done (after definition of the exact topic/research goal):

  • Selection of suitable data sources
  • Application of machine learning / data mining technique(s) to create a decision-support model
  • Evaluation of the decision-support model

Prerequisite: Students interested in this topic should have successfully completed one of the courses on data mining / machine learning offered in the Master of Software Engineering program.

Using Active Learning and Self-Training for Mining Quality Aspects in App Reviews

Supervisor: Faiz Ali Shah (faizalishah at gmail dot com) and Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

App users provide feedback on quality aspects (e.g., portability, usability, security and performance) of an app in the reviews submitted in AppStore or PlayStore [1]. Having knowledge of this information is valuable for developers for improving the app quality but the presence of this information is Needle in a haystack [1], which makes the choice of manual extraction impractical. Automatic classification of such information from user reviews using machine learning approach is an attractive choice but it requires manual annotation of reviews, which is hard and expensive to obtain. In the recent years, active learning and self-training are used successfully in combination to reduce the cost of manual annotation [2][3]. In the same direction, this study aims at exploring the hybrid of active learning and self-training for developing a multi-class classification model for mining reviews mentioning feedback on non-functional aspects of an app.

Literature (starting points): [1] Groen, E. C., Kopczyńska, S., Hauer, M. P., Krafft, T. D., & Doerr, J. (2017, September). Users—The Hidden Software Product Quality Experts?: A Study on How App Users Report Quality Aspects in Online Reviews. In Requirements Engineering Conference (RE), 2017 IEEE 25th International(pp. 80-89). IEEE. URL: https://ieeexplore.ieee.org/abstract/document/8048893/

[2] Borg, M., Lennerstad, I., Ros, R., & Bjarnason, E. (2017, June). On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering (pp. 308-313). ACM. URL: https://dl.acm.org/citation.cfm?id=3084273

[3] Dhinakaran, V. T., Pulle, R., Ajmeri, N., & Murukannaiah, P. K. App Review Analysis via Active Learning. URL: http://www.se.rit.edu/~pkm/doc/Dhinakaran-2018-RE-ActiveAppReview.pdf

Prerequisite: Students interested in this topic should have successfully completed one of the courses on data mining / machine learning offered in the Master of Software Engineering program.

Evaluation of a Toolkit for Energy Code Smell Detection

Supervisor: Hina Anwar (hina dot anwar2003 at gmail dot com) and Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

Every day more and more android apps are developed and published through stores like google play store. With the increased use of smart phone and smart phone related development the mobile app developers are now becoming more aware of the energy related problem in apps. But the research related to the energy efficiency of the mobile apps is still trying to catch up. One such relatively new research project is “Paprika Toolkit” which is an open source graph based system for automatically detecting and cataloguing energy code smells in android based projects. But the project is tested on a very limited number of apps and claims to detect only 7 android specific energy code smells, when in fact many more kinds of energy code smell exist in real world apps. Therefore, this study aims at analysing the performance of paprika toolkit on other mobile apps to check if it can effectively detect the 7 code smells in different kind of apps and also to suggest improvements in terms of integrating the coverage of new android specific energy code smells.

Literature (starting points):

Mining software repositories to understand team performance in agile software development

Supervisor: Ezequiel Scott (ezequiel dot scott at ut dot ee)

Mining software repositories consist in applying techniques to mine data from software repositories in order to leverage development data. Many kinds of repositories are intensively used by developers in today’s settings such as source control repositories and issue tracking repositories (e.g. Bitbucket, Github, Jira). These repositories contain a wealth of information that is available to extract, analyze and explore to study several development phenomena. For example, some studies have explored the evolution of projects and the prediction of relevant issues. However, very little attention has been paid to the role of human factors in the data analyzed from software repositories. This is surprising since human factors are always involved in every software development process. The goal of this project is to extract and use the data from the repositories about software developers in order to analyze their relationship with the team and their performance. We will provide a dataset of several software projects and your task will be to calculate several performance metrics in the context of agile software development. In addition, you will use simple predictive models and/or stats to describe the impact of human factors on software development.

References

Recommending issues to developers

Supervisor: Ezequiel Scott (ezequiel dot scott at ut dot ee)

In agile software development, issue allocation is often based on self-assignment. That is, developers choose the issues (e.g. user stories) that they will develop during the sprint according to their own preferences and experience. Industry practices give some evidence to support this method of issue allocation but how this takes place is not completely clear yet. As far we know, developers apply different strategies for self-assigning different types of issues (new feature, enhancement, bug fixation). However, applying these strategies to determine what issue develop can be difficult for non-experienced developers. A recommender system can help those non-experienced developers to choose their issues. The goal of this project is to use features about the developers to recommend issues to developers. To this end, you will use basic recommendation and clustering techniques to build a recommender. You will be provided with a dataset with user stories of several agile projects.

Analyzing the quality of User Stories in agile software projects

Supervisor: Ezequiel Scott (ezequiel dot scott at ut dot ee)

Requirements are usually expressed as User Stories in agile software development. Although User Stories are expected to follow a fixed structure (“As <a role> I want to <a feature> in order to <a benefit>”), they are still written by using natural language and informal descriptions. Recent research has defined a framework to assess the quality of the user stories [1] along with a tool to automatically detect errors in the description of the user stories. However, this approach has not been used with real datasets that comes from real projects. The aim of this project is to explore how the quality of user stories evolves along the time in open source projects. In addition, our hypothesis is that low-quality user stories can lead to a large number of bugs than high-quality ones.

References

  • [1] Lucassen, G., Dalpiaz, F., van der Werf, J.M.E. and Brinkkemper, S., 2016. Improving agile requirements: the quality user story framework and tool. Requirements Engineering, 21(3), pp.383-403.
  • [2] AQUSA Tool https://github.com/gglucass/AQUSA

Code generation in contemporary software development projects

Supervisor: Siim Karus (siim04 ät ut.ee)

A lot of modern software is generated by development tools or even machine-learning-based methods. The aim of the project is to review the state of art and practice of automated software development. Note that in this project, we would like to look into methods for predicting what te future software looks like. These methods are likely to lead to next generation code generation or even the development of software coder AI. The conclusion of the study should say, what is missing (if anything) from creating a practically usable AI that could replace some of the developers in a software project.

GPU-accelerated Data Analytics

Supervisor: Siim Karus (siim04 ät ut.ee)

In this project a set of GPU accelerated data mining or analytics algorithms will be implemented as an extension to an analytical database solution. For this task, you will need to learn parallel processing optimisations specific to GPU programming (balancing between bandwidth and processing power), implement the analytics algorithms, and design a user interface to accompany it. As the aim is to provide extension to analytical databases (preferably MSSQL, Oracle or PostgreSQL), you will also need to learn the extension interfaces of these databases and their native development and BI tools. Finally, you will assess the performance gains of your algorithms compared to comparable algorithms in existing analytical database tools.

GPU-accelerated Developer Feedback System

Supervisor: Siim Karus (siim04 ät ut.ee)

In this project you will implement source code analytics algorithms on GPU and devise a reliable and fast method for integrating the analysis feedback into integrated development environments (IDEs). For this task, you will need to learn parallel processing optimisations specific to GPU programming (balancing between bandwidth and processing power), implement the analytics algorithms, and design a user interface to accompany it. As the aim is to provide extension to IDEs (preferably Visual Studio or Eclipse), you will also need to learn the extension interfaces of these IDEs and their native development tools. Finally, you will assess the performance gains of your algorithms compared to implementations of these algorithms running on CPU.

Replication of Empirical Software Engineering Case Study Experiments

Supervisor: Siim Karus (siim04 ät ut.ee)

Empirical software engineering community publishes many case studies validating different approaches and analytical algorithms to software engineering. Unfortunately, these studies are rarely validated by independent replication. To make matters worse, the studies use different validation metrics, which makes them incomparable. Thus, your mission, should you choose to accept it, is to analyse different published case studies on one topic (e.g. bug detection, code churn estimation) to evaluate their replicability and replicate the studies in order to make them comparable. In short you will:

  1. envisage a workflow/pipeline for replicating published studies (including

testing for replicability);

  1. use the workflow to replicate several studies;
  2. validate these studies and compare their results on an common scale.

Process Mining meets Data Science: Analyzing Processes Focusing on Data

Supervisor: Fabrizio Maggi (f.m.maggi ät ut dot ee)

The data perspective in the analysis of business processes is crucial. Process analysis is the ideal ground for the application of data science techniques when the focus of the analysis is on process data. This thesis consists in the development of a novel process mining approach strongly oriented to data and leveraging data science methods. This thesis will include the implementation of a stand-alone java application. The results of this thesis are suitable to be published in a top conference on Information Systems.

Predictive Process Monitoring: Business Process Management and Data Science to predict the future of a process execution

Supervisor: Fabrizio Maggi (f.m.maggi ät ut dot ee)

Business Process Management is becoming an important application domain to apply Data Science techniques in practical scenarios. This thesis will allow the candidate to become familiar with Machine Learning techniques for building predictive models and to work with concrete case studies to apply these techniques for monitoring real business processes. The thesis consists in the development of a novel technique for predictive process monitoring that is more user-oriented and more effective than state-of-the-art techniques when applied in real scenarios. The thesis will include an implementation in Python. The idea is suitable for publication in a top conference on Information Systems or Data Mining.

Process Mining in Industry

Supervisor: Fabrizio Maggi (f.m.maggi ät ut dot ee)

This thesis consists in a systematic literature review and a survey with industrial partners about the usability of process mining functionality in industry. For this thesis no programming skills are required. The outcome of this thesis is suitable for publication in a journal on Information Systems.

Artificial Intelligence and Business Process Management

Supervisor: Fabrizio Maggi (f.m.maggi ät ut dot ee)

This thesis is about the application of Automated Planning for solving well-known problems in business process analysis. The candidate will become familiar with the most important planners available in the literature and will apply them to solve concrete problems in the context of BPM. The thesis will be supported by an implementation in java. The thesis is suitable for publication in a top conference on automated planning like ICAPS.

Beyond ProM and Disco: A lightweight tool for advanced process analysis

Supervisor: Fabrizio Maggi (f.m.maggi ät ut dot ee)

This thesis is purely implementative. Ideal for people interested in developing a java tool for process mining with a user-friendly interface. The outcome of this thesis is suitable for the demo session of a conference on Information Systems.

Robotic Automation of University Admission Processes

Supervisor: Marlon Dumas

The admission process for international students at University of Tartu involves several information systems that are not designed to talk to each other (e.g. DreamApply, SAIS, ÕIS, Urkund, and sometimes also Google Sheets). In order to move data across these systems, secretaries in different departments of the universities have to carry out manual, repetitive, and error-prone tasks (especially copy/pasting and searching).

Such manual tasks can be automated using an emerging type of technology called Robotic Process Automation (RPA). These tools allow business users to capture repetitive routines (e.g. moving data from one system into another via copy/paste actions). In this Masters project, you will analyze the existing process for admissions in at least two institutes of the university, and you will assess the possibility of automating parts of these processes (especially routine tasks) using RPA technology. As part of the project, you will review the capabilities of existing RPA tools, and you will implement at least two repetitive tasks using one of these tools in order to demonstrate the feasibility and potential benefits of using RPA in the university admission processes. This initial feasibility study and benefit assessment would be used as a basis for preparing a business case for the use of RPA technology to (partially) automated the admission process.

Given that RPA tools are meant to be used by business people, this project does not require any software development skills. However, knowledge of business process modeling and process automation (using a BPMS) would be very useful as a starting point. The topic is suitable for non-IT students (e.g. students in the Innovation and Technology Masters) but also for IT students (Computer Science or Software Engineering). In the case of a non-IT student, the emphasis would be on the automation part (i.e. developing RPA scripts for automating parts of this process). In the case of non-IT students, the emphasis would be on the feasibility and benefit asssessment, and on the business case development.

This Masters thesis project will be done in cooperation with people involved in the university admission process. You will be expected to understand the process in detail, to analyze the (repetitive) work that people involved in this process perform, and to show how it can be (partially)automated.

Reasoning About Uncorrelated Event Logs

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee) and Ahmed Awad

Note: This Masters thesis topic is inspired by a real-life problem found in the Estonian X-Road Infrastructure. The technique we will develop in this Masters thesis project will be applied to real-life datasets. You will be expected to spend at least 5 months full-time working on this thesis (e.g. January-May) and you will have office space in Liivi 2 to work together with other full-time Masters students and with the supervisors. A compensation of 400 euros/month will be offered to you if you engage to work on this project for a significant number of hours per week.

Process mining has been an active research topic and a relevant industrial practice for almost a decade. In its simplest form, process discovery, process mining can be seen as a reverse-engineering step of actual processes taking place within an organization into a process model. The main input to a process mining approach is the so-called process execution log aka event log. An event log is a collection of traces aka cases. Each case is a finite sequence of events. The event is a manifestation of state change in an activity within a process instance. Most process discovery approaches assume at least three pieces of information to be available within each event, case identifier (case ID), activity identifier and a time stamp. The case ID is required to correlate events to the same case and thus define which activities have been executed within a certain process instance.

In some scenaris, the "case ID" is not necessarily known beforehand. In other words, the events in the log are uncorrelated (i.e. the relation between events and cases is not known, for example because the case identifier is not known when the events are being recorded). In such situations, process mining approaches cannot be applied as we cannot form traces by correlated events. A few approaches have investigated the issue of reasoning about unlabeled events. In [1], [2], the approaches generate a process model from the unlabeled log. However, they assume that the process model generating these unlabeled events is acyclic. The approach in [3], relaxes this assumption but requires instead knowledge about the process model behind the execution as well as execution heuristics about the individual tasks. The output of [3] is a set of labeled event logs each with a probability. In case unlabeled events do not fit the knowledge about the process model or the execution heuristics, the approach generates incomplete labeled logs.

In this research we will follow the latter approach of generating a set of labeled logs out of the unlabeled one. However, we need to provide flexibility with respect to the a priori knowledge about the process. That is, we must not assume full knowledge about the process nor its runtime characteristics. Yet, the more information, constraints, the user provides, the less combinations and closer to reality we should achieve.

We (the supervisors) have very concrete ideas on how to appproach this problem using a method called Constraint Satisfaction. We have written some initial Python scripts that show that our ideas can work on real-life logs. We need a highly motivated student to learn this method and to develop and test it much more. Our ambition is to produce a tool that solves this problem in a robust manner and can be used in practical applications.

References

  1. Diogo R. Ferreira, Daniel Gillblad: Discovering Process Models from Unlabelled Event Logs. BPM 2009: 143-158
  2. Shaya Pourmirza, Remco M. Dijkman, Paul Grefen: Correlation Miner: Mining Business Process Models and Event Correlations Without Case Identifiers. Int. J. Cooperative Inf. Syst. 26(2): 1-32 (2017)
  3. Dina Bayomie, Ahmed Awad, Ehab Ezat: Correlating Unlabeled Events from Cyclic Business Processes Execution. CAiSE 2016: 274-289
  4. Andaloussi, Amine Abbad, Andrea Burattin, and Barbara Weber. "Toward an Automated Labeling of Event Log Attributes." Enterprise, Business-Process and Information Systems Modeling. Springer, Cham, 2018. 82-96.

Deviance Analysis Using Redescription Mining

Supervisors: Marlon Dumas (marlon dot dumas ät ut dot ee) and Fabrizio Maggi

Business process deviance refers to the phenomenon whereby a subset of the executions of a business process deviate, in a negative or positive way, with respect to the expected or desirable outcomes of the process. Deviant executions of a business process include those that violate compliance rules, or executions that undershoot or exceed performance targets. For example, in an order-to-cash process, the process instances (cases) that end up in a cancellation of the purchase order can be said to be "deviant". Those that end up in a correct delivery of the order and its payment are considered to be "normal".

Deviance analysis is concerned with uncovering the reasons for deviant executions by analyzing business process execution logs. Existing techniques for deviance analysis suffer from the fact that the output they produce is not easily interpretable. For example, when applied to real-life datasets, some of these techniques produce hundreds of rules, each one capturing one possible cause for the deviant cases. Such large sets of rules are difficult to understand.

In this Masters project, you will develop and evaluate an alternative technique for deviance analysis based on an emerging data mining technique called Redescription Mining. The main outcome of the project will be a tool that takes as input two event logs (the log of the normal cases and the log of the deviant cases) and that produces as output readable statements explaining how the deviant cases differ from the normal cases, using redescription mining techniques. You will try out at least two different redescription mining tools, for example CLUS-RM and SIREN and you will compare their performance using real-life business process execution logs. The project requires basic Python programming skills, some basic knowledge of business process management, and basic knowledge of data mining.

Learning working style as Images: A Deep Convolutional Neural Network approach

Supervisors: Marcello Sarini (firstname.lastname [ät] unimib.it) and Marlon Dumas

Working Style is a quite new concept in Business Informatics, although strictly related to concepts existing in other research areas such as Organizational Mining, Pattern recognition, and Visual analytics. Working style is about the characterization of the nature of work especially focusing on the interdependencies among human actors in performing the activities related to the unfolding of a business process. In this view, Working style makes visible the possible choices made by performers according to the constraints posed by the workplace. This is crucial when the organization of work is supported by Process-aware Information Systems because these systems pose some limits on the possible choices made by human actors while supporting the unfolding of a business process. So it would be useful in different situations to identify working style for make it visible how people arrange their work duties in the presence of such technologies.

The aim of the project is to implement a tool for learning, predicting, and classifying working style by using Deep learning approaches, in particular Convolutional Neural Networks (CNNs) to analyze working style as an image generated from log files.

The tool will provide three main functionalities:

  1. the management of the log file and its transformation into a suitable intermediate database structure;
  2. the definition and training of a CNN to learn, predict, and classify working style;

It is expected that the tool will be implemented by using the following technologies: Python as the main programming language;

  • Flask as the Python Web development framework.
  • Neo4j as the database(*);
  • a deep learning library (Tensorflow preferred)

(*) Neo4j is a graph database, that is falling under the umbrella of database technologies so called NO-SQL databases. The choice of this database is driven by the fact that its query language, Cypher, is oriented towards the identification of patterns within the graph database, and the identification of working style is strictly related to the identification of patterns.

Prerequisites: strong programming skills. Students interested in this topic should already know the Python programming language or should have the interest in learning this programming language. Basic knowledge of Neo4j graph db is preferred. It is also expected the knowledge of at least one deep learning library such as Tensorflow.

References:

  1. Sarini, M. "Can Working Style Be Identified?" (2017). Available at: http://ceur-ws.org/Vol-1898/paper1.pdf
  2. Xiaolei Ma et al. "Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction" (2017), Sensors (Vol 17)
  3. David Mack "Review prediction with Neo4j and TensorFlow" Available at: https://medium.com/octavian-ai/review-prediction-with-neo4j-and-tensorflow-1cd33996632a

Analyzing Organizational Mining from a Visual Analytics Perspective

Supervisors: Marcello Sarini (firstname.lastname [ät] unimib.it) and Marlon Dumas

Aim of the project is to test a tool (the Working Style Visual Analytics tool) implemented to identify Working Style from a visual analytics perspective and to frame it into Organizational Mining research.

Working Style (1) is a quite new concept in Business Informatics, although strictly related to concepts existing in other research areas such as Organizational Mining, Pattern recognition, and Visual Analytics. Working style is about the characterization of the nature of work especially focusing on the interdependencies among human actors in performing the activities related to the unfolding of a business process. In this view, Working style makes visible the possible choices made by performers according to the constraints posed by the workplace. This is crucial when the organization of work is supported by Process-aware Information Systems because these systems pose some limits on the possible choices made by human actors while supporting the unfolding of a business process. So it would be useful in different situations to identify working style for making it visible how people arrange their work duties in the presence of such technologies.

The main characteristic of the the Working Style Visual Analytics tool is to provide the user with the possibility to analyze working style from a Visual Analytics perspective: i.e., the user is provided with a visual language to look for patterns within data gathered from a log file represented as an image.

The student is asked to read the most relevant literature from Organizational Mining area, starting from (2)

The student is asked to identify the most common techniques, approaches, results, and measures employed in Organizational Mining area

The student is asked to choose a real case (large) event log file The student is also asked to identify tools provided from Organizational Mining research and to compare results from Organizational Mining perspective with results from using the the Working Style Visual Analytics tool

The student is asked to design (and eventually implement) measures related to the Organizational Mining perspective on top of the the Working Style Visual Analytics tool

References:

  1. Sarini, M. (2017) "Can Working Style Be Identified?". In In Pre-BIR 2017 Forum Papers (pp.1-8).
  2. Zhao W., Zhao X. (2014) Process Mining from the Organizational Perspective. In: Wen Z., Li T. (eds) Foundations of Intelligent Systems. Advances in Intelligent Systems and Computing, vol 277. Springer, Berlin, Heidelberg

Implementing a graph-based tool to identify working style

Supervisors: Marcello Sarini (firstname.lastname [ät] unimib.it) and Marlon Dumas

Working Style is a quite new concept in Business Informatics, although strictly related to concepts existing in other research areas such as Organizational Mining, Pattern recognition, and Visual analytics. Working style is about the characterization of the nature of work especially focusing on the interdependencies among human actors in performing the activities related to the unfolding of a business process. In this view, Working style makes visible the possible choices made by performers according to the constraints posed by the workplace. This is crucial when the organization of work is supported by Process-aware Information Systems because these systems pose some limits on the possible choices made by human actors while supporting the unfolding of a business process. So it would be useful in different situations to identify working style for make it visible how people arrange their work duties in the presence of such technologies.

The aim of the project is to implement a graph-based tool to support the identification of working style. It is expected that the output of the Master thesis will become a publicly available tool that would be made available on a software-as-a-service basis.

The tool will focus on the use of a graph database and graph algorithms to support the identification and analysis of working style. It will provide three main functionalities:

  1. the management of the log file and its transformation into a suitable intermediate graph database structure;
  2. the management of the artifact representing the working style: its creation, and the identification of patterns, with a special emphasis on graph based tools and techniques:
  3. the visualization of the artifact: the visualization of the patterns within the artifact and the visual comparison of different artifacts, taking into account the peculiarities of graph-based visualization

It is expected that the tool will be implemented by using the following technologies: Python as the main programming language; Neo4j as the database(*); Flask as the Python Web development framework: A graph visualization library such as Neovis.js

(*) Neo4j is a graph database, that is falling under the umbrella of database technologies so called NO-SQL databases. The choice of this database is driven by the fact that its query language, Cypher, is oriented towards the identification of patterns within the graph database, and the identification of working style is strictly related to the identification of patterns.

Prerequisites: strong programming skills. Students interested in this topic should already know the Python programming language or should have the interest in learning this programming language. It is also expected the effort to learn the Neo4j technology and the Cypher query language.

Reference

Sarini, M. "Can Working Style Be Identified?" (2017).

Design and Implementation of Computational Models for Learning Analytics at University of Tartu.

Supervisors: Irene-Angelica Chounta and Marlon Dumas (marlon dot dumas ät ut dot ee)

Learning Analytics (LA) is a human-centered design discipline that employs computational methods to explore data traces originating from learning activities in order to promote learning by providing meaningful feedback. The aim of Learning Analytics is threefold: a) to help students improve their learning outcomes by scaffolding self-reflection, self-regulation and motivation, b) to support teachers in orchestrating learning activities and providing appropriate scaffolding for students and c) to assist researchers in uncovering underlying mechanisms of the learning process and determining the impact technology, context and other factors have on the ways people learn.

In this thesis, we aim to design computational models for the assessment of students’ academic performance (for example, successfully completing a course), identification of risks (such as, missing deadlines) and prevention of failures (such as, drop-outs) using Learning Analytics and Machine Learning for students in Higher Education. We will use existing data about students’ demographics, academic background and current practice in order to implement such computational models and to integrate them in a Learning Analytics infrastructure that aims to support stakeholders from the University of Tartu.

Segmented Process Model Generation from Software Logs

Supervisors: Fredrik Milani (milani ät ut dot ee) and Fabrizio Maggi

Process mining allows for extracting process models from execution logs of software systems. There are many algorithms available that does this. However, most of them creates one single model capturing all behaviors and as such, creates models that are very difficult to understand. The understandability of such models can be improved by segmenting the models into different ones. Each model shows certain traces of the process based on for instance frequency. Furthermore, the models can be made more understandable if the models’ complexity are below empirically proven threshold values. In this thesis, you will take an existing algorithm for process discovery with BPMN and enhance it. The enhancements are first to allow generation of process models based on frequency of occurrence (filtering function). The second is to automatically segment the process model into several models based on frequency, main variants, and complexity.

Automated Identification of Parameters for Deviance Mining

Supervisors: Fredrik Milani (milani ät ut dot ee) and Fabrizio Maggi

Within the domain of process mining, it is possible to compare the execution paths of different outcomes of a process. For instance, a process for handling claims can have both slow cases and fast cases. The slow cases can be annotated as such and separate them from the fast cases. Once this is done, one can compare the execution of slow versus fast cases and identify the differences. This is referred to as deviance mining. However, in order to perform deviance mining, there has to know what to compare such as in this case, slow versus fast cases. There are other aspects that can be compared as well such as cheap versus expensive, negative versus positive outcome. These parameters have to be identified and set manually. However, at times, one does not know what parameters to consider and as such, it is helpful to have an automated way of discovering potentially interesting parameters to compare from a log. In this thesis, you will develop an algorithm that can take a log and automatically detect which parameters might be relevant for deviance mining. This is done by considering what aspects of the execution is sufficiently different and therefore, potentially interesting. You will develop this algorithm and also consider how to visualize the results of the analysis.

Business Rule Modeling Languages – Systematic Literature Review

Supervisors: Fredrik Milani (milani ät ut dot ee) and Fabrizio Maggi

Process modelling languages such as BPMN, Petri nets, UML ADs, EPCs and BPEL, are very useful in environments that are stable and where the decision procedures can be predefined. Participants can be guided based on such process models. However, they are less appropriate for environments that are more variable and that require more flexibility. Consider, for instance, a physician in a hospital confronted with a variety of patients that need to be handled in a flexible manner. Nevertheless, there are some general regulations and guidelines to be followed. In such cases, business rules are more effective than imperative process models. In comparison to imperative approaches, which produce “closed” models (what is not explicitly specified is forbidden), declarative languages are “open” (everything that is not forbidden is allowed). In this way, models offer flexibility and still remain compact. There are several types of business rules in the literature that can be used to describe a business process. This thesis aims at providing a framework to classify them. The possible types of business rules are harvested based on a systematic literature review.

Process Mining for Conformance – Systematic Literature Review

Supervisors: Fredrik Milani (milani ät ut dot ee) and Fabrizio Maggi

One of the use cases of process mining is conformance. Such analysis compare process models with a log of the same process. By its aid, one can assess if the model is conforming with the execution logs and vice versa. Research within this area has grown and this thesis aim at conducting a systematic literature review and develop a framework. The work required include development of research protocol, searching for and finding papers on conformance checking, extracting data, and analyze the final list of papers to produce a framework.

Measuring Business Process Performance Flexibility

Supervisors: Fredrik Milani (milani ät ut dot ee) and Fabrizio Maggi

Business processes can be assessed from the perspective of time, cost, quality, and flexibility (also known as the devils quadrangle). Process mining techniques can be used to assess, based on the execution log, the time, cost, and quality metrics of its performance. However, there are no process mining tools to measure the flexibility of a process. This thesis addresses this limitation. The work will include review of literature to define aspects of flexibility and develop an algorithm that can measure the flexibility of a process based on its event log.

Online Process Mining Tool

Supervisors: Fredrik Milani (milani ät ut dot ee) and Luciano Garcia Banuelos

As part of a project, simple process mining algorithms were produced together with a dashboard. The project is concluded but there is room for improvement of the online tool. This thesis is about further developing the tool to accommodate additional types of process mining use cases and improving the dashboard with novel ideas. This is an implementation thesis.

A Comparative Evaluation of Smart Contract Development Languages

Supervisors: Fredrik Milani (milani ät ut dot ee) and Luciano Garcia Banuelos

Smart contracts are an essential part of blockchain solutions. Over the past years, an impressive amount of new blockchain platforms and development languages have emerged. For instance, Ethereum has grown, Corda has been introduced, Hyperledger Fabric is maturing, Micrusoft Azure Workbench is gaining traction. These blockchain solutions use different languages for coding smart contracts. This thesis is about finding the most used smart contract development languages and comparing them. The comparison will cover different aspects such as strengths and weaknesses, support, applicability etc.

A Prototype for Wood Provenance in Hyperledger Fabric

Supervisors: Fredrik Milani (milani ät ut dot ee) and Luciano Garcia Banuelos

Provenance is one of the bigger use cases for blockchain technology. Hyperledger Fabric has grown and become one of the strongest platforms for commercial blockchain solutions. In this thesis, you will build a prototype in Fabric for provenance. This means that the solution should include the MVP for enabling provenance of wood. The work will most likely be conducted with a third-party partner.

Parking Solution on the Blockchain (EOS)

Supervisors: Luciano García-Bañuelos (luciano.garcia [ät] ut dot ee) & Fredrik Milani

Blockchain Technology has received a tremendous amount of attention from industry. The decentralized attributes of DLT is being investigated. Several platforms such as Hyperledger, Ethereum Alliance, Corda, EOS etc. are available as open source for exploring and using for developing new applications. In this thesis, you will build a prototype of a parking solution on EOS. The parking use case allows those who have parking space to rent it out to those who need it. The solution is to be built on EOS and evaluated.

Blockchain and IoT Systems

Supervisor Raimundas Matulevičius (contact: <rma@ut.ee>) and Satish Srirama

Internet of Things (IoT) is a network of connected devices and systems to exchange or accumulate data and information generated by users of and embedded sensors in the physical objects. Blockchain technologies have been named as a facilitator for the IoT systems. However thereexist a number of difficulties and open questions [1] on how to integrate the IoT systems and blockchain applications.To answer this research challenge the following steps needs to be performed:

  1. Understand the principles of the IoT and blockchain technologies and their integration;
  2. Explore (both analytically and by the on-handsapplication) how the these technologies can be integrated;
  3. Validate the research results in the empirical settings.

[1] Reyna A., Martin C., Chen J., Soler E., Diaz M.,On blockchain and its integration with IoT. Challenges and opportunities, Future Generation Computer Systems, vol 88, 2018, Pages 173-190

Comparison of the Requirements Modelling Languages for the Blockchain applications

Supervisor Raimundas Matulevičius (contact: <rma@ut.ee>)

Nowadays blockchain has became an emerging technology to develop distributed and collaborative information systems. There exist several platforms (e.g., Ethereum, Hyperledger, etc.), which provides mean to implement these applications. However understanding and systematic definition of the requirements for the blockchain application is still underspecified. There exist different modelling languages which can be used for system requirements engineering. However the question is whether these languages can capture principles of the blockchain applications. The goal of this thesis is to perform a systematic analysis and comparison of the requirements modelling languages and propose their suitability regarding the modelling of the blockchain application.

The following tasks should be performed to achieve the purpose:

  1. Explain principles of the blockchain application development;
  2. Select, compare, and discuss the requirements modelling languages;
  3. Validate comparison results.

Model-driven development of the Blockchain applications

Supervisor Raimundas Matulevičius (contact: <rma@ut.ee>)

Model driven development in plays an important role in automating the software application development. Blockchain applications are not an exception but currently there exist no approach to develop these applications using model driven principles. The goal of the Master thesis is to define principles which would allow transformation of the blockchain system models to the running blockchain applications.

The following tasks should be performed to achieve the purpose:

  1. Understand what are the key components of the blockchain application language;
  2. Select modelling language for the modelling blockchain applications;
  3. Define model driven principles (transformation rules to translate blockchain application model to executable blockchain application);
  4. Validated the transformation rules.

Enterprise Security Risk Management: Case Study

Supervisor Raimundas Matulevičius (contact: <rma@ut.ee>)

Security risk management plays an important role when developing and maintaining enterprise systems. Different domain potentially might require different approach [1] [2]. The goal of this Master thesis is to explore a selected application domain (mobility and automotive, global trade, supply chains, eHealth, Smart energy, etc) and to understand the existing processes, their security risks, and mitigation means to reduce these risks.

[1] Matulevicius R., Norta A., Samarutel S.,: Security Requirements Elicitation from Airline Turnaround Processes. BISE 60(1),3-20(2018) [2] Abasi-Amefon Affia, Security Risk Management of E-commerce Systems, Master thesis, University of Tartu, 2018

A Selection Framework for Business Process Compliance Checking Approaches

Jake Tom (jaketom [ät] ut dot ee)

There exist several approaches towards Business Process Compliance Checking that help business analysts ensure that their organizations are compliant to legal regulations. These include both, forward and backward compliance checking methods. While the former is focused on compliance checking at the design phase or at run-time, the latter is a post-execution evaluation paradigm. A criticism of the existing approaches is that many of them are not practically feasible. In this thesis, you will carry out a comparison of the different BPCC approaches particularly focused on backward compliance checking, assess them for practical feasability and develop guidelines for their selection based on their suitability to different scenarios. This could also lead to an understanding of what is missing from the current state of the art. You will then demonstrate the effectiveness of your guidelines by applying two of the approaches to a case study to support the selection framework you propose. The case study will be based on one (or more) business process(es) captured within an organization you may be working in or have familiarity with.

Starting points:

[1] El Kharbili, M., Alves De Medeiros, A.K., Stein, S., Van Der Aalst, W.M.P. Business process compliance checking: Current state and future challenges (2008) Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI), P-141, pp. 107-113.

[2] Becker, J., Eggert, M., Schwittay, S. How to evaluate the practical relevance of business process compliance checking approaches? (2012) Multikonferenz Wirtschaftsinformatik 2012 - Tagungsband der MKWI 2012, pp. 849-861.

Additional Masters Theses Topics

Sequence-Aware Recommender System for E-Commerce

Kristjan Eljand (STACC OÜ)

STACC has built a AI-based recommendation system that helps to guide web-users towards the items that are relevant to them. For example: it shows you newspaper articles that fit your profile, recommends you items that you might want to buy, gives you job-openings that might fit your profile. In this project, you will have an opportunity to enhance the current solution.

The current solution takes into account the history of items that a customer has visited or purchased (and other relevant event). However, it does not take into account the timestamps nor the sequence in which these events have taken place.

The sequence of user events makes difference when recommending items to user. Example 1: If user is clicking on a towel then A. when previous clicks were on kitchen items, recommendations should overweight kitchen items and B. when previous click were on bathroom items then those should be over-weighted. Example 2: if recommendation model is able to look into history it might detect periodic patterns in user behavior like "Average runner is buying a new pair of running shoes once per year". This information could be then used for recommendations.

In this project, you will develop a method for creating a sequence-aware recommendation model, which means a model that is able to make recommendations based on what the user did during the whole web-session or several sessions.

The output will be open-source. STACC team will guide throughout the project so that you would have clear understanding of how our recommendation system and API works and how the model should be built.

You will have a part-time workplace in STACC's office in Tartu. We plan to pay you a monetary compensation if you are willing to engage in this project a significant number of hours per week.

Prescriptive Business Process Monitoring System

Kristjan Eljand and Marlon Dumas (STACC OÜ)

The aim of this Project is to create a software tool that assists employees working on a business process (such as an order-to-cash, or invoice-to-payment process) to preempt undesired business outcomes such as delayed deliveries, customer complaints, unpaid invoices, etc. The system will be built on top of an existing system for predictive monitoring of business processes, namely Nirdizati. The current Nirdizati system estimates the probability of a negative outcome (e.g. a customer complaint) during the execution of a business process, but it does not do anything more than predicting. In this Masters project, you will extend the Nirdizati system so that it not only makes predictions but it also raises alarms and make recommendations to employees working in the business process in order to help them prevent the negative outcomes, while minimizing the amount of overhead (i.e. optimizing costs). During this project, we will offer you both technical guidance on how to extend the Nirdizati system and we will explain to you the techniques you need to use to generate alarms and recommendations.

The goal of the project is to produce a software tool that can be used in practice. We will test it on real scenarios and real data.

The project requires basic knowledge of data mining/machine learning, knowledge of at least one modern Javascript framework (e.g. Vue.js, React, Angular, etc.) and willingness to engage and to learn fast.

We will pay you a monetary compensation of up to 750 euros/month (full-time-equivalent) if you are willing to engage in this project more than 20 hours per week on this project.

Development of software for chromatographic method validation

Koit Herodes(koit dot herodes [ät] ut dot ee), Asko Laaniste (asko dot laaniste [ät] ut [dot] ee), and Marlon Dumas (marlon dot dumas [ät] ut dot ee)

The aim of this Project is to create a software tool which assists user in validating chromatographic analysis methods and would be useful in analytical chemistry laboratories all over the world. Chromatography is a technique of chemical analysis. When chromatographic method is developed for certain analysis (e.g. analysis of alcohol in blood), then evidence must be provided that the method really works as supposed – it means, the method has to be validated. For validation a set of analyses must be carried out and based on the results validation characteristics are calculated.

The planned software product would help analytical chemist to 1) plan the experiments; 2) calculate validation characteristics based on entered (imported) experimental results and 3) generate validation reports.

Analytical chemists are developing a mock-up of the software: http://lisa.chem.ut.ee/~koit/valc/

Software development should add functionality like user rights, database (e.g. MySQL), data import, reporting, interface to some statistical calculation package (preferably R: scripts for calculating and generating data and graphs will be done by analytical chemists), front-end development based on the mock-up etc.

We plan to pay you a monetary compensation if you are willing to engage in this project a significant number of hours per week.


Additional topics proposed by other groups in the Institute of Computer Science are available here.


Topics for IT Conversion Masters Theses (15 ECTS)

What happens to all those hackathon projects?

Alex Nolte (alexander[dot] nolte [ät] ut [dot] ee)

Hackathons started out as time-bounded competitive events during which young developers formed small ad-hoc teams and engaged in short-term intense collaboration on software projects for pizza and sometimes the prospect of a future job. Since those humble beginnings hackathons have become a global phenomenon with the largest hackathon league alone organizing 200 collegiate events with more than 65.000 participants every year (MLH).

During such events participants create an amazing variety of ideas and innovative software products. This master project aims to assess what happens with those projects after a hackathon is over and the winners have been announced. In this master project you will thus focus on the following research question:

RQ: Which hackathon projects get continued and what are potential reasons for their (dis-)continuation?

Using a combination of qualitative and quantitative research techniques you will start your investigation from a dataset that covers more than 2000 hackathons over the past 5 years (Devpost). Most projects in the dataset are connected to a Github repository which not only allows you to track the progress of this project before and after the hackathon but also enables you to contact participants if necessary.

Hackathons as catalysts for future job opportunities

Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee), Alex Nolte, Irene-Angelica Chounta

Hackathons are often perceived as events during which participants can expand their personal networks and develop or showcase their skills for future job opportunities. It is thus common for participants to participate in several hackathons that cover different themes and that take place in different locations.

The goal of this master thesis is to develop an understanding about the connection between hackathon participants and the potential impact of those connections on future job opportunities. As a starting point you will work with an existing dataset which covers roughly 120.000 hackathon participants (Devpost). Most of those participant profiles are connected to personal Github repositories, private websites or Linkedin profiles. The student will analyze these data from social network (or network science) perspective to understand the relations among the hackathon participants. Basic concepts and libraries to be used for network science and social network data analysis will be provided to speed up the process.

Predicting hackathon outcomes using machine-learning (Data Analytics)

Irene-Angelica Chounta (chounta [ät] ut [dot] ee), Alex Nolte, Rajesh Sharma

Hackathons started out as coding competitions during which participants engaged in short-term intense collaboration on software projects for pizza and sometimes other prices e.g. in the form of hardware or cash. Winning hackathon competitions also increases the visibility for winning teams and can benefit participants in terms of future job opportunities and personal development.

In this thesis we aim to use machine learning and other data analytics approaches to identify aspects of hackathon teams that improves their chances of winning. This includes the exploration of how contextual and team-structural factors – such as the topic of the hackathon project and the diversity of the team members with respect to skills, social characteristics and expectations – can impact the project’s outcome and lead a team to victory!

Based on an existing dataset which covers more than 70.000 hackathon teams (Devpost), the student will predict by feature extraction of hackathons' participants if the winning teams have a certain set of characteristics which help them in winning hackathons. Thus, by using computational approaches we would like to propose and evaluate models for predicting winning teams in hackathons.

The Role of the Business Analyst in the Digital Era

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

The traditional understanding of the role of Business Analyst is quite clear. However, in the past 10 years, the traditional company has changed. The emergence of digital companies, agile methods, new roles such as product owner, the changing role of project managers, new roles within agile methods, has change the landscape and as such, the role and competences of business analysts. This thesis is to examine these different roles and analyse what the role of the business analyst can be in such contexts and what competences would be required for business analysts to stay relevant.

The Role of the Business Analysts in Agile Processes

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

The role of the business analyst was very clear in predictive software development processes (such as waterfall and V-model). However, with the growing popularity of agile methods, the role of the business analyst is becoming less clear. This thesis aims at investigating agile methods and examine what work is being done in order to define the role of a business analyst in agile methods and map out how business analyst can deliver value in agile methods.

Digitalization and Process Innovation

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

Digitalization has disrupted many industries and changed the way business is conducted. It has reformed and revolutionized business models and been instrumental in driving process innovation. Still, many industries and companies have not yet utilized the full potential of digitalization. In this thesis, you will survey the digitalization opportunities so far exploited, and overlaying them with the business model canvas, create a framework for how digitalization has innovated processes within a certain industry. The topic is to be within one industry such as accounting, savings, health care, manufacturing, shipping, retail and so on. You are encouraged to choose the industry within which you have experience. The end result of the thesis will be a framework by which one can see, understand, and identify how digitalization can enable process innovation in different parts of the business model of a company.

Digitalization and the Role of the Business Analyst

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

Business Analyst have predominantly worked with identifying needs, mapping the current state, eliciting requirements, and designing solutions. As the profession grew and was established pre “big data” era, the competencies, methods, tools, and approaches were designed for traditional incumbents. However, with the emergence and penetration of “data-driven” perspective, new sets of perspectives, competencies, methods, tools, and approaches are required. The role of the business analysts is changing but it is not yet clear as to into what. In this thesis, you will survey the needs of “data-driven” projects, interlay those with the role and competencies of business analyst, analyse the results so to outline and explain what the role of a business analyst can and should be in “data-driven” era in order to deliver value.

Blockchain and Business Processes

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

The interest for blockchain is growing very strongly. As this new technology is gaining traction, many uses ranging from voting to financial markets application. Currently the hype around the technology is overshadowing the value it can deliver by enabling changes in the business processes. While blockchain can deliver value by replacing existing IT solutions, the real value comes from innovating the business processes. This topic is about exploring, for one of the below industries/cases what the current business processes are, how blockchain could enable innovation of the business processes, and finally comparing/contrasting them in order to draw conclusions.

Each of the below listed cases can be the starting point fora 15-ECTS Masters thesis. In each case, the goal will be to examine the current processes are examined, identify improvement opportunities using blockchain technology, and propose and analyze a redesigned process.

  • Health Care – transferring and owning your own medical health records and prescription management
  • Registry – management of assets (digital and physical) including registration, tracking, change of ownership, licensing and so on
  • Financial Markets – covering one or several cases such as post trading settlement of securities and bilateral agreements
  • IoT – connecting multiple devices with blockchain

Eliciting User Stories from Business Process Models

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

User stories have gained traction within agile methods. At the same time, process models are useful in creating common understanding between different stakeholders. Process models have the fundamental basis for eliciting requirements such as user stories. This thesis explores the elicitation of requirements expressed as user stories from business process models. To achieve this end, work has to be conducted on mapping the components of process models with those of requirements (user stories), elicit user stories from a set of process models and validate the method in a case study.

Customer Journey Mapping

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

A Customer Journey Map is a graphical representation of how a given customer interacts with an organization in order to consumer its products or services, possibly across multiple channels. Several approaches for customer journey mapping exist nowadays. Each relies on different concepts and notations. In this thesis, you will review the most popular approaches that are currently in use for customer journey mapping, and you will distill from them a common set of concepts and notations. You will then show how these concepts and notations can be applied in an organization of your choice (preferably an organization where you work or one where you have a lot of experience interacting as a customer).

Case Study in Business Process Improvement or Business Data Analytics

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

This is a "placeholder" Masters project topic, which needs to be negotiated individually. If you work in a IT company and you are actively engaged in a business process improvement or business data analytics project, or if you can convince your hierarchy to put in time and resources into such project in the near-term, we can make a case study out of it. We will sit down and formulate concrete hypotheses or questions that you will test/address as part of this project, and we will compare your approach and results against state-of-the-art practices. I am particularly interested in supervising theses topics related to customer analytics, product recommendation, business process analytics (process mining), and privacy-aware business analytics, but I welcome other topic areas.

Risk Management in a Startup Context (BOOKED)

Evgenia Trofimova

Startups are by nature highly risk-taking enterprises since their business model is uncemented or relies on a number of untested hypotheses. This risk-taking attitude however does not necessarily mean that startups do not need or do not actively engage in risk management.

In this project, you will interview a number of IT entrepreneurs and other stakeholders in the IT startup field to understand if and how risk management is handled in this environment. You will analyze on the one hand if traditional risk management approaches (e.g. TBQ) are used in this setting (why or why not?), or if other emerging risk management approaches (e.g. based on business model canvas) are already in active use, implicitly or explicitly.

Team diversity in Early-Stage Tech Companies (BOOKED)

Evgenia Trofimova

We like to talk about the importance of diversity to create great products. In this project, you will collect data from startups via surveys and/or interviews, particularly in the Estonian and Nordic IT startup context, in order to shed light into the question of how important is diversity in the context of MVP and early-stage product development.

Product Management in Estonian Tech Companies (BOOKED)

Evgenia Trofimova

Product management is quite a new topic on Estonian market. In this thesis, you will study how the software development processes have changed in Estonian tech companies after they have hired their first product manager(s), and what aspects of product managers are more or less developed in the Estonian tech sector, compared to international practices.

Case Study in Risk Management, Product Management and Release Management in an IT Company (BOOKED)

Evgenia Trofimova

This is a "placeholder" Masters project topic, which needs to be negotiated individually. If you work in a IT company that has actively engaged in risk management or product management in a more or less structured basis, you can focus your Masters thesis project on studying why and how these practices were introduced in the company? How have they evolved? And how the introduction and development of these management practices have impacted on the company's revenues, profit marging and other startegic KPIs. If you are in a company where these questions can be studied and are interested in digging into them, just contact me.

Minimum Viable Products (MVP) for Hardware: Is it or can it be done? (BOOKED)

Evgenia Trofimova

MVPs in the software industry are common practice and relatively easy to conceive because of their malleable nature: You push the updates every 2 weeks and people get them. In the hardware world, you can't just ship a new product every 2 weeks. Are there ways of applying a lean methodology based on the notion of MVP in the hardware context? Or do MVPs in the hardware context appear in other forms?


Bachelors projects

Note: The number of projects for Bachelors students is limited. But you can find several other potential project ideas by checking this year's list of proposed software projects. Some of the projects in the "Available" category could be used as Bachelors thesis topics. Also, we're open to student-proposed Bachelors thesis projects. If you have an idea for your Bachelors projects and your idea falls in the area of software engineering (broadly defined), please contact the group leader: Marlon . Dumas ät ut.ee

An investigation on the relationship between inequality and growth.

Jaan Ubi (jaanbi dot jb [ät] gmail.com ) and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

The purpose of this research is to investigate a relationship between inequality - of salaries - and growth - of firms. The first aim is to reproduce these statistical laws in the Estonian economy. First we introduce a measure of salary inequality inside a firm. Next, we look at the (synchronous) correlation between size and inequality. Are firms of different sectors adopting different inequality (e.g. banks)? We regress the current inequality measure with the future firm growth - possibly in a non-linear manner. The research question is: “Is a more unequal distribution of salaries improving the performance of a firm?” This research strand is suitable for a student who aspires to apply data science in the domain of economic/business analysis, which is an active area of endeavor in Estonia - as the country is about to take an active stance in applying such techniques for driving it's policies.

Understanding behavior propagation through public goods game.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

The public goods game is a standard of experimental economics. In the basic game, subjects secretly choose how many of their private tokens to put into a public pot. The tokens in this pot are multiplied by a factor (greater than one and less than the number of players, N) and this "public good" payoff is evenly divided among players. Each subject also keeps the tokens they do not contribute. In this thesis, we aim to devise a public goods game, and will recruit real subjects to play the game. The real dataset, which will be essentially obtained through the public-goods game experiment, will be used for analysing the propagation of (altrusitic and mean) behavior.

Inferring social network of mobile users' interactions.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Huber Flores

Mobile interactions often have been used in the past to infer the social network of individuals. In this study, which lies at the interactions of social network and data analytics, the student will analyse a large set of real users mobile interactions, recorded for a period of 1 month, to infer the social network of these real users. Basics of social network concepts and libraries will be provided to speed the thesis process.

Opinion mining of Public data for a health initiative project.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Ruth Hunter

Opinion mining involves text analysis techniques for gathering sentiments about certain topic from a corpus. With the advancement of web 2.0, various social platforms have provided opportunities for people to provide their unbiased opinion with respect to various topics. In this thesis, we are particularly interested in analysing opinion with respect to a health oriented initiative by UK government. The thesis will investigate two case studies in particular 1) Is 20 plenty for health? And 2) Connswater Community Greenway.

Is 20 plenty for health? – The project involves the implementation of a transport initiative across several sites in the UK, by proposing reduction of speed limits to 20mph to result in fewer casualties and lower traffic volumes, leading to an improvement in the perception of safety and a subsequent increase in cycling and walking.

Connswater Community Greenway – The project involves a urban regeneration project in east Belfast (Northern Ireland), which includes the development of a 9km linear park and the development of purpose-built walkways, cycle paths and parks to encourage the local residents to be more active and improve their health and wellbeing.

The work will involve, 1) analysing public sentiments, 2) proposal of a model predicting public mood and 3) a sentiment package specifically related to public policies initiatives related to health, 4) Investigation of public vs policy levels i.e. those who are promoting and implementing the schemes vs those who are using it.

Dataset will be provided for the analysis.

Workflow Automation With Business Data Streams

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

There exist services such as Flowxo and IFTTT which facilitate automation of simple workflows by facilitating creation of trigger / action pairs or if-then recipes, whose execution will orchestrate applications wrt external stimuli (e.g. application data stream). An example of a popular recipe is the following: "IF I post a picture on Instagram THEN save the photo to Dropbox" or a more complex example is "for every new deal in a CRM send the deal with an e-mail to a person with the GMail service, then wait for about 1 day and send a reminder SMS via Twilio service.

Such systems mostly rely on proprietary application data while there are cases where external stimulus will provide extra benefits. An example of such a case is integration of CRM and credit management tools with external stimuli in form of streaming company debt and risk score data for Order-to-Cash business process. There is a Stream API for business data currently under development at Register OÜ and it will provide a stream of events such as company debt change, changes in board membership and data about newly registered companies. Such data changes events can be easily applied in the context of CRM and a credit management (CM).

The aim of the project will be to leverage provision of an analogue of IFTTT, where users can define recipes for reacting into business data changes via actions in applications such as GMail, Odoo CRM etc.

The project will be done in collaboration with Register OÜ. The application will be developed by using the Complex Event processing (CEP) feature of Register Stream API.

Lead generator for accelerating B2B sales

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

Companies care a lot about improving their sales and to meet such a demand numerous online solutions have been proposed. While for B2C sales the prevalent solutions use social media campaign and Web visitor data, for B2B sales there are solutions, which allow generating a list of leads based on a set of attribute values over company data such as its activity field, size, financial metrics etc.

Some solutions for the Estonian market include https://www.baltictarget.eu, http://sihtgrupid.ee, http://turundusnimekirjad.ee/ and http://www.kliendibaas.ee/. However, these solutions have the following deficiencies:

  1. The market segment must be known before generating the leads
  2. The set of attributes is mostly limited to geographic, activity field and financial data
  3. the data is returned as a file.

This project aims at innovating the B2B sales by providing a solution, which differs from the existing ones in the following way:

  1. instead of a list of feature / value pairs a user can define its market segment by giving a set of prospective clients as input to lead generation;
  2. in addition to the activity fields, company size and financial metrics also data about owned real estate, credit history, credit risk, media coverage and related persons can be used;
  3. instead of outputting leads to a CSV file, lead data will be directly imported to an existing CRM system or a new cloud instance of a CRM will be deployed and populated with the leads.

The project will be done in collaboration with Register OÜ.

Automated brand magazine

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

It is essential for companies to acquire new and retain existing customers. To target this need content marketing techniques have been developed and advocated. However, due to lack of proper skills the techniques are often underutilized. Multiple tools have created to simplify content marketing. For instance Flipboard (https://about.flipboard.com/advertisers) simplifies creation of brand magazines executing content marketing within users. LinkedIn has been used by top management to deliver company news to their employees. Instant Articles by Facebook (https://instantarticles.fb.com) provides means for publishers to make their articles to appear more attractive and to increase engagement rate on Facebook. Inforegister (https://www.inforegister.ee/) has developed a business media feature, which allows publishing stories related to businesses. Anyway, all of the mentioned solutions expect that relevant content will be provided and managed manually.

In this project a solution will be developed, which will simplify creation and maintenance of brand magazines for companies (especially SME-s) and persons (e.g. bloggers). The key innovation of the project is that it will *automatically* search for mentions of companies, persons and brands from the Web via Register Graph API (https://developers.ir.ee/graph-api) and create attractive brand pages out of them, which will be made visible via search engines to the target audience. Mentions origining from online news media, blogosphere, forums, corporate blogs and other Web sources will be presented at brand pages with specific cards (see https://www.google.com/search/about/learn-more/now/ for the concept of cards), e.g. "Customers' feedback", "Our partners", "New product launched", "About company", etc.

Some initial requirements:

  1. Responsive design with Google Material
  2. Use of Register Stream API and Graph API as data sources
  3. Search engine and user friendly Web solution

Integration of Privacy-Enhanced BPMN to a GDPR compliance evaluation tool

Jake Tom (jaketom ät ut dot ee)

Privacy-Enhanced Business Process Model and Notation is an extension of BPMN aimed at capturing the different privacy-enhancing technologies that may be used along a business process flow. We have developed a web application prototype of a GDPR compliance checker that takes an input business process model and outputs a compliance evaluation report. This report tells us where compliance to the GDPR may be lacking along the input model. In this thesis, you will work on enhancing the prototype to be compatible with the PE-BPMN extension.

Additionally, there are several steps in the tool's workflow that could be automated. These include automated separation of system and human actors and the identification of sensitive personal data objects. There are also several user experience improvements that could be made to enhance user-friendliness for the tool's target users. You will need to be familiar with the Spring MVC Framework for this project.

Lab Package Development & Evaluation for the Course 'Software Testing' (LTAT.05.006)

Supervisor: Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

The course Software Testing (MTAT.03.159) has currently 9 labs (practice sessions) in which 2nd and 3rd year BSc students learn a specific test technique. We would like to improve existing labs and add new labs.

This topic is intended for students who have already taken this software testing course and who feel that they can contribute to improving it and by the same token complete their Bachelors project. The scope of the project can be negotiated with the supervisor to fit the size of a Bachelors project.

The tasks to do for this project are as follows:

  • Selection of a test-related topic for which a lab package should be developed (see list below)
  • Development of the learning scenario (i.e., what shall students learn, what will they do in the lab, what results shall they produce, etc.)
  • Development of the materials for the students to use
  • Development of example solutions (for the lab supervisors)
  • Development of a grading scheme
  • Evaluation of the lab package

Topics for which lab packages should be developed (in order of urgency / list can be extended based on student suggestions):

  • Automated Unit & Systems Testing
  • Visual GUI Testing
  • Issue Reporting
  • Continuous Integration & Testing -- NOTE: This topic has already been taken by a student!
  • Other topics that you find interesting and would like to discuss with me regarding their suitability

Literature Survey on "Open Innovation – How to use it for software requirements elicitation?"

Supervisor: Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

Open innovation (OI) is a new paradigm that aims at opening up organizational boundaries in order to use and recombine internal and external knowledge to develop and commercialize innovative products. The idea of OI can become an interesting new approach to requirements elicitation for software products. In particular, social media, blogs, and other freely accessible resources could be systematically analyzed for relevant ideas that would help improve the value of future products.

Project task: Find literature on reported attempts to exploit open-source software, social media, blogs, and other open sources for detecting new and complementing existing functionality of existing and new software products. Summarize and discuss the literature you find. In your analysis you may focus on the type of information sources exploited, the ways how they were analyzed, the kind of information (new requirements, discussion/evaluation of existing functionality, etc.) extracted, the type of products for which new requirements were sought, etc.

Starting point for literature search:

  • Anton Barua, Stephen W. Thomas, Ahmed E. Hassan (2014) What are developers talking about? An analysis of topics and trends in Stack Overflow. Empirical Software Engineering, June 2014, Volume 19, Issue 3, pp 619-654.

Customer Journey Mapping [Already booked]

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

A Customer Journey Map is a graphical representation of how a given customer interacts with an organization in order to consumer its products or services, possibly across multiple channels. Several approaches for customer journey mapping exist nowadays. Each relies on different concepts and notations. In this thesis, you will review the most popular approaches that are currently in use for customer journey mapping, and you will distill from them a common set of concepts and notations. You will then show how these concepts and notations can be applied in an organization of your choice (preferably an organization with which you have experience interacting as a customer).

Kursuse Veebirakenduste loomine (LTAT.05.004) projektide hindamise automaat

Supervisor: Siim Karus (siim dot karus at ut dot ee)

Kursuses Veebirakenduste loomine teostavad tudengid meeskonnas veebirakenduse arendamise projekti. Projekti hinnatakse 7 etapis ning projektide hindamine on võrdlemisi ajamahukas mittetriviaalne töö. Seega pakun lõputöö projektina välja veebirakenduste hindamise automatiseerimise vahendi loomise. Vahend peaks:

  1. analüüsima lähtekoodi. Näiteks otsima koodist üles SQL lausesete kasutamised ning kontrollima nende vastavust SQL standardile ning kursuses nõutud võtetele.
  2. teostama kontrolle veebirakenduse testkeskkonnas. Näiteks kontrollima HTML ja CSS standarditele vastavust (W3C-l on avaliku lähtekoodiga validaatori prototüüp).

Ideaalis peaks rakendus olema kasutatav ka tudengite endi poolt, et nad saaksid oma töö enne esitamist ka ise üle kontrollida. Rakendus võib täiendavalt pakkuda tudengitele ka vihjeid, kuidas levinud vigu parandada.

Estonian E-Governance Academy (two Bachelors project proposals)

Supervisor: Hannes Astok

Training app for government officials and integrated information screen for visitors