Student Projects, Academic Year 2017-2018

Below is a list of project topics for Masters and Bachelors theses offered by the software engineering research group for students who intend to defend in June 2018. The projects are divided into:

If you're interested in any of these projects, please contact the corresponding supervisor.


Masters projects

Comparative Evaluation for the Performance of Big Stream Processing Systems

Sherif Sakr (sakr [dot] sherif [ät] googlemail [dot] com)

As the world gets more instrumented and connected, we are witnessing a flood of digital data generated from various hardware (e.g., sensors) or software in the format of flowing streams of data. Examples of this phenomenon are crucial for several applications and domains including financial markets, surveillance systems, manufacturing, smart cities, and scalable monitoring infrastructure. In these applications and domains, there is a crucial requirement to collect, process, and analyze big streams of data to extract valuable information. Recently, several systems have been proposed to tackle the problem of big stream processing (e.g., Apache Flink, Apache Heron, Spark Streaming). However, we are still lacking a deeper understanding of the performance characteristics for the various design architectures in addition to lacking comprehensive benchmarks for the various Big Data processing platforms. The aim of this project is to conduct an empirical evaluation and benchmarking of the state-of-the-art of big stream processing systems.

Benchmarking Modern Big SQL Systems

Sherif Sakr (sakr [dot] sherif [ät] googlemail [dot] com)

Recently, several Big SQL systems have been proposed to tackle the problem of large scale structured data processing (e.g., SparkSQL, Presto, Cloudera Impala). However, we are still lacking a deeper understanding of the performance characteristics for the various design architectures in addition to lacking comprehensive benchmarks for the various Big SQL processing platforms. The aim of this project is to conduct an empirical evaluation and benchmarking of the state-of-the-art of big SQL processing systems.

Implementing SPARQL query processor on top of Big SQL Engine

Sherif Sakr (sakr [dot] sherif [ät] googlemail [dot] com)

RDF (Resource Description Framework) is the main ingredient and the data representation format of Linked Data and Semantic Web. It supports a generic graph-based data model and data representation format for describing things, including their relationships with other things. In practice, the SPARQL query language has been recommended by the W3C as the standard language for querying RDF data. The size of RDF databases is growing fast, thus RDF query processing engines must to be able to deal with increasing amounts of data. The aim of this project is to build scalable SPARQL query processor for massive RDF databases on top of modern Big SQL systems (e.g., SPARK SQL, Cloudera Impala).

Auto- Selection and -Tuning of Machine Learning Classification Algorithms

Sherif Sakr (sakr [dot] sherif [ät] googlemail [dot] com)

One of the major obstacle for supporting Machine Learning algorithms on big data is the challenging and time-consuming process of identifying and training an adequate predictive model. Therefore, machine learning is a highly iterative exploratory process where most scientists work hard to find the best model or algorithms that meets their data challenge. In practice, there is no one-model-fits-all solution, thus, there is no single model or algorithm that can handle all data set varieties and changes in data that may occur over time. All machine learning algorithms require user defined inputs to achieve a balance between accuracy and generalizability, referred to as tuning parameters. These tuning parameters impact the way the algorithm searches for the optimal solution. The aim of this project is to build an R package that can easily be used off the shelf to consider the problem of simultaneously selecting a learning algorithm and setting its hyperparameters. The goal of this package to help non-expert users to more effectively identify machine learning algorithms and hyperparameter settings appropriate to their applications, and hence to achieve improved performance.

Compiling StreamSQL to Flink

Sherif Sakr (sakr [dot] sherif [ät] googlemail [dot] com)

As the world gets more instrumented and connected, we are witnessing a flood of digital data generated from various hardware (e.g., sensors) or software in the format of flowing streams of data. StreamSQL is a query language that extends SQL with the ability to process real-time data streams. StreamSQL supports the ability to manipulate streams, which are infinite sequences of tuples that are not all available at the same time. The aim of this project is to build a query compiler that can compile StreamSQL queries into the low-level APIs of modern big stream processing systems (e.g., Apache Flink, Apache Heron, Spark Streaming, APEX) so that it adds a declarative programming layer on top of these systems.

Predicting Information Diffusion on Social Media

Rajesh Sharma (rajesh dot sharma ät ut dot ee) and Anna Jurek

Social media such as online social networks (Facebook), micromessaging services (Twitter) or sharing sites (Instagram) provide the space in which a significant part of social interactions takes place. Many real-life situations like elections are reflected by social media and in turn social media shapes them by forming opinions or strengthening trends. In addition to providing a large audience, social media has changed the speed of interaction: Information spreads within minutes or hours, triggering equally fast reactions.

The goal of the thesis is to develop an algorithm that allows to predict how well a message will diffuse on Twitter. • The first step will be identifying some significant user/message/network features that maybe be used to predict how fast a message will spread across the social media channel. • The second step will be the implementation of a classification model that will be able to predict how well a message will diffuse on Twitter using the identified features.

Large Twitter dataset related to 2 different topics will be provided for your analysis. However, we also expect to collect new data with different topics (1 or 2 more), to have a comprehensive analysis on a variety of topics. Literature with respect to topics will also be provided for speeding up the work.

Identifying Fake News using Linked Data and Network Science Approaches

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Deepak Padmanabhan

Fake news is often generated with malicious intent of spreading misinformation and for spreading rumours. The content in fake news is generally created to mislead readers in order to gain financially or politically, as well as to grab attention. Apart from social media such as Twitter and Facebook and Whatsapp, there are dedicated news agencies that propagate fake news.

The goal of this thesis is to use the content present in the news stories to identify as fake or not by using “Linked Data” in combination with “Network Science” approaches. The linked data approach will be used for identifying fake news indicators such as enhanced topical scatter in news content to be analyzed. The network science approach will be used for identifying the similarity among the topics of the content to boost accuracy of fake news detection. This involves analysis of a corpus of news stories that will be collected for the purpose of this project. Guidance on network science and Linked Data will be provided to get started on the project.

Measuring Corporate Reputation through Online Social Media

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Peter Ormosi

When businesses are caught out engaging in illegal or immoral activities, their reputation might suffer. Corporate reputation is a reflection of how a business is regarded by its customers and the public in general. If corporate misbehaviour negatively affects a business’ reputation, customers might switch to rival businesses. For this reason, reputation has got a central role in free markets as it has the potential to deter businesses from misbehaving.

The extent, to which corporate wrongdoings trigger a reputational loss is still debated and is subject to a large body of academic works. Most of these works are based on survey methods to measure reputation. This research relies on a more direct method to measure reputational changes, by conducting a sentiment analysis of how the public reacted on Twitter to some of the most high-profile corporate misconducts. In this particular work thesis, corporate reputation will be studied using the Volkswagen (VW) scandal as a case study and the public reaction it created on the Twitter. VW’s scandal has been chosen because it has been widely covered over time through both traditional and social media. Moreover we can measure how changes in media coverage and social media reaction affected VW’s financial performance. The dataset and related literature will be provided for speeding up the work.

Opinion mining of Public data for a health initiative project.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

and Ruth Hunter

Opinion mining involves text analysis techniques for gathering sentiments about certain topic from a corpus. With the advancement of web 2.0, various social platforms have provided opportunities for people to provide their unbiased opinion with respect to various topics. In this thesis, we are particularly interested in analysing opinion with respect to a health oriented initiative by UK government. The thesis will investigate two case studies in particular 1) Is 20 plenty for health? And 2) Connswater Community Greenway.

Is 20 plenty for health? – The project involves the implementation of a transport initiative across several sites in the UK, by proposing reduction of speed limits to 20mph to result in fewer casualties and lower traffic volumes, leading to an improvement in the perception of safety and a subsequent increase in cycling and walking.

Connswater Community Greenway – The project involves a urban regeneration project in east Belfast (Northern Ireland), which includes the development of a 9km linear park and the development of purpose-built walkways, cycle paths and parks to encourage the local residents to be more active and improve their health and wellbeing.

The work will involve, 1) analysing public sentiments, 2) proposal of a model predicting public mood and 3) a sentiment package specifically related to public policies initiatives related to health, 4) Investigation of public vs policy levels i.e. those who are promoting and implementing the schemes vs those who are using it.

A small dataset will be provided. However, we also expect to collect more data as part of the thesis.

Learning Social Representation using Deep Neural Networks

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Shirin Dora

The catalogue of techniques in machine learning is massive but the recent research in this area has spotlighted the immense potential of deep neural networks for solving many problems. Deep learning is a field of machine learning that involves developing learning algorithms for training neural networks with large number of layers. Deep neural networks are presented with a real-valued multidimensional representation of an input and through multiple layer of processing, they learn to extract meaningful information from this input.

The focus of this thesis will be application of deep learning in learning social network representations. A social network is represented as a collection of nodes and edges which connected these nodes. Each node represents a single member of the network and the edges emanating from this node represent the connections of this member. As a result of this information representation mechanism, there is no straightforward way to represent each node using real valued features. This makes it difficult to use machine learning techniques to deal with problems pertaining to social networks like network classification, content recommendation, etc. The problem becomes more complex for large social networks.

To overcome this issue, many researchers focus on developing techniques that learn representations for each node using the information stored in the social network. These representations provide a real-valued multidimensional input for nodes in the social network which can be processed by existing machine learning techniques. These representations have been used for various problems in the area of neural networks. In this thesis, the goal is to leverage the capabilities of deep neural networks to train a neural network to simultaneously learn representations and perform a given social network related task. This generic approach would involve training the neural network on a particular social network problem without worrying about presenting appropriate representations as the onus of learning the suitable representations lies with the neural.

Analysing Server Logs for Predicting Job Failures.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

Server logs generally refer to files which are created for monitoring the activities being performed on servers. In recent years a lot of research has been performed in analysing server logs for analysing the status of the jobs or tasks that arrive on servers. In this thesis, you will be analysing logs from Google cluster, which is a is a set of machines responsible for running real Google jobs for example, search queries. The research encompasses the domain of large scalable predictive analytics. The main contribution of the thesis includes proposing of model to predict the job failures on servers. A real dataset of Google traces will be provided along with related literature to ramp up the learning process.

Wisdom of the crowd Vs. Expert views regarding movie’s box office results.

Supervisor: Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

The term “wisdom of the crowd” refers to the collective opinion of a community or group. In comparison, expert views refer to the views expressed by the experts of a particular domain. In this thesis, you will investigate if it is the experts or if it’s the wisdom of the crowd, that can predict the box office outcome of the movies. In particular, you will analyse tweets with respect to movies around the period of release date of movies. A small dataset of tweets of various movies will be provided. However, we also expect to expand our analysis by collecting tweets about more movies during the period of thesis. The thesis involves, sentiment analysis of the tweets and subsequently proposal of the prediction model about predicting box office result of the movies.

Segmented Process Model Generation from Software Logs

Supervisors: Fredrik P. Milani (milani ät ut dot ee) and Fabrizio Maggi

Process mining allows for extracting process models from execution logs of software systems. There are many algorithms available that does this. However, most of them creates one single model capturing all behaviours and as such, creates models that are very difficult to understand. The understandability of such models can be improved by segmenting the models into different ones. Each model shows certain traces of the process based on for instance frequency. Furthermore, the models can be made more understandable if the models’ complexity are below empirically proven threshold values. In this thesis, you will take an existing algorithm for process discovery with BPMN and enhance it. The enhancements are first to allow generation of process models based on frequency of occurrence (filtering function). The second is to automatically segment the process model into several models based on frequency, main variants, and complexity.

Automated Identification of Parameters for Deviance Mining

Supervisors: Fredrik P. Milani (milani ät ut dot ee) and Fabrizio Maggi

Within the domain of process mining, it is possible to compare the execution paths of different outcomes of a process. For instance, a process for handling claims can have both slow cases and fast cases. The slow cases can be annotated as such and separate them from the fast cases. Once this is done, one can compare the execution of slow versus fast cases and identify the differences. This is referred to as deviance mining. However, in order to perform deviance mining, there has to know what to compare such as in this case, slow versus fast cases. There are other aspects that can be compared as well such as cheap versus expensive, negative versus positive outcome. These parameters have to be identified and set manually. However, at times, one does not know what parameters to consider and as such, it is helpful to have an automated way of discovering potentially interesting parameters to compare from a log. In this thesis, you will develop an algorithm that can take a log and automatically detect which parameters might be relevant for deviance mining. This is done by considering what aspects of the execution is sufficiently different and therefore, potentially interesting. You will develop this algorithm and also consider how to visualize the results of the analysis.

Conformance Checking with Artificial Intelligence.

Supervisor: Fabrizio Maggi (f.m.maggi ät ut dot ee) and Andrea Marrella

Conformance checking is a branch of process mining embracing approaches for verifying whether the behavior of a process, as recorded in a log, is in line with some expected behaviors provided in the form of a process model. One of the open challenges in the context of conformance checking is the capability of supporting multi-perspective specifications, i.e., data, time, and resources. In this thesis, we close this gap by providing a multi-perspective framework for conformance checking that makes use of artificial intelligence techniques like automated planning to identify and solve discrepancies between a log and a reference model. The approach will be implemented in Java and empirically evaluated using real-life examples.

Interpretable Predictive Process Monitoring

Supervisors: Anna Leontjeva (anna.leontjeva ät ut dot ee), Fabrizio Maggi (f.m.maggi ät ut dot ee), Chiara Di Francescomarino

Recent advances of supervised machine learning in various tasks stem from the use of powerful and complex models. However, the interpretability of these methods is very limited. It poses a serious challenge in the domains, such as financial and medical services, where the understanding of the decision behind the prediction is crucial. Moreover, the interpretability of the model can provide a valuable feedback in order to improve the predictive model even further.

In this thesis, you will adopt the available techniques to interpret complex models such as Recurrent Neural Networks to the business processes. The main goal is to propose a tool that is easy to use and understand for non-technical audience, which provides insights of the prediction logic behind the model.

Topic: Predictive process monitoring of time-related properties

Supervisors: Ilya Verenich (ilyav [ät] ut [dot] ee) and Fabrizio Maggi (f.m.maggi ät ut dot ee)

Predictive business process monitoring methods exploit historical process execution logs to provide predictions about running instances of a process. The problem of predictive process monitoring has received considerable attention in the past years. In particular, a considerable number of methods have been proposed to predict the completion time of process instances – e.g. How long will it take to resolve an insurance claim? When will this bug be fixed?

Several methods for predictive process monitoring have been proposed in the past years. These approaches have been evaluated on different datasets, using different experimental settings, evaluation measures and baselines. As a result, it is unclear which methods perform better than others and under what conditions. This thesis will address this gap by conducting a systematic review and comparative experimental evaluation of time-related predictive monitoring methods (e.g. predicting remaining time), covering both classical feature engineering-based methods as well as more recent methods based on deep learning.

Causal Deviance Mining of Business Processes

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

Business process deviance refers to the phenomenon whereby a subset of the executions of a business process deviate, in a negative or positive way, with respect to the expected or desirable outcomes of the process. Deviant executions of a business process include those that violate compliance rules, or executions that undershoot or exceed performance targets. Deviance mining is concerned with uncovering the reasons for deviant executions by analyzing business process event logs. Current deviance mining techniques are focused on identifying patterns or rules that are correlated with deviant outcomes. However, the obtained patterns might not actually help to explain the causes of the deviance. In this thesis, you will enhance existing deviance mining techniques with causal discovery techniques in order to more precisely identify the potential causes of deviant process executions.

Dynamic Time Warping for Predictive Monitoring of Business Processes

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

Predictive business process monitoring refers to a family of online process monitoring methods that seek to predict as early as possible the outcome of each case given its current (incomplete) execution trace and given a set of traces of previously completed cases. In this context, an outcome may be the fulfillment of a compliance rule, a performance objective (e.g., maximum allowed cycle time) or business goal, or any other characteristic of a case that can be determined upon its completion. For example, in a sales process, a possible outcome is the placement of a purchase order by a potential customer, whereas in a debt recovery process, a possible outcome is the receipt of a debt repayment.

Existing approaches for predictive business process monitoring are designed for processes with a relatively high level of regularity, where most cases go through the same stages and these stages are more or less of the same length. In the case of very irregular processes where the number of stages and their length is variable, the accuracy of these techniques generally suffers. In this project, you will design an approach to predictive process monitoring that addresses this limitation by using a time series analysis technique known as dynamic time warping. The thesis will adopt an experimental approach. You will implement a prototype and compare it with implementations of other predictive process monitoring techniques using a collection of real-life event logs.

Identifying Working Styles

Marcello Sarini (firstname.lastname [ät] unimib.it) and Marlon Dumas

Identifying working styles is about characterizing the nature of work especially focusing on the interdependencies among human actors in performing the activities related to the unfolding of a business process. In this view, Working Style makes visible the possible choices made by performers according to the constraints posed by the workplace. This is crucial when the organization of work is supported by Process-Aware Information Systems because these systems pose some limits on the possible choices made by human actors while supporting the unfolding of a business process. So it would be useful in different situations to identify working style for make it visible how people arrange their work duties in the presence of such technologies.

The aim of this Masters project is to implement a tool to support the identification of working styles from fine-grained log files recorded by the tools that workers use everyday. It is expected that the output of the Master thesis will become a publicly available tool that would be made available on a software-as-a-service basis.

The tool will provide three main functionalities:

  1. the management of the log file and its transformation into a suitable intermediate database structure;
  2. the management of the artifact representing the working style: its creation, and the identification of patterns;
  3. the visualization of the artifact: the visualization of the patterns within the artifact and the visual comparison of different artifacts;

It is expected that the tool will be implemented by using the following technologies:

  • Python as the main programming language;
  • Neo4j as the database(*);
  • Flask as the Python Web development framework.

(*) Neo4j is a graph database, that is falling under the umbrella of database technologies so called NO-SQL databases. The choice of this database is driven by the fact that its query language, Cypher, is oriented towards the identification of patterns within the graph database, and the identification of working style is strictly related to the identification of patterns.

Case Study on Exploratory Testing

Supervisor: Dietmar Pfahl (dietmar dot pfahl ät ut dot ee)

Exploratory software testing (ET) is a powerful and fun approach to testing. The plainest definition of ET is that it comprises test design and test execution at the same time. This is the opposite of scripted testing (having test plans and predefined test procedures, whether manual or automated). Exploratory tests, unlike scripted tests, are not defined in advance and carried out precisely according to plan.

Testing experts like Cem Kaner and James Bach claim that - in some situations - ET can be orders of magnitude more productive than scripted testing, and a few empirical studies exist supporting this claim to some degree. Nevertheless, ET is usually is often confused with (unsystematic) ad-hoc testing and thus not always well regarded in both academia and industrial practice.

The objective of this project will be to conduct a case study in a software company investigating the following research questions:

  • To what extend is ET currently applied in the company?
  • What are the advantages/disadvantages of ET as compared to other testing approaches (i.e., scripted testing)?
  • How can the current practice of ET be improved?
  • If ET is currently not used at all, what guidance can be provided to introduce ET in the company?

The method applied is a case study. Case studies follow a systematic approach as outlined in: Guidelines for conducting and reporting case study research in software engineering by Per Runeson and Martin Höst Important elements of the thesis are literature study, measurement and interviews with experts in the target company.

This project requires that the student has (or is able to establish) access to a suitable software company to conduct the study.

Case Study on Test Automation

Supervisor: Dietmar Pfahl (firstname dot lastname ät ut dot ee)

Similar to the case study project on Exploratory Testing (see above), a student can work in a company to analyse the current state-of-the-practice of test automation. The objective of this project will be to investigating the following research questions:

  • To what extend is test automation currently applied in the company (i.e., what test-related activities are currently automated and how is this done)?
  • What are the perceived strengths/weaknesses of the currently applied test automation techniques and tools?
  • How can the current practice of test automation be improved (i.e., how can the currently automated test process steps be made more productive, and what steps currently done manually are promising to be automated)?

The method applied is a case study. Case studies follow a systematic approach as outlined in: Guidelines for conducting and reporting case study research in software engineering by Per Runeson and Martin Höst Important elements of the thesis are literature study, measurement and interviews with experts in the target company.

This project requires that the student has (or is able to establish) access to a suitable software company to conduct the study.

Case Study on A/B Testing

Supervisor: Dietmar Pfahl (firstname dot lastname ät ut dot ee)

Similar to the case study project on Exploratory Testing (see above), a student can work in his/her company to analyse the current state-of-the-practice of A/B testing. The objective of this project will be to investigating the following research questions:

  • To what extend (and how) is A/B testing currently applied in the company?
  • What are the perceived strengths/weaknesses of the currently applied A/B testing techniques and tools?
  • How can the current practice of A/B testing be improved?

The method applied is a case study. Case studies follow a systematic approach as outlined in: Guidelines for conducting and reporting case study research in software engineering by Per Runeson and Martin Höst Important elements of the thesis are literature study, measurement and interviews with experts in the target company.

This project requires that the student has (or is able to establish) access to a suitable software company to conduct the study.

Using Data Mining & Machine Learning to Support Decision-Makers in SW Development

Supervisor: Dietmar Pfahl (firstname dot lastname ät ut dot ee)

Project repositories contain much data about software development activities ongoing in a company. In addition, there exists much data from open source projects. This opens up opportunities to analysis and learning from the past which can be converted into models that help make better decisions in the future - where 'better' can relate to either 'more efficient (i.e., cheaper) or more effective (i.e., with higher quality).

For example, we have recently started a research activity that investigates whether textual descriptions contained in issue reports can help predict the time (or effort) that a new incoming issue will require to be resolved.

There are, however, many more opportunities, e.g., analysing bug reports to help triagers assign issues to developers. And of course, there are other documents that could be analysed: requirements, design docs, code, test plans, test cases, emails, blogs, social networks, etc. But not only the application can vary, also the analysis approach can vary. Different learning approaches may have different efficiency and effectiveness characteristics depending on the type, quantity and quality of data available.

Thus, this topic can be tailored according to the background and preferences of an interested student.

Tasks to be done (after definition of the exact topic/research goal):

  • Selection of suitable data sources
  • Application of machine learning / data mining technique(s) to create a decision-support model
  • Evaluation of the decision-support model

Prerequisite: Students interested in this topic should have successfully completed one of the courses on data mining / machine learning offered in the Master of Software Engineering program.

Crowdsourced Software Testing

Supervisor: Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

Crowdsourcing has become a popular approach in software development. However, can the crowd also be used to enhance software testing? Today, several dedicated crowdsourcing services exist for the testing of mobile applications. They specifically address the problem of the exploding number of devices on which a mobile application may run, and which the developer or tester may not own, but which may be possessed by the crowd at large. Examples of these services include Mob4Hire (www.mob4hire.com), MobTest (www.mobtest.com), and uTest (www.utest.com).

Your task is to provide a systematic and coherent overview of the tools and techniques that have been employed for supporting crowdsourced software testing, and the experience that has been made with using and managing such approaches. In addition, criteria for comparing the various platform and a comparative analysis (applying the proposed criteria) should be made.

Literature (starting points): NB: Only links to the publishing venues are provided. You are expected to retrieve the complete information for correct referencing on your own.

Summarizing Opinions Expressed in App Reviews

Supervisor: Faiz Ali Shah (faizalishah at gmail dot com) and Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

Opinions expressed by app users in reviews available at AppStore or PlayStore are extremely large in number and thus making it difficult for app developers to digest them. In the past, researchers focused on generating sentiment summaries at feature level or ranking/extracting informative reviews for app developers. However, these summaries still force developers to go back to original review text to get further details about opinions expressed. Therefore, this study aims to apply approaches i.e. unsupervised that have been used for review summarization for other domains such as, restaurant and mp3 players on app reviews.

Literature (starting points):

Prerequisite: Students interested in this topic should have successfully completed one of the courses on data mining / machine learning offered in the Master of Software Engineering program.

Reverse Engineering of Graphical User Interfaces for Testing

Supervisor: Behzad Nazarbakhsh (behzad at ut dot ee) and Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

GUI testing is the process of testing a software through its Graphical User Interface (GUI). This process has been gone through a lot of advancement, but the available tools that are used for GUI testing are using limited techniques. There are few open source GUI test automation frameworks that present a unified solution to the GUI testing problem. One of the well-known GUI test frameworks is GUITAR (https://www.cs.umd.edu/~atif/GUITAR-Web/index.html.old). GUITAR framework contains the fixed predefined process such as GUI ripping, event instrumenter, test Oracle generator, test case generator, test case executor, coverage evaluator. GUI Ripping is a reverse engineering of GUI’s model through executing the GUI model directly from the executing GUI. This process is the main part of GUI test frameworks that extracts information from the executable files and extract all the information via low-level implementation-dependent system calls. Developing the GUI ripper has several challenges that require the novel solutions. The specific contributions of this thesis are the following:

  • To provide an efficient algorithm to extract a GUI model without the need for its source code.
  • To implement a GUI ripper tool that can be applied to the Windows and Java Swing GUIs.

Evaluation of a Toolkit for Energy Code Smell Detection

Supervisor: Hina Anwar (hina dot anwar2003 at gmail dot com) and Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

Every day more and more android apps are developed and published through stores like google play store. With the increased use of smart phone and smart phone related development the mobile app developers are now becoming more aware of the energy related problem in apps. But the research related to the energy efficiency of the mobile apps is still trying to catch up. One such relatively new research project is “Paprika Toolkit” which is an open source graph based system for automatically detecting and cataloguing energy code smells in android based projects. But the project is tested on a very limited number of apps and claims to detect only 7 android specific energy code smells, when in fact many more kinds of energy code smell exist in real world apps. Therefore, this study aims at analysing the performance of paprika toolkit on other mobile apps to check if it can effectively detect the 7 code smells in different kind of apps and also to suggest improvements in terms of integrating the coverage of new android specific energy code smells.

Literature (starting points):

Relationship between computer generated code and fault proneness

Supervisor: Siim Karus (siim04 ät ut.ee)

We have developed a method for quantified estimations of the extent of computer generated code used in software modules. The hypothesis is that computer generated code leads to less errors. This thesis topic is about testing this hypothesis on software development data. In short, the student will collect or reuse source code revisioning data and calculate the computer-generated code amount estimate for the modules at different points in time. Then it will use the issue repository data to check, which modules have more errors found in them (at different points in time). Finally, it will try to model a relationship between computer generated code extent and error proneness.

GPU-accelerated Data Analytics

Supervisor: Siim Karus (siim04 ät ut.ee)

In this project a set of GPU accelerated data mining or analytics algorithms will be implemented as an extension to an analytical database solution. For this task, you will need to learn parallel processing optimisations specific to GPU programming (balancing between bandwidth and processing power), implement the analytics algorithms, and design a user interface to accompany it. As the aim is to provide extension to analytical databases (preferably MSSQL, Oracle or PostgreSQL), you will also need to learn the extension interfaces of these databases and their native development and BI tools. Finally, you will assess the performance gains of your algorithms compared to comparable algorithms in existing analytical database tools.

GPU-accelerated Developer Feedback System

Supervisor: Siim Karus (siim04 ät ut.ee)

In this project you will implement source code analytics algorithms on GPU and devise a reliable and fast method for integrating the analysis feedback into integrated development environments (IDEs). For this task, you will need to learn parallel processing optimisations specific to GPU programming (balancing between bandwidth and processing power), implement the analytics algorithms, and design a user interface to accompany it. As the aim is to provide extension to IDEs (preferably Visual Studio or Eclipse), you will also need to learn the extension interfaces of these IDEs and their native development tools. Finally, you will assess the performance gains of your algorithms compared to implementations of these algorithms running on CPU.

Replication of Empirical Software Engineering Case Study Experiments

Supervisor: Siim Karus (siim04 ät ut.ee)

Empirical software engineering community publishes many case studies validating different approaches and analytical algorithms to software engineering. Unfortunately, these studies are rarely validated by independent replication. To make matters worse, the studies use different validation metrics, which makes them incomparable. Thus, your mission, should you choose to accept it, is to analyse different published case studies on one topic (e.g. bug detection, code churn estimation) to evaluate their replicability and replicate the studies in order to make them comparable. In short you will:

  1. envisage a workflow/pipeline for replicating published studies (including

testing for replicability);

  1. use the workflow to replicate several studies;
  2. validate these studies and compare their results on an common scale.

Linking rescue event data with public (dynamic) data

Supervisor: Siim Karus (siim04 ät ut.ee)

The operational planning of rescue services would benefit from exploiting operational public data (e.g. public events, roadworks, population density, land use, etc.). The task in this thesis is to create a solution that combines online public data with rescue event data. The ultimate aim is to find correlations between the rescue events (and their attributes) and the online data in order to estimate rescue event risk changes. It is expected, that the final solution can bring the Rescue Service's attention to public data that can affect the response times or risk of an accident (thus, allowing better planning of response teams).

The thesis will be conducted in cooperation with the Rescue Services. If the cooperation with the Rescue Services is fruitful, it might be possible to continue this research/analysis beyond Master's Thesis (i.e. by introducing predictive analytics and making use of private datasets made available to the Rescue Services).

The datasets to be used are in Estonian, so knowledge of Estonian is an advantage. The thesis can be written in Estonian.

Profiling of deaths and risk groups of rescue events

Supervisor: Siim Karus (siim04 ät ut.ee)

In order to better identify the people with higher risk of accident or death, it is necessary to develop profiles of people at higher risk. In this thesis you will get access to data collected by the Rescue Services regarding cases of deaths or injuries. The scope of these data is extremely limited due to limits set by data privacy regulations. Thus, you will enrich the data by tapping into public data sources so as to build risk profiles. The first part of the thesis project will be to create a data crawler that finds supplemental data about the people injured or killed in accidents. The second part of the thesis project will be to build a distinguishing model in order to identify, what causes these people to be at higher accident risk than other people. This information will be used to improve the focus of the preventive efforts of the Rescue Services.

This thesis will be conducted in cooperation with the Rescue Services. The datasets to be used are in Estonian, so knowledge of Estonian is an advantage. The thesis can be written in Estonian.

Designing Visually Effective and Intuitive Modelling Notations for Security Risk Management

Supervisor: Raimundas Matulevičius (rma [ät] ut [dot] ee)

Graphical notations play an important role in information systems modelling. To express different perspectives one could use various modelling languages. Security risk management is a possible activity to understand security risks at the early stages of the system development. It could be performed using various modelling languages, such s security risk-oriented/aware BPMN, Secure Tropos, misuse cases, mal-activities, etc. However to use these languages, one needs to learn what visual constructs actually present the concepts of the security risk management. The goal of this thesis is to understand and improve intuitiveness boundaries and constraints of the visual notations to express security risk management concepts. The main steps of this work is as follows:

  1. Comprehend the principles of the graphical notation construction
  2. Study the state of the art and report how principles of the graphical notation construction are used in different modelling perspectives
  3. Perform a sequence of empirical studies to understand the limitations and advantages of the graphical notations of the current modelling languages for the security risk management. Compare the empirical results to the state of the art
  4. Design and propose the improvement of the graphical notations for the security risk management
  5. Validate the proposed notations empirically

Starting literature:

  • Domain Model for Security Risk Management:
    • Dubois, E., Heymans, P., Mayer, N., Matulevicius, R.: A Systematic Approach to Define the Domain of Information System Security Risk Management, pp. 289–306. Springer (2010)
    • Mayer, N.: Model-Based Management of Information System Security Risk. Phd thesis, University of Namur (2009)
  • Principles of the visual notation construction:
    • Moody, D., The “Physics” of Notations: Toward a Scientific Basis for Constructing Visual Notations in Software Engineering, IEEE Transactions on Software Engineering, vol 35, no 6, 2009
  • Security Risk-oriented/aware modelling languages:
    • Matulevicius, R., Mayer, N., Mouratidis, H., Dubois, E., Heymans, P.: Syntactic and Semantic Extensions to Secure Tropos to Support Security Risk Management. J. UCS 18(6), 816–844 (2012)
    • Altuhhova, O., Matulevicius, R., Ahmed, N.: An Extension of Business Process Model and Notification for Security Risk Management. International Journal of Information System Modeling and Design (IJISMD) 4(4), 93 – 113 (2013)
    • Chowdhury, M.J.M., Matulevicius, R., Sindre, G., P., K.: Aligning Mal-activity Diagrams and Security Risk Management for Security Requirements Definitions. In: Requirements Engineering: Foundation for Software Quality. pp. 132–139. Springer (2012)
    • Soomro, I., Ahmed, N.: Towards Security Risk-Oriented Misuse Cases. In: Business Process Management Workshops. pp. 689–700. Springer LNBIP (2012)
  • Examples of the cognitive notations
    • Moody D., Heymans P., Matulevicius R., Visual syntax does matter: improving the cognitive effectiveness of the i* visual notation, Requirements Eng (2010) 15:141—175
    • Genon, N., Heymans, P., Amyot, D.: Analysing the cognitive effectiveness of the BPMN 2.0 visual notation. In: Malloy, B., Staab, S., van den Brand, M. (eds.) SLE 2010. LNCS, vol. 6563, pp. 377–396. Springer, Heidelberg (2011)
    • Leitner M., Schefer-Wenzl S., Rinderle-Ma S., An Experimental Study on the Design and Modeling of Security Concepts in Business Processes, IFIP Working Conference on The Practice of Enterprise Modeling, PoEM 2013: The Practice of Enterprise Modeling pp 236-250

Prediction Model for Tendencies in Cybersecurity (BOOKED)

Supervisors: Raimundas Matulevičius (rma [ät] ut [dot] ee) and Justinas Janulevičius (Vilnius Gediminas Technical University, Lithuania)

The current global focus on the IT industry is boosting the growth of economies, optimizing the resources and providing space for new, previously unavailable markets. The line between the physical and virtual worlds is becoming vaguer with time, due to the growth of application of network-enabled devices based on the Internet of Things concept. Accordingly, our daily lives become highly dependent on cyber-physical things. Such dependability raises security concerns, requiring in-depth understanding to predict. This topic should present a prediction model for tendencies in cybersecurity based on open cybersecurity threat and vulnerability data.

The need to predict future events is crucial in most activities ranging from estimation of future demand of production to expected climate changes. When IT is being integrated into our lives, unavoidably we are forced to face exposure to cybersecurity threats and vulnerabilities. Avoiding IT-based equipment does not help anymore, as even the most basic daily devices, such as home appliances, gadgets and vehicles are connected to the Internet. Current focus on cybersecurity threats provides past trends based on historical data without the aim to predict the future events. For example, ENISA Threat Landscape provides a report on the trends of the past year, with a short description of the emerging trends.

However, given there is enough obtainable statistical data on the subject matter, scientific procedures can be applied to design a prediction model for the cybersecurity tendencies.

Major steps:

  1. Get all the entries from years [2011;2015] from https://cve.mitre.org/data/downloads/index.html
  2. Extract information (or find on the internet) about possible categories (hint: ENISA Threat Landscape 2016)
  3. Extract scores and subscores of the vulnerabilities and exploits from (1.)
  4. Rank the vulnerabilities and exploits according to (2.)
  5. Analyze the changes in the score and subscores of the vulnerabilities and exploits of the categories that you made in (4.)
  6. Use statistical and AI models to predict what should be the average score and subscore in categories in year 2016
  7. Get data about year 2016 from (1.)
  8. Compare results of (6.) with (7.), make hypotheses and gather evidence to back them or dismiss them.

Value-aware Information Systems Development

Raimundas Matulevičius (rma [ät] ut [dot] ee) and Christopher Feltus (LIST, Luxembourg)

Nowadays, most researches that focus on depicting the semantic of value agree on the abstract character of the latter, mostly generated by its different types of nature (money, privacy, security, freedom, quality…). When value is perceived at the provider side, economists largely argue that it is created (manufactured) by the firm and distributed in the market, usually through exchange of goods and money. Indeed, the nature of the value has for a long time traditionally been represented by the possession of wealth and money. However, it is also worth to note that considering the provider in the context of the digital society expands this narrow mind perception to the consideration of other value elements, like the information collected regarding the customers which, afterwards, fills the bill of economic increase.

Let’s take the example of a SME that outsources the privacy management of its assets to dedicated enterprises, in order to remain being focused on its core business. In this case, the privacy nature of the value is traditionally expressed with well-defined characteristics (e.g., pseudonymity, anonymity, consent, etc.) Moreover, two types of value are created by this outsourcing: a direct value (privacy of the assets) and an indirect value (more time for core activities). Over and above that, this transaction happening with a customer being a citizen also contributes to the latter’s improvement of his well-being.

In this context, the goal is to define an ontology of value co-creation (VCC) and to establish a value co-creation reference language to support the analysis, the modelling, and the design of a new generation of information systems i.e., value-aware information systems (VAIS).

Major steps:

  1. reviewing the state of the art in the field of VCC and privacy in order to acquire the required knowledge;

$ understanding specific cases of relationships between privacy and value (generated by the privacy);

  1. elaborating instances of well-defined reference languages to express the privacy related value co-created in the frame of the cases analysed, validating of the instances and disseminating of the results through the scientific community.

Security Risk Management Framework for Blockchain-based Applications

Raimundas Matulevicius (rma [ät] ut [dot] ee) and Nicolas Mayer (LIST, Luxembourg)

Blockchain-based applications are more and more investigated to bring trust in processes initially needing a neutral third party, generally expensive to establish and maintain. However, security issues may arise and are considered as the main drawback not to adopt blockchain-based applications for critical processes. Blockchain being still an emerging technology, assessing security risks is still not mature, and no dedicated approach is available. The objective of the Master's thesis is to define a framework for assessing security risks of blockchain-based applications in a focused and exhaustive way.

State of the art includes:

  • Blockchain architecture and specific security issues / key risks
  • Risk management methods
  • Security assessment methods dedicated to applications (e.g., STRIDE approach)

Security Risk Management Framework for Internet of Things (IoT) Applications

Raimundas Matulevicius (rma [ät] ut [dot] ee) and Nicolas Mayer (LIST, Luxembourg)

Internet of Things (IoT) applications combine various devices, their aggregated applications, including both physical, software, distributed and social systems. In such a complex system it is important to define and implement the security policies to protect information against its malicious misuse. The objective of the Master's thesis is to define a framework for assessing security risks of IoT applications in a focused and exhaustive way.

Major steps:

  1. Perform a state of the art on existing security risk management frameworks including the ones for the IoT, security (potentially privacy) threats and security countermeasures.
  2. Define the security risk management framework for the IoT applications.
  3. Validate the constructed framework in the theoretical and/or empirical settings.

Model-based Secure Software System Development

Raimundas Matulevicius (firstname dot lastname ät ut dot ee)

The use of security models could support discussion about the security needs and its importance. Models contribute to security requirements validation, as well as potentially guide the secure software coding activities. However there exist a number of modelling approaches, which contribute with different perspectives or viewpoints on the developed secure system. The major goal of this Master thesis is to establish a systematic method to align different security perspectives expressed using various modelling notations. The major research steps are:

  1. Perform literature survey (i) on the existing security modelling languages and (ii) on the existing transformation between different security models
  2. Develop the systematic approach which would guide the developers with alignment of different security perspectives
  3. Validate the proposed method either through the proof of concept or empirically (e.g., experimental comparison with similar approaches).

Pattern-based Security Requirements Derivation from Use Case Models

Raimundas Matulevicius (firstname dot lastname ät ut dot ee)

Security requirements engineering plays an important role during software system development. However in many cases they are overlooked and considered only at the end of the software development. A possible way to improve this situation is development of the systematic instruments, which would facilitate security requirements elicitation. For instance, security patterns describe a particular recurring security problem that arises in a specific security context and presents a well-proven generic scheme for a security solution.

Use cases diagrams is a popular modelling technique to describe, organize and represent functional system and software requirements and to define major actors who interact with the considered system. Recently their security extension – a.k.a., misuse case diagrams – is proposed to address the negative scenarios.

The major goal of this Master thesis is to develop a set of security patterns using use and misuse cases and to illustrate how these patterns could be used to derive security requirements from the use cases. The thesis include the following steps:

  1. Conduct a literature review on (i) security engineering and security patterns, (ii) use cases and misuse cases, and (iii) security risk-oriented misuse cases;
  2. Develop a set of security patterns using use/misuse case diagrams;
  3. Develop guidelines to derive security requirements using the developed security patterns;
  4. Validating the developed security patterns and their guidelines empirically.

Parking Solution on the Blockchain

Supervisors: Luciano García-Bañuelos (luciano.garcia [ät] ut dot ee) & Fredrik Milani

Blockchain Technology and in particular, Distributed Ledger Technology (DLT) has received a tremendous amount of attention from industry. The decentralized attributes of DLT is being investigated. Several platforms such as Hyperledger, Etherium Alliance, Corda, Chain etc. are available as open source for exploring and using for developing new applications. In this thesis, you will build a prototype of a parking solution. The parking use case allows those who have parking space to rent it out to those who need it. The solution is to be built on blockchain technology. If possible, the same solution can be built on two different DLT as to compare and contrast.

Rental marketplace on the Blockchain

Supervisors: Luciano García-Bañuelos (luciano.garcia [ät] ut dot ee) & Undisclosed industry partner

An online real estate intermediary based in Singapore is currently running a rental marketplace based on a traditional centralized database Web application. The company is planning to deploy a new blockchain-based marketplace that will provide a wider range of functions in a tamper-proof and peer-to-peer manner. In the new marketplace, a notion of blockchain token will be used to facilitate rental transactions, and smart contracts will be deployed to trigger the associated payments. The company is willing to fly-in the Masters student who selects this project to Singapore for a short period of time in order to put this project on the rails. The rest of the project will be done remotely from Estonia. The company already has detailed use cases documented and is ready to engage in the development of the platform.

This project requires both excellent software development skills, strong analytical and problem-solving skills, and ability to quickly learn how to deploy and configure blockchain platforms.

Due to confidentiality issues, details of this project are provided only upon request. The Masters thesis might have to be defended in a closed-doors fashion, but this will only be determined closer to the defense date.

Hot Deployment of Linked Data for Online Data Analytics

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

The aim of this project is to design and implement an "hot" linked data deployment extension to an open source analytics server, such as RapidAnalytics. Software tools such as Weka or RapidMiner allow building analytical applications, which exploit knowledge hidden to data. However, one of the bottlenecks of such toolkits, in settings where a vast quantities of data with heterogeneous data models are available, is the amount of human effort required for first unification of the data models at the stage of data pre-processing and then extraction of relevant features for data mining. Furthermore, these steps are repeatedly executed each time a new dataset is added or an existing one is changed. However, in case of open linked data uniform representation of data leverages implicit handling of data model heterogeneity. Moreover, there exist open source toolkits, such as FeGeLOD [1] (), which automatically create data mining features from linked data. Unfortunately, the current approaches assume that a linked dataset is already pre-processed and available as a static file for which the features are created each time the file is loaded.

In this thesis project first an extension will be developed for discovering and loading a new dataset to an analytics server. Then existing data mining feature extraction methods will be enhanced and incorporated to the framework. Finally, the developed solution will be validated on a real-life problem.

[1] Heiko Paulheim, Johannes Fürnkranz. Unsupervised Generation of Data Mining Features from Linked Open Data. Technical Report TUD–KE–2011–2 Version 1.0, Knowledge Engineering Group, Technische Universität Darmstadt, November 4th, 2011. Available at http://www.ke.tu-darmstadt.de/bibtex/attachments/single/297 .

Open Cloud Infrastructure for Cost-Effective Harvesting, Processing and Linking of Unstructured Open Government Data

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

Bootstrapping Open Government Data projects, even when not considering the complementary five-star initiatives for linking the data, represents a tremendous task, if implemented manually in an uncoordinated way in ad hoc settings. Although the essential datasets may be publicly available for downloading, location, documentation and preparation of them for publication in a central (CKAN) repository represent a burden, which is difficult to absorb by officials of public administration. Furthermore, linking the prepared and published datasets represents even further challenges, especially in the case of semistructured and full-text documents, where understanding the content is complicated by lack of clear structure. Namely, detection of entities, which should be linked and detection of metamodels, which should be used for linking, is a search-intensive task even for machines, not only for humans.

Luckily, there are some open source tools for simplifying the process. A set of tools, including the ones for semi-automatic link discovery, is represented in the LOD2 Technology Stack (http://stack.lod2.eu/blog/). In addition there are general-purpose text processing frameworks such as Apache OpenNLP and for the Estonian language there is a named entity recognition solution (svn://ats.cs.ut.ee/u/semantika/ner/branches/ner1.1) available. Finally, there is NetarchiveSuite (https://sbforge.org/display/NAS/Releases+and+downloads) for Internet archival, which can be used for creating Web snapshots.

This project aims at developing a cloud platform for harvesting Open Government Data and transforming it into Linked Open Government Data. The platform consists of a Web archival subsystem, open data repository (CKAN), document content analysis pipeline with named entity recognition and resolution pipeline, and finally a linked data repository for serving the processed data. The Web archival subsystem will continuously monitor changes in the Web by creating monthly snapshots of the Estonian public administration Web, comparing the snapshots and detecting new datasets (or changes) together with their metadata. The datasets together with their metadata are automatically published at a CKAN repository. The CKAN repository is continuously monitored for new datasets and updates and each change will trigger execution of the pipeline of document content analysis (i.e. analysis of CSV file content). The pipeline will detect named entities from the source documents, resolve the names wrt other linked datasets (i.e. aadresses or organizations) and finally publish the updates at a linked data repository with an open SPARQL endpoint. The latter will provide means for consumption of Linked Open Government Data.

A Crawler for RESTful, SOAP Services and Web Forms

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

The Deep Web, consisting of online databases hidden behind SOAP-based or REST-ful Web services or Web forms, is estimated to contain about 500 times more data than the (visible) Web. Despite many advances in search technology, the full potential of the Deep Web has been left largely underexploited. This is partially due to the lack of effective solutions for surfacing and visualizing the data. The Deep Web research initiative at University of Tartu's Institute of Computer Science has developed an experimental platform to surface and visualize Deep Web data sources hidden behind SOAP Web service endpoints. However, currently this experimental platform only supports a limited set of SOAP endpoints, updated on ad hoc basis.

The aim of this project is to build a crawler and an indexing engine capable of recognizing endpoints behind Web forms, RESTful services and SOAP-based services, together with their explicit descriptions (e.g. WSDL interface descriptions, when available). Furthermore, the crawler should identify examples of queries that can be forwarded to those endpoints, especially for endpoints with no explicit interface descriptions such as Web forms.

This project is available both for Master and for Bachelor students. The goal of the Masters project would be to build a crawler supporting endpoints with and without explicit interfaces. The goal of the Bachelor thesis will be to crawl WSDL interfaces only.

Transforming the Web into a Knowledge Base: Linking the Estonian Web

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

The aim of the project is to study automated linking opportunities for Web content in Estonian language. Recent advances in Web crawling and indexing have resulted in effective means for finding relevant content from the Web. However, getting answers to queries, which require aggregation of results, is still in its infancy since better understanding of the content is required. At the same time there has been a fundamental shift in the content linking - instead of linking Web pages, more and more Web content is tagged and annotated to facilitate linking of smaller fragments of Web pages by means of RDFa and microformat markups. Unfortunately this technology has not been widely adopted yet and further efforts are required to advance the Web in this direction.

This project aims at providing a platform for automating this task by exploiting existing natural language technologies, such as named entity recognition for Estonian language, in order to link content of the entire Estonian Web. For doing this, two Master students will work closely, first in setting up the conventional crawling and indexing infrastructure for the Estonian Web and then extending the indexing mechanism with a microtagging mechanism, which will enable linking the crawled Web sites. The microtagging mechanism will take advantage of existing language technologies to extract names (such as names of persons, organizations and locations) from the crawled Web pages. In order to validate the approach a portion of the Estonian Web is processed and exposed in RDF form through a SPARQL query interface such as the one provided by the Virtuoso OpenSource Edition.

Automated Estimation of Company Reputation

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

Reputation is recognized as a fundamental instrument of social order - a commodity, which is accumulated over time, is hard to gain and easy to loose. In case of organizations reputation is also linked to their identity, performance and the way others respond to their behaviour. There is an intuition that reputation of a company affects perception of its value by investors, helps to attract new customers and to retain the existing ones. Therefore organizations, focusing to long-term operation, care about their reputation.

Several frameworks, such as WMAC (http://money.cnn.com/magazines/fortune/most-admired/, http://www.haygroup.com/Fortune/research-and-findings/fortune-rankings.aspx), used by the Fortune magazine, have been exploited to rank companies by their reputation. However, there are some serious issues associated with reputation evaluation in general. First, the existing evaluation frameworks are usually applicable to evaluation of large companies only. Second, the costs of applying these frameworks are quite high in terms of accumulated time of engaged professionals. I.e. in case of WMAC more than 10,000 senior executives, board directors, and expert analysts were engaged to fill questionnaires to evaluate nine performance aspects of Fortune 1000 companies in 2009. Third, the evaluation is largely based on subjective opinions rather than objective criteria making continuous evaluation cumbersome and increases the length of evaluation cycles.

This thesis project aims at finding a solution to these issues. More specifically, the project is expected an answer the following research question: in which degree the reputation of a company is determined by objective criteria such as its age, financial indicators, sentiment of news articles and comments in the Web etc. The more specific research questions are the following:

  1. Which accuracy in reputation evaluation can be achieved by using solely objective criteria?
  2. Which objective criteria and which combinations discriminate best reputation of organizations?
  3. In which extent does reputation of an organization affect reputation of another organization through people common in their management?
  4. How do temporal aspects (organization's age, related past events etc) bias reputation?

In order to answer to these questions network analysis and machine learning methods will be exploited and a number of experiments will be performed with a given dataset. The dataset to be used is an aggregation of data from the Estonian Business Registry, Registry of Buildings, Land Register, Estonian Tax and Customs Board, Register of Economic Activities, news articles from major Estonian news papers and blogs and some propriatory data sources.

Visualization of traffic flow and/or people density changes with animated texture/particles.

Toivo Vajakas (firstname.lastname ät ut.ee)

Download the project description.

Machine learning in the cloud

Chris Thompson (chris [ät] speaklanguages dot com)

Machine learning in the cloud

Cell constructor

Leopold Parts (firstname.lastname ät gmail.com)

Download the project description.

Situational awareness for cloud centers

Ilja Livenson (firstname.lastname [ät] ut [dot] ee)

In this project, you will develop a tool chain for collection and analysis of cloud data center operation data. The tool will gather and process data across different sources (e.g. hw sensors, opsystem monitoring, internal, campus and public network monitoring, network inspection (DPI, IDS), user behaviour, service behaviour, etc).

The toolset should allow users to answer the following questions:

  • what is happening at all
  • what started happening 30mins ago
  • who/what is affected

This is conceptually to https://github.com/JakobRogstadius/CrisisTracker (changing the domain and probably being a bit more dashboard oriented).

This is a real problem and will lead to a prototype that will be used in practice. The prototype will be tested by six operators of data centers in Estonia. The project can be made smaller or larger by varying the amount and variety of data to be gathered and analyzed.


Topics for IT Conversion Masters Theses (15 ECTS)

The Role of the Business Analyst in the Digital Era

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

The traditional understanding of the role of Business Analyst is quite clear. However, in the past 10 years, the traditional company has changed. The emergence of digital companies, agile methods, new roles such as product owner, the changing role of project managers, new roles within agile methods, has change the landscape and as such, the role and competences of business analysts. This thesis is to examine these different roles and analyse what the role of the business analyst can be in such contexts and what competences would be required for business analysts to stay relevant.

The Role of the Business Analysts in Agile Processes

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

The role of the business analyst was very clear in predictive software development processes (such as waterfall and V-model). However, with the growing popularity of agile methods, the role of the business analyst is becoming less clear. This thesis aims at investigating agile methods and examine what work is being done in order to define the role of a business analyst in agile methods and map out how business analyst can deliver value in agile methods.

Digitalization and Process Innovation

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

Digitalization has disrupted many industries and changed the way business is conducted. It has reformed and revolutionized business models and been instrumental in driving process innovation. Still, many industries and companies have not yet utilized the full potential of digitalization. In this thesis, you will survey the digitalization opportunities so far exploited, and overlaying them with the business model canvas, create a framework for how digitalization has innovated processes within a certain industry. The topic is to be within one industry such as accounting, savings, health care, manufacturing, shipping, retail and so on. You are encouraged to choose the industry within which you have experience. The end result of the thesis will be a framework by which one can see, understand, and identify how digitalization can enable process innovation in different parts of the business model of a company.

Digitalization and the Role of the Business Analyst

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

Business Analyst have predominantly worked with identifying needs, mapping the current state, eliciting requirements, and designing solutions. As the profession grew and was established pre “big data” era, the competencies, methods, tools, and approaches were designed for traditional incumbents. However, with the emergence and penetration of “data-driven” perspective, new sets of perspectives, competencies, methods, tools, and approaches are required. The role of the business analysts is changing but it is not yet clear as to into what. In this thesis, you will survey the needs of “data-driven” projects, interlay those with the role and competencies of business analyst, analyse the results so to outline and explain what the role of a business analyst can and should be in “data-driven” era in order to deliver value.

Blockchain and Business Processes

Supervisor: Fredrik Milani (milani [ät] ut [dot] ee)

The interest for blockchain is growing very strongly. As this new technology is gaining traction, many uses ranging from voting to financial markets application. Currently the hype around the technology is overshadowing the value it can deliver by enabling changes in the business processes. While blockchain can deliver value by replacing existing IT solutions, the real value comes from innovating the business processes. This topic is about exploring, for one of the below industries/cases what the current business processes are, how blockchain could enable innovation of the business processes, and finally comparing/contrasting them in order to draw conclusions.

Each of the below listed cases can be the starting point fora 15-ECTS Masters thesis. In each case, the goal will be to examine the current processes are examined, identify improvement opportunities using blockchain technology, and propose and analyze a redesigned process.

  • Health Care – transferring and owning your own medical health records and prescription management
  • Registry – management of assets (digital and physical) including registration, tracking, change of ownership, licensing and so on
  • Financial Markets – covering one or several cases such as post trading settlement of securities and bilateral agreements
  • IoT – connecting multiple devices with blockchain

Customer Journey Mapping

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

A Customer Journey Map is a graphical representation of how a given customer interacts with an organization in order to consumer its products or services, possibly across multiple channels. Several approaches for customer journey mapping exist nowadays. Each relies on different concepts and notations. In this thesis, you will review the most popular approaches that are currently in use for customer journey mapping, and you will distill from them a common set of concepts and notations. You will then show how these concepts and notations can be applied in an organization of your choice (preferably an organization where you work or one where you have a lot of experience interacting as a customer).

Case Study: Impact of GDPR on a Business Process

Article 5(1)-c of the European General Data Protection Regulation (GDPR) sets a requirement for data minimization. Specifically, this article establishes that companies that process personal data of European citizens must be able to demonstrate that every business process in which personal data is handled, should be such that it uses the personal data in a way that is "adequate, relevant and limited to what is necessary in relation to the purposes for which [the personal data] are processed" [1].

In this thesis, you will study what this data minimization requirement means concretely in a given case study within your company or another company you are very familiar with. To be able to study this topic, you need to have access to documentation and you should be able to interview the key stakeholders involved in one or more business processes where private data is handled. As part of the project, you will model the process(es) in details using a rigorous process modeling method; you will analyze the data access and processing within this/these process(es), and you will determine if the process "as is" fulfills the data minimization requirement. If it does not, you will propose possible changes to the process to implement this requirement and analyze their trade-offs.

[1] https://www.privacy-regulation.eu/en/5.htm

Case Study in Business Process Improvement or Business Data Analytics

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

This is a "placeholder" Masters project topic, which needs to be negotiated individually. If you work in a IT company and you are actively engaged in a business process improvement or business data analytics project, or if you can convince your hierarchy to put in time and resources into such project in the near-term, we can make a case study out of it. We will sit down and formulate concrete hypotheses or questions that you will test/address as part of this project, and we will compare your approach and results against state-of-the-art practices. I am particularly interested in supervising theses topics related to customer analytics, product recommendation, business process analytics (process mining), and privacy-aware business analytics, but I welcome other topic areas.

Risk Management in a Startup Context

Evgenia Trofimova (evgenia [dot] trofimova [ät] ut [dot] ee)

Startups are by nature highly risk-taking enterprises since their business model is uncemented or relies on a number of untested hypotheses. This risk-taking attitude however does not necessarily mean that startups do not need or do not actively engage in risk management.

In this project, you will interview a number of IT entrepreneurs and other stakeholders in the IT startup field to understand if and how risk management is handled in this environment. You will analyze on the one hand if traditional risk management approaches (e.g. TBQ) are used in this setting (why or why not?), or if other emerging risk management approaches (e.g. based on business model canvas) are already in active use, implicitly or explicitly.

Team diversity in Early-Stage Tech Companies

Evgenia Trofimova (evgenia [dot] trofimova [ät] ut [dot] ee)

We like to talk about the importance of diversity to create great products. In this project, you will collect data from startups via surveys and/or interviews, particularly in the Estonian and Nordic IT startup context, in order to shed light into the question of how important is diversity in the context of MVP and early-stage product development.

Product Management in Estonian Tech Companies

Evgenia Trofimova (evgenia [dot] trofimova [ät] ut [dot] ee)

Product management is quite a new topic on Estonian market. In this thesis, you will study how the software development processes have changed in Estonian tech companies after they have hired their first product manager(s), and what aspects of product managers are more or less developed in the Estonian tech sector, compared to international practices.

Case Study in Risk Management, Product Management and Release Management in an IT Company

Evgenia Trofimova (evgenia [dot] trofimova [ät] ut [dot] ee)

This is a "placeholder" Masters project topic, which needs to be negotiated individually. If you work in a IT company that has actively engaged in risk management or product management in a more or less structured basis, you can focus your Masters thesis project on studying why and how these practices were introduced in the company? How have they evolved? And how the introduction and development of these management practices have impacted on the company's revenues, profit marging and other startegic KPIs. If you are in a company where these questions can be studied and are interested in digging into them, just contact me.

Minimum Viable Products (MVP) for Hardware: Is it or can it be done?

Evgenia Trofimova (evgenia [dot] trofimova [ät] ut [dot] ee)

MVPs in the software industry are common practice and relatively easy to conceive because of their malleable nature: You push the updates every 2 weeks and people get them. In the hardware world, you can't just ship a new product every 2 weeks. Are there ways of applying a lean methodology based on the notion of MVP in the hardware context? Or do MVPs in the hardware context appear in other forms?


Bachelors projects

Note: The number of projects for Bachelors students is limited. But you can find several other potential project ideas by checking this year's list of proposed software projects. Some of the projects in the "Available" category could be used as Bachelors thesis topics. Also, we're open to student-proposed Bachelors thesis projects. If you have an idea for your Bachelors projects and your idea falls in the area of software engineering (broadly defined), please contact the group leader: Marlon . Dumas ät ut.ee

Rescue event categorisation

Supervisor: Siim Karus (siim04 ät ut.ee)

In this thesis, you will analyze data provide by the Rescue services in order to find commonalities in rescue events so as to categorise them. One of the aims will be to isolate and characterize less common rescue event categories, which are of special interest to the Rescue services.

The thesis will be conducted in cooperation with the Rescue Services. The thesis can be written in Estonian.

Workflow Automation With Business Data Streams

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

There exist services such as Flowxo and IFTTT which facilitate automation of simple workflows by facilitating creation of trigger / action pairs or if-then recipes, whose execution will orchestrate applications wrt external stimuli (e.g. application data stream). An example of a popular recipe is the following: "IF I post a picture on Instagram THEN save the photo to Dropbox" or a more complex example is "for every new deal in a CRM send the deal with an e-mail to a person with the GMail service, then wait for about 1 day and send a reminder SMS via Twilio service.

Such systems mostly rely on proprietary application data while there are cases where external stimulus will provide extra benefits. An example of such a case is integration of CRM and credit management tools with external stimuli in form of streaming company debt and risk score data for Order-to-Cash business process. There is a Stream API for business data currently under development at Register OÜ and it will provide a stream of events such as company debt change, changes in board membership and data about newly registered companies. Such data changes events can be easily applied in the context of CRM and a credit management (CM).

The aim of the project will be to leverage provision of an analogue of IFTTT, where users can define recipes for reacting into business data changes via actions in applications such as GMail, Odoo CRM etc.

The project will be done in collaboration with Register OÜ. The application will be developed by using the Complex Event processing (CEP) feature of Register Stream API.

Lead generator for accelerating B2B sales

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

Companies care a lot about improving their sales and to meet such a demand numerous online solutions have been proposed. While for B2C sales the prevalent solutions use social media campaign and Web visitor data, for B2B sales there are solutions, which allow generating a list of leads based on a set of attribute values over company data such as its activity field, size, financial metrics etc.

Some solutions for the Estonian market include https://www.baltictarget.eu, http://sihtgrupid.ee, http://turundusnimekirjad.ee/ and http://www.kliendibaas.ee/. However, these solutions have the following deficiencies:

  1. The market segment must be known before generating the leads
  2. The set of attributes is mostly limited to geographic, activity field and financial data
  3. the data is returned as a file.

This project aims at innovating the B2B sales by providing a solution, which differs from the existing ones in the following way:

  1. instead of a list of feature / value pairs a user can define its market segment by giving a set of prospective clients as input to lead generation;
  2. in addition to the activity fields, company size and financial metrics also data about owned real estate, credit history, credit risk, media coverage and related persons can be used;
  3. instead of outputting leads to a CSV file, lead data will be directly imported to an existing CRM system or a new cloud instance of a CRM will be deployed and populated with the leads.

The project will be done in collaboration with Register OÜ.

Automated brand magazine

Supervisor: Peep Küngas (peep.kungas ät ut.ee)

It is essential for companies to acquire new and retain existing customers. To target this need content marketing techniques have been developed and advocated. However, due to lack of proper skills the techniques are often underutilized. Multiple tools have created to simplify content marketing. For instance Flipboard (https://about.flipboard.com/advertisers) simplifies creation of brand magazines executing content marketing within users. LinkedIn has been used by top management to deliver company news to their employees. Instant Articles by Facebook (https://instantarticles.fb.com) provides means for publishers to make their articles to appear more attractive and to increase engagement rate on Facebook. Anyway, all of the mentioned solutions expect that relevant content will be provided and managed manually.

In this project a solution will be developed, which will simplify creation and maintenance of brand magazines for companies (especially SME-s) and persons (e.g. bloggers). The key innovation of the project is that it will *automatically* search for mentions of companies, persons and brands from the Web via Register Graph API (https://developers.ir.ee/graph-api) and create attractive brand pages out of them, which will be made visible via search engines to the target audience. Mentions origining from online news media, blogosphere, forums, corporate blogs and other Web sources will be presented at brand pages with specific cards (see https://www.google.com/search/about/learn-more/now/ for the concept of cards), e.g. "Customers' feedback", "Our partners", "New product launched", "About company", etc.

Some initial requirements:

  1. Responsive design with Google Material
  2. Use of Register Stream API and Graph API as data sources
  3. Search engine and user friendly Web solution

Comparison of BPMN Security Extensions

Raimundas Matulevicius (raimundas.matulevicius ät ut.ee)

Recently a lot of BPMN extensions are proposed towards security analysis. These extensions concern different aspects, starting from the security problem definition, to security requirements introduction and control identification. The goal of the thesis is develop systematic and coherent overview of the extensions and to define a set of guidelines for selecting certain BPMN security extensions for targeted problems. This should provide an overview of emerging trends.

Starting points:

  1. R. Braun, W. Esswein, Classification of Domain-Specific BPMN Extensions, The Practice of Enterprise Modeling Lecture Notes in Business Information Processing Volume 197, 2014, pp 42-57
  2. Menzel, M., Thomas, I., Meinel, C.: Security Requirements Specification in Service- oriented Business Process Management. In: ARES 2009, pp. 41–49 (2009)
  3. Altuhhova, O., Matulevičius, R., Ahmed, N.: An extension of business process model and notation for security risk management. International Journal of Information System Modeling and Design (IJISMD) 4(4), 93–113 (2013)
  4. Cherdantseva Y., Hilton J., Rana O., Towards SecureBPMN - Aligning BPMN with the Information Assurance and Security Domain, Business Process Model and Notation, Lecture Notes in Business Information Processing Volume 125, 2012, pp 107-115
  5. Marcinkowski, B., Kuciapski, M.: A business process modeling notation extension for risk handling. In: Cortesi, A., Chaki, N., Saeed, K., Wierzchoń, S. (eds.) CISIM 2012. LNCS, vol. 7564, pp. 374–381. Springer, Heidelberg (2012)
  6. Saleem, M., Jaafar, J., Hassan, M.: A domain-specific language for modelling security objectives in a business process models of soa applications. AISS 4(1), 353–362 (2012)
  7. Rodriguez, A., Fernandez-Medina, E., Piattini, M.: A bpmn extension for the modeling of security requirements in business processes. IEICE Transactions on Information and Systems 90(4), 745–752 (2007)

Comparison of CORAS and ArchiMate risk and security extension

Raimundas Matulevicius (raimundas.matulevicius ät ut.ee)

In this project, you will conduct a comparison of CORAS and ArchiMate risk and security extension as visual notations for modeling security risks. It will include modelling a case study using CORAS and then using ArchiMate and compare them. Comparison based on cognitive effectiveness or any other relevant criteria. The research consists of the following steps:

  1. Introduce what is CORAS (book + few papers + tool)
  2. Introduce what is ArchiMate risk and security extension
  3. Define comparison criteria
  4. Use at least 3 criteria to assess options 1 and 2 and compare assessment results

Lab Package Development & Evaluation for the Course 'Software Testing' (MTAT.03.159)

Supervisor: Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

The course Software Testing (MTAT.03.159) has currently 6 labs (practice sessions) in which 2nd and 3rd year BSc students learn a specific test technique. We would like to improve existing labs and add new labs.

This topic is intended for students who have already taken this software testing course and who feel that they can contribute to improving it and by the same token complete their Bachelors project. The scope of the project can be negotiated with the supervisor to fit the size of a Bachelors project.

The tasks to do for this project are as follows:

  • Selection of a test-related topic for which a lab package should be developed (see list below)
  • Development of the learning scenario (i.e., what shall students learn, what will they do in the lab, what results shall they produce, etc.)
  • Development of the materials for the students to use
  • Development of example solutions (for the lab supervisors)
  • Development of a grading scheme
  • Evaluation of the lab package

Topics for which lab packages should be developed (in order of urgency / list can be extended based on student suggestions):

  • Combinatorial Testing
  • Automated Unit & Systems Testing
  • Issue Reporting
  • Debugging

Literature Survey on "Requirements Elicitation Techniques – Strengths and Weaknesses"

Supervisor: Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

Elicitation is the process by means of which a software analyst gathers information about the problem domain. The analyst uses a series of analyst-user interaction mechanisms, called elicitation techniques, to acquire information. A very wide range of elicitation techniques have been proposed: interviews (structured, semi-structured, open), protocol analysis, laddering, work groups (or: focus groups), prototyping, etc. A number of studies suggest that elicitation techniques are not inter-changeable, and there are far-reaching differences with regard to what type of knowledge each technique can uncover. Other aspects, like quantity of information or elicitation efficiency, are features that might distinguish one elicitation technique from another.

Project task: Pick (at least) two software requirements elicitation techniques, find literature about them, and compare them with regards to

  • the type of requirements-related information each technique is best/worst at finding,
  • the effectiveness of each technique with regards to requirements elicitation,
  • the efficiency of each technique with regards to requirements elicitation

Starting point for finding literature: Dieste, O., & Juristo, N. (2011). Systematic review and aggregation of empirical studies on elicitation techniques. IEEE TSE 37(2), 283-304.

Literature Survey on "Open Innovation – How to use it for software requirements elicitation?"

Supervisor: Dietmar Pfahl (dietmar dot pfahl at ut dot ee)

Open innovation (OI) is a new paradigm that aims at opening up organizational boundaries in order to use and recombine internal and external knowledge to develop and commercialize innovative products. The idea of OI can become an interesting new approach to requirements elicitation for software products. In particular, social media, blogs, and other freely accessible resources could be systematically analyzed for relevant ideas that would help improve the value of future products.

Project task: Find literature on reported attempts to exploit social media, blogs, and other open sources for detecting new and complementing existing functionality of existing and new software products. Summarize and discuss the literature you find. In your analysis you may focus on the type of information sources exploited, the ways how they were analyzed, the kind of information (new requirements, discussion/evaluation of existing functionality, etc.) extracted, the type of products for which new requirements were sought, etc.

Starting point for literature search:

  • Anton Barua, Stephen W. Thomas, Ahmed E. Hassan (2014) What are developers talking about? An analysis of topics and trends in Stack Overflow. Empirical Software Engineering, June 2014, Volume 19, Issue 3, pp 619-654.

Customer Journey Mapping

Supervisor: Marlon Dumas (marlon dot dumas ät ut dot ee)

A Customer Journey Map is a graphical representation of how a given customer interacts with an organization in order to consumer its products or services, possibly across multiple channels. Several approaches for customer journey mapping exist nowadays. Each relies on different concepts and notations. In this thesis, you will review the most popular approaches that are currently in use for customer journey mapping, and you will distill from them a common set of concepts and notations. You will then show how these concepts and notations can be applied in an organization of your choice (preferably an organization with which you have experience interacting as a customer).

Blockchain and Business Processes

Supervisor: Fredrik P. Milani (milani ät ut dot ee)

The interest for blockchain is growing very strongly. As this new technology is gaining traction, many uses ranging from voting to financial markets application. Currently the hype around the technology is overshadowing the value it can deliver by enabling changes in the business processes. While blockchain can deliver value by replacing existing IT solutions, the real value comes from innovating the business processes. This topic is about exploring, for one of the below industries/cases what the current business processes are, how blockchain could enable innovation of the business processes, and finally comparing/contrasting them in order to draw conclusions.

Each of the below listed cases makes a separate bachelor thesis where the current processes are examined, conceptual processes with blockchain are designed and analysed.

  • Voting – voting solutions based on blockchain technology
  • Insurance – insurance firms offering a range of products and how that would be transformed if supported by blockchain technology
  • Health Care – transferring and owning your own medical health records and prescription management
  • Registry – management of assets (digital and physical) including registration, tracking, change of ownership, licensing and so on
  • Financial Markets – covering one or several cases such as post trading settlement of securities and bilateral agreements
  • IoT – connecting multiple devices with blockchain

Estonian E-Governance Academy (two Bachelors project proposals)

Supervisor: Hannes Astok

Training app for government officials and integrated information screen for visitors