Performance Issues Using ORDER to Reduce the Number of Out Files – Apache Pig 0.16 Amazon EMR

Often you have a simple ETL process (a Pig job in our example) that just applies a filter to the source data, performs some calculations and saves the result to a target table.

So it is a Map-only job (no Reduce phase is required) that generates N output files, where N is the number of map tasks.

For example:

set mapreduce.output.fileoutputformat.compress true

-- Read data from a Hive table
data_all = LOAD 'dmtolpeko.mcp' USING org.apache.hive.hcatalog.pig.HCatLoader();

-- Filter selects less than 2% of rows from the source table
data_filtered = FILTER data_all BY event_name == 'CLIENT_LOGIN'; 

-- Define out columns and save the results to S3
data_out = FOREACH data_filtered GENERATE app_id, payload, event_timestamp;
STORE data_out INTO 's3://epic/hive/dmtolpeko.db/mcp_client_login/';   

From the log you can see that the job used 231 mappers:

INFO mapred.FileInputFormat: Total input paths to process : 250
INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 231
INFO util.MapRedUtil: Total input paths (combined) to process : 231
INFO mapreduce.JobSubmitter: number of splits:231

And we can see that the output contains 231 data files (_SUCCESS is an empty file):

aws s3 ls s3://epic/hive/dmtolpeko.db/mcp_client_login/ --summarize

2018-03-01 22:22:33          0 _SUCCESS
2018-03-01 22:22:30      11165 part-m-00000.gz
2018-03-01 22:22:29      11107 part-m-00001.gz
2018-03-01 22:22:30      11346 part-m-00002.gz
2018-03-01 22:22:27       5697 part-m-00228.gz
2018-03-01 22:22:26       5686 part-m-00229.gz
2018-03-01 22:22:28       5480 part-m-00230.gz

Total Objects: 232
   Total Size: 1396533

This job produced a large number of very small files. What can we do to reduce the number of output files?
Continue reading

MOBA Games Analytics – Platform Balance Details

In the previous article I started analyzing the platform balance for a MOBA game that supports multiple platforms and allows players on PC and PlayStation 4 i.e. to play together in the same match.

Looking at the summary statistics on platform distribution we could not find any obvious disbalance issues. Now I am going to continue the research, but only for PvP mode from now.

First it makes sense to start looking at the platform distribution within teams over a time period (each column below is a day of observation):


This sample report shows data for 5 time intervals, and you can see that most team combinations are 0-5 (0 PCs and 5 PlayStation 4), following by 1-4 (1 PCs and 4 PlayStation 4) and so on.

The next question you should ask what is the win rate for each platform combination in teams:


Ok, the win rate grows with the number of PC players in a match, but as you can see above, the number of games with PC platforms exceeding PS4 platforms in a team is small, so the matchmaking already tries to balance the PC advantage (that’s itself interesting to investigate in detail).

When you know how the platforms are distributed in teams, it is interesting to know which teams/platforms combinations appear in matches:


In this example data set, 0-5/0-5 means a match where 5 players on PlayStation 4 play with another team of 5 players on PlayStation 4.


Fortunately (obviously for PS4 players), the number of 5-0/0-5 matches is low. Let’s double check and look at the win rate for team/platform combinations right now:


Here we can see confirmation (the view shows the win rate of the first team in PC-PS4/PC-PS4 combination) that the PC platform has some advantage. 0-5/5-0 means that 5 PS4 players has the win rate about 40% when they play with 5 PC players. 1-4/0-5 and 1-4/1-4 is about 50% as expected.

Why does one platform have advantage over another? It is a very interesting question, and the answer may have nothing related to the platform and its mechanics itself. I will try to go deeper in further posts.

MOBA Games Analytics – Platform Balance Summary

Some MOBA games are cross-platform titles, so you can choose whether to play on PC, Sony PlayStation 4, Xbox One etc. Moreover, in some games players on different platforms can appear in the same match.

In a MOBA, how does this affect the game balance and matchmaking quality? First let’s look at the players distribution on the supported platforms over a time period:


In my sample data, you can see that the number of players on one platform is 3 times greater than on another one. Is it a problem? Let’s see at the win rate for each platform:


The win rate is slightly different on two platforms, but together it exceeds 120%. How can be this possible?

We can assume that the win rate is higher for players in Solo (single player vs AI) and Coop (multiple real players vs AI) modes compared with the PvP (players vs players) mode. Let’s verify this by looking at the win rate distribution per match mode:


Ok, now we see the win rate in Coop matches is very high (bots are weak?) and it is around 50% in PvP games as we may expect. Solo win rate is low, and you can explain this as either the players who chooses this mode are newbie, or it is experienced players who want to experiment, and do not want to disturb other players.

Since for the game balance and matchmaking quality we are mostly interested in PvP matches, we have not found any significant disbalance so far. But we cannot be sure whether it is good now.

What we can additionally do is to see how the win rate is distributed within players having different MMR (matchmaking rating):


We can see that top players has a higher win rate. Probably you will not be able to fix that since the number of top players is always small, you cannot always match them to play with each other only for various reasons.

We need to do a deeper analysis that I will do in the next posts.