Performance Issues Using ORDER to Reduce the Number of Out Files – Apache Pig 0.16 Amazon EMR

Often you have a simple ETL process (a Pig job in our example) that just applies a filter to the source data, performs some calculations and saves the result to a target table.

So it is a Map-only job (no Reduce phase is required) that generates N output files, where N is the number of map tasks.

For example:

set mapreduce.output.fileoutputformat.compress true

-- Read data from a Hive table
data_all = LOAD 'dmtolpeko.mcp' USING org.apache.hive.hcatalog.pig.HCatLoader();

-- Filter selects less than 2% of rows from the source table
data_filtered = FILTER data_all BY event_name == 'CLIENT_LOGIN'; 

-- Define out columns and save the results to S3
data_out = FOREACH data_filtered GENERATE app_id, payload, event_timestamp;
STORE data_out INTO 's3://epic/hive/dmtolpeko.db/mcp_client_login/';   

From the log you can see that the job used 231 mappers:

INFO mapred.FileInputFormat: Total input paths to process : 250
INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 231
INFO util.MapRedUtil: Total input paths (combined) to process : 231
INFO mapreduce.JobSubmitter: number of splits:231

And we can see that the output contains 231 data files (_SUCCESS is an empty file):

aws s3 ls s3://epic/hive/dmtolpeko.db/mcp_client_login/ --summarize

2018-03-01 22:22:33          0 _SUCCESS
2018-03-01 22:22:30      11165 part-m-00000.gz
2018-03-01 22:22:29      11107 part-m-00001.gz
2018-03-01 22:22:30      11346 part-m-00002.gz
2018-03-01 22:22:27       5697 part-m-00228.gz
2018-03-01 22:22:26       5686 part-m-00229.gz
2018-03-01 22:22:28       5480 part-m-00230.gz

Total Objects: 232
   Total Size: 1396533

This job produced a large number of very small files. What can we do to reduce the number of output files?
Continue reading

MOBA Games Analytics – Platform Balance Details

In the previous article I started analyzing the platform balance for a MOBA game that supports multiple platforms and allows players on PC and PlayStation 4 i.e. to play together in the same match.

Looking at the summary statistics on platform distribution we could not find any obvious disbalance issues. Now I am going to continue the research, but only for PvP mode from now.

First it makes sense to start looking at the platform distribution within teams over a time period (each column below is a day of observation):


This sample report shows data for 5 time intervals, and you can see that most team combinations are 0-5 (0 PCs and 5 PlayStation 4), following by 1-4 (1 PCs and 4 PlayStation 4) and so on.

The next question you should ask what is the win rate for each platform combination in teams:


Ok, the win rate grows with the number of PC players in a match, but as you can see above, the number of games with PC platforms exceeding PS4 platforms in a team is small, so the matchmaking already tries to balance the PC advantage (that’s itself interesting to investigate in detail).

When you know how the platforms are distributed in teams, it is interesting to know which teams/platforms combinations appear in matches:


In this example data set, 0-5/0-5 means a match where 5 players on PlayStation 4 play with another team of 5 players on PlayStation 4.


Fortunately (obviously for PS4 players), the number of 5-0/0-5 matches is low. Let’s double check and look at the win rate for team/platform combinations right now:


Here we can see confirmation (the view shows the win rate of the first team in PC-PS4/PC-PS4 combination) that the PC platform has some advantage. 0-5/5-0 means that 5 PS4 players has the win rate about 40% when they play with 5 PC players. 1-4/0-5 and 1-4/1-4 is about 50% as expected.

Why does one platform have advantage over another? It is a very interesting question, and the answer may have nothing related to the platform and its mechanics itself. I will try to go deeper in further posts.