pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork,
tidytext, tidyverse, RColorBrewer, wordcloud2,DT,
treemap,knitr,ggiraph,hrbrthemes,scales,igraph)
options(scipen = 999) Take Home Exercise 3
Mini Case 3 of Vast Challenge 2023
1. OVERVIEW
FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. FishEye has transformed the database into a knowledge graph which includes companies, owners, workers, and financial status.
1.1 The Task
In this take-home exercise, anomalies in the business groups are identified using the knowledge graph FishEye created from the database. In addition, there is a visual analytics process to find similar businesses by focusing on the business’s most important features.
2. Datasets
The trade data is stored in MC3.json file. There are a total of 27,622 nodes and 24,036 edges in the knowledge graphs.
2.1 Metadata
| Location | Variables Name | Description |
|---|---|---|
| Node, Edges | id | Identifier of the node |
| Node | country | Country associated with the entity |
| Node,Edges | type | Type of node. Could be a person or a company |
| Node | revenue_omu | Operating revenue of id node in Oceanus Monetary Units |
| Node | product_services | Description of product services that id node does |
| Edge | source | ID of source node |
| Edge | target | Id of target node |
3. Data Preparation
3.1 Install R-packages
Using p_load() of pacman package to load and install the following libraries:
jsonlite: To import data from JSON File into Rtidygraph: For graph manipulationggraph: For visualizing graphs and networksvisNetwork: For network visualizationtidytext: To tidy, transform and analyze text datatidyverse: A collection of R packages use in everyday data analyses. It is able to support data science, data wrangling, and analysisRColorBrewer: For visualization. Contains ready-to-use color palettes to create visually appealing plotswordcloud2: For creating word clouds, which is a visual representations of word frequencyDT: For creating interactive data table. Provides functions to display, filter and sort data with interactive exploration and analysistreemap: For visualizing hierarchical data using nested rectanglesknitr: For dynamic report generationggiraph: For creating interactive ggplot2 graphics.hrbrthemes: For Additional Themes, and Utilities for ‘ggplot2’scales: For customizing the appearance of axis and legend labelsigraph: For exploring and analyzing networks
options(scipen = 999) : removes scientific notation in our exercise.
3.2 Importing Data
The JSON file will be imported into R with the use of fromJSON function from [jsonlite] package. The code chunk below shows the knowledge graph that FishEye have transformed from a database given by an international finance corporation with regards to fishing related companies.
MC3_challenge <- fromJSON("data/MC3.json")3.2.1 Extracting Edges
As the imported data file is a large list, we will extract the edges from MC3_challenge and save it as a tibble data frame called MC3_edges. The code is extracted in the following manner:
distinct()is used to remove duplicated recordsmutate()andas.character()are used to convert field data type from list to charactergroup_by()andsummarise()are used to count the number of unique linksfilter(source!=target)is used to ensure that both companies are not identical
MC3_edges <-as_tibble(MC3_challenge$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source,target, type) %>%
summarise(weights = n()) %>%
filter (source != target) %>%
ungroup()3.2.2 Extracting Nodes
Similarly, we will extract the nodes from MC3_challenge and save it as a tibble data frame called MC3_nodes. The code is extracted in the following manner:
mutate()andas.character()are used to convert data type from list to characteras.numeric(as.character())are used to convert revenue_omu from list to character before converting it to numeric data type.select()is used to reorganize the sequence
MC3_nodes <-as_tibble(MC3_challenge$nodes) %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id,country,type,revenue_omu,product_services)4. Creating a Master ID data frame
We would like to create a master ID data frame from the knowledge graph. As there might be source or target that are not reflected in MC3_nodes. The code below select the source, target column individually and concatenate it through rbind(). Thereafter, a left_join() is used to append into MC3_nodes data frame.
Additionally, we rename the id to label and create an id column through mutate() and nrow() . Similarly, we rename the source and target to sourcelabel and targetlabel respectively and use left_join() to append the id before renaming it.
Show the code
#default masterlist
id1 <- MC3_edges %>%
select(source) %>%
rename(id = source)
id2 <- MC3_edges %>%
select(target) %>%
rename(id = target)
MC3_nodes_master <- rbind(id1, id2) %>%
distinct() %>%
left_join(MC3_nodes,
unmatched = "drop")
#create new node df to include id number
MC3_nodes_Masterlist <- MC3_nodes_master %>%
select(id) %>%
distinct() %>%
rename(label = id) %>%
ungroup()
#add ID to nodes dataframe
MC3_masternodes <- MC3_nodes_Masterlist %>%
mutate(id = as.character(1:nrow(MC3_nodes_Masterlist))) %>%
relocate(id,label) %>%
ungroup()
#to append correspoinding id through left_join
MC3_edges_addID <- MC3_edges %>%
rename(sourcelabel = source, targetlabel = target) %>%
left_join(MC3_masternodes, by = c("sourcelabel" = "label")) %>%
rename(source = id) %>%
left_join(MC3_masternodes, by = c("targetlabel" = "label")) %>%
rename(target = id) %>%
relocate(source,target)
#output for dataframe using knitr:: kable
kable(head(MC3_masternodes), "simple")| id | label |
|---|---|
| 1 | 1 AS Marine sanctuary |
| 2 | 1 Ltd. Liability Co Cargo |
| 3 | 1 S.A. de C.V. |
| 4 | 1 and Sagl Forwading |
| 5 | 2 Limited Liability Company |
| 6 | 2 S.A. de C.V. |
5. Exploratory Data Visualization
In this section, we will design plots with interactivity for users to explore the data. The plots are created with the use of [giraph], and [treemap] packages. We examine the distribution by type for nodes and edges , the distribution of the revenue, and the distribution by country with divergent into type.
Bar chart is chosen to show segments of information by comparing different categorical variables. Aggregation is first performed before creating the interactive graph. Moreover, tooltip is used to highlight the count of the data while scale_fill_manual() is used to standardize the color in each graph.
Note: The Treemap shows the Top 10 countries.
5.1 Type & Revenue
Observations:
Distribution by Type : There are three categories in Nodes and two categories in Edges. We observed that there are only Beneficial Owner and Company Contact in Edges whereas Company information is only available in Nodes. Therefore, we can infer that the type in Edges refer to the connection/ownership that the people are often associated with.
Distribution by Revenue: As observed, the distribution is right-skewed with higher proportion in $10k to $100k, followed by revenue <$10k. The counts decreases as revenue bins increases. There are eight ID who earns more than $100M in revenue. However, more research have to be done to be able to determine if the revenue gain is achieved over time or in a specifc time period that reciprocate to its strength, size, or selling proposition.
5.2 Country
As we have installed the package through [devtools] previously. The github (#install_github) is not installed. Instead, we load the library of [d3treeR] instead. Note: only required to install the github package once.
#install_github("timelyportfolio/d3treeR")
library(d3treeR)The d3tree function from the [d3treeR] package is used to build our interactive treemap.
d3tree(e5, rootname = "Categories by Country")Observation:
- Distribution by Country: Among the countries,
ZHhas the highest size and it contains the darkest gradient that belongs to Beneficial Owner. Meanwhile, the rest of the countries only have information about thecompany.
6. Network Visualization and Analysis
Based on the distribution by type, we could like to explore the network distribution to identify anomalies. We create two separate networks to focus on Beneficial Owner and Company Contacts individually. We compute the centrality between the nodes, by using the graph_from_data_frame function from the [igraph] package.
We used Betweenness Centrality in our network to identify nodes who may have considerable influence instead of other centrality. visNetwork is used to plot the plot the interactive network graph with the Fruchterman and Reingold layout.
Observations:
Majority of Beneficial Owner have exclusivity or a partnership with 1-2 owners per company. However, we identified companies such as
Dutch Oyster Sagl Cruise ship,Ola de la Costa N.V.,Niger Bend AS Express, andBahia de Plata Submarinewith relatively high ownership.Similarly, the Company Contacts maintains exclusivity/partnership with the liaise company. As observed, there are 1 company who maintain high contacts. In addition, there are personnel who form multiple connections with Companies. For instances,
Jennifer Johnsonhave 5 connections,David Smithhave 4 connections, andJohn Williamshave 3 connections.
6.1 Identify Connections
From the network visualization, we noticed that there are people with high ownership/connections. Henceforth, a bar chart is plotted to get a better visualization of their connections. Since we are identifying anomalies, we aggregated the data and filtered away exclusivity of count equals to one.
Observation:
- As the ownership/connections increases, the frequency decreases. Henceforth, to detect the anomalies, we would look into Beneficial Owner with 4 or more companies, and Company Contacts with 3 or more.
6.2 Identify Personnel
To identify the personnel with high ownership/connections, we will filter() them based on the criteria above and concatenate it through rbind(). Thereafter, we use distinct() to ensure that there is no duplicate ID and a left_join() to append from MC3_edges_addID data frame.
With the finalized table, datatable() from [DT] package is used to display MC3_anomalies as an interactive table with search, sorting features.
Show the code
#filter for personel with high ownership
MC3_edges_cc <- MC3_edges_cc_agg %>%
filter(count >=3) %>%
ungroup()
#filter for personnel with high connections
MC3_edges_bc <- MC3_edges_bc_agg %>%
filter(count >=4) %>%
ungroup()
#combined the id in both list and merge with edges
MC3_anomalies <- rbind(MC3_edges_cc, MC3_edges_bc) %>%
distinct() %>%
left_join(MC3_edges_addID, by = c("to" = "target"),unmatched = "drop") %>%
rename(target = to) %>%
select(sourcelabel,targetlabel, type) %>%
ungroup()
#output for dataframe using datatable:: dt
DT::datatable(MC3_anomalies)7. Product and Services
Since we are unable to gain insights based on the revenue, we would like to look into the business’s most important features - the product and services offered. Given an unique selling point, a product/service determine the prospect of the business.
To begin with, we will perform text sensing by using the [tidytext] package. We used tokenisation to break up the given text into units called tokens. The unit could be an individual words, phrases or the entire sentences. Thereafter, unnest_token() is used. The unnested text goes to the output column - word after extracting it from product_services.
Show the code
token_nodes <- MC3_nodes %>%
unnest_tokens(word, product_services)
#output for dataframe using knitr:: kable
kable(head(token_nodes), "simple")| id | country | type | revenue_omu | word |
|---|---|---|---|---|
| Jones LLC | ZH | Company | 310612303 | automobiles |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | passenger |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | cars |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | trucks |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | vans |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | and |
From the table above, we noticed that there are word that might not be useful for our analysis. E.g. and in row 6. Therefore, we would like to exclude such word in our analysis. From the code chunk below, stop_words() from the [tidytext] package is used to aid us in the filtering progress while anti_join() from the [dplyr] package is used to remove all stop_words from the analysis.
Show the code
stopwords_removed <- token_nodes %>%
anti_join(stop_words)
#output for dataframe using knitr:: kable
kable(head(stopwords_removed), "simple")| id | country | type | revenue_omu | word |
|---|---|---|---|---|
| Jones LLC | ZH | Company | 310612303 | automobiles |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | passenger |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | cars |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | trucks |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | vans |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | buses |
As observed, the word and has been removed from the analysis and have been replaced with buses. Despite so, it seems that the extracted word could be further categorized into a broader category.
7.1. Word Frequency
To get a better graphical representations of the word frequency, a word cloud is used to give greater prominence to words that appears more frequently in a source text. General words such as character, 0, unknown, including and related that do not bring insights are removed from the frequency count. Thereafter, we create a dataframe stopword_removed_freq through the following steps:
filter(!)to remove generic wordgroup_by()word andsummarize()by counting the frequencyarrange(desc())to sort in descending order
The [worldcloud2] package is used for easy and quick insights to the text. For standardization, set.seed() is incorporated.
Show the code
#remove generic words
remove_characters <- c("character", "0","unknown","products","services",
"including", "source", "offers","range", "related")
#create dataframe of each word with frequency
stopwords_removed_freq <- stopwords_removed %>%
filter(!word %in% remove_characters) %>%
group_by(word) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
#generate word cloud
set.seed(1234) #for reproductibility
wordcloud2(data=stopwords_removed_freq, size=1.6, color='random-dark')Observations:
Fish remains the top keyword for the word cloud. However, there are multiple words that are not grouped within the fish categories (E.g. Shellfish, Tuna, Crab, Squids)
Given that the knowledge graph should focus on fishing related companies, the word cloud identified words that are not affiliated with fishing. Words such as
bags,shoes,metal,powerhave higher frequency count. Therefore, they will be further categorized into broader group.
7.2 Grouping Words into Categories
With the aid of the word cloud, we are able to group words into various categories. We begin by creating a custom stop words data frame to remove generic words that we deemed as unsuitable. Words like additional, range may not be part of the stop_word(). While creating the new data frame with a category column, we filtered away numbers and special character that appears in the word column.
grepl() is used to extract words that could easily be extracted (E.g. fish) while mutate() is used to match word in a list. We identified 17 categories, ranging from Fish, Other Food, Transportation to Apparels and Fashion. Unidentified word are filtered away from our analysis.
Show the code
#custom stopwords list to remove generic words
toremove_stopwords <- c("character", "0", "unknown", "yellow", "pink", "red",
"â","added", "adding", "additional", "additions",
"addition", "aid", "aids", "air", "related", "including",
"range","products","products","service", "services",
"source", "offers","range")
#new df with category column
stopwords_removed_addcategory <- stopwords_removed %>%
filter(!word %in% toremove_stopwords) %>% #filter list without generic wordd
filter(!grepl("\\d", word)) %>% #to remove numbers
filter(!grepl("[^[:alnum:]\\s]", word)) %>% #to remove words with symbols
mutate(category = case_when(
word %in% c("automobiles", "automobile", "car", "cars", "bus", "buses",
"trucks", "truck", "trucking", "van", "vans",
"aircraft", "airjet", "airfrieght", "airline","airlines",
"airport","vehicle", "vehicles","powertrain",
"transportation", "trains", "automotive",
"freight","transport") ~ "Transportation",
grepl("fish", word) ~ "Fishing",
word %in% c("pangasius", "angunagas","seafood","seafoods",
"shrimp", "shrimps", "tuna", "tunas", "pollock", "fin", "roe",
"fillet", "haddock", "cod", "hake", "saithe", "scallops",
"arrowtooth", "flounder", "chum", "salmon", "mahi", "tilapia",
"crab", "oyster", "scallop", "clam", "cephalopods","eel",
"squid", "loligo", "illex", "cuttle", "abalone", "prawns",
"lobster", "clams", "octopus", "oysters", "crabs", "halibut",
"crustaceans", "sockeye", "trout", "herring", "sardines",
"mackerel", "mussels", "mollusks", "lobsters", "molluscs",
"caviar","seabass", "shark", "yellowfin","fillets") ~ "Fishing",
word %in% c("oil", "oils", "gas", "cars", "gasoline", "energy",
"evs", "electric", "biodiesel", "turbines", "steam",
"power", "plants", "acidizing") ~ "Oil & Gas",
word %in% c("chrysogaster", "insect", "insecticides", "pest",
"pesticides") ~ "Insects",
word %in% c("meat", "meats", "chicken","beef", "food", "lamb", "foods",
"pork", "vegetables", "fruit", "vegetable", "salad", "tea",
"fruits", "poultry","rice", "cakes", "sugarbeets",
"bakery", "lambs", "lamb's", "meatballs", "bread", "cream",
"crabmeat", "steak","soup", "tomato","cheese", "chocolate",
"burger", "corn","pasta") ~ "Other Food",
grepl("beverage", word) ~ "Beverages",
word %in% c("milk","milks","soymilk", "tea", "teas", "coffee", "juices",
"water","drinks","drink","juice", "wine","wines") ~ "Beverages",
word %in% c("production","manufacturing", "manufacture") ~ "Manufacturer",
word %in% c("dance", "zumba", "yoga") ~ "Dance",
word %in% c("abrasive", "abrasives", "metal", "materials", "material",
"steel", "plastic", "plastics", "paper", "rubber", "iron",
"sealants") ~"Materials",
word %in% c("accessories", "accessory") ~ "Accessories",
word %in% c("acid", "acids", "deacidification") ~"Chemicals",
grepl("chemical", word) ~ "Chemicals",
grepl("animal", word) ~ "Animals",
word %in% c("tech","technology", "technologies", "nanotechnology",
"biotech") ~ "Technology",
word %in% c("explosive", "explosives", "pyrotechnics",
"combustible", "powders", "gun") ~"Explosives",
word %in% c("shoes", "adidas", "footwear", "apparel", "bags", "shirts",
"clothing", "fabric", "fabrics","fashion", "textiles", "bag",
"boots","garments","jewelry", "beauty", "wear", "belts",
"slippers", "gloves","caps","dress", "short", "apparels",
"skirts","cosmetic","garment", "jackets", "socks", "shoe",
"pants", "cosmetics") ~"Apparel & Fashion",
word %in% c("toys") ~ "Toys",
word %in% c("pharmaceutical", "construction", "leather", "dental",
"stationery", "textile", "building", "optical", "researcher",
"pharmaceuticals", "engineering", "freelance",
"management") ~"Other Industries",
TRUE ~ "other"))
#create new df to to filter away others
stopwords_removed_new <- stopwords_removed_addcategory %>%
filter(category != "other" ) #filter for all except other
#output for dataframe using knitr:: kable
kable(head(stopwords_removed_new), "simple")| id | country | type | revenue_omu | word | category |
|---|---|---|---|---|---|
| Jones LLC | ZH | Company | 310612303 | automobiles | Transportation |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | cars | Transportation |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | trucks | Transportation |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | vans | Transportation |
| Coleman, Hall and Lopez | ZH | Company | 162734684 | buses | Transportation |
| Aqua Advancements Sashimi SE Express | Oceanus | Company | 115004667 | chemicals | Chemicals |
7.2.1 Word Extraction
While generating the categories above, we created the check_text data frame to cross check on high frequency words. After running the code chunk above, the identified list below helps to examine the words that might have been missed.
As such, we filter() only for category in other. Given that we are only interested in word with high frequency, we did a frequency counts and examine through the top-down approach.
Show the code
#create a mini table for this
check_text <- stopwords_removed_addcategory %>%
filter(category == "other") %>% #to filter unidentified categories
group_by(word) %>% #by grouping the word
summarize(count = n()) %>% #and doing a frequency count
arrange(desc(count)) #sort in descending order
#output for dataframe using knitr:: kable
kable(head(check_text), "simple")| word | count |
|---|---|
| frozen | 467 |
| equipment | 309 |
| fresh | 276 |
| systems | 180 |
| industrial | 164 |
| canned | 163 |
Notably, the word above remain unidentified as they might be too generic or they could potentially belong to more than one category. For instance, Frozen could be applicable to Fishing, or to other type of Food. Thus, it is better to remain unidentified as the keyword in the product/service would still be identified.
7.3 Distribution of Identified Word in each Category
A treemap displays hierarchical data as a set of nested rectangles, where each group is represented by an area that is proportional to its value. Based on our categorization, we would like to look into the distribution by our design category and the word involved.
Based on the frequency count, we are able to identify categories that are not part of the fishing. We will aggregate from stopwords_removed_new data frame by using the [dpylr] package:
group_by()is used to aggregate it by category and wordsummarize()is used to compute the countarrange(desc())to sort weight in descending order
Show the code
#create an aggregated dataframe by category and word
clean_text <- stopwords_removed_new %>%
group_by(category,word) %>%
summarize(count = n()) %>%
arrange(desc(count))
#treemap saved under 'it' object
it <- treemap(clean_text,
index=c("category","word"),
vSize="count",
vColor="count",
type ="value",
algorithm = "pivotSize",
sortID = "count",
palette="Blues",
border.lwds = "white"
)We build our treemap by inputting our aggregated data frame (clean_text) into the treemap() function and saved it as an object called it. Thereafter, the d3tree function from the [d3treeR] package is used to build our interactive treemap.
d3tree(it, rootname = "Categories of Business")Observations:
- Multiple words that are associated with Fishing has been identified, with Fish and Seafood being the main contributor to the category.
- It is concerning that there are multiple category that are not related to Fishing. Apart from food related category,
Apparel and Fashionhave one of the highest count with the main contributor from shoes, footwear, bags, and apparel.
index vector shows the category followed by the associated word.
vSize vector shows the distribution by count (the size of the rectangles)
vColor vector shows the different intensity by count
type vector reflects the value type of the treemap
algorithm vector to pivot by size
sortID vector to determined the order from top left to bottom right
palette vector paints the treemap by using the [rColorBrewer] package
border.lwds vector paints the border width to white
7.4 Analyzing Apparel and Fashion
In this section ,we will analyse more about the Apparel and Fashion category to see the businesses in this group are similar in nature. We begin by creating a new dataframe - apparel_type by aggregation. An interactive pie chart is used as the number of categories are small and we suspect that the differences among the type will be significant.
Observations:
Notably, there are only company in Apparel and Fashion.
The distribution by countries are spread among the countries, with higher concentration in ZH, Oceanus and Marebak. This is in line with the distribution in section 5. If the business groups have to further aggregated, it is ideal to look into the country after its categories.
reorder command help to reorder the columns in ascending and descending order.
- represents descending order
+ represents ascending order
However, there is a coord_flip() in the code which flip the command.
7.4.1 Interactive Data Table
As identified, there are 721 entries in the interactive data table with 431 unique companies that have product/services that are related to Apparel and Fashion. Further analysis might be needed to measure the similarity of businesses.
Show the code
#aggregate by category
stopwords_removed_new_af <- stopwords_removed_new %>%
filter(category == "Apparel & Fashion") %>%
select(id,country,word) %>%
distinct() %>%
ungroup()
#output for dataframe using datatable:: dt
DT::datatable(stopwords_removed_new_af)8. Conclusion
In our exercise, we observed the following:
Abnormality in business groups based on the type. From the network graph and point of contact, we identified multiple people with high ownership/ large connections.
Multiple words that are not associated with fishing related business. The identified keyword are distinct and could be categorized into broader group.
Apparel and Fashion is one of the largest category that mainly includes companies and they have wide operation in various countries.
References:
Chatterjee, S. (2019, January 10). Create a Word Cloud with R. Towards Data Science. Retrieved June 11, 2023, from https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a
R4VA. (2023, June 4). 27. Network Metrics Analysis. Retrieved June 11, 2023, from https://r4va.netlify.app/chap27.html#network-metrics-analysisr
VAST Challenge. (2023). MC3: VAST Challenge 2023 - Mini Challenge 3. Retrieved June 09, 2023, from https://vast-challenge.github.io/2023/MC3.html