<\/figure>\n\n\n\nWe can see that for some ransomware families several addresses have been collected (for example, for the Locky ransomware there are over 7000). We will conduct our analysis starting from a single address, even if of course the clustering logic for addresses would remain the same.<\/p>\n\n\n\n
Let\u2019s take in this case the seed address that belongs to CryptXXX <\/strong>ransomware.<\/p>\n\n\n\nCryptXXX_seed_address = seed_addresses.loc[seed_addresses[\u2018family\u2019] == \u2018CryptXXX\u2019]\nCryptXXX_seed_address<\/code><\/pre>\n\n\n\nNow we extract only the bitcoin address from the dataset and save the variable in string format. With the address in string format we can use the BlockSci libraries to create a so-called address object and, always using the BlockSci libraries, display all transactions received and made from this address.<\/p>\n\n\n\n
#extract bitcoin address and convert value to string\nCryptXXX_seed_address = str(CryptXXX_seed_address.iloc[0][\u2018address\u2019])\n#create the address object from the string\naddress_obj = chain.address_from_string(address_string = CryptXXX_seed_address)\naddress_obj<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nBefore continuing, it is important to understand the methodology that allows us to associate different addresses to the same person. These are the conditions that we are going to check in order to be able to say that an address is linked to the ransomware attack or not.<\/p>\n\n\n\n
Disclaimer: I am not an expert on blockchain analytics. If you have any doubts or believe there is an error please feel free to leave a comment! \u00a0\ud83d\ude09\u00a0<\/p>\n\n\n\n
Methodology for Linking Addresses<\/h2>\n\n\n\n In this section we will look at two blockchain-based heuristics (Common Spending<\/strong> and One-Time Change<\/strong>) that allow us to connect different addresses to the same actor. These heuristics that we will use to identify ransomware wallets have already been used in various academic research for clustering bitcoin addresses. To start, let\u2019s define a bitcoin transaction as a triplet of elements:<\/p>\n\n\n\nt = (A, B, c)<\/em><\/p>\n\n\n\nA<\/em> represents the finite multiset of inputs<\/strong> of the transaction t<\/em><\/li>B<\/em> represents the finite multiset of outputs<\/strong> of the transaction t<\/em><\/li>c<\/em> represents the transaction fee<\/li><\/ul>\n\n\n\nCommon Spending (CS)<\/strong><\/p>\n\n\n\nThe first heuristic we will use for tracking ransomware addresses is called Common Spending. It is based on the fact that if two or more input addresses perform a transaction to the same address (output), then all addresses involved in the transaction are controlled by the same person. This may not be true only in the case where multiple people agree to execute a transaction, but this is a very rare case and so we will ignore this possibility. Also because we are talking about a criminal activity, and even if the transaction was performed by multiple people, they would still all be involved in the ransomware. For the heuristic validity, it is necessary that the transaction must have only one output, this is because multi-output transactions (through coin-mixers) are often used to obfuscate transaction history. We can summarize this first heuristic this way:<\/p>\n\n\n\n
If two or more addresses are inputs of the same transaction with one output, then all these addresses are controlled by the same user.<\/p><\/blockquote>\n\n\n\n
One-Time Change (OTC)<\/strong><\/p>\n\n\n\nThe OTC heuristic is based on the standard Bitcoin mechanism where the change from the transaction is returned to a new address. When you send funds from your bitcoin wallet, the specified amount of funds is sent to the intended bitcoin address and the rest of the funds stored in the sending bitcoin address are sent to what is called an \u201cchange address\u201d associated with the same wallet of the sender. The conditions we will use to check if a transaction is an OTC transaction were taken from a paper used for bitcoin address clustering [2]. These are the conditions that must be met:<\/p>\n\n\n\n
1<\/strong> Addr(B) = 2, i.e. the transaction t has exactly two outputs.<\/p>\n\n\n\n2<\/strong> Addr(A) \u2260 2, i.e. the number of t inputs is not equal to two. If Addr(A) = Addr(B) = 2 the transaction is most likely shared send mixer.<\/p>\n\n\n\n3<\/strong> Both outputs of transaction t, B1 and B2, are not selfchange addresses, i.e. B1, B2 \u2208\/ Addr(A).<\/p>\n\n\n\n4<\/strong> One output of the transaction B1 did not exist before transaction t and decimal representation of the value b1 has more than 4 digits after the dot.<\/p>\n\n\n\nIf the transaction satisfies the conditions of a One-Time Change transaction, input and output addresses belong to the same user.<\/p><\/blockquote>\n\n\n\n
Now that we have defined the conditions that will lead us to associate addresses with the seed address, let\u2019s take a look at all the transactions in the blockchain where the seed address appears as an input of the transaction.<\/p>\n\n\n\n
To do this we apply the input_txes<\/code> method to the address object.<\/p>\n\n\n\ninputs_txs = list(address_obj.input_txes)\nprint(\u201cNumber of transactions involved: \u201c + str(len(inputs_txs)))\ninputs_txs<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nThere are 64 transactions where the seed address appears as an input of the transaction. The transaction list provides various information:<\/p>\n\n\n\n
len(txins)<\/code> is the number of inputs used in the transaction<\/li>len(txouts)<\/code> is the number of outputs used in the transaction<\/li>size_bytes<\/code> is the value in bytes of the transaction<\/li>block_height<\/code> is the block where the transaction is located<\/li>tx_index<\/code> is the transaction identification index, this information is not derived from the blockchain, but was added during blockchain parsing<\/li><\/ul>\n\n\n\nFrom this list we create new_list<\/code>, a list of lists. We then convert it to a DataFrame to work better with the Python libraries.<\/p>\n\n\n\nnew_list = []\nfor i in range(len(inputs_txs)):\n \n new_list.append(str(inputs_txs[i]))\n \n #split values with \u201c,\u201d\n new_list[i] = new_list[i].split(\u2018,\u2019)\ndf = pd.DataFrame(new_list)\ndf.columns = ['inputs', 'outputs', 'size', 'block_height', 'tx_index']\ndf.drop(columns= [\"size\", \"block_height\"], axis=1, inplace = True)\ndf<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nValues are expressed in str<\/code> format. We must then eliminate what we don’t need (for example “len(txins)”) and convert these values to int<\/code> format.<\/p>\n\n\n\ndf[\u2018inputs\u2019] = df[\u2018inputs\u2019].str.replace(\u2018Tx\\(len\\(txins\\)=\u2019,\u2019\u2019)\ndf[\u2018outputs\u2019] = df[\u2018outputs\u2019].str.replace(\u2018len\\(txouts\\)=\u2019,\u2019\u2019)\ndf[\u2018tx_index\u2019] = df[\u2018tx_index\u2019].str.replace(\u2018tx_index=\u2019,\u2019\u2019)\ndf[\u2018tx_index\u2019] = df[\u2018tx_index\u2019].str.replace(\u2018\\)\u2019,\u2019\u2019)\ndf[\u2018inputs\u2019] = df[\u2018inputs\u2019].astype(int)\ndf[\u2018outputs\u2019] = df[\u2018outputs\u2019].astype(int)\ndf[\u2018tx_index\u2019] = df[\u2018tx_index\u2019].astype(int)\ndf<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nFinding transactions that satisfy the first heuristic (Common Spending) is fairly simple. Only two conditions need to be met to show that the input and output addresses belong to the same person. The first is that the inputs are >=2 and that the outputs are = 1.<\/p>\n\n\n\n
We create a function that adds a column named heuristic1<\/code> and, iterating along each row, inserts 1<\/code> if the conditions are verified, 0<\/code> if they are not.<\/p>\n\n\n\ndef heur1(row):\n \n if row[\u2018inputs\u2019] >= 2 and row[\u2018outputs\u2019] == 1: \n val = 1\n \n else: \n val = 0\n \n return val\ndf[\u2018heuristic1\u2019] = df.apply(heur1, axis=1)\ndf<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\ndf[\u2018heuristic1\u2019].value_counts()\n<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nWe found only two transactions that respect the first heuristic. We can say that the addresses used in these two transactions (inouts and outputs) belong to the same person.<\/p>\n\n\n\n
Let\u2019s see now, if the other transactions, respect the second heuristic<\/strong>. We can easily check the first two conditions of the second heuristic. The first condition<\/strong> requires that the outputs be equal to 2<\/code>. We count the values present in the outputs<\/code> column.<\/p>\n\n\n\n#first condition -> outputs = 2\ndf[\u2018outputs\u2019].value_counts()<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nThere are 62 transactions with 2 outputs and 2 transactions with 1 output (which are those that we have already identified with the first heuristic). We create a second DataFrame that contains only transactions with 2 ouputs and delete the heuristic1<\/code> column.<\/p>\n\n\n\ndf2 = df.loc[df[\u2018outputs\u2019]== 2]\ndf2.drop(columns=\u201dheuristic1\", axis=1, inplace=True)<\/code><\/pre>\n\n\n\nThe second condition<\/strong> requires that the inputs to the transaction not be 2 (Addr(A) \u2260 2).<\/p>\n\n\n\n#second condition -> inputs \/= 2\n\ndf2[\u2018inputs\u2019].value_counts()<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nWe can see that there is not even a transaction composed of 2 inputs. All transactions verify the second condition.<\/p>\n\n\n\n
Before continuing with the condition check, we compile a function that allows us to extract addresses (and all elements) from blocksci objects<\/code>.<\/p>\n\n\n\ndef create_addresses_list(inputs):\n \n addresses_list = []\nfor i in range(len(inputs)):\n \n #convert values in str and append to the list\n addresses_list.append(str(inputs[i]))\n \n #split values with \u201c,\u201d\n addresses_list[i] = addresses_list[i].split(\u2018,\u2019)\n \n #select value in position 1 i.e. the address\n addresses_list[i] = addresses_list[i][1]\nstopwords = [\u2018address\u2019,\u2019=\u2019,\u2019(\u2018,\u2019)\u2019,\u2019PubkeyHashAddress\u2019,\u2019 \u2018,\u2019ScriptHashAddress\u2019]\nfor word in stopwords:\n \n if word in addresses_list[i]:\n \n addresses_list[i] = addresses_list[i].replace(word,\u201d\u201d)\n \n addresses_list = list(dict.fromkeys(addresses_list))\n \n return addresses_list<\/code><\/pre>\n\n\n\nFrom df2<\/code> we create a list with transaction indexes.<\/p>\n\n\n\ntx_list = list(df2[\u2018tx_index\u2019])\n<\/code><\/pre>\n\n\n\nTo verify the third condition <\/strong>we need to check that the output addresses are not present among the input addresses (B1, B2 \u2208\/ Addr(A)).<\/p>\n\n\n\ncondition_dict = {}\n#we use a for loop to iterate each transaction in the transaction list \nfor i in range(len(tx_list)):\n \n #create tx object for each transaction\n tx_obj = chain.tx_with_index(tx_list[i])\n \n #create a list of inputs of the transaction\n inputs_addresses = create_addresses_list(list(tx_obj.inputs))\n \n #create a list of outputs of the transaction\n outputs_addresses = create_addresses_list(list(tx_obj.outputs))\n \n #compares whether the output addresses are present among the input addresses and updates condition_dict\n if outputs_addresses not in inputs_addresses:\n \n condition_dict[str(tx_list[i])] = 1\n \n else:\n \n condition_dict[str(tx_list[i])] = 0\n#count the number of transactions that satisfy the third condition\nsum(map((1).__eq__, condition_dict.values()))<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\n62 transactions, or the total number of transactions under consideration, verify the third condition.<\/p>\n\n\n\n
Now there is the fourth coindition<\/strong> to analyze:<\/p>\n\n\n\nOne output of the transaction B1 did not exist before transaction t and decimal representation of the value b1 has more than 4 digits after the dot.<\/p><\/blockquote>\n\n\n\n
The first part <\/strong>of the condition requires that, for at least one output, this is the first transaction, and so that address did not exist before it. The second part<\/strong> requires the value of this transaction to be at least 0.00001000 btc.<\/p>\n\n\n\nWe will test the fourth condition in two parts.<\/p>\n\n\n\n
For the first part, we can compare the dates of the first transaction of the output addresses and the date of the transaction under analysis. If the date of the transaction matches at least one date of the first transaction of the outputs, then the first part of the condition is verified. We apply the same logic used previously for the third condition, a for loop that performs the analysis for each transaction and updates the dictionary.<\/p>\n\n\n\n
d = {}\nfor i in range(len(tx_list)):\n \n #create tx object\n tx_obj = chain.tx_with_index(tx_list[i])\n \n outputs_addresses = create_addresses_list(list(tx_obj.outputs))\n \n for addresses in outputs_addresses:\n \n addresses_list = []\n \n addresses_list.append(chain.address_from_string(address_string = addresses).first_tx.block_time)\n \n block_time_tx = chain.tx_with_index(tx_list[i]).block_time\n \n if block_time_tx not in addresses_list:\n d[str(tx_list[i])] = 0\n #the date of the transaction is not equal to any of the dates of the first transactions of the output addresses,our dictionary will mark 0 for the index of that transaction\n \n else:\n d[str(tx_list[i])] = 1\n#the transaction date is equal to at least one of the dates of the first output address transactions,our dictionary will mark 1 for the index of that transaction.<\/code><\/pre>\n\n\n\nLet\u2019s look at how many times value 1 appears in the dictionary (value indicating that the condition is verified):<\/p>\n\n\n\n
print(\u201cnumber of transactions that verify the condition: \u201c + str(sum(map((1).__eq__, d.values()))))<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nAgain all transactions are verified for the first part of the fourth condition.<\/p>\n\n\n\n
Now we need to verify that the outputs of these transactions were at least 0.00001000 btc.<\/p>\n\n\n\n
To extract transaction values we need to slightly modify the previous function used to create the list of addresses. Basically we only need to change the selection index of the list, which instead of selecting the second variable, in position [1]<\/code>, will have to select the third, in position [2]<\/code>. We also change the name of the function and of the variables to make it clearer, but it remains practically the same function.<\/p>\n\n\n\ndef create_value_list(transactions):\n \n value_list = []\nfor i in range(len(transactions)):\n \n #convert values in str and append to the list\n value_list.append(str(transactions[i]))\n \n #split values with \u201c,\u201d\n value_list[i] = value_list[i].split(\u2018,\u2019)\n \n #select value in position 1 i.e. the address\n value_list[i] = value_list[i][2]\nstopwords = [\u2018value\u2019,\u2019=\u2019,\u2019)\u2019, \u2018 \u2018,]\nfor word in stopwords:\n \n if word in value_list[i]:\n \n value_list[i] = value_list[i].replace(word,\u201d\u201d)\n \n value_list = list(dict.fromkeys(value_list))\n \n return value_list<\/code><\/pre>\n\n\n\nLet\u2019s check the condition:<\/p>\n\n\n\n
d2 = {}\nfor i in range(len(tx_list)):\n \n #create tx object\n tx_obj = chain.tx_with_index(tx_list[i])\n \n outputs_transactions = create_value_list(list(tx_obj.outputs))\n \n #convert outputs values from str to int\n outputs_transactions = list(map(int, outputs_transactions))\n \n for outputs in outputs_transactions:\n \n if outputs_transactions[0] > 1000 or outputs_transactions[1] > 1000:\n \n d2[str(tx_list[i])] = 1\n \n else:\n \n d2[str(tx_list[i])] = 0\nprint(\u201cnumber of transactions that verify the condition: \u201c + str(sum(map((1).__eq__, d2.values()))))<\/code><\/pre>\n\n\n\n <\/figure>\n\n\n\nAll transactions also verify the second part of the fourth condition.<\/p>\n\n\n\n
Let\u2019s see how many addresses, involved in these transactions, can be associated with the same person.<\/p>\n\n\n\n
inputs_addresses_related = create_addresses_list(list(address_obj.input_txes.inputs))\n \noutputs_addresses_related = create_addresses_list(list(address_obj.input_txes.outputs))\nprint(\u201cAddresses related to seed address: \u201c + str(len(inputs_addresses_related) + len(outputs_addresses_related)))<\/code><\/pre>\n\n\n\n