How to find the DataNodes that actually store a file in HDFS?
A file may be splitted to many chunks and replications stored on many datanodes in HDFS. Now, the question is how to find the DataNodes that actually store a file in HDFS?
You may use the dfsadmin -fsck tool from the Hadoop hdfs util. Here is an example:
$ hadoop fsck /user/aaa/file.name -files -locations -blocks
Connecting to namenode via http://dstore-170:50070
FSCK started by hadoop (auth:SIMPLE) from /10.0.3.170 for path /user/path/to/file.gz at Fri Oct 17 12:25:55 HKT 2014
/user/path/to/file.gz 12448905476 bytes, 93 block(s):  OK
0. BP-1960069741-10.0.3.170-1410430543652:blk_1074365040_625145 len=134217728 repl=2 [10.0.3.173:50010, 10.0.3.174:50010]
1. BP-1960069741-10.0.3.170-1410430543652:blk_1074365041_625146 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
2. BP-1960069741-10.0.3.170-1410430543652:blk_1074365042_625147 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
3. BP-1960069741-10.0.3.170-1410430543652:blk_1074365043_625148 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
4. BP-1960069741-10.0.3.170-1410430543652:blk_1074365044_625149 len=134217728 repl=2 [10.0.3.181:50010, 10.0.3.174:50010]
...
91. BP-1960069741-10.0.3.170-1410430543652:blk_1074365131_625236 len=134217728 repl=2 [10.0.3.175:50010, 10.0.3.174:50010]
92. BP-1960069741-10.0.3.170-1410430543652:blk_1074365132_625237 len=100874500 repl=2 [10.0.3.181:50010, 10.0.3.174:50010]
Status: HEALTHY
 Total size:	12448905476 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	93 (avg. block size 133859198 B)
 Minimally replicated blocks:	93 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		10
 Number of racks:		1
FSCK ended at Fri Oct 17 12:25:55 HKT 2014 in 1 milliseconds
The filesystem under path '/user/aaa/file.name' is HEALTHY